# 🚀 Your Challenge: Boost Customer Retention for a Telco Company!
Your mission is to help a telecom company predict whether a customer will leave (churn) in the next following months, 
and to develop strategies to keep them engaged and prevent revenue loss.

By doing this, you’ll enable the business to focus on retention programs for customers at risk, keeping them satisfied while reducing churn!

## 📊 Data Overview
You’re provided with **two datasets** that hold essential data about the customers:

1.	**customer_data**: Data about each customer
2.	**activity_data**: Monthly activity data from the year 2021
### Customer Data
- **customer_id (primary key)**: id of the customer (string - e.g. 100002)
- **birth_date**: birth date of the customer (string - e.g. 1976-12-11 00:00:00)
- **plan_type**: phone plan type (string - e.g.'pay-as-you-go', 'prepaid', 'postpaid')
- **join_type**: the date the customer started using services (string - e.g. 2010-03-12 00:00:00)
- **churn_in_3mos**: whether the customer left in the first 3 months of 2022"? (boolean - 0 for active customers, 1 for departed customers)
### Activity Data
- **customer_id (primary foreign key)**: id of the customer (string - e.g. 100002)
- **month (primary key)**: the billing period (string - e.g. 1/01/2021, format: dd/mm/yyyy, range: 12 months, from 1/01/2021 to 1/12/2021)
- **data_usage**: how many GBs this customer used in a month (float - e.g. 21.23)
- **phone_usage**: how many minutes this customer used in a month (float - e.g. 534.47)
- **use_app**: whether the customer used the online app this month (boolean - 0 for customers who did not use the online app in the given month, 1 for customers who have used the online app)

In [1]:
# Suggestion: to keep your notebook organized and clean, maintain a cell to manage your imports.
# Imports
import pandas as pd
import numpy as np
import os 

In [3]:
# Loading the data
DATA_DIR = os.path.join("./","data")
customer_data = pd.read_csv(os.path.join(DATA_DIR, "customer_data.csv"))
activity_data = pd.read_csv(os.path.join(DATA_DIR, "activity_data.csv"))

### 🔎 Start with Some Exploration!

Before jumping into building a predictive model, first **explore** the data to uncover any useful insights that can be relevant.

In [4]:
print(customer_data)

      customer_id  birth_date   join_date      plan_type  churn_in_3mos
0           10000  1994-08-13  2015-11-22       postpaid              0
1           10001  1994-06-25  2015-01-12  pay-as-you-go              1
2           10002  2008-06-10  2020-05-22        prepaid              0
3           10003  1970-09-04  2017-11-10        prepaid              0
4           10004  1969-11-06  2019-05-19        prepaid              0
...           ...         ...         ...            ...            ...
9995        19995  2007-07-30  2015-11-23        prepaid              1
9996        19996  1981-10-26  2018-03-18        prepaid              0
9997        19997  1999-01-10  2012-12-02       postpaid              0
9998        19998  1993-09-10  2015-09-09       postpaid              0
9999        19999  1971-03-17  2018-02-23        prepaid              1

[10000 rows x 5 columns]


In [8]:
print(activity_data)

        customer_id      month  data_usage  phone_usage  use_app
0             10000  1/01/2021       43.61      4570.12        1
1             10001  1/01/2021        2.07      2038.61        0
2             10002  1/01/2021       45.69      1786.97        1
3             10003  1/01/2021       45.70      2450.95        1
4             10004  1/01/2021       15.28      4627.57        1
...             ...        ...         ...          ...      ...
119995        19995  1/12/2021       37.83      1733.67        1
119996        19996  1/12/2021       33.76      4220.29        1
119997        19997  1/12/2021       11.96      1659.82        1
119998        19998  1/12/2021       39.31      4154.94        0
119999        19999  1/12/2021       39.62      2707.97        1

[120000 rows x 5 columns]


### 🧠 Question 1: What’s the Average Tenure of Our Customers?

The client wants to know how long their customers have been with them, as of 2022-01-01.

**Task**: Calculate the **average tenure** (in years) of the customer base. Your function should return a number with **two decimal** places. 

This could help to identify loyal customers who might be at risk.

In [20]:
from datetime import datetime,date

def duration(customer_data):
    
    
 
def avg_tenure(customer_data):
    
   

# 计算平均任期
average_tenure = avg_tenure(customer_data)
print("Average Tenure:", average_tenure)
    


IndentationError: expected an indented block (163131244.py, line 7)

### 📶 Question 2: What’s the Average Data Usage by Plan Type?

**Task**: Analyze how different types of customers are using data! Calculate the average monthly data usage for each plan type, and sort the results from low to high. 

This insight could be impactful for the marketing team to understand customer habits and plan targeted promotions.

In [17]:
def avg_data_usage_by_type(customer_data, activity_data): 
    
    activity_data_updated = activity_data.merge(customer_data, on = ['customer_id'], how = 'left')
    
    print(activity_data_updated)
    
    return activity_data_updated.groupby('plan_type')['data_usage'].mean()
    
    
avg_data_usage_by_type(customer_data, activity_data)

        customer_id      month  data_usage  phone_usage  use_app  birth_date  \
0             10000  1/01/2021       43.61      4570.12        1  1994-08-13   
1             10001  1/01/2021        2.07      2038.61        0  1994-06-25   
2             10002  1/01/2021       45.69      1786.97        1  2008-06-10   
3             10003  1/01/2021       45.70      2450.95        1  1970-09-04   
4             10004  1/01/2021       15.28      4627.57        1  1969-11-06   
...             ...        ...         ...          ...      ...         ...   
119995        19995  1/12/2021       37.83      1733.67        1  2007-07-30   
119996        19996  1/12/2021       33.76      4220.29        1  1981-10-26   
119997        19997  1/12/2021       11.96      1659.82        1  1999-01-10   
119998        19998  1/12/2021       39.31      4154.94        0  1993-09-10   
119999        19999  1/12/2021       39.62      2707.97        1  1971-03-17   

         join_date      plan_type  chur

plan_type
pay-as-you-go    15.611766
postpaid         62.738026
prepaid          31.228732
Name: data_usage, dtype: float64

### 💼 Business Problem: Prevent Customer Churn
The **marketing department** want to send personalized promotions to users who are **likely to churn**.

Your task is to build a **classification model** that predicts the likelihood for a customer to churn in the next 3 months. The model will help focus **retention efforts** on the right customers, minimizing revenue loss.

The target variable here is **churn_in_3mos** (1 = churned, 0 = active).

### 🔧 Question 3a: Feature Engineering
To make predictions, you’ll first need to create new features using both customer and activity data. 

Here is a suggested list of relevant features that you should add to the aggregated table:
- **tenure**: Customer tenure in years
- **total_phone_usage**: Total phone minutes used in 2021
- **app_usage_count**: Number of months the customer used the online app
- **phone_usage_ratio**: Ratio of phone usage in the last 3 months vs the entire year

Don’t forget to include the churn_in_3mos target variable!

### 🧑‍💻 Question 3b: Can You Create Additional Features?
Now that we have some basic features, can you **add 2 more features** that might help the model perform better? 

Think about what might influence customer behavior — use your creativity!

## 🏋️‍♂️ Question 3c: Train a Churn Prediction Model
It’s time to build our model! 

First, **split the data**: 80% for training and 20% for testing. 

Then, train your model to predict customer churn and calculate the AUC score on the test set.

You should answer these three questions:

1.	**What is the AUC score?**
2.	**What is a good AUC score?**
3.	**What is one other evaluation metric** we could use for this problem? Why would it be useful?

In [55]:
import pandas as pd 
import numpy as np 
from datetime import datetime
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [51]:
def feature_set(activity_data, customer_data):
    """
    Generate features
    """
    # Clean data
    activity_data_clean = activity_data.dropna()
    customer_data_clean = customer_data.dropna()
    
    # phone_usage_ratio
    phone_usage_last_3mos = activity_data_clean[activity_data_clean['month'].isin(['1/10/2021', '1/11/2021', '1/12/2021'])].groupby('customer_id')['phone_usage'].sum()
    phone_usage_total = activity_data_clean.groupby('customer_id')['phone_usage'].sum()
    phone_usage_ratio = phone_usage_last_3mos / phone_usage_total
    
    # total data usage
    total_data_usage = activity_data_clean.groupby('customer_id')['data_usage'].sum()
    
    # app_usage_count
    app_usage_count = activity_data_clean.groupby('customer_id')['use_app'].sum()
    
    # tenure
    customer_data_clean['join_date'] = pd.to_datetime(customer_data_clean['join_date'])
    customer_data_clean['tenure'] = (pd.to_datetime('2022-01-01') - customer_data_clean['join_date']).dt.days / 365
    
    # churn_in_3mos
    churn_in_3mos = customer_data.groupby('customer_id')['churn_in_3mos'].sum()
   
    
    # Combine features
    features = pd.DataFrame({
        'phone_usage_ratio': phone_usage_ratio,
        'total_data_usage': total_data_usage,
        'app_usage_count': app_usage_count,
        'tenure': customer_data_clean.set_index('customer_id')['tenure'],
        'churn_in_3mos': churn_in_3mos
    }).reset_index()
    
    return features

dataset = feature_set(activity_data, customer_data)
print(dataset['churn_in_3mos'])



0       0
1       1
2       0
3       0
4       0
       ..
9995    1
9996    0
9997    0
9998    0
9999    1
Name: churn_in_3mos, Length: 10000, dtype: int64


In [60]:
def train_test_model(X_train, y_train, X_test, y_test):
    
    # Initialize the model
    model = RandomForestClassifier(random_state=42)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    # Return the evaluation metrics
    print("Accuracy:", accuracy)
    print("Classification Report:\n", report)
    return model
    
# Split the data into features and target
X = dataset.drop(columns=['churn_in_3mos'])
y = dataset['churn_in_3mos']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


train_test_model(X_train, y_train, X_test, y_test)
    


Accuracy: 0.677
Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.93      0.80      1377
           1       0.44      0.13      0.19       623

    accuracy                           0.68      2000
   macro avg       0.57      0.53      0.50      2000
weighted avg       0.62      0.68      0.61      2000



### 🎯 Final Challenge: Explaining the Model

The last step is to make sure the client understands **why your model predicts** that certain customers will churn. 

Use an explainability framework to show which features are driving the predictions and ensure the solution is transparent and actionable!

💡 Hint: `shap` is one of the most widely used library to generate such insights.

In [2]:
import shap
import matplotlib.pyplot as plt

def generate_shap_plot_feature_importance(model, train, test):
    return


# Run shap

ModuleNotFoundError: No module named 'shap'

# 🤖 GenAI Challenge: Customer Complaints Categorization

# Data Overview
**Customer Complaints Data**: You are provided with a dataset containing customer complaints from various channels. Each complaint includes a text description of the customer's issue and a category.

 - Complaint: The actual text of the customer complaint (string)
 - Customer ID: Unique identifier of the customer submitting the complaint
 - Category: The assigned category for each complaint (e.g., Billing Issue, Service Disruption)
 
Your mission is to analyze each complaint, extract key information, and categorize the complaints into various actionable buckets to help the company take appropriate steps.

# Challenge: Extracting Advanced Features from Customer Complaints
In this challenge, you'll process the complaints to extract key insights, including:

- **Key Issues**: Extracting the most relevant keywords from the complaint that describe the main issue.
- **Sentiment Analysis**: Determining the sentiment expressed by the customer (neutral, negative, extremely negative).
- **Severity Rating**: Rating the severity of the complaint on a scale from 1 to 10, based on the impact and seriousness of the issue.

And try to categorize each complaints into actionable buckets



# Improving with LLMs
The only method we will use to improve the LLM's performance in this exercise will be through prompting. You will be tasked with crafting effective prompts to extract the required features from the complaints, rather than fine-tuning the model or making architectural changes.

In [62]:
import os
from langchain_huggingface import HuggingFaceEndpoint
import pandas as pd
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, Runnable
from multiprocessing import Pool
from tqdm import tqdm
from huggingface_hub import login

In [64]:
login(token='', add_to_git_credential=True)
# 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

In [66]:
df_complaints = pd.read_excel('data/complaints.xlsx')

In [69]:
# instanciate llm
llm_model = "meta-llama/Llama-3.2-1B-Instruct"
hf_llm = HuggingFaceEndpoint(
    repo_id=llm_model,
    temperature=0.5,
    huggingfacehub_api_token='',
)

print(hf_llm)

[1mHuggingFaceEndpoint[0m
Params: {'endpoint_url': None, 'task': None, 'model_kwargs': {}}


# Question 1: 
- build the chain for the llm 
- create the prompt for the llm
- extract the key insights listed above from the complaints

In [6]:
# Class to extract feature from complaints
class ExtractKeywords(Runnable):
    def __init__(self):
        return 
    
         


# Define the trasformation chain that are applied to the complaint
# - the chain start with reading the complaint 
# - the class to extract the feature is applied
# - the result is given to the LLM
# - A parser is applied

 chain = (
)





In [None]:

# Function to process all the df row by row
def process_df_with_chain(df):

    return df

df_complaints_processed = process_df_with_chain(df_complaints)


# Question 2: Parallel Processing and Optimization:
In large datasets, processing customer complaints sequentially can be time-consuming. How would you optimize this task using parallel processing techniques? Can you identify potential challenges when scaling this approach, especially with regards to memory management and error handling?

In [None]:

import concurrent.futures

# Function called from the thread to invoke the chain
def process_complaint_with_chain(complaint):
    try:
        
        
        return 
    except Exception as e:
        print(f"Error processing complaint: {e}")
        return None

# Function to parallelize the dataframe elaboration using ThreadPoolExecutor
def process_df_with_chain_parallel(df):
    results = []
    
    
    return df

df_complaints_processed = process_df_with_chain_parallel(df_complaints)




# Question 3: Cluster the complaints in differents buckets and suggest actionable insight

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans