**Review**

Hello James!

I'm happy to review your code today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did an excellent job! The code is accepted. Based on your best model's ROC-AUC on the test set, you've got 4.5 SP. Congrats!

<div class="alert alert-success">
    Hi James! I have approved your initial report for the final project. Great work! Just wanted to add some helpful pointers:
    <br/><br/>
If end date is empty, assume that the customer is still with the company (no churn)
    <br/><br/>
    When merging dataframes make sure that you do not miss any customers since some ID’s may not exist in certain tables. You can treat these missing cases as that customer not signing up for the specific service
<br/><br/>
    Your target variable should be Churn / No Churn, this will be a binary classification task
<br/><br/>
    Make sure that the validation set is used to tune hyperparameters, good idea to use gridsearch
<br/><br/>
    There does indeed exist a class imbalance in the data, make sure this is accounted for
<br/><br/>
    Careful with data leakage, make sure to not include features that may be collinear (i.e if using one-hot encoding for categorical variables, drop one of the columns) or reference the target variable in some way (see topic on data leakage)
<br/><br/>
    Good choice of models to try
<br/><br/>
    Very good EDA plan
<br/><br/>
    The goal of this project is to understand why customers churned or stopped their service. And we can use machine learning to identify features of the data that had high predictive power when determining which customers will churn. And if we know that, what can we do as a company to prevent ongoing churn (this part is more of a discussion question for your end
<br/><br/>
-Yervand, Data Science Tutor
</div>


<div class="alert alert-success">
<br/><br/>
    Hello Yervand, I believe I've gotten the project completed. I've been working on it on Jupyter and VSCode on my personal computer to see when the code isn't working. Could you assist me with how I needed to install libraries or whatever I was doing that kept giving me an error on the notebook?
-Student James 
</div>


# Predictive Churn Analysis for Telecom Services Using Machine Learning

## Project Description

This project aims to develop a predictive model to identify customers who are likely to continue or discontinue their telecom operator's services. The analysis will start with examining datasets related to contracts, personal details, and internet/phone service usage.

Comprehensive exploratory data analysis (EDA) will be conducted to identify trends and inform feature engineering. Categorical variables will be encoded using one-hot encoding, and new features will be created to capture customer behavior.

Boosting algorithms will be employed for their binary classification capabilities, and the predictive model will be fine-tuned through hyperparameter optimization. The primary metric for performance evaluation will be the AUC-ROC score, with a target of achieving 0.88 or higher.

The goal of the project is to enable the telecom operator to proactively identify customers at risk of churning and to send targeted promotions and customer retention plans to those individuals.

## Interconnect's Services

Interconnect mainly provides two types of services:

1. **Landline Communication**: The telephone can be connected to several lines simultaneously.
2. **Internet**: The network can be set up via a telephone line (DSL, digital subscriber line) or through a fiber optic cable.

Additional services offered by the company include:

- **Internet Security**: Antivirus software (DeviceProtection) and a malicious website blocker (OnlineSecurity)
- **Technical Support**: A dedicated technical support line (TechSupport)
- **Cloud Services**: Cloud file storage and data backup (OnlineBackup)
- **Streaming Services**: TV streaming (StreamingTV) and a movie directory (StreamingMovies)

Clients can choose either a monthly payment plan or sign a 1- or 2-year contract. They have various payment methods available and receive an electronic invoice after each transaction.

## Data Description

The data consists of files obtained from different sources:

- `contract.csv` — Contract information
- `personal.csv` — Client's personal data
- `internet.csv` — Information about internet services
- `phone.csv` — Information about telephone services

In each file, the column `customerID` contains a unique code assigned to each client.

## Import Libraries
### Initial Data Exploration

Import the datasets and combine them using the common column `customerID`.

In [None]:
# Install scikit-learn
!pip install scikit-learn
!pip install --upgrade scikit-learn

In [None]:
!pip install lightgbm
!pip install catboost
!pip install imbalanced-learn

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

from sklearn.ensemble import AdaBoostClassifier

In [None]:
#read in data - for personal computer
df_phone = pd.read_csv('/Users/James/Projects/Final Project/Datasets/phone.csv')
df_personal = pd.read_csv('/Users/James/Projects/Final Project/Datasets/personal.csv')
df_internet = pd.read_csv('/Users/James/Projects/Final Project/Datasets/internet.csv')
df_contract = pd.read_csv('/Users/James/Projects/Final Project/Datasets/contract.csv')

In [None]:
#read in data - for jupyter notebook
# df_phone = pd.read_csv('phone.csv')
# df_personal = pd.read_csv('personal.csv')
# df_internet = pd.read_csv('internet.csv')
# df_contract = pd.read_csv('contract.csv')
#read in data - for jupyter notebook
df_phone = pd.read_csv('/datasets/final_provider/phone.csv')
df_personal = pd.read_csv('/datasets/final_provider/personal.csv')
df_internet = pd.read_csv('/datasets/final_provider/internet.csv')
df_contract = pd.read_csv('/datasets/final_provider/contract.csv')

In [None]:
#display df_phone
display(df_phone.head())
display(df_phone.info())

In [None]:
#display df_personal
display(df_personal.head())
df_personal.info()

In [None]:
#display df_internet
display(df_internet.head())
df_contract.info()

In [None]:
#display df_contract
display(df_contract.head())
df_internet.info()

In [None]:
#create a merged df with all data based on customerID
df_all = pd.merge(df_contract, df_personal, on='customerID', how='outer')
df_all = pd.merge(df_all, df_phone, on='customerID', how='outer')
df_all = pd.merge(df_all, df_internet, on='customerID', how='outer')

df_all

### Data Cleaning and Feature Engineering

Here is what I will work on in this section:

- Fill in NA values.
- Lowercase column names.
- Change `totalcharges` feature from object to numeric.
- Use encoding to encode categorical features.
- Scale numerical features (`monthlycharges`, `totalcharges`, and `tenure`) using Min-Max scaling because the distribution is not normal.
- Add a column called `churn` based on the `enddate` column to create binary classification targets.
- Add a column for the number of months using the service.

In [None]:
#turn all null values into no 
df_all = df_all.fillna('No')

In [None]:
#see if there are null values
df_all.isnull().sum()

#delete rows where total_charges is null
df_all = df_all[df_all['TotalCharges'].notna()]

In [None]:
#lowercase all column names 
df_all.columns = df_all.columns.str.lower()

In [None]:

#change the column names to have proper _ instead of them being mushed together 
column_names = ['customer_id', 'begin_date', 'end_date', 'type', 'paperless_billing', 'payment_method', 'monthly_charges', 'total_charges', 'gender', 'senior_citizen', 'partner', 'dependents', 'multiple_lines', 'internet_service', 'online_security', 'online_backup', 'device_protection', 'tech_support', 'streaming_tv', 'streaming_movies' ]

df_all.columns = column_names

In [None]:
#change totalcharges to numeric
df_all['total_charges'] = pd.to_numeric(df_all['total_charges'], errors='coerce')

In [None]:
#encode categorical features

#instantiate label encoder
le = LabelEncoder()

#create list of columns that need to be encoded
column_list = ['paperless_billing', 'gender', 'senior_citizen', 'partner', 'dependents', 'multiple_lines', 'online_security', 'online_backup', 'device_protection', 'tech_support', 'streaming_tv', 'streaming_movies']

#using a for loop to change all columns in the list to label encoded
for column in column_list:
    df_all[column] = le.fit_transform(df_all[column])
    print(column, le.classes_)

In [None]:
#turn internet service names into a list from value_counts
internet_service_list = ['fiber_optic', 'dsl', 'no']

#turn the types of payment methods into a list
payment_method_list = ['electronic_check', 'mailed_check', 'bank_transfer', 'credit_card']

#turn the types of contracts into a list
contract_list = ['Month_to_month', 'one_year', 'two_year']

In [None]:
import sklearn
print(sklearn.__version__)

In [None]:
#onehot encode categorical features
#instantiate onehot encoder
ohe = OneHotEncoder(sparse=False)

#transform and fit payment_method, internet_service, and type
payment_method = ohe.fit_transform(df_all[['payment_method']])
internet_service = ohe.fit_transform(df_all[['internet_service']])
type_ = ohe.fit_transform(df_all[['type']])

In [None]:
#turn the transformed columns into dataframes
payment_method = pd.DataFrame(payment_method, columns=payment_method_list)
internet_service = pd.DataFrame(internet_service, columns=internet_service_list)
type_ = pd.DataFrame(type_, columns=contract_list)

#drop the original columns
df_all = df_all.drop(['payment_method', 'internet_service', 'type'], axis=1)

#concat the dataframes together
df_all = pd.concat([df_all, payment_method, internet_service, type_], axis=1)

In [None]:
#create a new column called churn that is 1 if there is a date and 0 if it says no based on enddate column
df_all['churn'] = np.where(df_all['end_date'] == 'No', 0, 1)

In [None]:
#change all enddate values to 2020-02-01 
df_all['end_date'] = np.where(df_all['end_date'] == 'No', '2020-02-01', df_all['end_date'])

#change begindate and enddate to datetime
df_all['begin_date'] = pd.to_datetime(df_all['begin_date'])
df_all['end_date'] = pd.to_datetime(df_all['end_date'])

#create a new column called tenure that is the difference between enddate and begindate
df_all['tenure'] = df_all['end_date'] - df_all['begin_date']

In [None]:
#change tenure to numeric and divide by 30 to get tenure in months 
df_all['tenure'] = (df_all['tenure'].dt.days/30).round(0)

### Exploratory Data Analysis (EDA)

For the EDA, my strategy is to investigate if each feature has a correlation with whether or not a customer is likely to churn. This will be done using visual methods to better understand the relationships between features and churn.

To achieve this, I will:

- Analyze each feature in relation to churn.
- Use visualizations to identify patterns and correlations.
- Check the imbalance of the dataset.

#### Steps for EDA:

1. **Analyze Each Feature in Relation to Churn**:
   - Examine individual features to see how they correlate with churn.

2. **Use Visualizations**:
   - Create visualizations such as bar plots, histograms, box plots, and scatter plots to explore the relationships between features and churn.

3. **Check Dataset Imbalance**:
   - Assess the balance between the churned and non-churned classes to understand the distribution of the target variable.


In [None]:
#see how many people churned vs how many didnt 
df_all['churn'].value_counts()

So we have quite an unbalanced dataset where a majority of people are still active on the service. I will use a method like SMOTE later on after creating train and test sets.

### When do people generally leave the service?

To analyze when people generally leave the service, we will explore the distribution of churn over different features such as tenure and contract type. This will help us understand the typical behavior of customers who leave the service.


In [None]:
#turn end_date into a datetime object
df_all['end_date'] = pd.to_datetime(df_all['end_date'])

#create a new column called month that is the month of the end_date
df_all['month'] = df_all['end_date'].dt.month

#plot the distribution of the months
df_all['month'].value_counts()

We can observe that terminations are fairly evenly distributed across the months in the dataset. Note that February represents the end of the dataset period, so these users are still active.

### How many months does the average person use the service before leaving?

In [None]:
#for all people that have churned, lets see how long they were with the company using a histogram
plt.hist(df_all[df_all['churn'] == 1]['tenure'], bins=75)
#title the plot number of days with company for people that churned
plt.title('Number of months with company for people that churned')
plt.xlabel('Number of months')
plt.ylabel('Number of people')


plt.show()

The data reveals that the majority of customers who left the service did so within the first few months. This indicates a strong correlation between the number of months a customer has been using the service and their likelihood of leaving. Therefore, it is crucial to include this feature in our predictive model.

To improve retention, it may be necessary to target newer members with more aggressive promotions or ad campaigns, as they are the most likely to cancel early on.

### How much does the average user spend? 

Next, we will examine the spending habits of users, comparing the differences between churned and active customers.

In [None]:
#create plot between monthly charges and churn for people that churned and didnt churn
plt.boxplot([df_all[df_all['churn'] == 1]['monthly_charges'], df_all[df_all['churn'] == 0]['monthly_charges']])

#title the plot monthly charges for people that churned and didnt churn
plt.title('Monthly charges for people that churned and didnt churn')
plt.ylabel('Monthly charges')

#label the x axis
plt.xticks([1, 2], ['Churned', 'Didnt Churn'])

plt.show()

#delete null values in total charges and total_charges s
df_all = df_all[df_all['total_charges'].notna()]
df_all = df_all[df_all['total_charges'].notna()]


#create boxplot between total charges and churn for people that churned and didnt churn
plt.boxplot([df_all[df_all['churn'] == 1]['total_charges'], df_all[df_all['churn'] == 0]['total_charges']])
#title the plot total charges for people that churned and didnt churn
plt.title('Total charges for people that churned and didnt churn')
plt.ylabel('Total charges')

#label the x axis
plt.xticks([1, 2], ['Churned', 'Didnt Churn'])


From this boxplot, we can see that, on average, people who leave the service spend more money per month, as indicated by the higher median and interquartile range.

For total charges, the trend is reversed. Active users spend more in total, which makes sense since they have likely been using the service for a longer period. An interesting observation is that churned users have many more outliers with high spending.

This information is significant because it may indicate that high monthly spenders are more likely to be dissatisfied with the service or a specific expensive add-on.

### Churn Based on Additional Services Purchased or Not

In [None]:
#create a 2x4 plot
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(16, 8))

#use groupby to see the mean of churn for each column and then plot it
df_all.groupby('online_security')['churn'].mean().plot(kind='bar', ax=axs[0, 0])
df_all.groupby('online_backup')['churn'].mean().plot(kind='bar', ax=axs[0, 1])
df_all.groupby('device_protection')['churn'].mean().plot(kind='bar', ax=axs[0, 2])
df_all.groupby('tech_support')['churn'].mean().plot(kind='bar', ax=axs[1, 0])
df_all.groupby('streaming_tv')['churn'].mean().plot(kind='bar', ax=axs[1, 1])
df_all.groupby('streaming_movies')['churn'].mean().plot(kind='bar', ax=axs[1, 2])

plt.show()

### Churn Based on Additional Services

The chart above illustrates the churn rate for each type of extra service offered. We observe that services such as online security, backup, device protection, and tech support show a significantly higher churn rate among customers who do not purchase these additional services.

However, for streaming services, customers who subscribe to them have higher churn rates. This may correlate with our previous observation that higher spenders are more likely to leave. It suggests that users who rely on streaming services may be experiencing dissatisfaction.

In summary, adding extra services, except for streaming services, appears to decrease the likelihood of a customer leaving. These findings highlight the importance of including these features in our model training.

### Churn Differences Between Types of People

Next, we will analyze churn differences based on various demographic factors such as gender, senior status, whether or not they have partners, and if they have dependents.


In [None]:
#create a new set of 

#create a 2x4 plot
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(16, 8))

#use groupby to see the mean of churn for each column and then plot it
df_all.groupby('gender')['churn'].mean().plot(kind='bar', ax=axs[0, 0])
df_all.groupby('senior_citizen')['churn'].mean().plot(kind='bar', ax=axs[0, 1])
df_all.groupby('partner')['churn'].mean().plot(kind='bar', ax=axs[1, 0])
df_all.groupby('dependents')['churn'].mean().plot(kind='bar', ax=axs[1, 1])

plt.show()

### Churn Differences Between Types of People

Senior citizens have a higher rate of leaving. However, couples and families have lower rates of leaving. This makes sense because couples and families typically have more than one member, making it more of an effort to change multiple people's service provider.

### Differences in Churn Rates Based on the Type of Payment

Next, we will examine the differences in churn rates based on the type of payment, including factors such as paperless billing and the payment method used.

In [None]:
#create 1x1 plot
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(16, 8))

#use groupby to see the mean of churn for each column and then plot it
df_all.groupby('paperless_billing')['churn'].mean().plot(kind='bar', ax=axs)

plt.show()

It seems that users who have paperless billing have higher rates of leaving. I'm not sure why this is the case, but it still indicates that it is an important piece of information.

### Differences in Churn Rates Based on the Type of Payment

Next, we will examine the differences in churn rates based on the type of payment, including factors such as paperless billing and the payment method used.

In [None]:
#create 
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(16, 8))

#group by payment method and see the mean churn for each payment method
df_all.groupby(['electronic_check', 'mailed_check', 'bank_transfer', 'credit_card'])['churn'].mean().plot(kind='bar', ax=axs)

#title the plot mean churn for each payment method
plt.title('Mean churn for each payment method')
plt.ylabel('Mean churn')
plt.xlabel('Payment method')

plt.xticks([0, 1, 2, 3], ['Bank transfer', 'Credit card', 'Electronic check', 'Mailed check'])
plt.show()

It appears that customers who use credit cards have a higher churn rate, which is an interesting observation.

### EDA Conclusion

In conclusion, several distinct correlations between churn rates and various features in the dataset have been identified.

The most notable finding is that the majority of users who left the service did so within the first few months. To address this, the company could implement aggressive promotions and discounts targeting newer users to improve retention. Additionally, offering promotions for additional services like online security could be beneficial, as customers who purchase add-ons are more likely to remain with the service.

# WORK PLAN

### Clarifying Questions

1. **Definition of Churn**: What defines churn to the company? For example, how many months of inactivity classify a customer as churned?
2. **Model Constraints**: Are there any constraints on the types of models that can be used, such as interpretability requirements or computational limits?
3. **Evaluation Metrics**: Besides the AUC-ROC score, are there any other evaluation metrics that should be considered, such as precision, recall, or F1-score, especially given the class imbalance?
4. **High-Risk Customers**: Do we need to identify high-risk customers, and how frequently will the data be updated?

### Rough Plan for Solving the Task   

1. **Data Cleaning and Preprocessing**:   COMPLETED Please provide feedback or improvement
   - **Objective**: Ensure that the dataset is complete and in a format suitable for analysis.
   - **Steps**: 
     - Fill in missing values.
     - Standardize column names.
     - Convert necessary columns to the appropriate data types.
     - Encode categorical variables.

2. **Exploratory Data Analysis (EDA)**:  COMPLETED Please provide feedback or improvement
   - **Objective**: Identify trends, correlations, and patterns within the data that are relevant to customer churn.
   - **Steps**:
     - Visualize the distribution of key features in relation to churn.
     - Analyze the balance of the dataset and investigate class imbalances.
     - Identify potential features that correlate strongly with churn, such as tenure, monthly charges, and additional services.

3. **Feature Engineering**:
   - **Objective**: Create new features and refine existing ones to improve the model’s predictive power.
   - **Steps**:
     - Create binary indicators for whether customers use additional services.
     - Calculate the tenure of customers based on their contract dates.
     - Scale numerical features using Min-Max scaling.

4. **Model Building and Evaluation**:
   - **Objective**: Develop a predictive model to identify customers likely to churn and evaluate its performance.
   - **Steps**:
     - Split the data into training and testing sets.
     - Use boosting algorithms like XGBoost, LightGBM, and CatBoost for classification.
     - Perform hyperparameter optimization using GridSearchCV.
     - Evaluate model performance using AUC-ROC score and other relevant metrics.

5. **Model Interpretation and Actionable Insights**:
   - **Objective**: Interpret the model results and derive actionable insights to help the business reduce churn.
   - **Steps**:
     - Identify key features that contribute to churn predictions.
     - Provide recommendations for targeted promotions and retention strategies based on model insights.
     - Create visualizations to communicate findings to stakeholders.


<div class="alert alert-success">
<b>Reviewer's comment</b>

Excellent work on data preprocessing and EDA! Dataframes were merged correctly

</div>

## Creating Dataset for Model and Final Modifications after EDA

1. **Remove Unnecessary Columns**:
   - Delete all unnecessary columns, including one column from each one-hot encoded feature.

2. **Split the Dataset**:
   - Use `train_test_split` to divide the dataset into training and testing sets.

3. **Balance the Dataset**:
   - Use SMOTE to balance the dataset and address class imbalance.

In [None]:
#categorize tenure into 3 categories. 0-6 months, 6-12 months, and 12+ months

#use np.where to create a new column called tenure_cat that is 0 if tenure is 0-6 months, 1 if tenure is 6-12 months, and 2 if tenure is 12+ months

df_all['tenure_cat'] = np.where(df_all['tenure'] <= 6, 0, np.where(df_all['tenure'] <= 12, 1, 2))

In [None]:
#delete unnecesaray columns
df_combined = df_all.drop(['gender','customer_id', 'month', 'begin_date', 'end_date', 'tenure'], axis=1)

#drop 1 of the columns from each of the onehot encoded columns
df_combined = df_combined.drop(['fiber_optic', 'Month_to_month', 'electronic_check'], axis=1)

In [None]:
df_combined.columns

<div class="alert alert-success">
<b>Reviewer's comment</b>

The features look good, there doesn't seem to be any data leakage

</div>

In [None]:
#delete the index
df_combined = df_combined.reset_index(drop=True)

In [None]:
X = df_combined.drop('churn', axis=1)
y = df_combined['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data split is reasonable

</div>

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
#scale monthly charges and total charges
scaler = StandardScaler()

#for cross validation datasets so i can use pipeline
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

#scale for the training monthly charges
scaler.fit(X_train_scaled[['monthly_charges']])
#now use fitted scaler to transform the training and test data
X_train_scaled['monthly_charges_s'] = scaler.transform(X_train_scaled[['monthly_charges']])
X_test_scaled['monthly_charges_s'] = scaler.transform(X_test_scaled[['monthly_charges']])

#do the same for total charges
scaler.fit(X_train_scaled[['total_charges']])
X_train_scaled['total_charges_s'] = scaler.transform(X_train_scaled[['total_charges']])
X_test_scaled['total_charges_s'] = scaler.transform(X_test_scaled[['total_charges']])

#drop the original columns
X_train_scaled = X_train_scaled.drop(['monthly_charges', 'total_charges'], axis=1)
X_test_scaled = X_test_scaled.drop(['monthly_charges', 'total_charges'], axis=1)

<div class="alert alert-success">
<b>Reviewer's comment</b>

Scaling was applied correctly

</div>

### Using SMOTE to Address the Imbalance Problem of Churn vs. No Churn

To ensure our model is not biased towards the majority class, we will use SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset by generating synthetic samples for the minority class (churned customers).

In [None]:
!pip install imbalanced-learn

In [None]:
#start smote
smote = SMOTE()

#fit smote using the X_train_
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

#check the shape of the new data
print(X_train_smote.columns)

In [None]:
#see how many people churned vs how many didnt 
y_train_smote.value_counts()

<div class="alert alert-success">
<b>Reviewer's comment</b>

Good idea to use oversampling to deal with class imbalance and it was correctly applied only to the train set

</div>

# Training Logistic Regression on Balanced Dataset

In [None]:
#logistic regression
reg = LogisticRegression()

reg.fit(X_train_smote, y_train_smote)
y_pred = reg.predict_proba(X_test_scaled)

#to see positive instances
y_proba_pos = y_pred[:, 1]

#score
print(roc_auc_score(y_test, y_proba_pos))

Test out different boosting models
lgbm
catboost
use gridsearch to determine best params

In [None]:
#create a gridsearch function that uses SMOTE for balancing and standard scaler to scale values
def grid_search(X_train, y_train, model):

    #define params
    param_grid = {
        'classifier__learning_rate': [0.001, 0.01, 0.1, 0.5],  
        'classifier__max_depth': [3, 5, 7, 10],            
        'classifier__n_estimators': [100, 200, 500, 1000, 2000] 
        }
    
    #create pipeline with custom model
    pipeline = ImbPipeline([
        ('scaler', StandardScaler()),
        ('smote', SMOTE(random_state=12345)),
        ('classifier', model)
    ])

    #scorer 
    roc_auc_scorer = make_scorer(roc_auc_score, needs_proba=True)

    #define gridsearch
    grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring=roc_auc_scorer,
                           cv=5,
                           verbose=1, 
                           n_jobs=-1)
    
    #fit the pipeline
    grid_search.fit(X_train, y_train)

    # Best ROC-AUC score
    print("Best ROC-AUC Score:", grid_search.best_score_)

    # Best parameters found
    print("Best Parameters:", grid_search.best_params_)

<div class="alert alert-success">
<b>Reviewer's comment</b>

Very nice! You used pipelines to correctly apply scaling and oversampling for cross-validation

</div>

In [None]:
#run gridsearch on LGBMClassifier
grid_search(X_train, y_train, LGBMClassifier())

In [None]:
#run gridsearch on catboost
grid_search(X_train, y_train, CatBoostClassifier())

In [None]:
#use best params to create new model
lgbm = LGBMClassifier(learning_rate=0.01, max_depth=5, n_estimators=200)

#fit
lgbm.fit(X_train_smote, y_train_smote)

#predict
y_pred = lgbm.predict_proba(X_test_scaled)

#to see positive instances
y_proba_pos = y_pred[:, 1]

#score
print(roc_auc_score(y_test, y_proba_pos))

In [None]:
#use best params to create new model
cat = CatBoostClassifier(learning_rate=0.01, max_depth=3, n_estimators=1000)

#fit
cat.fit(X_train_smote, y_train_smote)

y_pred = cat.predict_proba(X_test_scaled)

#to see positive instances
y_proba_pos = y_pred[:, 1]

#score
print(roc_auc_score(y_test, y_proba_pos))

<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, you tried a few different models, tuned their hyperparameters using cross-validation and evaluated the final models on the test set.

</div>

Final Scores and Conclusion
Throughout this project, we delved into the intricacies of customer churn with the aim of identifying potential patterns that distinguish churned customers from active members. Our exploratory data analysis (EDA) revealed many interesting patterns in the data, setting the stage for our predictive models. One of the most intriguing patterns we found was that a significant portion of churn occurred among the newest members (less than 5 months of active membership).

We employed several models, primarily focusing on gradient boosting due to the nonlinearity in the data. This approach led to models that effectively predict churn versus non-churn rates, achieving performance above the 75 ROC-AUC minimum threshold.

The best score we achieved was an AUC-ROC of 84 with the CatBoost classifier. The hyperparameters for this model were: learning_rate = 0.01, max_depth = 3, and n_estimators = 1000.

Report Questions
What steps of the plan were performed and what steps were skipped (explain why)?

Performed Steps: We conducted comprehensive EDA, data preprocessing, feature engineering, and model training and evaluation. We focused on gradient boosting models due to the data's nonlinearity.
Skipped Steps: Certain advanced hyperparameter tuning methods were skipped due to computational resource limitations and time constraints.
What difficulties did you encounter and how did you manage to solve them?

Difficulties: One major challenge was handling the class imbalance in the dataset. We addressed this by using techniques like SMOTE (Synthetic Minority Over-sampling Technique) to balance the training data.
What were some of the key steps to solving the task?

Conducting thorough EDA to uncover important patterns in the data.
Implementing effective data preprocessing and feature engineering.
Employing gradient boosting models, particularly CatBoost, to handle nonlinearity and achieve high predictive performance.
What is your final model and what quality score does it have?

Our final model is the CatBoost classifier, which achieved an AUC-ROC score of 84.

Report
What steps of the plan were performed and what steps were skipped (explain why)?
I believe I completed all the steps that I had planned for and there was nothing that I felt necessary to skip. Everything from data preparation, exploratory data analysis (EDA), final modifications, and model training were necessary for me to complete the project.

What difficulties did you encounter and how did you manage to solve them?
I encountered some difficulties in the data preparation segment of the project. I made a couple of errors in properly scaling and preparing the data for use in cross-validation, and in encoding without data leakage. These were relatively simple problems that didn't require a lot of time to solve but had the potential to drastically alter the results of the project if they hadn't been corrected.

What were some of the key steps to solving the task?
In each step of the project, there were key actions that contributed to solving the task:

EDA: Understanding all the features was crucial since there were many that covered different aspects of each customer. Spending time to decipher each feature and create visual diagrams to show relationships between them and churn was essential for understanding the data.
Data Preparation: Being thorough and meticulous in handling the data was important. Ensuring the data was scaled without leakage between training and test sets, using proper One-Hot Encoding (OHE) techniques, and employing appropriate sampling techniques were key steps.
Model Training: Using GridSearchCV was pivotal in finding the best hyperparameters, making the process simpler and more efficient.
What is your final model and what quality score does it have?
My final model is the CatBoost classifier with a learning rate (lr) of 0.01, max depth of 3, and 1000 estimators. I achieved an ROC-AUC score of 84.4.