# Thought Process

We test 4 models:
- Logisitic Regression
- Decision Tree
- Random Forest
- Gradient Boosting

Each of the models were tested using several methods to improve their performance, including normalized/scaled features, sampling methods (SMOTE), feature selection, and hyperparameter tuning. 

Currently we evaluated the models based off the Churned class, which is represented by a 1. For the churned class we wanted to prioritize f1-score because it balances precision and recall. In churn dection we couldn't figure out if we wanted to prioritize identifying false positives or false negatives so we chose f1 to balance the two. 

# Logistic Regression Classifier

## Baseline Model 1

In [42]:
#Import pandas
import pandas as pd

#Read in original dataframe
df = pd.read_csv('telco_churn_encoded.csv')

#Read in normalized dataframe 
normalized_df = pd.read_csv('gnb_logreg_telco_churn.csv') #Normalized total_charges by applying a cube root transformation

#Check head
normalized_df.head()

Unnamed: 0,tenure_months,monthly_charges,total_charges,gender_male,senior_citizen_yes,partner_yes,dependents_yes,phone_service_yes,paperless_billing_yes,multiple_lines_no,...,streaming_tv_yes,streaming_movies_no,streaming_movies_yes,contract_month-to-month,contract_one_year,payment_method_bank_transfer,payment_method_electronic_check,payment_method_mailed_check,churn_value,cbrt_total_charges
0,2,53.85,108.15,1,0,0,0,1,1,1,...,0,1,0,1,0,0,0,1,1,4.764407
1,2,70.7,151.65,0,0,0,1,1,1,1,...,0,1,0,1,0,0,1,0,1,5.332704
2,8,99.65,820.5,0,0,0,1,1,1,0,...,1,0,1,1,0,0,1,0,1,9.361804
3,28,104.8,3046.05,0,0,1,1,1,1,0,...,1,0,1,1,0,0,1,0,1,14.495916
4,49,103.7,5036.3,1,0,0,1,1,1,0,...,1,0,1,1,0,1,0,0,1,17.141041


In [40]:
#Import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [43]:
#Scale the dataset
#Import min-max scaling from scikit-learn
from sklearn.preprocessing import MinMaxScaler

#Initialize list of columns to scale
scaled_cols = ['tenure_months', 'monthly_charges', 'total_charges', 'cbrt_total_charges']

#Initialize scaler
scaler = MinMaxScaler()

#Make a copy of the normalized dataframe, called scaled_df
scaled_df = normalized_df.copy()

#Apply the scaler to the scaled dataframe
scaled_df[scaled_cols] = scaler.fit_transform(normalized_df[scaled_cols])

#Check result
scaled_df[['tenure_months', 'monthly_charges', 'total_charges', 'cbrt_total_charges']].head()

Unnamed: 0,tenure_months,monthly_charges,total_charges,cbrt_total_charges
0,0.027778,0.354229,0.01031,0.117646
1,0.027778,0.521891,0.01533,0.149401
2,0.111111,0.80995,0.092511,0.374539
3,0.388889,0.861194,0.349325,0.661424
4,0.680556,0.850249,0.578987,0.809228


Documentation for min-max scaling: https://www.kaggle.com/code/alexisbcook/scaling-and-normalization

**Why it's important:** 

When we orignally tried to run the logisitc regression we were was getting a max iterations warning message. While not completely aware of what this error message meant, we understood that to fix this problem we had to scale our numerical features. What scaling does it change the range of the data. For Min-Max scaling specifically it changes the range of the data to zeros (0) and one's (1). What scaling allows us to do is to compare our numerical features on equal footing. Machine learning models, especially linear models might give more weight to larger numerical values as opposed to smaller numerical values. This is especially true with distance based algorithms such as K-Nearest Neighbors. 

In [14]:
#Prepare X and y (first I'll test without cube root total_charges)
X = scaled_df.drop(columns=['churn_value', 'cbrt_total_charges']) #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Conduct a train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize and fit logistic regression model
logreg = LogisticRegression(class_weight='balanced', random_state=42)

#Fit on training data
logreg.fit(X_train, y_train)

#Make predictions
y_pred_log = logreg.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(logreg, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_log) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_log))

Accuracy: 74.63%
Best Cross-Validation Score: 75.91

Classification Report: 
               precision    recall  f1-score   support

           0       0.91      0.72      0.80      1525
           1       0.53      0.81      0.64       588

    accuracy                           0.75      2113
   macro avg       0.72      0.77      0.72      2113
weighted avg       0.80      0.75      0.76      2113



In [15]:
#Prepare X and y (first I'll test without total_charges)
X = scaled_df.drop(columns=['churn_value', 'total_charges']) #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Conduct a train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize and fit logistic regression model
logreg = LogisticRegression(class_weight='balanced', random_state=42)

#Fit on training data
logreg.fit(X_train, y_train)

#Make predictions
y_pred_log = logreg.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(logreg, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_log) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_log))

Accuracy: 75.01%
Best Cross-Validation Score: 76.46

Classification Report: 
               precision    recall  f1-score   support

           0       0.90      0.73      0.81      1525
           1       0.53      0.80      0.64       588

    accuracy                           0.75      2113
   macro avg       0.72      0.77      0.72      2113
weighted avg       0.80      0.75      0.76      2113



In [None]:
#Prepare X and y (first I'll test without either)
X = scaled_df.drop(columns=['churn_value', 'total_charges', 'cbrt_total_charges']) #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Conduct a train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize and fit logistic regression model
logreg = LogisticRegression(class_weight='balanced', random_state=42)

#Fit on training data
logreg.fit(X_train, y_train)

#Make predictions
y_pred_log = logreg.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(logreg, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_log) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_log))

Accuracy: 74.73%
Best Cross-Validation Score: 75.85

Classification Report: 
               precision    recall  f1-score   support

           0       0.91      0.72      0.80      1525
           1       0.53      0.81      0.64       588

    accuracy                           0.75      2113
   macro avg       0.72      0.77      0.72      2113
weighted avg       0.80      0.75      0.76      2113



**Note:** Not much difference in model performance when using total charges and it's cubed version. The purpose behind testing the different features of total charges is because from our earlier analysis we saw that total charges was highly correlated with several other features. Thus we thought it might not have been that important of a feature to the model. However, the differences in performance wasn't anything substantial. 

## Model 2: Sampling Methods

### Synthetic Minority Oversampling Technique (SMOTE)

In [None]:
#Import libraries
from collections import Counter #Use to count the class distribution of our response variable
from imblearn.over_sampling import SMOTE

**What is SMOTE:**

**Synthetic Minority Oversampling** is a sampling technique that is used to address class imbalance within machine learning classification tasks. Given our current dataset of churns and from our EDA we saw that around 26% of customers churned, meaning there is a substantial class imbalance. What class imbalance does is introduce bias into the model against the minority class (churned customers), meaning that the performance metrics for predicted churn is worse than it is for customers predicted not to churn.

What **SMOTE** does to counteract this is create synthetic samples of the minority class (churned customers) to make the class distribution even. It works by generating new, similar examples based on existing ones rather than just duplicating them. The goal is to increase performance for predicted churn customers.

**Documentation:** https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/

In [22]:
#Prepare X and y
X = scaled_df.drop(columns=['churn_value', 'total_charges']) #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Re-conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before SMOTE
print(f"Class Distribution before SMOTE: {Counter(y_train)}")

#Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='minority', random_state=42) #Initialize SMOTE
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

#Display class balance after SMOTE
print(f"Class Distribution after SMOTE: {Counter(y_train_res)}")

#Initialize and fit logistic regression model
logreg = LogisticRegression(random_state=42)

#Fit on training data
logreg.fit(X_train_res, y_train_res)

#Make predictions
y_pred_log = logreg.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(logreg, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_log) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_log))

Class Distribution before SMOTE: Counter({0: 3649, 1: 1281})
Class Distribution after SMOTE: Counter({1: 3649, 0: 3649})
Accuracy: 76.10%
Best Cross-Validation Score: 81.14

Classification Report: 
               precision    recall  f1-score   support

           0       0.90      0.76      0.82      1525
           1       0.55      0.77      0.64       588

    accuracy                           0.76      2113
   macro avg       0.72      0.76      0.73      2113
weighted avg       0.80      0.76      0.77      2113



Doesn't appear to be much of a effect for SMOTE on the logisitic regression model. 

### Random Under-Sampling

**What is Random Under-Sampling:**

**Random Under-Sampling** is another sampling technique used in classification tasks where there is a class imbalance. As opposed to SMOTE instead of creating new instances of the minority class (churned customers), random undersampling randomly removes rows of data from the majority class (customers who didn't churn), so that the numbers of each class align. 

**Documentation:** https://www.geeksforgeeks.org/handling-imbalanced-data-for-classification/

In [None]:
#Import imblearn under_sampling library
from imblearn.under_sampling import RandomUnderSampler

#Prepare X and y 
X = scaled_df.drop(columns=['churn_value', 'total_charges']) #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before under-sampling
print(f"Class Distribution before under-sampling: {Counter(y_train)}")

#Apply random under-sampling to reduce majority class
underS = RandomUnderSampler(random_state=42, replacement=True)

#Redo train split
X_under, y_under = underS.fit_resample(X_train, y_train)

#Display class balance after under-sampling
print(f"Class Distribution after under-sampling: {Counter(y_under)}")

#Initialize and fit logistic regression model
logreg = LogisticRegression(random_state=42)

#Fit on training data
logreg.fit(X_under, y_under)

#Make predictions
y_pred_log = logreg.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(logreg, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_log) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_log))

Class Distribution before under-sampling: Counter({0: 3649, 1: 1281})
Class Distribution after under-sampling: Counter({0: 1281, 1: 1281})
Accuracy: 75.06%
Best Cross-Validation Score: 81.14

Classification Report: 
               precision    recall  f1-score   support

           0       0.91      0.73      0.81      1525
           1       0.53      0.81      0.64       588

    accuracy                           0.75      2113
   macro avg       0.72      0.77      0.73      2113
weighted avg       0.81      0.75      0.76      2113



**Note:** Compared to SMOTE, random undersampling slightly boosted the recall for churned customers, meaning the model was better at correctly identifying customers who actually churned, reducing the number of false negatives.

## Model 3: Feature Selection

In [44]:
#Prepare X and y
X = scaled_df.drop(columns=['churn_value', 'total_charges']) #Explanatory variables
y = scaled_df['churn_value'] #Response variable

### Filter Method

In [45]:
#Import filter method
from sklearn.feature_selection import SelectKBest, mutual_info_classif 

**Explanation for SelectKBest (Filter Method):**

`SelectKBest` is a filter based feature selection method. What it does is select features independently from the machine learning algorithm. It does this based off several statistical measures to score and rank features. We chose to use mutual information classifier as the statistical test to feed into the filter method. 

**Explanation for Mutual Information Classifier:**

The reason I'm using mutual information classifier on the filter method is because it's best for datasets that have a mix of numerical and cateogorial features. Although I do have more categorical than numerical it's still is useful in this situation. 

**Documentation:** https://medium.com/@Kavya2099/optimizing-performance-selectkbest-for-efficient-feature-selection-in-machine-learning-3b635905ed48

In [46]:
#Find the best features using the filter method

#Use mutual info classifier due to combo of categorical and numerical features 
selector = SelectKBest(mutual_info_classif, k=15) #Select top 15 features
X_new = selector.fit_transform(X, y)

#Print out selected features
selected_features = X.columns[selector.get_support()]
print(f"Selected Features: {selected_features}")

Selected Features: Index(['tenure_months', 'monthly_charges', 'partner_yes', 'dependents_yes',
       'internet_service_fiber_optic', 'online_security_no',
       'online_backup_no', 'device_protection_no', 'tech_support_no',
       'streaming_tv_no', 'streaming_movies_no', 'contract_month-to-month',
       'contract_one_year', 'payment_method_electronic_check',
       'cbrt_total_charges'],
      dtype='object')


In [47]:
#Import libraries
from imblearn.over_sampling import SMOTE
from collections import Counter

#Prepare X and y
X = scaled_df[selected_features] #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Re-conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before SMOTE
print(f"Class Distribution before SMOTE: {Counter(y_train)}")

#Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='minority', random_state=42) #Initialize SMOTE
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

#Display class balance after SMOTE
print(f"Class Distribution after SMOTE: {Counter(y_train_res)}")

#Initialize and fit logistic regression model
logreg = LogisticRegression(random_state=42)

#Fit on training data
logreg.fit(X_train_res, y_train_res)

#Make predictions
y_pred_log = logreg.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(logreg, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_log) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_log))

Class Distribution before SMOTE: Counter({0: 3649, 1: 1281})
Class Distribution after SMOTE: Counter({1: 3649, 0: 3649})
Accuracy: 75.44%
Best Cross-Validation Score: 80.79

Classification Report: 
               precision    recall  f1-score   support

           0       0.91      0.73      0.81      1525
           1       0.54      0.81      0.65       588

    accuracy                           0.75      2113
   macro avg       0.72      0.77      0.73      2113
weighted avg       0.81      0.75      0.77      2113



## Wrapper Method

In [52]:
#Import wrapper method
from sklearn.feature_selection import RFE

#Initialize logisitic regression model
log_reg = LogisticRegression(random_state=42)

#Initialize RFE and select top 15 features
rfe = RFE(log_reg, n_features_to_select=15)
rfe.fit(X, y) #Fit on X and y

#Print out selected features
selected_features = X.columns[rfe.support_]
print(f"Selected Features using the Wrapper Method: {selected_features}")

#Prepare X 
X_selected = scaled_df[selected_features]

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, train_size=0.7, random_state=42)

#Display class balance before SMOTE
print(f"Class Distribution before SMOTE: {Counter(y_train)}")

#Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='minority', random_state=42) #Initialize SMOTE
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

#Display class balance after SMOTE
print(f"Class Distribution after SMOTE: {Counter(y_train_res)}")

#Fit on training data
log_reg.fit(X_train_res, y_train_res)

#Make predictions
y_pred_tree = log_reg.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(log_reg, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for decision tree
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Selected Features using the Wrapper Method: Index(['tenure_months', 'monthly_charges', 'partner_yes', 'dependents_yes',
       'internet_service_fiber_optic', 'online_security_no',
       'online_backup_no', 'device_protection_no', 'tech_support_no',
       'streaming_tv_no', 'streaming_movies_no', 'contract_month-to-month',
       'contract_one_year', 'payment_method_electronic_check',
       'cbrt_total_charges'],
      dtype='object')
Class Distribution before SMOTE: Counter({0: 3649, 1: 1281})
Class Distribution after SMOTE: Counter({1: 3649, 0: 3649})
Accuracy: 75.44%
Best Cross-Validation Score: 80.79

Classification Report: 
               precision    recall  f1-score   support

           0       0.91      0.73      0.81      1525
           1       0.54      0.81      0.65       588

    accuracy                           0.75      2113
   macro avg       0.72      0.77      0.73      2113
weighted avg       0.81      0.75      0.77      2113



**Explanation of Wrapper Method:**

The wrapper method or Recursive Feature Elimination (RFE) is a commonly used feature selection method. What the wrapper method/RFE does, is train the model based off the features, then iteratively remove the least important features to the model. It does this one by one until left with the specified number of features. Where the wrapper method differs from the filter method is that is ranks features based off how well they individually boost the models performance as opposed to statistical tests. 


**Result from classification report:**

Very similar result to the filter method. Both model's can be considered the same. 

**Documentation:** https://medium.com/@rithpansanga/logistic-regression-for-feature-selection-selecting-the-right-features-for-your-model-410ca093c5e0

## Logistic Regression with Hyperparameter Tuning

**What is GridSearchCV:**

GridSearchCV is a method from scikit-learn which works through multiple combinations of parameter tuning to provide the best set of parameters from the defined list of parameters (`param_grid`). Overall it's used to optimize the model's performance by providing the best combination of hyperparameters. 

**Documentation:** https://www.geeksforgeeks.org/how-to-optimize-logistic-regression-performance/

**Note:**

I used the same parameter grid from the geeksforgeeks link I posted above. I honestly have never used GridSearchCV for logisitic regression before so I wasn't too aware of the parameter tuning. For a more in depth look at each of the parameters of a logisitic regression look at the geeksforgeeks link above and read from scikit-learns documentation on logisitic regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [4]:
#Import GridSearchCV
from sklearn.model_selection import GridSearchCV

#Import numpy
import numpy as np

#Define a parameter grid
param_grid = {
    'penalty':['l1','l2','elasticnet','none'],
    'C' : np.logspace(-4,4,20),
    'solver': ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter'  : [100,1000,2500,5000]
}

#Prepare X and y
X = scaled_df.drop(columns=['churn_value', 'total_charges'])
y = scaled_df['churn_value']

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize logisitc regression model
logreg = LogisticRegression()

#Initialize GridSearchCV
grid_search = GridSearchCV(logreg, param_grid=param_grid, cv=5, verbose=True, n_jobs=1)

#Fit grid to training data
grid_search.fit(X_train, y_train)

#Get best hyperparameters
print(f"Best Hyperparameters: {grid_search.best_params_}")

#Put the best parameters on the logreg model
logreg = grid_search.best_estimator_

#Make predictions on testing data
y_pred_log = logreg.predict(X_test)

#Get best cross-val score
score_mean = grid_search.best_score_ * 100

#Display accuracy and classification
accuracy = accuracy_score(y_test, y_pred_log) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_log))

Fitting 5 folds for each of 1600 candidates, totalling 8000 fits




Best Hyperparameters: {'C': np.float64(0.615848211066026), 'max_iter': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
Accuracy: 81.50%
Best Cross-Validation Score: 81.16

Classification Report: 
               precision    recall  f1-score   support

           0       0.85      0.90      0.88      1525
           1       0.70      0.59      0.64       588

    accuracy                           0.81      2113
   macro avg       0.77      0.75      0.76      2113
weighted avg       0.81      0.81      0.81      2113



5200 fits failed out of a total of 8000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
400 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\KRAyu\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\KRAyu\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\KRAyu\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1194, in fit
    solver = _check_solve

**Results from classification Report:**

The GridSearch powered logisitic regression significantly improved in precision (for churned customers) however, it suffered in recall as a result. Regardless the f1-score didn't see much improvement. 

**Error Explanation:**

What this error "The max_iter was reached which means the coef_ did not converge" means is that there's a chance that my model could of reached better parameters but the full number of iterations was reached before this could happen. Overall it means my model could've been better. I could look into how to fix this error but I honestly don't know how to. Rather than trying to fix this issues I'll just test out other models. 

# Decision Trees

## Baseline Model

In [None]:
#Import decision tree and other libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_score

#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value']

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize decision tree
decision_tree = DecisionTreeClassifier(random_state=42) #Decision Tree without any hyperparameter tuning

#Fit on training data
decision_tree.fit(X_train, y_train)

#Make predicitions
y_pred_tree = decision_tree.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(decision_tree, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for decision tree
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Accuracy: 74.49%
Best Cross-Validation Score: 73.22

Classification Report: 
               precision    recall  f1-score   support

           0       0.82      0.83      0.82      1525
           1       0.54      0.53      0.54       588

    accuracy                           0.74      2113
   macro avg       0.68      0.68      0.68      2113
weighted avg       0.74      0.74      0.74      2113



Performs pretty poorly for churned customers, especially compared to the baseline logisitic regression model. 

## Decision Tree with Normalized Features

In [None]:
#Use normalized features to see if there's a difference

#Prepare X and y
X = normalized_df.drop(columns= ['churn_value', 'total_charges']) #Explanatory variables
y = normalized_df['churn_value']

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize decision tree
decision_tree = DecisionTreeClassifier(random_state=42) #Decision Tree without any hyperparameter tuning

#Fit on training data
decision_tree.fit(X_train, y_train)

#Make predicitions
y_pred_tree = decision_tree.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(decision_tree, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for decision tree
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Accuracy: 74.30%
Best Cross-Validation Score: 73.22

Classification Report: 
               precision    recall  f1-score   support

           0       0.81      0.84      0.82      1525
           1       0.54      0.50      0.52       588

    accuracy                           0.74      2113
   macro avg       0.68      0.67      0.67      2113
weighted avg       0.74      0.74      0.74      2113



No improvement

## Decision Tree with Normalized and Scaled Features

In [None]:
#Use scaled + normalized features to see if there's a difference 

#Prepare X and y
X = scaled_df.drop(columns='churn_value') #Explanatory variables
y = scaled_df['churn_value']

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize decision tree
decision_tree = DecisionTreeClassifier(random_state=42) #Decision Tree without any hyperparameter tuning

#Fit on training data
decision_tree.fit(X_train, y_train)

#Make predicitions
y_pred_tree = decision_tree.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(decision_tree, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for decision tree
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Accuracy: 74.44%
Best Cross-Validation Score: 72.95

Classification Report: 
               precision    recall  f1-score   support

           0       0.82      0.83      0.82      1525
           1       0.54      0.53      0.53       588

    accuracy                           0.74      2113
   macro avg       0.68      0.68      0.68      2113
weighted avg       0.74      0.74      0.74      2113



No improvement.

## Decision Tree with different sampling methods

### Synthetic Minority Oversampling Method

In [None]:
#Import libraries
from imblearn.over_sampling import SMOTE
from collections import Counter

#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable

#Re-conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before SMOTE
print(f"Class Distribution before SMOTE: {Counter(y_train)}")

#Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='minority', random_state=42) #Initialize SMOTE
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

#Display class balance after SMOTE
print(f"Class Distribution after SMOTE: {Counter(y_train_res)}")

#Initialize and fit decision tree
decision_tree = DecisionTreeClassifier(random_state=42)

#Fit on training data
decision_tree.fit(X_train_res, y_train_res)

#Make predictions
y_pred_tree = decision_tree.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(decision_tree, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for decision tree
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Class Distribution before SMOTE: Counter({0: 3649, 1: 1281})
Class Distribution after SMOTE: Counter({1: 3649, 0: 3649})
Accuracy: 73.12%
Best Cross-Validation Score: 73.22

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.79      0.81      1525
           1       0.52      0.57      0.54       588

    accuracy                           0.73      2113
   macro avg       0.67      0.68      0.68      2113
weighted avg       0.74      0.73      0.74      2113



Still under whelming performance. 

### Random Under-Sampling

In [None]:
#Import imblearn under_sampling library
from imblearn.under_sampling import RandomUnderSampler

#Prepare X and y 
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before under-sampling
print(f"Class Distribution before under-sampling: {Counter(y_train)}")

#Apply random under-sampling to reduce majority class
underS = RandomUnderSampler(random_state=42, replacement=True)

#Redo train-test split
X_under, y_under = underS.fit_resample(X_train, y_train)

#Display class balance after under-sampling
print(f"Class Distribution after under-sampling: {Counter(y_under)}")

#Initialize and fit decision tree
decision_tree = DecisionTreeClassifier(random_state=42)


#Fit on training data
decision_tree.fit(X_under, y_under)

#Make predictions
y_pred_tree = decision_tree.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(decision_tree, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for decision tree
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Class Distribution before under-sampling: Counter({0: 3649, 1: 1281})
Class Distribution after under-sampling: Counter({0: 1281, 1: 1281})
Accuracy: 67.96%
Best Cross-Validation Score: 73.22

Classification Report: 
               precision    recall  f1-score   support

           0       0.86      0.67      0.75      1525
           1       0.45      0.71      0.55       588

    accuracy                           0.68      2113
   macro avg       0.65      0.69      0.65      2113
weighted avg       0.74      0.68      0.70      2113



Even worse performance for churned customers despite the rise in recall which came at the cost of precision and f1

## Decision Tree with Feature Selection

### Filter Method

In [11]:
#Import filter method and mutual info classifier
from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import mutual_info_classif #B/c we have a mix of categorical and numerical features

In [106]:
#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable

In [107]:
#Find the best features using the filter method

#Use mutual info classifier due to combo of categorical and numerical features 
selector = SelectKBest(mutual_info_classif, k=5) #Select top 5 features
X_new = selector.fit_transform(X, y)

#Print out selected features
selected_features = X.columns[selector.get_support()]
print(f"Selected Features: {selected_features}")

Selected Features: Index(['tenure_months', 'internet_service_fiber_optic', 'online_security_no',
       'tech_support_no', 'contract_month-to-month'],
      dtype='object')


In [None]:
#Re-prepare X and y
X = df[selected_features]
y = df['churn_value']

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize decision tree
decision_tree = DecisionTreeClassifier(random_state=42) #Decision Tree without any hyperparameter tuning

#Fit on training data
decision_tree.fit(X_train, y_train)

#Make predicitions
y_pred_tree = decision_tree.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(decision_tree, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for decision tree
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Accuracy: 77.33%
Best Cross-Validation Score: 77.61

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.86      0.85      1525
           1       0.60      0.54      0.57       588

    accuracy                           0.77      2113
   macro avg       0.72      0.70      0.71      2113
weighted avg       0.77      0.77      0.77      2113



Still pretty bad performance for churned customers. 

### Wrapper Method

In [None]:
#Import wrapper method
from sklearn.feature_selection import RFE

#Initialize decision tree
decision_tree = DecisionTreeClassifier(random_state=42)

#Initialize RFE and select top 15 features
rfe = RFE(decision_tree, n_features_to_select=15)
rfe.fit(X, y) #Fit on X and y

#Print out selected features
selected_features = X.columns[rfe.support_]
print(f"Selected Features using the Wrapper Method: {selected_features}")

#Prepare X 
X_selected = df[selected_features]

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, train_size=0.7, random_state=42)

#Fit on training data
decision_tree.fit(X_train, y_train)

#Make predictions
y_pred_tree = decision_tree.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(decision_tree, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for decision tree
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Selected Features using the Wrapper Method: Index(['tenure_months', 'monthly_charges', 'total_charges', 'gender_male',
       'senior_citizen_yes', 'partner_yes', 'dependents_yes',
       'paperless_billing_yes', 'internet_service_fiber_optic',
       'online_security_yes', 'online_backup_no', 'device_protection_no',
       'tech_support_no', 'contract_month-to-month',
       'payment_method_electronic_check'],
      dtype='object')
Accuracy: 74.16%
Best Cross-Validation Score: 73.22

Classification Report: 
               precision    recall  f1-score   support

           0       0.82      0.82      0.82      1525
           1       0.54      0.53      0.53       588

    accuracy                           0.74      2113
   macro avg       0.68      0.68      0.68      2113
weighted avg       0.74      0.74      0.74      2113



Bad performance yet again. 

## Decision Tree with Hyperparameter tuning

Again I use GridSearchCV to optimize the parameters of the Decision Tree. For more information on the parameters of a decision tree:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [None]:
#Import GridSearchCV
from sklearn.model_selection import GridSearchCV

#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables (No feature selection)
y = df['churn_value'] #Response variable

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize decision tree
decision_tree = DecisionTreeClassifier(random_state=42)

#Define parameters grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'criterion': ['gini', 'entropy']
}

#Set up GridSearchCV with 5-fold cross validation
grid_search = GridSearchCV(estimator=decision_tree, param_grid=param_grid, cv=5, scoring='accuracy')

#Fit to training data
grid_search.fit(X_train, y_train)

#Get best parameters
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best cross-validation accuracy: ", grid_search.best_score_)

#Put best parameters on decision tree
decision_tree = grid_search.best_estimator_

#Make predictions on testing data
y_pred_tree = decision_tree.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(decision_tree, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Best Hyperparameters:  {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10}
Best cross-validation accuracy:  0.7884381338742393
Accuracy: 78.89%
Best Cross-Validation Score: 78.76

Classification Report: 
               precision    recall  f1-score   support

           0       0.88      0.82      0.85      1525
           1       0.60      0.70      0.65       588

    accuracy                           0.79      2113
   macro avg       0.74      0.76      0.75      2113
weighted avg       0.80      0.79      0.79      2113



Pretty decent improvement, matches the f1-score from the logisitic regression with the filter method (65%). Relatively decent precision and recall as well.  

In [4]:
#Add more hyperparameter tuning

#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables (No feature selection)
y = df['churn_value'] #Response variable

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize decision tree
decision_tree = DecisionTreeClassifier(random_state=42)

#Define parameters grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'criterion': ['gini', 'entropy'],
    'max_features': [None, 'sqrt', 'log2'],
    'class_weight': [None, 'balanced']
}

#Set up GridSearchCV with 5-fold cross validation
grid_search = GridSearchCV(estimator=decision_tree, param_grid=param_grid, cv=5, scoring='accuracy')

#Fit to training data
grid_search.fit(X_train, y_train)

#Get best parameters
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best cross-validation accuracy: ", grid_search.best_score_)

#Put best parameters on decision tree
decision_tree = grid_search.best_estimator_

#Make predictions on testing data
y_pred_tree = decision_tree.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(decision_tree, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Best Hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 5, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 10}
Best cross-validation accuracy:  0.7884381338742393
Accuracy: 78.89%
Best Cross-Validation Score: 78.76

Classification Report: 
               precision    recall  f1-score   support

           0       0.88      0.82      0.85      1525
           1       0.60      0.70      0.65       588

    accuracy                           0.79      2113
   macro avg       0.74      0.76      0.75      2113
weighted avg       0.80      0.79      0.79      2113



Similar performance. 

In [None]:
#Import GridSearchCV
from sklearn.model_selection import GridSearchCV

#Prepare X and y
X = df[selected_features] #Explanatory variables (No feature selection)
y = df['churn_value'] #Response variable

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize decision tree
decision_tree = DecisionTreeClassifier(random_state=42)

#Define parameters grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'criterion': ['gini', 'entropy']
}

#Set up GridSearchCV with 5-fold cross validation
grid_search = GridSearchCV(estimator=decision_tree, param_grid=param_grid, cv=5, scoring='accuracy')

#Fit to training data
grid_search.fit(X_train, y_train)

#Get best parameters
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best cross-validation accuracy: ", grid_search.best_score_)

#Put best parameters on decision tree
decision_tree = grid_search.best_estimator_

#Make predictions on testing data
y_pred_tree = decision_tree.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(decision_tree, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report
accuracy = accuracy_score(y_test, y_pred_tree) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_tree))

Best Hyperparameters:  {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best cross-validation accuracy:  0.7935091277890466
Accuracy: 79.22%
Best Cross-Validation Score: 79.38

Classification Report: 
               precision    recall  f1-score   support

           0       0.82      0.92      0.86      1525
           1       0.69      0.46      0.55       588

    accuracy                           0.79      2113
   macro avg       0.75      0.69      0.71      2113
weighted avg       0.78      0.79      0.78      2113



Tried using the list of selected features from the filter method but that didn't improve performance and instead drastically reduced it. 

# Random Forest

In [1]:
#Import libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_score

## Baseline Model

In [12]:
#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize random forest classifier
random_forest = RandomForestClassifier(random_state=42) #No hyperparameter tuning 

#Fit on training data
random_forest.fit(X_train, y_train)

#Make predicitions
y_pred_forest = random_forest.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(random_forest, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))

Accuracy: 78.99%
Best Cross-Validation Score: 79.19

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.89      0.86      1525
           1       0.65      0.52      0.58       588

    accuracy                           0.79      2113
   macro avg       0.74      0.71      0.72      2113
weighted avg       0.78      0.79      0.78      2113



Not as good performance compared to the baseline logisitic regression model, but still is better than the baseline decision tree. 

## Normalized Features

In [13]:
#Prepare X and y
X = normalized_df.drop(columns=['total_charges', 'churn_value']) #Explanatory variables
y = normalized_df['churn_value'] #Response variable

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize random forest classifier
random_forest = RandomForestClassifier(random_state=42) #No hyperparameter tuning 

#Fit on training data
random_forest.fit(X_train, y_train)

#Make predicitions
y_pred_forest = random_forest.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(random_forest, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))

Accuracy: 79.56%
Best Cross-Validation Score: 79.53

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.90      0.86      1525
           1       0.67      0.53      0.59       588

    accuracy                           0.80      2113
   macro avg       0.75      0.71      0.73      2113
weighted avg       0.79      0.80      0.79      2113



Slightly improved f1 for churned customers. 

## Scaled Features

In [None]:
#Use scaled features to see if there's a difference 

#Prepare X and y
X = scaled_df.drop(columns='churn_value') #Explanatory variables
y = scaled_df['churn_value']

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize decision tree
random_forest =  RandomForestClassifier(random_state=42) #Without any hyperparameter tuning

#Fit on training data
random_forest.fit(X_train, y_train)

#Make predicitions
y_pred_forest = random_forest.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(random_forest, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for random forest
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))

Accuracy: 80.12%
Best Cross-Validation Score: 79.17

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.91      0.87      1525
           1       0.68      0.53      0.60       588

    accuracy                           0.80      2113
   macro avg       0.76      0.72      0.73      2113
weighted avg       0.79      0.80      0.79      2113



Again improved the f1 for churned customers. 

## Random Forest with different sampling methods

### Synthetic Minority Oversampling Method

In [10]:
#Import libraries
from imblearn.over_sampling import SMOTE
from collections import Counter

#Prepare X and y
X = scaled_df.drop(columns=['total_charges', 'churn_value']) #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before SMOTE
print(f"Class Distribution before SMOTE: {Counter(y_train)}")

#Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='minority', random_state=42) #Initialize SMOTE
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

#Display class balance after SMOTE
print(f"Class Distribution after SMOTE: {Counter(y_train_res)}")

#Initialize and fit random forest
random_forest = RandomForestClassifier(random_state=42)

#Fit on training data
random_forest.fit(X_train, y_train)

#Make predictions
y_pred_forest = random_forest.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(random_forest, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))

Class Distribution before SMOTE: Counter({0: 3649, 1: 1281})
Class Distribution after SMOTE: Counter({1: 3649, 0: 3649})
Accuracy: 79.51%
Best Cross-Validation Score: 79.51

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.90      0.86      1525
           1       0.67      0.53      0.59       588

    accuracy                           0.80      2113
   macro avg       0.75      0.71      0.73      2113
weighted avg       0.79      0.80      0.79      2113



Didn't improve performance. 

### Random Under-Sampling

In [9]:
#Import imblearn under_sampling library
from imblearn.under_sampling import RandomUnderSampler

#Prepare X and y 
X = scaled_df.drop(columns=['total_charges', 'churn_value']) #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before under-sampling
print(f"Class Distribution before under-sampling: {Counter(y_train)}")

#Apply random under-sampling to reduce majority class
underS = RandomUnderSampler(random_state=42, replacement=True)

#Redo train-test split
X_under, y_under = underS.fit_resample(X_train, y_train)

#Display class balance after under-sampling
print(f"Class Distribution after under-sampling: {Counter(y_under)}")

#Initialize random forest model
random_forest = RandomForestClassifier(random_state=42)

#Fit on training data
random_forest.fit(X_under, y_under)

#Make predictions
y_pred_forest = random_forest.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(random_forest, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))

Class Distribution before under-sampling: Counter({0: 3649, 1: 1281})
Class Distribution after under-sampling: Counter({0: 1281, 1: 1281})
Accuracy: 74.49%
Best Cross-Validation Score: 79.51

Classification Report: 
               precision    recall  f1-score   support

           0       0.92      0.71      0.80      1525
           1       0.53      0.84      0.65       588

    accuracy                           0.74      2113
   macro avg       0.72      0.77      0.72      2113
weighted avg       0.81      0.74      0.76      2113



Best f1-score (65%) along with wrapper method powered Logistitic Regression and GridSearch powered Decision Tree. Also very high performance in recall, but at the tradeoff of lower performance for precision (for churned customers).

## Random Forest with Feature Selection

### Filter Method

In [96]:
#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable

In [93]:
#Find the best features using the filter method

#Use mutual info classifier due to combo of categorical and numerical features 
selector = SelectKBest(mutual_info_classif, k=20) #Select top 20 features
X_new = selector.fit_transform(X, y)

#Print out selected features
selected_features = X.columns[selector.get_support()]
print(f"Selected Features: {selected_features}")

Selected Features: Index(['tenure_months', 'monthly_charges', 'total_charges', 'gender_male',
       'senior_citizen_yes', 'dependents_yes', 'paperless_billing_yes',
       'internet_service_dsl', 'internet_service_fiber_optic',
       'online_security_no', 'online_security_yes', 'online_backup_no',
       'online_backup_yes', 'device_protection_no', 'tech_support_no',
       'tech_support_yes', 'contract_month-to-month', 'contract_one_year',
       'payment_method_bank_transfer', 'payment_method_electronic_check'],
      dtype='object')


In [94]:
#Re-prepare X and y
X = scaled_df[selected_features]
y = scaled_df['churn_value']

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize Random Forest
random_forest = RandomForestClassifier(random_state=42)

#Fit on training data
random_forest.fit(X_train, y_train)

#Make predicitions
y_pred_forest = random_forest.predict(X_test)

#Evaluate model

#Conduct 5-fold cross-validation 
scores = cross_val_score(random_forest, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))

Accuracy: 79.70%
Best Cross-Validation Score: 78.97

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.90      0.86      1525
           1       0.67      0.54      0.60       588

    accuracy                           0.80      2113
   macro avg       0.75      0.72      0.73      2113
weighted avg       0.79      0.80      0.79      2113



Not great performance for churned customers. 

In [95]:
#Filter method with SMOTE

#Prepare X and y
X = scaled_df[selected_features] #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before SMOTE
print(f"Class Distribution before SMOTE: {Counter(y_train)}")

#Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='minority', random_state=42) #Initialize SMOTE
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

#Display class balance after SMOTE
print(f"Class Distribution after SMOTE: {Counter(y_train_res)}")

#Initialize and fit random forest
random_forest = RandomForestClassifier(random_state=42)

#Fit on training data
random_forest.fit(X_train, y_train)

#Make predictions
y_pred_forest = random_forest.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(random_forest, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))

Class Distribution before SMOTE: Counter({0: 3649, 1: 1281})
Class Distribution after SMOTE: Counter({1: 3649, 0: 3649})
Accuracy: 79.70%
Best Cross-Validation Score: 78.97

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.90      0.86      1525
           1       0.67      0.54      0.60       588

    accuracy                           0.80      2113
   macro avg       0.75      0.72      0.73      2113
weighted avg       0.79      0.80      0.79      2113



Again, not great performance.

### Wrapper Method

In [121]:
#Import wrapper method
from sklearn.feature_selection import RFE

#Initialize random forest
random_forest = RandomForestClassifier(random_state=42)

#Initialize RFE and select top 15 features
rfe = RFE(random_forest, n_features_to_select=15)
rfe.fit(X, y) #Fit on X and y

#Print out selected features
selected_features = X.columns[rfe.support_]
print(f"Selected Features using the Wrapper Method: {selected_features}")

#Prepare X 
X_selected = normalized_df[selected_features]

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, train_size=0.7, random_state=42)

#Fit on training data
random_forest.fit(X_train, y_train)

#Make predictions
y_pred_forest = random_forest.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(random_forest, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))

Selected Features using the Wrapper Method: Index(['tenure_months', 'monthly_charges', 'gender_male', 'senior_citizen_yes',
       'partner_yes', 'dependents_yes', 'paperless_billing_yes',
       'multiple_lines_no', 'internet_service_fiber_optic',
       'online_security_no', 'online_backup_no', 'tech_support_no',
       'contract_month-to-month', 'payment_method_electronic_check',
       'cbrt_total_charges'],
      dtype='object')
Accuracy: 79.13%
Best Cross-Validation Score: 79.51

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.89      0.86      1525
           1       0.65      0.54      0.59       588

    accuracy                           0.79      2113
   macro avg       0.74      0.71      0.72      2113
weighted avg       0.78      0.79      0.78      2113



In [122]:
#Import wrapper method
from sklearn.feature_selection import RFE

#Initialize random forest
random_forest = RandomForestClassifier(random_state=42)

#Initialize RFE and select top 15 features
rfe = RFE(random_forest, n_features_to_select=15)
rfe.fit(X, y) #Fit on X and y

#Print out selected features
selected_features = X.columns[rfe.support_]
print(f"Selected Features using the Wrapper Method: {selected_features}")

#Prepare X 
X_selected = scaled_df[selected_features]

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, train_size=0.7, random_state=42)

#Fit on training data
random_forest.fit(X_train, y_train)

#Make predictions
y_pred_forest = random_forest.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(random_forest, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))

Selected Features using the Wrapper Method: Index(['tenure_months', 'monthly_charges', 'gender_male', 'senior_citizen_yes',
       'partner_yes', 'dependents_yes', 'paperless_billing_yes',
       'multiple_lines_no', 'internet_service_fiber_optic',
       'online_security_no', 'online_backup_no', 'tech_support_no',
       'contract_month-to-month', 'payment_method_electronic_check',
       'cbrt_total_charges'],
      dtype='object')
Accuracy: 79.22%
Best Cross-Validation Score: 79.51

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.89      0.86      1525
           1       0.65      0.54      0.59       588

    accuracy                           0.79      2113
   macro avg       0.74      0.71      0.73      2113
weighted avg       0.78      0.79      0.79      2113



In [123]:
#With SMOTE

#Prepare X and y
X = scaled_df.drop(columns='churn_value') #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Initialize random forest
random_forest = RandomForestClassifier(random_state=42)

#Initialize RFE and select top 15 features
rfe = RFE(random_forest, n_features_to_select=15)
rfe.fit(X, y) #Fit on X and y

#Print out selected features
selected_features = X.columns[rfe.support_]
print(f"Selected Features using the Wrapper Method: {selected_features}")

#Prepare X 
X_selected = scaled_df[selected_features]

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before SMOTE
print(f"Class Distribution before SMOTE: {Counter(y_train)}")

#Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='minority', random_state=42) #Initialize SMOTE
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

#Display class balance after SMOTE
print(f"Class Distribution after SMOTE: {Counter(y_train_res)}")

#Initialize and fit random forest
random_forest = RandomForestClassifier(random_state=42)

#Fit on training data
random_forest.fit(X_train, y_train)

#Make predictions
y_pred_forest = random_forest.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(random_forest, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))

Selected Features using the Wrapper Method: Index(['tenure_months', 'monthly_charges', 'total_charges', 'gender_male',
       'senior_citizen_yes', 'partner_yes', 'dependents_yes',
       'paperless_billing_yes', 'internet_service_fiber_optic',
       'online_security_no', 'online_backup_no', 'tech_support_no',
       'contract_month-to-month', 'payment_method_electronic_check',
       'cbrt_total_charges'],
      dtype='object')
Class Distribution before SMOTE: Counter({0: 3649, 1: 1281})
Class Distribution after SMOTE: Counter({1: 3649, 0: 3649})
Accuracy: 80.12%
Best Cross-Validation Score: 79.17

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.91      0.87      1525
           1       0.68      0.53      0.60       588

    accuracy                           0.80      2113
   macro avg       0.76      0.72      0.73      2113
weighted avg       0.79      0.80      0.79      2113



## Random Forest with Hyperparameter Tuning

In [6]:
#Import RandomSearchCV
from sklearn.model_selection import RandomizedSearchCV

**What is RandomSearchCV:** 

RandomSearchCV is another hyperparameter optimization module from scikit-learn, similar to GridSearchCV. RandomSearchCV works by sampling random combinations of hyperparameters from the parameter grid (`param_grid`) and then evaluating them using 5 fold cross-validation. It continues to repeat this process until it's met the defined number of iterations (which we tested both at 25 and 100). It's important to note that RandomSearchCV doesn't explore the entire parameter grid but instead selects a random subset of it. This can be more efficient than GridSearchCv which goes through the whole parameter grid. 

**Documentation:** https://www.geeksforgeeks.org/comparing-randomized-search-and-grid-search-for-hyperparameter-estimation-in-scikit-learn/

For the parameters of a Random forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

In [4]:
#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize random forest model
random_forest = RandomForestClassifier(random_state=42)

#Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'criterion': ['gini','entropy']
}

#Initialize Random Search CV
random_search = RandomizedSearchCV(estimator=random_forest, param_distributions=param_grid, n_iter=25, cv=5, random_state=42)

#Fit on training data 
random_search.fit(X_train, y_train)

#Get best parameters and display cross-validation score
print(f"Best Hyperparameters: {random_search.best_params_}")

#Get cross-validation score
cross_val = random_search.best_score_ * 100

#Put best parameters on random forest
random_forest = random_search.best_estimator_

#Make predictions
y_pred_forest = random_forest.predict(X_test)

#Display accuracy and classification report for model
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Cross-Validation Score: {cross_val:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))


Best Hyperparameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'criterion': 'entropy', 'bootstrap': True}
Accuracy: 80.31%
Cross-Validation Score: 80.51

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.89      0.87      1525
           1       0.67      0.57      0.62       588

    accuracy                           0.80      2113
   macro avg       0.76      0.73      0.74      2113
weighted avg       0.80      0.80      0.80      2113



In [None]:
#Incrase the number of iterations to 100

#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize random forest model
random_forest = RandomForestClassifier(random_state=42)

#Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'criterion': ['gini','entropy']
}

#Initialize Random Search CV
random_search = RandomizedSearchCV(estimator=random_forest, param_distributions=param_grid, n_iter=100, cv=5, random_state=42)

#Fit on training data 
random_search.fit(X_train, y_train)

#Get best parameters and display cross-validation score
print(f"Best Hyperparameters: {random_search.best_params_}")

#Get cross-validation score
cross_val = random_search.best_score_ * 100

#Put best parameters on random forest
random_forest = random_search.best_estimator_

#Make predictions
y_pred_forest = random_forest.predict(X_test)

#Display accuracy and classification report for model
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Cross-Validation Score: {cross_val:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))


Best Hyperparameters: {'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 10, 'criterion': 'gini', 'bootstrap': True}
Accuracy: 80.69%
Cross-Validation Score: 80.39

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.90      0.87      1525
           1       0.68      0.57      0.62       588

    accuracy                           0.81      2113
   macro avg       0.76      0.73      0.75      2113
weighted avg       0.80      0.81      0.80      2113



In [8]:
#Normalized Features

#Prepare X and y
X = normalized_df.drop(columns='churn_value') #Explanatory variables
y = normalized_df['churn_value'] #Response variable

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize random forest model
random_forest = RandomForestClassifier(random_state=42)

#Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'criterion': ['gini','entropy']
}

#Initialize Random Search CV
random_search = RandomizedSearchCV(estimator=random_forest, param_distributions=param_grid, n_iter=100, cv=5, random_state=42)

#Fit on training data 
random_search.fit(X_train, y_train)

#Get best parameters and display cross-validation score
print(f"Best Hyperparameters: {random_search.best_params_}")

#Get cross-validation score
cross_val = random_search.best_score_ * 100

#Put best parameters on random forest
random_forest = random_search.best_estimator_

#Make predictions
y_pred_forest = random_forest.predict(X_test)

#Display accuracy and classification report for model
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Cross-Validation Score: {cross_val:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))


Best Hyperparameters: {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 10, 'criterion': 'entropy', 'bootstrap': True}
Accuracy: 80.79%
Cross-Validation Score: 80.43

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.90      0.87      1525
           1       0.69      0.57      0.62       588

    accuracy                           0.81      2113
   macro avg       0.77      0.73      0.75      2113
weighted avg       0.80      0.81      0.80      2113



In [7]:
#Scaled + Normalized Features Features

#Prepare X and y
X = scaled_df.drop(columns='churn_value') #Explanatory variables
y = scaled_df['churn_value'] #Response variable

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize random forest model
random_forest = RandomForestClassifier(random_state=42)

#Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'criterion': ['gini','entropy']
}

#Initialize Random Search CV
random_search = RandomizedSearchCV(estimator=random_forest, param_distributions=param_grid, n_iter=100, cv=5, random_state=42)

#Fit on training data 
random_search.fit(X_train, y_train)

#Get best parameters and display cross-validation score
print(f"Best Hyperparameters: {random_search.best_params_}")

#Get cross-validation score
cross_val = random_search.best_score_ * 100

#Put best parameters on random forest
random_forest = random_search.best_estimator_

#Make predictions
y_pred_forest = random_forest.predict(X_test)

#Display accuracy and classification report for model
accuracy = accuracy_score(y_test, y_pred_forest) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Cross-Validation Score: {cross_val:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_forest))


Best Hyperparameters: {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 10, 'criterion': 'entropy', 'bootstrap': True}
Accuracy: 80.79%
Cross-Validation Score: 80.45

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.90      0.87      1525
           1       0.69      0.57      0.62       588

    accuracy                           0.81      2113
   macro avg       0.77      0.73      0.75      2113
weighted avg       0.80      0.81      0.80      2113



Good Precision for all the RandomSearch boosted Random Forests, but poor f1 for churned customers, especially compared to our best f1-score of 65%. 

# Gradient Boosting Model

## Baseline Model 1

In [56]:
#Import libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

In [16]:
#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable 

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize gradient boosting model
gbc = GradientBoostingClassifier(random_state=42) #No hyperparameter tuning

#Fit on training data
gbc.fit(X_train, y_train)

#Make predictions on testing data
y_pred_gbc = gbc.predict(X_test)

#Model evaluation

#Conduct 5-fold cross-validation 
scores = cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report
accuracy = accuracy_score(y_test, y_pred_gbc) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_gbc))


Accuracy: 80.64%
Best Cross-Validation Score: 80.75

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.90      0.87      1525
           1       0.68      0.57      0.62       588

    accuracy                           0.81      2113
   macro avg       0.76      0.73      0.75      2113
weighted avg       0.80      0.81      0.80      2113



Decent performance, although not the best f1 for churned customers. 

**Note:** I did also test both the normalized and scaled dataframes on the gradient boosted model, but there wasn't much of a difference in model performance so I didn't bother keep the cell blocks. 

## SMOTE

In [55]:
#Import libraries
from imblearn.over_sampling import SMOTE
from collections import Counter

In [None]:
#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable 

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before SMOTE
print(f"Class Distribution before SMOTE: {Counter(y_train)}")

#Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='minority', random_state=42) #Initialize SMOTE
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

#Initialize gradient boosting model
gbc_smote = GradientBoostingClassifier(random_state=42) #No hyperparameter tuning

#Fit on training data
gbc_smote.fit(X_train_res, y_train_res)

#Display class balance after SMOTE
print(f"Class Distribution after SMOTE: {Counter(y_train_res)}")

#Make predictions on testing data
y_pred_gbc = gbc_smote.predict(X_test)

#Model evaluation

#Conduct 5-fold cross-validation 
scores = cross_val_score(gbc_smote, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report
accuracy = accuracy_score(y_test, y_pred_gbc) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_gbc))


Class Distribution before SMOTE: Counter({0: 3649, 1: 1281})
Class Distribution after SMOTE: Counter({1: 3649, 0: 3649})
Accuracy: 79.27%
Best Cross-Validation Score: 80.75

Classification Report: 
               precision    recall  f1-score   support

           0       0.88      0.83      0.85      1525
           1       0.61      0.69      0.65       588

    accuracy                           0.79      2113
   macro avg       0.74      0.76      0.75      2113
weighted avg       0.80      0.79      0.80      2113



Goes along with our other best models that achieved a 65% f1-score for churned customers

Those being:
- Logisitic Regression + Filter/Wrapper Method
- Decision Tree + GridSearchCV
- Random Forest + Random-Under Sampling

## Random Under Sampling

In [61]:
#Import imblearn under_sampling library
from imblearn.under_sampling import RandomUnderSampler

In [None]:
#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable 

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize gradient boosting model
gbc_under = GradientBoostingClassifier(random_state=42) #No hyperparameter tuning

#Display class balance before under-sampling
print(f"Class Distribution before under-sampling: {Counter(y_train)}")

#Apply random under-sampling to reduce majority class
underS = RandomUnderSampler(random_state=42, replacement=True)

#Redo train-test split
X_under, y_under = underS.fit_resample(X_train, y_train)

#Display class balance after under-sampling
print(f"Class Distribution after under-sampling: {Counter(y_under)}")

#Fit on training data
gbc_under.fit(X_under, y_under)

#Make predictions on testing data
y_pred_gbc = gbc_under.predict(X_test)

#Model evaluation

#Conduct 5-fold cross-validation 
scores = cross_val_score(gbc_under, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report
accuracy = accuracy_score(y_test, y_pred_gbc) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_gbc))


Class Distribution before under-sampling: Counter({0: 3649, 1: 1281})
Class Distribution after under-sampling: Counter({0: 1281, 1: 1281})
Accuracy: 74.54%
Best Cross-Validation Score: 80.75

Classification Report: 
               precision    recall  f1-score   support

           0       0.92      0.71      0.80      1525
           1       0.53      0.84      0.65       588

    accuracy                           0.75      2113
   macro avg       0.72      0.78      0.72      2113
weighted avg       0.81      0.75      0.76      2113



Along with our best f1-scores (65%) and very good recall. However it comes at the cost of lower precision for churned customers. 

## Feature Selection

### Filter

In [27]:
#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory
y = df['churn_value'] #Response


#Import libraries for filter method
from sklearn.feature_selection import SelectKBest, mutual_info_classif

#Use mutual info classifier due to combo of categorical and numerical features 
selector = SelectKBest(mutual_info_classif, k=20) #Select top 20 features
X_new = selector.fit_transform(X, y)

#Print out selected features
selected_features = X.columns[selector.get_support()]
print(f"Selected Features: {selected_features}")

#Put selected features on X
X = df[selected_features] #Explanatory variables

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize Gradient boosting model
gbc = GradientBoostingClassifier(random_state=42)

#Fit on training data
gbc.fit(X_train, y_train)

#Make predicitions on test data
y_pred_gbc = gbc.predict(X_test)

#Model Evaluation

#Conduct 5-fold cross-validation 
scores = cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report
accuracy = accuracy_score(y_test, y_pred_gbc) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_gbc))

Selected Features: Index(['tenure_months', 'monthly_charges', 'total_charges',
       'senior_citizen_yes', 'partner_yes', 'dependents_yes',
       'paperless_billing_yes', 'internet_service_fiber_optic',
       'online_security_no', 'online_security_yes', 'online_backup_no',
       'device_protection_no', 'device_protection_yes', 'tech_support_no',
       'tech_support_yes', 'streaming_tv_no', 'contract_month-to-month',
       'contract_one_year', 'payment_method_electronic_check',
       'payment_method_mailed_check'],
      dtype='object')
Accuracy: 80.31%
Best Cross-Validation Score: 80.75

Classification Report: 
               precision    recall  f1-score   support

           0       0.85      0.89      0.87      1525
           1       0.67      0.58      0.62       588

    accuracy                           0.80      2113
   macro avg       0.76      0.74      0.74      2113
weighted avg       0.80      0.80      0.80      2113



Not great performance compared to the other "best" models. 

In [28]:
#With SMOTE


#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory
y = df['churn_value'] #Response


#Import libraries for filter method
from sklearn.feature_selection import SelectKBest, mutual_info_classif

#Use mutual info classifier due to combo of categorical and numerical features 
selector = SelectKBest(mutual_info_classif, k=20) #Select top 20 features
X_new = selector.fit_transform(X, y)

#Print out selected features
selected_features = X.columns[selector.get_support()]
print(f"Selected Features: {selected_features}")

#Put selected features on X
X = df[selected_features] #Explanatory variables

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Display class balance before SMOTE
print(f"Class Distribution before SMOTE: {Counter(y_train)}")

#Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='minority', random_state=42) #Initialize SMOTE
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

#Initialize Gradient boosting model
gbc = GradientBoostingClassifier(random_state=42)

#Fit on training data
gbc.fit(X_train_res, y_train_res)

#Display class balance before SMOTE
print(f"Class Distribution before SMOTE: {Counter(y_train_res)}")

#Make predicitions on test data
y_pred_gbc = gbc.predict(X_test)

#Model Evaluation

#Conduct 5-fold cross-validation 
scores = cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report
accuracy = accuracy_score(y_test, y_pred_gbc) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_gbc))

Selected Features: Index(['tenure_months', 'monthly_charges', 'total_charges',
       'senior_citizen_yes', 'dependents_yes', 'paperless_billing_yes',
       'internet_service_dsl', 'internet_service_fiber_optic',
       'online_security_no', 'online_security_yes', 'online_backup_no',
       'device_protection_no', 'device_protection_yes', 'tech_support_no',
       'tech_support_yes', 'streaming_tv_no', 'streaming_movies_no',
       'contract_month-to-month', 'contract_one_year',
       'payment_method_electronic_check'],
      dtype='object')
Class Distribution before SMOTE: Counter({0: 3649, 1: 1281})
Class Distribution before SMOTE: Counter({1: 3649, 0: 3649})
Accuracy: 77.99%
Best Cross-Validation Score: 80.48

Classification Report: 
               precision    recall  f1-score   support

           0       0.88      0.80      0.84      1525
           1       0.58      0.72      0.65       588

    accuracy                           0.78      2113
   macro avg       0.73      0.7

### Wrapper

In [29]:
#Import wrapper method
from sklearn.feature_selection import RFE

#Initialize gradint boosting model
gbc = GradientBoostingClassifier(random_state=42)


#Initialize RFE and select top 15 features
rfe = RFE(random_forest, n_features_to_select=15)
rfe.fit(X, y) #Fit on X and y

#Print out selected features
selected_features = X.columns[rfe.support_]
print(f"Selected Features using the Wrapper Method: {selected_features}")

#Prepare X 
X_selected = df[selected_features]

#Conduct train-test split
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, train_size=0.7, random_state=42)

#Fit on training data
gbc.fit(X_train, y_train)

#Make predictions
y_pred_gbc = gbc.predict(X_test)

#Conduct 5-fold cross-validation 
scores = cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
score_mean = scores.mean() * 100

#Display accuracy and classification report for logistic regression
accuracy = accuracy_score(y_test, y_pred_gbc) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_gbc))

Selected Features using the Wrapper Method: Index(['tenure_months', 'monthly_charges', 'total_charges', 'dependents_yes',
       'paperless_billing_yes', 'internet_service_dsl',
       'internet_service_fiber_optic', 'online_security_no',
       'online_backup_no', 'device_protection_no', 'tech_support_no',
       'streaming_movies_no', 'contract_month-to-month', 'contract_one_year',
       'payment_method_electronic_check'],
      dtype='object')
Accuracy: 80.03%
Best Cross-Validation Score: 80.48

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.89      0.87      1525
           1       0.66      0.57      0.62       588

    accuracy                           0.80      2113
   macro avg       0.75      0.73      0.74      2113
weighted avg       0.79      0.80      0.80      2113



## Gradient Boosting with Hyperparameter Tuning

In [32]:
#Import GridSearchCV
from sklearn.model_selection import GridSearchCV

#Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'max_features': [1, 3, 5],
}

#Prepare X and y
X = df.drop(columns='churn_value') #Explanatory variables
y = df['churn_value'] #Response variable 

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

#Initialize gradient boosting model
gbc = GradientBoostingClassifier()

#Initialize gridsearchcv
grid_search = GridSearchCV(estimator=gbc, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=1)

#Fit on training data
grid_search.fit(X_train, y_train)

#Get best parameters
print("Best Hyperparameters: ", grid_search.best_params_)

#Put best parameters on model
gbc = grid_search.best_estimator_


#Make predictions on testing data
y_pred_gbc = gbc.predict(X_test)

#Model evaluation

#Get cross-val score
score_mean = grid_search.best_score_ * 100

#Display accuracy and classification report
accuracy = accuracy_score(y_test, y_pred_gbc) * 100
print(f"Accuracy: {accuracy:.2f}%")
print(f"Best Cross-Validation Score: {score_mean:.2f}")
print("\nClassification Report: \n", classification_report(y_test,y_pred_gbc))


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters:  {'learning_rate': 0.1, 'max_depth': 3, 'max_features': 5, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 100}
Accuracy: 80.69%
Best Cross-Validation Score: 81.18

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.90      0.87      1525
           1       0.68      0.57      0.62       588

    accuracy                           0.81      2113
   macro avg       0.76      0.73      0.75      2113
weighted avg       0.80      0.81      0.80      2113



Not better than our "Best" models. 

# Best Model

**Explanation:**

Given our evalaution metric for churned customers being f1-score. We have 4 models tied currently for the best f1-score of 65% for churned customers. 

- Logisitic Regression + SMOTE + Filter/Wrapper Method 
- Decision Tree + GridSearchCV
- Random Forest + Under-Sampling
- Gradient Boosting Model + SMOTE

In terms of our simpliest model with the best performance it would probably be the Logisitic Regression Model with scaled/normalized features, which had a f1-score of 64% for churned customers. 

## Let's further dive into our "Best Models"

In [54]:
#Import numpy
import numpy as np

#Create a dataframe of each of our best models and their given perofmrance metrics
print("Best Model's\nNote: Metrics involve only class 1: Churned Customers")
model_eval = pd.DataFrame({
    'Accuracy': [75.44, 78.89, 74.49, 79.27, 74.54],
    'Cross-Val Score': [80.79, 78.76, 79.51, 80.75, 80.75],
    'Precision': [54, 60, 53, 61, 53],
    'Recall': [81, 70, 84, 69, 84],
    'F1-Score': [65, 65, 65, 65, 65],
}, index=['Log Reg + SMOTE+ Filter', 'Decision Tree + Grid', 'Random Forest + Random-Under', 'Gradient Boost + SMOTE', 
          'Gradient Boosting + Random-Under'])

#Change formatting by adding percentages
model_eval = model_eval.style.format({
    'Accuracy': "{}%",
    'Cross-Val Score': "{}%",
    'Precision': "{}%",
    'Recall': "{}%",
    'F1-Score': "{}%"
})

#Define a function to highlight the minimum and maximum metrics
def highlight_max(x, props='color:white;background-color: green; font-weight:bold;'):
    return np.where(x == np.nanmax(x.values), props, '')
def highlight_min(x, props='color:white;background-color: red; font-weight:bold;'):
    return np.where(x==np.nanmin(x.values), props, '')

#Get all columns expect for f1-score
columns_to_style = model_eval.columns[0:4]

#Apply to dataframe (expect for f1-score)
style_model_eval = model_eval.apply(highlight_max, axis=0, subset=columns_to_style)
style_model_eval = model_eval.apply(highlight_min, axis=0, subset=columns_to_style)

#Display
display(style_model_eval)

Best Model's
Note: Metrics involve only class 1: Churned Customers


Unnamed: 0,Accuracy,Cross-Val Score,Precision,Recall,F1-Score
Log Reg + SMOTE+ Filter,75.44%,80.79%,54%,81%,65%
Decision Tree + Grid,78.89%,78.76%,60%,70%,65%
Random Forest + Random-Under,74.49%,79.51%,53%,84%,65%
Gradient Boost + SMOTE,79.27%,80.75%,61%,69%,65%
Gradient Boosting + Random-Under,74.54%,80.75%,53%,84%,65%


## What is our best model?

From analyzing the classification report for churned customers, **two models clearly stand out**:

1. **Gradient Boost + SMOTE**

    This model is the **most accurate**, but given our class imbalance, accuracy isn’t the most reliable metric. More importantly, it has the **highest precision** among all models and reaches the **maximum F1-score (65%)**. While its recall is the lowest among the top contenders, it still maintains the **2nd highest cross-validation score**, suggesting strong generalizability.

    This model excels at avoiding **false positives** — in other words, it’s cautious about labeling customers as churners unless it's confident.
    **Business context:** This helps **minimize unnecessary spending** on retention efforts for customers who aren't likely to churn, making it ideal when retention costs are high and resources need to be used efficiently.

2. **Gradient Boost + Random-Under**

    This model takes the opposite approach: it achieves the **highest recall (tied)**, meaning it’s best at **catching actual churners**, even at the cost of some false positives. While it has one of the lower accuracy scores, that's expected and acceptable given the class imbalance. It also ties for second in **cross-validation score**, indicating it still performs consistently.

    This model is best for **minimizing false negatives** — ensuring churners don’t slip through the cracks.
    **Business context:** It's ideal in situations where **losing a customer is very costly**, and the priority is to retain as many at-risk customers as possible, even if it means occasionally overreacting.

**Overall:** the best model depends on the **business objective**. If the goal is to reduce **unnecessary retention costs**, prioritize precision. If the goal is to **prevent customer loss**, prioritize recall.

## Save best model's

In [58]:
#Save best model using pickle

#Use try and expect block to install/import pickle
try:
    import pickle
    print('Sucessfully Imported Pickle')
except:
    !pip install pickle
    import pickle
    print('Needed to install pickle first')

Sucessfully Imported Pickle


In [60]:
#Save best model's using pickle
file_to_write_smote = open('telco_churn_best_model_gbc_smote.saved', 'wb')
pickle.dump(gbc_smote, file_to_write_smote)
file_to_write_smote.close()

In [63]:
#Repeat the same process for Gradient boosting + Under-sampling
file_to_write_under = open('telco_churn_best_model_gbc_under.saved', 'wb')
pickle.dump(gbc, file_to_write_under)
file_to_write_under.close()