# Model Development

In this notebook, we will be developing predictive models using the `Telco Customer Churn` dataset. Using `sklearn` models: Logistic Regression, Decision Tree, and K-Nearest Neighbors; we will craft various classifiers as base models and then to optimize using the cost function, regularization and hyperparameter tuning.

By the end of this notebook, we will establish foundations for evaluating which model will perform best for predicting churn.

## Loading Tools and Dataset

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, log_loss, classification_report, precision_score, recall_score, ConfusionMatrixDisplay, roc_curve, roc_auc_score
import pickle
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)

df = pd.read_csv('../data/encoded_telco_churn.csv')
df.head()

Unnamed: 0,Male,Partner,Dependents,SeniorCitizen,DurationMonths,PhoneService,MultipleLines,NoInternet,DSLInternet,FiberOpticInternet,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MonthlyCharges,Contract,Churn
0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,0,29.85,Month-to-month,0
1,1,0,0,0,34,1,0,0,1,0,1,0,1,0,0,0,56.95,One year,0
2,1,0,0,0,2,1,0,0,1,0,1,1,0,0,0,0,53.85,Month-to-month,1
3,1,0,0,0,45,0,0,0,1,0,1,0,1,1,0,0,42.3,One year,0
4,0,0,0,0,2,1,0,0,0,1,0,0,0,0,0,0,70.7,Month-to-month,1


In [4]:
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
numerical = ['DurationMonths','MonthlyCharges']
scaler = StandardScaler()
categorical = ['Contract']
ord_encoder = OrdinalEncoder()

preprocessor = ColumnTransformer(
    transformers = [
        ('num', scaler, numerical),
        ('cat', ord_encoder, categorical)
    ])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

X_train_processed = pipeline.fit_transform(X_train)
X_test_processed = pipeline.transform(X_test)


X_train_scaled = pd.DataFrame(X_train_processed, columns=numerical + categorical)
X_test_scaled = pd.DataFrame(X_test_processed, columns=numerical + categorical)

X_train_scaled.head()

Unnamed: 0,DurationMonths,MonthlyCharges,Contract
0,-0.47,-0.0,1.0
1,0.89,1.07,2.0
2,-1.28,-1.38,0.0
3,-1.16,0.18,0.0
4,-1.33,-0.1,2.0


In [15]:
scaler = pipeline.named_steps['preprocessor'].named_transformers_['num']
scaled_numerical = X_train_processed[:,:len(numerical)]
scaled_numerical
# # To inverse transform the scaled numerical data
# # Access the StandardScaler using the double underscore method
# scaler = pipeline.named_steps['preprocessor'].named_transformers_['num']

# # Select the transformed numerical columns
# scaled_numerical_data = transformed_data[:, :len(numerical)]

# # Apply the inverse transform to get the original numerical data
# original_numerical_data = scaler.inverse_transform(scaled_numerical_data)

# # Combine the inverse transformed numerical data with the rest of the data
# # Assuming the transformed data structure is: numerical columns first, then categorical columns
# inverse_transformed_data = np.hstack((original_numerical_data, transformed_data[:, len(numerical):]))

array([0.88553679, 1.07475386])

## Transformation

In [None]:


# Fit and transform the training data, and transform the test data
X_train_processed = pipeline.fit_transform(X_train)
X_test_processed = pipeline.transform(X_test)

# Convert to DataFrame for easier visualization if needed
X_train_scaled = pd.DataFrame(X_train_processed, columns=numerical_features + categorical_features)
X_test_scaled = pd.DataFrame(X_test_processed, columns=numerical_features + categorical_features)

X_train_scaled.head()

In [84]:
X_train_encoded = X_train.copy()
X_test_encoded = X_test.copy()
encoder = OrdinalEncoder()
X_train_encoded['Contract'] = encoder.fit_transform(X_train[['Contract']])
X_test_encoded['Contract'] = encoder.transform(X_test[['Contract']])

X_train_scaled = X_train_encoded.copy()
X_test_scaled = X_test_encoded.copy()
scaler = StandardScaler()
numerical_features = ['DurationMonths','MonthlyCharges']
for ft in numerical_features:
    X_train_scaled[ft] = scaler.fit_transform(X_train_encoded[[ft]])
    X_test_scaled[ft] = scaler.transform(X_test_encoded[[ft]])

X_train_scaled.head()

Unnamed: 0,Male,Partner,Dependents,SeniorCitizen,DurationMonths,PhoneService,MultipleLines,NoInternet,DSLInternet,FiberOpticInternet,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MonthlyCharges,Contract
2142,0,0,1,0,-0.47,1,0,0,1,0,1,0,1,0,0,1,-0.0,1.0
1623,0,0,0,0,0.89,1,1,0,0,1,0,1,0,0,1,1,1.07,2.0
6074,1,1,0,0,-1.28,0,0,0,1,0,0,0,0,0,0,0,-1.38,0.0
1362,1,0,0,0,-1.16,1,0,0,0,1,0,0,0,0,0,0,0.18,0.0
6754,1,0,1,0,-1.33,1,1,0,1,0,1,1,0,1,0,0,-0.1,2.0


## Logistic Regression

In [85]:
logreg_base = LogisticRegression()
logreg_base.fit(X_train_scaled, y_train)

In [86]:
logreg_ypred = logreg_base.predict(X_test_scaled)
logreg_accuracy = accuracy_score(y_test, logreg_ypred)

logreg_ypred_proba = logreg_base.predict_proba(X_test_scaled)
logreg_logloss = log_loss(y_test, logreg_ypred_proba)

In [87]:
print(logreg_accuracy)
print(logreg_logloss)

0.8126330731014905
0.4020852819613363


In [88]:
C_list = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100, 1e3]
penalty_list = ['l1', 'l2']
class_weight_list = ['balanced', None]

cv_scores = []
cv_scores_std = []
best_score = 0
best_params = {}
results = []

for c in C_list:
    for penalty in penalty_list:
        for class_weight in class_weight_list:
            logreg = LogisticRegression(C=c, random_state=42, max_iter=1000, class_weight=class_weight, penalty=penalty, solver='liblinear' if penalty == 'l1' else 'lbfgs')
            cv_loop_results = cross_validate(
                logreg,
                X=X_train_scaled,
                y=y_train,
                cv=8,
                scoring='recall',
                return_train_score=True
            )
            mean_score = np.mean(cv_loop_results['test_score'])
            std_score = np.std(cv_loop_results['test_score'])
            cv_scores.append(mean_score)
            cv_scores_std.append(std_score)

            results.append((np.log10(c), penalty, class_weight, mean_score, std_score))

            if mean_score > best_score:
                best_score = mean_score
                best_params = {'C': c, 'penalty': penalty, 'class_weight': class_weight}
                
            print(c, penalty, class_weight, round(mean_score, 4), round(std_score, 4))

print('--------')
print(best_params)
print(round(best_score, 4))

0.0001 l1 balanced 0.0 0.0
0.0001 l1 None 0.0 0.0
0.0001 l2 balanced 0.8382 0.0291
0.0001 l2 None 0.0 0.0
0.001 l1 balanced 0.0 0.0
0.001 l1 None 0.0 0.0
0.001 l2 balanced 0.8255 0.0288
0.001 l2 None 0.0187 0.0096
0.01 l1 balanced 0.8269 0.0325
0.01 l1 None 0.4459 0.0503
0.01 l2 balanced 0.7975 0.0335
0.01 l2 None 0.4566 0.0455
0.1 l1 balanced 0.7941 0.035
0.1 l1 None 0.5087 0.04
0.1 l2 balanced 0.7941 0.0358
0.1 l2 None 0.5134 0.0412
1 l1 balanced 0.7961 0.0367
1 l1 None 0.5294 0.0397
1 l2 balanced 0.7968 0.0367
1 l2 None 0.5274 0.0394
10 l1 balanced 0.8001 0.0371
10 l1 None 0.5274 0.0418
10 l2 balanced 0.8001 0.0371
10 l2 None 0.5301 0.0414
100 l1 balanced 0.7995 0.0372
100 l1 None 0.5294 0.0424
100 l2 balanced 0.7995 0.0372
100 l2 None 0.5301 0.0414
1000.0 l1 balanced 0.7995 0.0372
1000.0 l1 None 0.5294 0.0424
1000.0 l2 balanced 0.7995 0.0372
1000.0 l2 None 0.5301 0.0414
--------
{'C': 0.0001, 'penalty': 'l2', 'class_weight': 'balanced'}
0.8382


## Logistic Regression with C Regularization

#### Testing To Find Best `Regularization Strength C`

In [None]:
C_list = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100, 1e3]
cv_scores = []
cv_scores_std = []

for c in C_list:
    logreg = LogisticRegression(C=c, random_state=42)
    cv_loop_results = cross_validate(
                                    X=X_train,
                                    y=y_train,
                                    estimator=logreg,
                                    cv=8)
    cv_scores.append(np.mean(np.sqrt(np.abs(cv_loop_results['test_score']))))
    cv_scores_std.append(np.std(np.sqrt(np.abs(cv_loop_results['test_score']))))

In [None]:
cv_scores, cv_scores_std

In [None]:
fig, ax = plt.subplots()
sns.lineplot(x = np.log10(C_list), y = cv_scores, marker = 's', ax = ax)
ax.set_xlabel('Log(C)')
ax.set_ylabel('Mean Accuracy')
ax.set_title('Accuracy Averaged on LogReg C Validation Folds')
plt.show()

Regularization Strength `C=1e-2` average cross-validation score is almost 0.898

#### After Finding the Best `Regularization Strength C`

In [None]:
logreg_best = LogisticRegression(C=1e-2)
logreg_best.fit(X_train, y_train)

In [None]:
logbest_ypred = logreg_best.predict(X_test)
logbest_accuracy = accuracy_score(y_test, logbest_ypred)
logbest_ypred_proba = logreg_best.predict_proba(X_test)
logbest_logloss = log_loss(y_test, logbest_ypred_proba)

In [None]:
print(logbest_accuracy)
print(logbest_logloss)

`Scores` resulted in being slightly `less` than the `base model`.

## Decision Tree

In [None]:
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

In [None]:
dtree_base = DecisionTreeClassifier()
dtree_base.fit(X_train, y_train)

In [None]:
dtree_base_ypred = dtree_base.predict(X_test)
dtree_base_accuracy = accuracy_score(y_test, dtree_base_ypred)
dtree_base_report = classification_report(y_test, dtree_base_ypred)

In [None]:
print(dtree_base_accuracy)
print(dtree_base_report)

In [91]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
import numpy as np

max_depth_list = [None, 10, 20, 30, 40, 50]
min_samples_split_list = [2, 5, 10]
min_samples_leaf_list = [1, 2, 4]
class_weight_list = ['balanced', None]

cv_scores = []
cv_scores_std = []
best_score = 0
best_params = {}
results = []

for max_depth in max_depth_list:
    for min_samples_split in min_samples_split_list:
        for min_samples_leaf in min_samples_leaf_list:
            for class_weight in class_weight_list:
                dt = DecisionTreeClassifier(
                    max_depth=max_depth,
                    min_samples_split=min_samples_split,
                    min_samples_leaf=min_samples_leaf,
                    class_weight=class_weight,
                    random_state=42
                )
                cv_loop_results = cross_validate(
                    dt,
                    X=X_train_scaled,
                    y=y_train,
                    cv=8,
                    scoring='recall',
                    return_train_score=True
                )
                mean_score = np.mean(cv_loop_results['test_score'])
                std_score = np.std(cv_loop_results['test_score'])
                cv_scores.append(mean_score)
                cv_scores_std.append(std_score)

                results.append((max_depth, min_samples_split, min_samples_leaf, class_weight, mean_score, std_score))

                if mean_score > best_score:
                    best_score = mean_score
                    best_params = {
                        'max_depth': max_depth,
                        'min_samples_split': min_samples_split,
                        'min_samples_leaf': min_samples_leaf,
                        'class_weight': class_weight
                    }

                print(max_depth, min_samples_split, min_samples_leaf, class_weight, round(mean_score, 4), round(std_score, 4))

print('--------')
print(best_params)
print(round(best_score, 4))


None 2 1 balanced 0.484 0.0415
None 2 1 None 0.488 0.0278
None 2 2 balanced 0.5916 0.0293
None 2 2 None 0.4392 0.0334
None 2 4 balanced 0.617 0.0382
None 2 4 None 0.4626 0.0314
None 5 1 balanced 0.5461 0.0388
None 5 1 None 0.4639 0.0288
None 5 2 balanced 0.5949 0.0294
None 5 2 None 0.4385 0.0274
None 5 4 balanced 0.617 0.0382
None 5 4 None 0.4626 0.0314
None 10 1 balanced 0.6056 0.0382
None 10 1 None 0.4753 0.0302
None 10 2 balanced 0.629 0.0381
None 10 2 None 0.4733 0.034
None 10 4 balanced 0.6344 0.0418
None 10 4 None 0.4619 0.0293
10 2 1 balanced 0.7146 0.0289
10 2 1 None 0.5047 0.0632
10 2 2 balanced 0.7326 0.0342
10 2 2 None 0.488 0.0648
10 2 4 balanced 0.7239 0.035
10 2 4 None 0.49 0.0635
10 5 1 balanced 0.7226 0.0343
10 5 1 None 0.4987 0.0631
10 5 2 balanced 0.7366 0.0303
10 5 2 None 0.4893 0.0639
10 5 4 balanced 0.7239 0.035
10 5 4 None 0.49 0.0635
10 10 1 balanced 0.7293 0.0326
10 10 1 None 0.4987 0.0552
10 10 2 balanced 0.7413 0.032
10 10 2 None 0.492 0.0582
10 10 4 balanced 

## Decision Tree Testing Tuning Hyperparameters

#### `max_depth` Hyperparameter Tuning

In [None]:
max_depth_list = [10, 20, 30, 40, 50]
cv_scores = []
cv_scores_std = []

for depth in max_depth_list:
    dtree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    cv_loop_results = cross_validate(
                                X=X_train,
                                y=y_train,
                                estimator=dtree,
                                cv=8)
    cv_scores.append(np.mean(np.sqrt(np.abs(cv_loop_results['test_score']))))
    cv_scores_std.append(np.std(np.sqrt(np.abs(cv_loop_results['test_score']))))

In [None]:
best_depth = {'max_depth':max_depth_list, 'cv_scores':cv_scores, 'cv_scores_std':cv_scores_std}
best_depth = pd.DataFrame(best_depth)
best_depth

In [None]:
fig, ax = plt.subplots()
sns.lineplot(x = max_depth_list, y = cv_scores, marker = 's', ax = ax)
ax.set_xlabel('Max Depth')
ax.set_ylabel('Mean Accuracy')
ax.set_title('Accuracy Averaged on Max Depth Validation Folds')
plt.show()

Best `max_depth` is `10` with an accuracy score of almost 0.875.

#### `min_samples_split` Hyperparameter Tuning

In [None]:
min_samples_split_list = [10, 20, 50, 75, 100]
cv_scores = []
cv_scores_std = []

for split in min_samples_split_list:
    dtree = DecisionTreeClassifier(min_samples_split=split, random_state=42)
    cv_loop_results = cross_validate(
                                X=X_train,
                                y=y_train,
                                estimator=dtree,
                                cv=8)
    cv_scores.append(np.mean(np.sqrt(np.abs(cv_loop_results['test_score']))))
    cv_scores_std.append(np.std(np.sqrt(np.abs(cv_loop_results['test_score']))))

In [None]:
best_split = {'min_samples_split':min_samples_split_list, 'cv_scores':cv_scores, 'cv_scores_std':cv_scores_std}
best_split = pd.DataFrame(best_split)
best_split

In [None]:
fig, ax = plt.subplots()
sns.lineplot(x = min_samples_split_list, y = cv_scores, marker = 's', ax = ax)
ax.set_xlabel('Min Samples Split')
ax.set_ylabel('Mean Accuracy')
ax.set_title('Accuracy Averaged on Samples Split Validation Folds')
plt.show()

Best `min_samples_split` is `100` with an accuracy score of above 0.89.

#### `min_samples_leafs` Hyperparameter Tuning

In [None]:
min_samples_leaf_list = [10, 20, 50, 75, 100]
cv_scores = []
cv_scores_std = []

for leaf in min_samples_leaf_list:
    dtree = DecisionTreeClassifier(min_samples_leaf=leaf, random_state=42)
    cv_loop_results = cross_validate(
                                X=X_train,
                                y=y_train,
                                estimator=dtree,
                                cv=8)
    cv_scores.append(np.mean(np.sqrt(np.abs(cv_loop_results['test_score']))))
    cv_scores_std.append(np.std(np.sqrt(np.abs(cv_loop_results['test_score']))))

In [None]:
best_leaf = {'min_samples_split':min_samples_leaf_list, 'cv_scores':cv_scores, 'cv_scores_std':cv_scores_std}
best_leaf = pd.DataFrame(best_leaf)
best_leaf

In [None]:
fig, ax = plt.subplots()
sns.lineplot(x = min_samples_leaf_list, y = cv_scores, marker = 's', ax = ax)
ax.set_xlabel('Min Samples Leaf')
ax.set_ylabel('Mean Accuracy')
ax.set_title('Accuracy Averaged on Samples Leaf Validation Folds')
plt.show()

Best `min_samples_leaf` is `75` with an accuracy score of above 0.89.

#### `criterion` Hyperparameter Tuning

In [None]:
criterion_list = ['gini', 'entropy', 'log_loss']
cv_scores = []
cv_scores_std = []

for criteria in criterion_list:
    dtree = DecisionTreeClassifier(criterion=criteria, random_state=42)
    cv_loop_results = cross_validate(
                                X=X_train,
                                y=y_train,
                                estimator=dtree,
                                cv=8)
    cv_scores.append(np.mean(np.sqrt(np.abs(cv_loop_results['test_score']))))
    cv_scores_std.append(np.std(np.sqrt(np.abs(cv_loop_results['test_score']))))

In [None]:
best_criterion = {'criterion':criterion_list, 'cv_scores':cv_scores, 'cv_scores_std':cv_scores_std}
best_criterion = pd.DataFrame(best_criterion)
best_criterion

In [None]:
fig, ax = plt.subplots()
sns.lineplot(x = criterion_list, y = cv_scores, marker = 's', ax = ax)
ax.set_xlabel('Criterion')
ax.set_ylabel('Mean Accuracy')
ax.set_title('Accuracy Averaged on Criterion Validation Folds')
plt.show()

Best `criterion` is `entropy` with an accuracy score of almost 0.85.

#### Summary of `Best Hyperparameter` Tuning for Decision Tree
- `max_depth` is 10
- `min_samples_split` is 100
- `min_samples_leaf` is 75
- `criterion` is entropy

## Decision Tree Best Hyperparameter Tuned

In [None]:
dtree_best = \
DecisionTreeClassifier(
    max_depth = 10,
    min_samples_split = 100,
    min_samples_leaf = 75,
    criterion = 'entropy')
dtree_best.fit(X_train, y_train)

In [None]:
dtree_best_ypred = dtree_best.predict(X_test)
dtree_best_accuracy = accuracy_score(y_test, dtree_best_ypred)
dtree_best_report = classification_report(y_test, dtree_best_ypred)

In [None]:
print(dtree_best_accuracy)
print(dtree_best_report)

In [None]:
print(dtree_base_report)
print(dtree_best_report)

## Saving Models

In [None]:
logreg_base_model = '../models/logreg_base.pkl'
logreg_best_model = '../models/logreg_tune.pkl'
dtree_base_model = '../models/dtree_base.pkl'
dtree_best_model = '../models/dtree_tune.pkl'

model_list = [
    logreg_base_model, logreg_best_model, 
    dtree_base_model, dtree_best_model ]
for model in model_list:
# StackOverflow
# https://stackoverflow.com/questions/65152886/save-the-model-using-pickle
    with open(model, 'wb') as file:
        pickle.dump(model, file)
    print(f'{model} has been saved.')

In [None]:
try:
    print('Script Executed Successfully')
except:
    print('FAILED')