# Scenario
You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.
# Instructions
In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

### Import the required libraries and modules that you would need.

In [1]:
import pandas as pd
import numpy as np
import warnings

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

warnings.filterwarnings('ignore')

### Read that data into Python and call the dataframe churnData.

In [2]:
churnData = pd.read_csv("files_for_lab/Customer-Churn.csv")
churnData.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [3]:
churnData.shape

(7043, 16)

### Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.

In [4]:
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

In [5]:
# pd.to_numeric(churnData["TotalCharges"])
# this gets an error due to the presence of " " strings

In [6]:
churnData["TotalCharges"] = churnData["TotalCharges"].apply(lambda x: np.nan if x == " " else x)

In [7]:
churnData["TotalCharges"] = pd.to_numeric(churnData["TotalCharges"])

In [8]:
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7032 non-null   float64
 15  Churn             7043 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory

### Check for null values in the dataframe. Replace the null values.

In [9]:
churnData.isna().sum()

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [10]:
churnData["TotalCharges"].describe()

count    7032.000000
mean     2283.300441
std      2266.771362
min        18.800000
25%       401.450000
50%      1397.475000
75%      3794.737500
max      8684.800000
Name: TotalCharges, dtype: float64

In [11]:
# Replace NaN values with the median

churnData["TotalCharges"] = churnData["TotalCharges"].fillna(churnData["TotalCharges"].median())
churnData.isna().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

### Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:

In [12]:
X = churnData[["tenure", "SeniorCitizen", "MonthlyCharges", "TotalCharges"]]
X

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,1,0,29.85,29.85
1,34,0,56.95,1889.50
2,2,0,53.85,108.15
3,45,0,42.30,1840.75
4,2,0,70.70,151.65
...,...,...,...,...
7038,24,0,84.80,1990.50
7039,72,0,103.20,7362.90
7040,11,0,29.60,346.45
7041,4,1,74.40,306.60


In [13]:
y = churnData[["Churn"]]
y

Unnamed: 0,Churn
0,No
1,No
2,Yes
3,No
4,Yes
...,...
7038,No
7039,No
7040,No
7041,Yes


### Scale the features either by using normalizer or a standard scaler.

In [14]:
# Normalize with Min Max Scaler

transformer = MinMaxScaler().fit(X)
X_minmax = transformer.transform(X) 
X_norm = pd.DataFrame(X_minmax, columns=X.columns)
X_norm

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,0.013889,0.0,0.115423,0.001275
1,0.472222,0.0,0.385075,0.215867
2,0.027778,0.0,0.354229,0.010310
3,0.625000,0.0,0.239303,0.210241
4,0.027778,0.0,0.521891,0.015330
...,...,...,...,...
7038,0.333333,0.0,0.662189,0.227521
7039,1.000000,0.0,0.845274,0.847461
7040,0.152778,0.0,0.112935,0.037809
7041,0.055556,1.0,0.558706,0.033210


### Split the data into a training set and a test set.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.3, random_state=42)

### Fit a logistic regression model on the training data.

In [16]:
classification = LogisticRegression().fit(X_train, y_train)
predictions = classification.predict(X_test)

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.82      0.93      0.87      1539
         Yes       0.69      0.44      0.54       574

    accuracy                           0.80      2113
   macro avg       0.75      0.69      0.70      2113
weighted avg       0.78      0.80      0.78      2113



Although we have an accuracy of 0.8, the model is good at predicting the "No" class, but not the "Yes" class, due to imbalance in the data. The recall for the "yes" class is particularly low, that means that the model only identified correctly 44% of the customers that have churned.

### Managing imbalance in the dataset
### Check for the imbalance.


In [17]:
y.value_counts()

Churn
No       5174
Yes      1869
dtype: int64

### Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.

In [18]:
# Undersampling the majority with Tomek Links

tl = TomekLinks(sampling_strategy='majority')
X_tl, y_tl = tl.fit_resample(X_norm, y)
y_tl.value_counts()

Churn
No       4651
Yes      1869
dtype: int64

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl, test_size=0.3, random_state=42)

classification = LogisticRegression().fit(X_train, y_train)
predictions = classification.predict(X_test)

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.82      0.90      0.86      1379
         Yes       0.68      0.52      0.59       577

    accuracy                           0.79      1956
   macro avg       0.75      0.71      0.72      1956
weighted avg       0.78      0.79      0.78      1956



The model has improved at predicting the "Yes" class, but there is still room for improvement.

In [20]:
# Upsampling the minority with SMOTE

smote = SMOTE()
X_sm, y_sm = smote.fit_resample(X_norm, y)
y_sm.value_counts()

Churn
No       5174
Yes      5174
dtype: int64

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=42)

classification = LogisticRegression().fit(X_train, y_train)
predictions = classification.predict(X_test)

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.76      0.74      0.75      1574
         Yes       0.74      0.76      0.75      1531

    accuracy                           0.75      3105
   macro avg       0.75      0.75      0.75      3105
weighted avg       0.75      0.75      0.75      3105



The model has improved significantly at predicting the "yes" class, but the accuracy it's still low, 0.75.

In [22]:
# Combination of the 2 techniques 

## 1. Tomeklinks on the majority (already done)
## 2. Upsample minority with SMOTE

X_tl_sm, y_tl_sm = smote.fit_resample(X_tl, y_tl)
y_tl_sm.value_counts()

Churn
No       4651
Yes      4651
dtype: int64

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X_tl_sm, y_tl_sm, test_size=0.3, random_state=42)

classification = LogisticRegression().fit(X_train, y_train)
predictions = classification.predict(X_test)

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.76      0.74      0.75      1383
         Yes       0.75      0.77      0.76      1408

    accuracy                           0.76      2791
   macro avg       0.76      0.76      0.76      2791
weighted avg       0.76      0.76      0.76      2791



Some improvement (0.76 accuracy)

In [25]:
## 3. Apply Tomeklinks one more time to remove borderline cases

X_tl_sm_tl, y_tl_sm_tl = tl.fit_resample(X_tl_sm, y_tl_sm)
y_tl_sm_tl.value_counts()

Churn
Yes      4651
No       4443
dtype: int64

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X_tl_sm_tl, y_tl_sm_tl, test_size=0.3, random_state=42)

classification = LogisticRegression().fit(X_train, y_train)
predictions = classification.predict(X_test)

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.75      0.76      0.76      1297
         Yes       0.78      0.77      0.78      1432

    accuracy                           0.77      2729
   macro avg       0.77      0.77      0.77      2729
weighted avg       0.77      0.77      0.77      2729



### Conclusion
We have significantly improved the ability of the model to predict the "yes" class (customer churn), from 0.69 precision and 0.44 recall, to 0.78 precision and 0.77 recall. <br>
This means that 78% of the customers that the model identified as "yes" were actually so, and the model was able to identify 77% of the actual "yes" customers.