For this lab and in the next lessons we will build a model on customer churn binary classification problem. You will be using files_for_lab/Customer-Churn.csv file.

Scenario

You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

Instructions

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

- Import the required libraries and modules that you would need.
- Read that data into Python and call the dataframe churnData.
- Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.
- Check for null values in the dataframe. Replace the null values.

- Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
    - Scale the features either by using normalizer or a standard scaler.
    - Split the data into a training set and a test set.
    - Fit a logistic regression model on the training data.
    - Check the accuracy on the test data.
    
Note: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model is.

In [19]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
from sklearn.datasets import load_boston

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [2]:
# Read that data into Python and call the dataframe churnData.
churnData = pd.read_csv('files_for_lab/Customer-Churn.csv')
churnData.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [3]:
churnData.shape

(7043, 16)

In [4]:
# Check the datatypes of all the columns in the data. 
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

In [5]:
# You would see that the column TotalCharges is object type. 
# Convert this column into numeric type using pd.to_numeric function.
churnData['TotalCharges']= pd.to_numeric(churnData['TotalCharges'], errors='coerce')

In [6]:
# Check the datatypes of all the columns in the data. 
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7032 non-null   float64
 15  Churn             7043 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory

In [7]:
# check length of the data frame again - should be smaller because of the drop
churnData.shape

(7043, 16)

In [8]:
# Check for null values in the dataframe. 
churnData.isna().any()

gender              False
SeniorCitizen       False
Partner             False
Dependents          False
tenure              False
PhoneService        False
OnlineSecurity      False
OnlineBackup        False
DeviceProtection    False
TechSupport         False
StreamingTV         False
StreamingMovies     False
Contract            False
MonthlyCharges      False
TotalCharges         True
Churn               False
dtype: bool

In [9]:
# drop rows with missing values in the column TotalCharges
#churnData.dropna(subset=['TotalCharges'], inplace = True)

In [10]:
# Replace the null values.
churnData['TotalCharges'].fillna(value=churnData['TotalCharges'].mean(), inplace=True)

In [11]:
# Check agian for null values in the dataframe. 
churnData.isna().any()

gender              False
SeniorCitizen       False
Partner             False
Dependents          False
tenure              False
PhoneService        False
OnlineSecurity      False
OnlineBackup        False
DeviceProtection    False
TechSupport         False
StreamingTV         False
StreamingMovies     False
Contract            False
MonthlyCharges      False
TotalCharges        False
Churn               False
dtype: bool

### Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:

- Scale the features either by using normalizer or a standard scaler.
- Split the data into a training set and a test set.
- Fit a logistic regression model on the training data.
- Check the accuracy on the test data.

In [12]:
#Scale the features either by using normalizer or a standard scaler.
#first create new data frame with the numerical columns
numerical = churnData[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]

In [13]:
# check numerical data frame
numerical

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,1,0,29.85,29.85
1,34,0,56.95,1889.50
2,2,0,53.85,108.15
3,45,0,42.30,1840.75
4,2,0,70.70,151.65
...,...,...,...,...
7038,24,0,84.80,1990.50
7039,72,0,103.20,7362.90
7040,11,0,29.60,346.45
7041,4,1,74.40,306.60


In [14]:
#Now scale the features by using standard scaler.
transformer = MinMaxScaler().fit(numerical) 
x_minmax = transformer.transform(numerical)
print(x_minmax.shape)

(7043, 4)


In [15]:
numerical_norm = pd.DataFrame(x_minmax,index=numerical.index,columns=numerical.columns)
numerical_norm

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,0.013889,0.0,0.115423,0.001275
1,0.472222,0.0,0.385075,0.215867
2,0.027778,0.0,0.354229,0.010310
3,0.625000,0.0,0.239303,0.210241
4,0.027778,0.0,0.521891,0.015330
...,...,...,...,...
7038,0.333333,0.0,0.662189,0.227521
7039,1.000000,0.0,0.845274,0.847461
7040,0.152778,0.0,0.112935,0.037809
7041,0.055556,1.0,0.558706,0.033210


In [16]:
#Split the data into a training set and a test set.
#x-y split

y = churnData['Churn']
X = numerical_norm

In [17]:
X_train, X_test, y_train, y_test = train_test_split(numerical_norm, y, test_size=0.3, random_state=42)

In [20]:
# Fit a logistic regression model on the training data.
classification = LogisticRegression(random_state=42, max_iter=10000)
classification.fit(X_train, y_train)
predictions = classification.predict(X_test)
print(metrics.classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.82      0.93      0.87      1539
         Yes       0.69      0.44      0.54       574

    accuracy                           0.80      2113
   macro avg       0.75      0.69      0.70      2113
weighted avg       0.78      0.80      0.78      2113



In [21]:
# Check the accuracy on the test data.
classification.score(X_test, y_test)

0.7950780880265026

 # Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model is.

In [22]:
#Check for the imbalance.
churnData['Churn'].value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [23]:
#double check with the targer y
#Check for the imbalance.
y.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

## start with upsampling

In [25]:
category_no = churnData[churnData['Churn'] == 'No']
category_yes = churnData[churnData['Churn'] == 'Yes']

# Upsampling 
# As we are going to repeat observations, the random samples can be picked more then once,
# threfore we need to use the keyword: replace=True

category_yes = category_yes.sample(len(category_no), replace=True)  
(category_yes.shape)

(5174, 16)

In [26]:
churnData_2 = pd.concat([category_no, category_yes], axis=0)
#shuffling the data
churnData_2 = churnData_2.sample(frac=1)
print(churnData_2['Churn'].value_counts())

Yes    5174
No     5174
Name: Churn, dtype: int64


In [28]:
# check new data frame churnData_2
churnData_2

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
251,Female,0,Yes,Yes,2,Yes,No,No,No,No,No,No,Month-to-month,70.40,147.15,Yes
2992,Male,0,No,No,12,No,No,Yes,No,Yes,No,No,Month-to-month,34.00,442.45,No
3706,Male,0,No,No,26,Yes,No,No,Yes,Yes,No,No,Month-to-month,84.30,2281.60,No
5890,Female,0,No,No,69,Yes,Yes,No,Yes,Yes,Yes,Yes,Two year,85.35,5897.40,No
3700,Male,0,No,Yes,20,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,19.40,374.50,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3142,Male,0,No,No,1,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,19.80,19.80,Yes
953,Female,0,No,No,15,Yes,Yes,No,Yes,Yes,No,No,Month-to-month,58.95,955.15,No
3357,Female,0,No,Yes,1,Yes,No,No,Yes,No,No,Yes,Month-to-month,59.20,59.20,Yes
5103,Female,0,Yes,Yes,28,Yes,Yes,No,No,No,No,Yes,One year,82.85,2320.80,No


In [64]:
#first create new data frame with the numerical columns
numericals_2 = churnData_2[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
numericals_2

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
251,2,0,70.40,147.15
2992,12,0,34.00,442.45
3706,26,0,84.30,2281.60
5890,69,0,85.35,5897.40
3700,20,0,19.40,374.50
...,...,...,...,...
3142,1,0,19.80,19.80
953,15,0,58.95,955.15
3357,1,0,59.20,59.20
5103,28,0,82.85,2320.80


In [80]:
#again scale the features by using standard scaler.
transformer = MinMaxScaler().fit(numericals_2) 
x_minmax = transformer.transform(numericals_2)
print(x_minmax.shape)

(10348, 4)


In [81]:
numerical_norm_2 = pd.DataFrame(x_minmax,index=numericals_2.index,columns=numericals_2.columns)
numerical_norm_2

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
251,0.027778,0.0,0.518905,0.014811
2992,0.166667,0.0,0.156716,0.048886
3706,0.361111,0.0,0.657214,0.261112
5890,0.958333,0.0,0.667662,0.678352
3700,0.277778,0.0,0.011443,0.041045
...,...,...,...,...
3142,0.013889,0.0,0.015423,0.000115
953,0.208333,0.0,0.404975,0.108049
3357,0.013889,0.0,0.407463,0.004662
5103,0.388889,0.0,0.642786,0.265636


In [84]:
# x-y split
y = churnData_2['Churn']
# X = numericals_2

In [85]:
#train test
X_train, X_test, y_train, y_test = train_test_split(numerical_norm_2, y, test_size=0.3, random_state=42)

In [83]:
# Fit a logistic regression model on the training data.
classification_2 = LogisticRegression(random_state=42, max_iter=10000)
classification_2.fit(X_train, y_train)
predictions_2 = classification_2.predict(X_test)
print(metrics.classification_report(y_test, predictions_2))

              precision    recall  f1-score   support

          No       0.72      0.74      0.73      1524
         Yes       0.74      0.72      0.73      1581

    accuracy                           0.73      3105
   macro avg       0.73      0.73      0.73      3105
weighted avg       0.73      0.73      0.73      3105



In [70]:
classification_2.score(X_test, y_test)

0.7320450885668277

## First comparison imbalanced data vs upsampling data

In [96]:
#Upsampling makes the model worse
# Imbalanced data accuracy: 0.7950780880265026
# Upsampling data accuracy: 0.7320450885668277

## now downsampling

In [71]:
category_0 = churnData[churnData['Churn'] == 'No']
category_1 = churnData[churnData['Churn'] == 'Yes']

# We pick a random sample of rows from of observations belonging to "category_0"
# in the same amount of observations belonging to "category_1"
category_0 = category_0.sample(len(category_1))
print(category_0.shape)
print(category_1.shape)

df_2 = pd.concat([category_0, category_1], axis=0)
#shuffling the data
df_2 = df_2.sample(frac=1)
df_2['Churn'].value_counts()

(1869, 16)
(1869, 16)


No     1869
Yes    1869
Name: Churn, dtype: int64

In [73]:
churnData_3 = pd.concat([category_0, category_1], axis=0)
#shuffling the data
churnData_3 = churnData_3.sample(frac=1)
print(churnData_3['Churn'].value_counts())

Yes    1869
No     1869
Name: Churn, dtype: int64


In [86]:
#create new data frame with the numerical columns
numericals_3 = churnData_3[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
numericals_3

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
1834,1,1,45.10,45.10
1057,10,0,84.70,832.05
651,1,0,74.60,74.60
2175,30,0,85.15,2555.90
4517,11,1,99.55,1131.20
...,...,...,...,...
6926,56,0,73.85,4092.85
4516,72,0,87.55,6463.15
2092,72,0,79.55,5810.90
950,2,1,44.95,85.15


In [87]:
#again scale the features by using standard scaler.
transformer = MinMaxScaler().fit(numericals_3) 
x_minmax = transformer.transform(numericals_3)
print(x_minmax.shape)

(3738, 4)


In [88]:
numerical_norm_3 = pd.DataFrame(x_minmax,index=numericals_3.index,columns=numericals_3.columns)
numerical_norm_3

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
1834,0.013889,1.0,0.266069,0.003029
1057,0.138889,0.0,0.660688,0.093839
651,0.013889,0.0,0.560040,0.006433
2175,0.416667,0.0,0.665172,0.292761
4517,0.152778,1.0,0.808670,0.128359
...,...,...,...,...
6926,0.777778,0.0,0.552566,0.470116
4516,1.000000,0.0,0.689088,0.743635
2092,1.000000,0.0,0.609367,0.668369
950,0.027778,1.0,0.264574,0.007651


In [90]:
# x y split
y = churnData_3['Churn']
# x = numerical_norm_3

In [91]:
# train test
X_train, X_test, y_train, y_test = train_test_split(numerical_norm_3, y, test_size=0.3, random_state=42)

In [92]:
classification_3 = LogisticRegression(random_state=42, max_iter=10000)
classification_3.fit(X_train, y_train)
predictions = classification_3.predict(X_test)
print(metrics.classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.71      0.70      0.70       549
         Yes       0.72      0.73      0.72       573

    accuracy                           0.71      1122
   macro avg       0.71      0.71      0.71      1122
weighted avg       0.71      0.71      0.71      1122



In [94]:
classification_3.score(X_test, y_test)

0.713903743315508

In [None]:
#Upsampling makes the model worse
# Imbalanced data accuracy: 0.7950780880265026
# Upsampling data accuracy: 0.7320450885668277
# Downsampling data accuracy: 0.713903743315508