
# Lab | Handling Data Imbalance in Classification Models

### Scenario
You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

## Instructions
In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

### FIRST STEP
Here is the list of steps to be followed (building a simple model without balancing the data):

- Import the required libraries and modules that you would need.
- Read that data into Python and call the dataframe churnData.
- Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.
- Check for null values in the dataframe. Replace the null values.

### SECOND STEP
- Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
    - Scale the features either by using normalizer or a standard scaler.
    - Split the data into a training set and a test set.
    - Fit a logistic regression model on the training data.
    - Check the accuracy on the test data.
Note: So far we have not balanced the data.

### THIRD STEP
Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model is.

### FIRST STEP

In [79]:
# Importing the libraries & modules
import pandas as pd
import numpy as np
import statsmodels.api as sm

from sklearn.utils import resample
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [80]:
# Reading the data

churnData = pd.read_csv('./files_for_lab/Customer-Churn.csv')
churnData.head(489)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
484,Female,0,Yes,Yes,8,Yes,No,Yes,No,No,Yes,No,Month-to-month,83.55,680.05,Yes
485,Male,0,Yes,Yes,72,Yes,No,No,No,No,No,Yes,Two year,84.50,6130.85,No
486,Female,0,No,No,15,Yes,No,Yes,No,No,Yes,Yes,Month-to-month,100.15,1415,No
487,Male,0,No,No,72,Yes,No,Yes,Yes,Yes,Yes,Yes,Two year,88.60,6201.95,No


In [81]:
# Checking the data types
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [82]:
# Cleaning the TotalCharges column

def clean_spac(row):
    return row.replace(" ","")
churnData['TotalCharges'] = list(map(clean_spac,churnData['TotalCharges']))
churnData['TotalCharges']

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: TotalCharges, Length: 7043, dtype: object

In [83]:
# Conveting into series
s = pd.Series(churnData['TotalCharges'])

In [84]:
churnData['TotalCharges'] = pd.to_numeric(s, downcast='float')

In [85]:
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float32
Churn                object
dtype: object

In [86]:
# Check for null values in the dataframe & replace them

# Checking NaN values
churnData.isnull().sum()/len(churnData)

gender              0.000000
SeniorCitizen       0.000000
Partner             0.000000
Dependents          0.000000
tenure              0.000000
PhoneService        0.000000
OnlineSecurity      0.000000
OnlineBackup        0.000000
DeviceProtection    0.000000
TechSupport         0.000000
StreamingTV         0.000000
StreamingMovies     0.000000
Contract            0.000000
MonthlyCharges      0.000000
TotalCharges        0.001562
Churn               0.000000
dtype: float64

In [94]:
churnData['TotalCharges'].mean()

2283.298828125

In [71]:
# Replacing null values from 'TotalCharges' column
def replaceNaN(x):
    if x == [None]:
        return churnData['TotalCharges'].mean()
    else:
        return x
    
churnData['TotalCharges'] = list(map(replaceNaN,churnData['TotalCharges']))

In [96]:
churnData['TotalCharges'].fillna((churnData['TotalCharges'].mean()), inplace=True)

In [97]:
churnData.isnull().sum()/len(churnData)

gender              0.0
SeniorCitizen       0.0
Partner             0.0
Dependents          0.0
tenure              0.0
PhoneService        0.0
OnlineSecurity      0.0
OnlineBackup        0.0
DeviceProtection    0.0
TechSupport         0.0
StreamingTV         0.0
StreamingMovies     0.0
Contract            0.0
MonthlyCharges      0.0
TotalCharges        0.0
Churn               0.0
dtype: float64

### SECOND STEP

Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
1. Scale the features either by using normalizer or a standard scaler.
2. Split the data into a training set and a test set.
3. Fit a logistic regression model on the training data.
4. Check the accuracy on the test data. Note: So far we have not balanced the data.

In [98]:
# Converting the selected features into a dataframe
churndata_features = churnData[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]
churndata_features.head()

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,1,0,29.85,29.85
1,34,0,56.95,1889.5
2,2,0,53.85,108.150002
3,45,0,42.3,1840.75
4,2,0,70.7,151.649994


In [100]:
churndata_features_df = pd.DataFrame(churndata_features)

In [99]:
churndata_features.isnull().sum()/len(churndata_features)

tenure            0.0
SeniorCitizen     0.0
MonthlyCharges    0.0
TotalCharges      0.0
dtype: float64

#### 1. Scaling the features w/ standard scaler

In [101]:
from sklearn.preprocessing import MinMaxScaler

churndata_features_normalized = pd.DataFrame(MinMaxScaler().fit_transform(churndata_features), columns=churndata_features_df.columns)

churndata_features_normalized.head()

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,0.013889,0.0,0.115423,0.001275
1,0.472222,0.0,0.385075,0.215867
2,0.027778,0.0,0.354229,0.01031
3,0.625,0.0,0.239303,0.210241
4,0.027778,0.0,0.521891,0.01533


In [102]:
churndata_features_normalized.isnull().sum()/len(churndata_features_normalized)

tenure            0.0
SeniorCitizen     0.0
MonthlyCharges    0.0
TotalCharges      0.0
dtype: float64

In [103]:
# X-y build

X = churndata_features_normalized
y = churnData['Churn']

#### 2. Spliting the data into train-test

In [104]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#### 3. Fit a logistic regression model on the training data.

In [105]:
from sklearn.linear_model import LogisticRegression


classification = LogisticRegression(random_state=0, solver='saga',
                  multi_class='ovr').fit(X_train, y_train)
                    # ovr bc it is a binary problem

#### 4 Check the accuracy on the test data.

In [106]:
# 4.1 Check the accuracy on the test data.
predictions = classification.predict(X_test)
classification.score(X_test, y_test)

0.7830777967064169

In [108]:
# 4.2 Checking the accuracy on the test data w/ a different method
from sklearn.metrics import accuracy_score

score = accuracy_score(y_test,predictions)
score

0.7830777967064169

Using the selected features, aproximately 78% of the testing instances are predicting correctly the outcome.
- (True Churn('Yes') + True Churn('No')) / All predictions = 0.783

### THIRD STEP

Managing imbalance in the dataset

5. Check for the imbalance.
6. Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
7. Each time fit the model and see how the accuracy of the model is.

#### 5. Checking the imbalance for my target variable

In [109]:
churnData['Churn'].value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [111]:
5174/(5174+1869)
# It is clearly imbalanced (75% NOs out of the total individuals)

0.7346301292063041

#### 6. Using the resampling strategies for upsampling and downsampling to create a balance

In [112]:
#Resampling to upsample or downsample

from sklearn.utils import resample

# Separating my Churn values
category_No = churnData[churnData['Churn'] == 'No']
category_Yes = churnData[churnData['Churn'] == 'Yes']

In [114]:
# Checking the sampling will work
category_No.sample(5)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
6812,Female,1,No,No,45,Yes,Yes,Yes,No,Yes,No,No,Month-to-month,85.7,3778.100098,No
4915,Female,0,Yes,Yes,22,No,No,Yes,No,No,Yes,No,Month-to-month,39.2,849.900024,No
5093,Male,0,No,No,49,Yes,No,No,No,Yes,Yes,No,Two year,66.15,3199.0,No
2863,Male,0,No,No,1,Yes,No,No,No,No,No,No,Month-to-month,44.6,44.599998,No
1105,Female,0,Yes,No,9,Yes,No,No,No,No,No,Yes,Month-to-month,79.75,769.099976,No


In [115]:
## UPSAMPLING ##

# Upsampling my smallest category, in this case, 'Yes'
category_Yes_over = resample(category_Yes, #from where am I going to upsample
                                   replace=True, #we are allowed to take the same element twice
                                   n_samples = len(category_No)) #lenght of the sample equals to the big category

In [116]:
# Checking new category values shape
print(category_Yes_over.shape)
print(category_No.shape)

(5174, 16)
(5174, 16)


In [117]:
# Concatenating my undersampled df w/ original lower category df
churnData_upsampled = pd.concat([category_Yes_over, category_No], axis=0)

In [118]:
# Checking Imbalance, should get the same nº
churnData_upsampled['Churn'].value_counts()

Yes    5174
No     5174
Name: Churn, dtype: int64

In [119]:
## DOWNSAMPLING ##

# Downsampling biggest categories
category_No_under = resample(category_No,
                                   replace=False, 
                                   n_samples = len(category_Yes))

In [120]:
# Checking new category values shape
print(category_No_under.shape)
print(category_Yes.shape)

(1869, 16)
(1869, 16)


In [121]:
# Concatenating my undersampled df w/ original lower category df
churnData_downsampled = pd.concat([category_No_under, category_Yes], axis=0)

In [122]:
# Checking Imbalance, should get the same nº
churnData_downsampled['Churn'].value_counts()

Yes    1869
No     1869
Name: Churn, dtype: int64

#### 7. Each time fit the model and see how the accuracy of the model is.

In [None]:
## Oversampled 'Churn' 

In [129]:
# Converting the selected features into a dataframe
churndata_features_over = churnData_upsampled[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]
churndata_features_over.head()

features_over_df = pd.DataFrame(churndata_features_over)

In [131]:
# Normalizing
churndata_features__over_normalized = pd.DataFrame(MinMaxScaler().fit_transform(churndata_features_over), columns=features_over_df.columns)

churndata_features__over_normalized.head()

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,0.319444,0.0,0.714428,0.248211
1,0.069444,0.0,0.624378,0.042788
2,0.333333,0.0,0.859701,0.291213
3,0.277778,1.0,0.672139,0.197173
4,0.041667,0.0,0.661692,0.028393


In [144]:
# X-y build

X_over = churndata_features__over_normalized
y_over = churnData_upsampled['Churn']

In [145]:
# Train-test split
X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(X_over, y_over, random_state=0)

In [146]:
# Fiting the LogisticRegression model w/ our Oversampled data
classification = LogisticRegression(random_state=0, solver='saga',
                  multi_class='ovr').fit(X_train_over, y_train_over)

In [147]:
# Checking the accuracy on the test data.
predictions = classification.predict(X_test_over)
classification.score(X_test_over, y_test_over)

0.7263239273289525

The accuracy has been reduced by roughly 1% from the original data.

In [137]:
## Undersampled 'Churn' 

In [139]:
# Converting the selected features into a dataframe
churndata_features_down = churnData_downsampled[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]
churndata_features_down.head()

features_under_df = pd.DataFrame(churndata_features_down)

In [140]:
# Normalizing
churndata_features__under_normalized = pd.DataFrame(MinMaxScaler().fit_transform(churndata_features_down), columns=features_under_df.columns)

churndata_features__under_normalized.head()

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,0.152778,1.0,0.599502,0.098708
1,0.902778,0.0,0.854229,0.770973
2,0.111111,0.0,0.012935,0.016426
3,0.569444,0.0,0.801493,0.454691
4,0.597222,0.0,0.328358,0.246111


In [148]:
# X-y build
X_under = churndata_features__under_normalized
y_under = churnData_downsampled['Churn']

In [149]:
# Train-test split
X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_under, y_under, random_state=0)

In [150]:
# Fiting the model
classification = LogisticRegression(random_state=0, solver='saga',
                  multi_class='ovr').fit(X_train_under, y_train_under)

In [151]:
# Checking the accuracy on the test data.
predictions = classification.predict(X_test_under)
classification.score(X_test_under, y_test_under)

0.7294117647058823

The accuracy has been roughly affected again compared to when using the original data.

# Lab | Random Forests
### Instructions
- Apply the Random Forests algorithm but this time only by upscaling the data from the ChurnData
- Discuss the output and its impact in the business scenario. Is the cost of a false positive equals to the ost of the false negative? How would you change your algorithm or data in order to maximize the return of the business?