# Lab | Cross Validation

For this lab, we will build a model on customer churn binary classification problem. You will be using files_for_lab/Customer-Churn.csv file.

**Instructions**

    1. Apply SMOTE for upsampling the data:
        - Use logistic regression to fit the model and compute the accuracy of the model.
        - Use decision tree classifier to fit the model and compute the accuracy of the model.
        - Compare the accuracies of the two models.

    2. Apply TomekLinks for downsampling the data:
        - Use logistic regression to fit the model and compute the accuracy of the model.
        - Use decision tree classifier to fit the model and compute the accuracy of the model.
        - Compare the accuracies of the two models.


## Load libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import cohen_kappa_score
import warnings
warnings.filterwarnings('ignore')

## Import data

In [2]:
data = pd.read_csv(r'C:/Users/josefin/01_IRONHACK/Week8/Day8.4/lab-cross-validation/files_for_lab/Customer-Churn.csv')
data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


## Data exploration

In [3]:
data.shape

(7043, 16)

In [4]:
data.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [5]:
data.isna().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [6]:
data.isin(['', ' ']).sum()

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [7]:
for i in data.columns.tolist():         
    print(i, len(data[i].unique()))

gender 2
SeniorCitizen 2
Partner 2
Dependents 2
tenure 73
PhoneService 2
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Contract 3
MonthlyCharges 1585
TotalCharges 6531
Churn 2


In [8]:
for i in data:
    print('Catagory: ', i)
    print(data[i].value_counts())
    print('\n')

Catagory:  gender
Male      3555
Female    3488
Name: gender, dtype: int64


Catagory:  SeniorCitizen
0    5901
1    1142
Name: SeniorCitizen, dtype: int64


Catagory:  Partner
No     3641
Yes    3402
Name: Partner, dtype: int64


Catagory:  Dependents
No     4933
Yes    2110
Name: Dependents, dtype: int64


Catagory:  tenure
1     613
72    362
2     238
3     200
4     176
     ... 
28     57
39     56
44     51
36     50
0      11
Name: tenure, Length: 73, dtype: int64


Catagory:  PhoneService
Yes    6361
No      682
Name: PhoneService, dtype: int64


Catagory:  OnlineSecurity
No                     3498
Yes                    2019
No internet service    1526
Name: OnlineSecurity, dtype: int64


Catagory:  OnlineBackup
No                     3088
Yes                    2429
No internet service    1526
Name: OnlineBackup, dtype: int64


Catagory:  DeviceProtection
No                     3095
Yes                    2422
No internet service    1526
Name: DeviceProtection, dtype: int64

In [9]:
data.describe(include=[np.object]).T

Unnamed: 0,count,unique,top,freq
gender,7043,2,Male,3555
Partner,7043,2,No,3641
Dependents,7043,2,No,4933
PhoneService,7043,2,Yes,6361
OnlineSecurity,7043,3,No,3498
OnlineBackup,7043,3,No,3088
DeviceProtection,7043,3,No,3095
TechSupport,7043,3,No,3473
StreamingTV,7043,3,No,2810
StreamingMovies,7043,3,No,2785


In [10]:
data.describe(include=[np.number]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SeniorCitizen,7043.0,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
MonthlyCharges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75


## Data cleaning

In [11]:
#convert TotalCharges from categorical to numerical
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

In [12]:
data.describe(include=[np.number]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SeniorCitizen,7043.0,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
MonthlyCharges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75
TotalCharges,7032.0,2283.300441,2266.771362,18.8,401.45,1397.475,3794.7375,8684.8


In [13]:
#replace Na's of TotalCharges with the mean
data['TotalCharges'] = data['TotalCharges'].fillna(np.mean(data['TotalCharges']))

## X-y-split

In [14]:
#target variable is categorical
y = data['Churn']
y.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [15]:
X = data[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]

## Data preprocessing

In [16]:
#StandardScaler
transformer = StandardScaler().fit(X)
X_scaled = transformer.transform(X)
X_scaled = pd.DataFrame(X_scaled)
print(X_scaled.shape)

(7043, 4)


## SMOTE - upsampling the data

In [17]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()

X_sm, y_sm = smote.fit_sample(X_scaled, y)
y_sm = y_sm.to_numpy()
y_sm = pd.Series(data=y_sm.flatten())
y_sm.value_counts()

Yes    5174
No     5174
dtype: int64

### Logistic Regression (SMOTE)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.2, random_state=100)

classification = LogisticRegression(random_state=0, multi_class='ovr', max_iter=10000).fit(X_train, y_train.values.ravel())
y_pred = classification.predict(X_test)

print("The accuracy: ", (classification.score(X_test, y_test)))
print("The kappa: ", (cohen_kappa_score(y_pred,y_test)) )

The accuracy:  0.7396135265700483
The kappa:  0.47935925158913184


### Decision Tree (SMOTE)

In [19]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print("The accuracy of the model is: ", model.score(X_test, y_test))

The accuracy of the model is:  0.7652173913043478


## TomekLinks - downsampling the data

In [20]:
from imblearn.under_sampling import TomekLinks

tl = TomekLinks('majority')
X_tl, y_tl = tl.fit_sample(X_scaled, y)
pd.DataFrame(y_tl).value_counts()

Churn
No       4666
Yes      1869
dtype: int64

### Logistic Regression (TomekLinks)

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl, test_size=0.2, random_state=100)

classification = LogisticRegression(random_state=0, multi_class='ovr', max_iter=10000).fit(X_train, y_train)
y_pred = classification.predict(X_test)

print("The accuracy: ", (classification.score(X_test, y_test)))
print("The kappa:", (cohen_kappa_score(y_pred,y_test)) )

The accuracy :  0.7972456006120887
The kappa: 0.44370562036327665


### Decision Tree (TomekLinks)

In [22]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print("The accuracy of the model is: ", model.score(X_test, y_test))

The accuracy of the model is:  0.7597551644988524
