# Lab | Cross Validation

Jorge Castro DAPT NOV 2021

For this lab, we will build a model on customer churn binary classification problem. You will be using `files_for_lab/Customer-Churn.csv` file.



### Instructions

1. Apply SMOTE for upsampling the data

    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.


2. Apply TomekLinks for downsampling

    - It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.
    - You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.under_sampling import TomekLinks
from imblearn.under_sampling import TomekLinks
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import cohen_kappa_score, accuracy_score 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

In [3]:
data = pd.read_csv('Customer-Churn.csv')

In [4]:
data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [5]:
data.shape

(7043, 16)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

In [7]:
data.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [8]:
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce') 
data['TotalCharges'] = data['TotalCharges'].fillna(np.mean(data['TotalCharges']))

## X-y split

In [9]:
y = data['Churn']
X = data[['tenure', 'SeniorCitizen','MonthlyCharges', 'TotalCharges']] 

## Apply SMOTE for upsampling

In [10]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
transformer = StandardScaler().fit(X)
X = transformer.transform(X) 
X_sm, y_sm = smote.fit_sample(X, y) 
y_sm.value_counts()



No     5174
Yes    5174
Name: Churn, dtype: int64

## Train-test split

In [11]:
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_sm, y_sm, test_size=0.33, random_state=11)

## Logistic Regression

In [12]:
classification = LogisticRegression(random_state=0, solver='lbfgs',
                        multi_class='ovr').fit(X_train_smote, y_train_smote)

print("The accuracy of the model is: ",round(classification.score(X_test_smote, y_test_smote),2))
print("The kappa of the model is: ",round(cohen_kappa_score(y_sm,classification.predict(X_sm)),2))

The accuracy of the model is:  0.73
The kappa of the model is:  0.47


## Decision Regression

In [13]:
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train_smote, y_train_smote)
print("The accuracy of the model is: {:4.2f}".format(model.score(X_test_smote, y_test_smote)))
print("The kappa of the model is: ",round(cohen_kappa_score(y_sm,model.predict(X_sm)),2))

The accuracy of the model is: 0.70
The kappa of the model is:  0.42


In [14]:
model1 = DecisionTreeClassifier()
model2 = LogisticRegression()
model3 = KNeighborsClassifier()
from sklearn.model_selection import cross_val_score
from scipy.stats import t, norm

In [15]:
model_pipeline = [model1, model2, model3]
model_names = ['Regression Tree', 'Linear Regression', 'KNN']


def confidence_intervals(model_pipeline, model_names, X_train, y_train, alpha = 0.05, K = 10):
# We set the significance level
#alpha = 0.05
#K = 10
    scores = {}
    i=0
    for model in model_pipeline:
        mean_score = np.mean(cross_val_score(model, X_train, y_train, cv=K))
        if (K < 30):
            # t.ppf(area) gives us the critical value corresponding to the area for the t-student distribution.
            t_critical = abs(t.ppf(1-alpha/2, K-1)) 
            interval = t_critical*(np.std(cross_val_score(model, X_train, y_train, cv=K))/np.sqrt(K))
        else:
            # norm.ppf(area) gives us the critical value corresponding to the area for the normal distribution
            z_critical = abs(norm.ppf(1-alpha/2)) 
            interval = z_critical*(np.std(cross_val_score(model, X_train, y_train, cv=K))/np.sqrt(K))
        scores[model_names[i]] = [mean_score, mean_score - interval, mean_score + interval]
        print("The rmse of the {} model is (CV witk K={}) = {:4.2f} +/- {:4.2f}".format(model_names[i], K, mean_score, interval))
        i = i+1

confidence_intervals(model_pipeline, model_names, X_train_smote, y_train_smote, 0.05, 5)

The rmse of the Regression Tree model is (CV witk K=5) = 0.76 +/- 0.01
The rmse of the Linear Regression model is (CV witk K=5) = 0.74 +/- 0.01
The rmse of the KNN model is (CV witk K=5) = 0.77 +/- 0.01


# Apply TomekLinks for downsampling

In [16]:
tl = TomekLinks('majority')
X_tl, y_tl = tl.fit_sample(X, y)
pd.DataFrame(y_tl).value_counts()



Churn
No       4666
Yes      1869
dtype: int64

## Train-test split

In [17]:
X_train_tl, X_test_tl, y_train_tl, y_test_tl = train_test_split(X_tl, y_tl, test_size=0.33, random_state=11)

## Logistic Regression

In [18]:
classification1 = LogisticRegression(random_state=0, solver='lbfgs',
                        multi_class='ovr').fit(X_train_tl, y_train_tl)

print("The accuracy of the model is: ",round(classification1.score(X_test_tl, y_test_tl),2))
print("The kappa of the model is: ",round(cohen_kappa_score(y_tl,classification1.predict(X_tl)),2))

The accuracy of the model is:  0.8
The kappa of the model is:  0.45


## Decision Regression

In [19]:
model1 = DecisionTreeClassifier(max_depth=3)
model1.fit(X_train_tl, y_train_tl)
print("The accuracy of the model is: {:4.2f}".format(model.score(X_test_tl, y_test_tl)))
print("The kappa of the model is: ",round(cohen_kappa_score(y_tl,model.predict(X_tl)),2))

The accuracy of the model is: 0.76
The kappa of the model is:  0.42
