# Lesson 1

![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Cross Validation

For this lab, we will build a model on customer churn binary classification problem. You will be using `files_for_lab/Customer-Churn.csv` file.



### Instructions

1. Apply SMOTE for upsampling the data

    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.


2. Apply TomekLinks for downsampling

    - It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.
    - You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.


## Loading libraries

In [1]:
import pandas as pd
import numpy  as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.utils import resample

from imblearn.over_sampling import SMOTE


from sklearn.metrics import f1_score, roc_curve, roc_auc_score

## encoding categorical variables for SMOTE
from sklearn.preprocessing import LabelEncoder


import seaborn as sns
import matplotlib.pyplot as plt



## Preparing the data for sampling
(We already know that it's imbalanced from a previous exercise)

In [2]:
churnData = pd.read_csv('Customer-Churn.csv')  # Replace 'your_data.csv' with the actual file path
churnData.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [3]:
churnData.dtypes # check for discrete and continuous variables 

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

### SMOTE
Apply SMOTE for upsampling the data


In [4]:
X = churnData.drop('Churn', axis=1) # df with features
y = churnData['Churn'] # df with target 

In [5]:
X.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65


In [6]:
y.head()

0     No
1     No
2    Yes
3     No
4    Yes
Name: Churn, dtype: object

In [7]:
y.value_counts() # count of target variable before resampling 

Churn
No     5174
Yes    1869
Name: count, dtype: int64

In [8]:
# Encode categorical variables
categorical_cols = X.select_dtypes(include='object').columns
X_encoded = pd.get_dummies(X, columns=categorical_cols)

# Convert target variable to numerical
le = LabelEncoder()
y_encoded = le.fit_transform(y)

In [9]:
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [10]:
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_encoded, y_encoded)

In [12]:
# Convert back to a data frame
X_resampled = pd.DataFrame(X_resampled, columns=X_encoded.columns)
y_resampled = pd.Series(y_resampled, name='Churn')

In [13]:
print("The up sampled target variable looks like:÷\n", y_resampled.value_counts())


The up sampled target variable looks like:÷
 Churn
0    5174
1    5174
Name: count, dtype: int64


In [14]:
# # Convert back to a data frame with features and target

churnDataUp_sampled = pd.concat([X_resampled, y_resampled], axis=1)
churnDataUp_sampled

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,PhoneService_No,...,TotalCharges_996.45,TotalCharges_996.85,TotalCharges_996.95,TotalCharges_997.65,TotalCharges_997.75,TotalCharges_998.1,TotalCharges_999.45,TotalCharges_999.8,TotalCharges_999.9,Churn
0,0,1,29.850000,True,False,False,True,True,False,True,...,False,False,False,False,False,False,False,False,False,0
1,0,34,56.950000,False,True,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,0
2,0,2,53.850000,False,True,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,1
3,0,45,42.300000,False,True,True,False,True,False,True,...,False,False,False,False,False,False,False,False,False,0
4,0,2,70.700000,True,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10343,0,17,85.446687,True,False,False,True,True,True,False,...,False,False,False,False,False,False,False,False,False,1
10344,1,9,99.463753,False,True,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,1
10345,0,1,20.200000,True,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,1
10346,0,37,100.268002,True,True,True,True,True,False,False,...,False,False,False,False,False,False,False,False,False,1


### LOGISTIC REGRESSION

Use logistic regression to fit the model and compute the accuracy of the model.


We'll standardsie the data first. Logistic regression assumes a linear relationship so this helps with convergence 

In [15]:
# Scale the data first 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(churnDataUp_sampled.drop('Churn', axis=1))

# Concatenate the scaled features and target variable
churnDataUp_scaled = pd.concat([pd.DataFrame(X_scaled, columns=churnDataUp_sampled.columns[:-1]), churnDataUp_sampled['Churn']], axis=1)


In [16]:
X_scaled

array([[-0.40804115, -1.10232948, -1.32679626, ..., -0.02408648,
        -0.00983089, -0.00983089],
       [-0.40804115,  0.26650415, -0.38434654, ..., -0.02408648,
        -0.00983089, -0.00983089],
       [-0.40804115, -1.06084967, -0.49215444, ..., -0.02408648,
        -0.00983089, -0.00983089],
       ...,
       [-0.40804115, -1.10232948, -1.66239183, ..., -0.02408648,
        -0.00983089, -0.00983089],
       [-0.40804115,  0.39094357,  1.12211249, ..., -0.02408648,
        -0.00983089, -0.00983089],
       [-0.40804115, -0.85345063,  0.10421809, ..., -0.02408648,
        -0.00983089, -0.00983089]])

In [17]:

# Split the data into features (X) and target variable (y)
X = churnDataUp_scaled.drop('Churn', axis=1)
y = churnDataUp_scaled['Churn']

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [18]:
#Fit a logistic regression model on the training data and check accuracy on the test data

model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)



In [19]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy after SMOTE:", accuracy)

Accuracy after SMOTE: 0.8584541062801933


In [20]:
# f1_yes = f1_score(y_test, y_pred, pos_label='Yes')
f1 = f1_score(y_test, y_pred)
print("F1-score after SMOTE:", f1)

F1-score after SMOTE: 0.8685509196949304


In [21]:
y_pred_prob = model.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_pred_prob)
print("ROC AUC after SMOTE:", roc_auc)

ROC AUC after SMOTE: 0.9507576358810079


In [22]:
## print it all together :) 

from sklearn.metrics import accuracy_score, classification_report

report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("ROC-AUC:", roc_auc)
print("F1 score:", f1)

print("Classification Report for the logistic regressiuon:\n", report)

Accuracy: 0.8584541062801933
ROC-AUC: 0.9507576358810079
F1 score: 0.8685509196949304
Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.79      0.85      1021
           1       0.82      0.92      0.87      1049

    accuracy                           0.86      2070
   macro avg       0.86      0.86      0.86      2070
weighted avg       0.86      0.86      0.86      2070



Our logistic regression seems to be performing well:

- ROC AUC is near 1 so it shows that it can distinguish between classes
    
- F1-score of 0.86 shows that recall for the positive class 'Yes' is high
- Accuracy is similar at 0.85 which means that we can be confident that the model correctly predicts 'Yes' or 'No' 85% of the time


## Use decision tree classifier to fit the model and compute the accuracy of the model

Now we can see if a decision tree can perform better at prediction

In [23]:
from sklearn.tree import DecisionTreeClassifier

We'll make a new training and test set based on the non-scaled data. Decision trees don't assume a linear relationship so the data don't need to be scaled.

In [24]:
# Split the data into features (X) and target variable (y)
X = churnDataUp_sampled.drop('Churn', axis=1)
y = churnDataUp_sampled['Churn']

# Perform the train-test split
X_traindt, X_testdt, y_traindt, y_testdt = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:
model = DecisionTreeClassifier(criterion='gini', max_depth=5)

In [26]:
model.fit(X_traindt, y_traindt)

In [27]:
y_preddt = model.predict(X_testdt)

In [39]:
report = classification_report(y_testdt, y_preddt)

y_pred_probdt = model.predict_proba(X_testdt)[:, 1]
roc_aucdt = roc_auc_score(y_testdt, y_pred_probdt)
f1dt = f1_score(y_testdt, y_preddt)
accuracydt = accuracy_score(y_testdt, y_preddt)



print("Accuracy:", accuracydt)
print("ROC-AUC:", roc_aucdt)
print("F1 score:", f1dt)

print("Classification Report for decision tree:\n", report)

Accuracy: 0.8019323671497585
ROC-AUC: 0.871812527952091
F1 score: 0.8153153153153154
Classification Report for decision tree:
               precision    recall  f1-score   support

           0       0.84      0.74      0.79      1021
           1       0.77      0.86      0.82      1049

    accuracy                           0.80      2070
   macro avg       0.81      0.80      0.80      2070
weighted avg       0.81      0.80      0.80      2070



Our decision tree performed less well:

- ROC AUC is near .86 (1 for regression) so it shows that it can distinguish between classes
    
- F1-score of 0.81 (.86 for regression) shows that recall for the positive class 'Yes' is high
- Accuracy is similar at 0.80 (.85 for regression ) which means that we can be confident that the model correctly predicts 'Yes' or 'No' 80% of the time

- ROC AUC is near 1 so it shows that it can distinguish between classes
    



### TomekLinks
Now we'll use tomeklinks to downsample the data 

In [29]:
churnData

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


In [30]:
churnData['Churn'].value_counts() # count of target variable before resampling 

Churn
No     5174
Yes    1869
Name: count, dtype: int64

In [31]:
from imblearn.under_sampling import TomekLinks

X = churnData.drop('Churn', axis=1)
y = churnData['Churn']

In [32]:
# Encode categorical variables
categorical_cols = X.select_dtypes(include='object').columns
X_encoded = pd.get_dummies(X, columns=categorical_cols)

# Convert target variable to numerical
le = LabelEncoder()
y_encoded = le.fit_transform(y)

In [33]:
tl = TomekLinks()
X_resampled, y_resampled = tl.fit_resample(X_encoded, y_encoded)

In [34]:
# Put it all back into a dataframe 
X_resampled = pd.DataFrame(X_resampled, columns=X_encoded.columns)
y_resampled = pd.Series(y_resampled, name='Churn')

In [None]:
# count the majority class (1) that should be downsampled. ie. less observations with the column 'Churn'
#y_resampled.value_counts() 
# did it differently below 

In [35]:
churnDataDown_sampled = pd.concat([X_resampled, y_resampled], axis=1)

# print out the  the majority class (1) that should be downsampled. ie. we should see less observations of '1' (or 'Yes')  the column 'Churn'
print("The downsampled target variable counts are:\n", churnDataDown_sampled['Churn'].value_counts())


The downsampled target variable counts are:
 Churn
0    4722
1    1869
Name: count, dtype: int64


In [42]:
# Split the data into features (X) and target variable (y)
XDSlm = churnDataDown_sampled.drop('Churn', axis=1)
yDSlm = churnDataDown_sampled['Churn']

# Perform the train-test split
X_trainDSlm, X_testDSlm, y_trainDSlm, y_testDSlm = train_test_split(XDSlm, yDSlm, test_size=0.2, random_state=42)

In [44]:
DSlmModel = LogisticRegression(solver='lbfgs', max_iter=1000)
DSlmModel.fit(X_trainDSlm, y_trainDSlm)

In [46]:
y_predDSlm = DSlmModel.predict(X_test)
accuracyDSlm = accuracy_score(y_test, y_pred)
f1DSlm = f1_score(y_test, y_pred)
y_pred_probDSlm = DSlmModel.predict_proba(X_test)[:, 1]
roc_aucDSlm = roc_auc_score(y_test, y_pred_probDSlm)

In [47]:
report = classification_report(y_testDSlm, y_predDSlm)

y_pred_probDSlm = model.predict_proba(X_testDSlm)[:, 1]
roc_aucDSlm = roc_auc_score(y_testDSlm, y_pred_probDSlm)
f1DSlm = f1_score(y_testDSlm, y_predDSlm)
accuracyDSlm = accuracy_score(y_testDSlm, y_predDSlm)



print("Accuracy:", accuracyDSlm)
print("ROC-AUC:", roc_aucDSlm)
print("F1 score:", f1DSlm)

print("Classification Report for tomeklinks (downsampled) logistic regression:\n", report)

Accuracy: 0.7907505686125853
ROC-AUC: 0.8439584270474095
F1 score: 0.6260162601626017
Classification Report for tomeklinks (downsampled) logistic regression:
               precision    recall  f1-score   support

           0       0.84      0.87      0.85       933
           1       0.66      0.60      0.63       386

    accuracy                           0.79      1319
   macro avg       0.75      0.73      0.74      1319
weighted avg       0.79      0.79      0.79      1319



It performed much worse with lower scores all around! 

Now let's try the decision tree

In [49]:
# Split the data into features (X) and target variable (y)
X = churnDataDown_sampled.drop('Churn', axis=1)
y = churnDataDown_sampled['Churn']

# Perform the train-test split
X_trainDSdt, X_testDSdt, y_trainDSdt, y_testDSdt = train_test_split(X, y, test_size=0.2, random_state=42)

# create the model parameters
model = DecisionTreeClassifier(criterion='gini', max_depth=5)

model.fit(X_trainDSdt, y_trainDSdt)

y_predDSdt = model.predict(X_testDSdt)


In [50]:
report = classification_report(y_testDSdt, y_predDSdt)

y_pred_probDSdt = model.predict_proba(X_testDSdt)[:, 1]
roc_aucDSdt = roc_auc_score(y_testDSdt, y_pred_probDSdt)
f1DSdt = f1_score(y_testDSdt, y_predDSdt)
accuracyDSdt = accuracy_score(y_testDSdt, y_predDSdt)



print("Accuracy:", accuracyDSdt)
print("ROC-AUC:", roc_aucDSdt)
print("F1 score:", f1DSdt)

print("Classification Report for decision tree:\n", report)

Accuracy: 0.7740712661106899
ROC-AUC: 0.8353186833935881
F1 score: 0.6026666666666668
Classification Report for decision tree:
               precision    recall  f1-score   support

           0       0.83      0.85      0.84       933
           1       0.62      0.59      0.60       386

    accuracy                           0.77      1319
   macro avg       0.73      0.72      0.72      1319
weighted avg       0.77      0.77      0.77      1319



Even worse! 

Logistic regression seems to be the winner. 