![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Cross Validation

For this lab, we will build a model on customer churn binary classification problem. You will be using `files_for_lab/Customer-Churn.csv` file.



### Instructions

1. Apply SMOTE for upsampling the data

    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.


2. Apply TomekLinks for downsampling

    - It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.
    - You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.


In [4]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

In [None]:
#Apply SMOTE for upsampling the data
#Use logistic regression to fit the model and compute the accuracy of the model.
#Use decision tree classifier to fit the model and compute the accuracy of the model.
#Compare the accuracies of the two models.

In [7]:
churnData = pd.read_csv('Customer-Churn.csv')
churnData['TotalCharges']  = pd.to_numeric(churnData['TotalCharges'], errors='coerce')
churnData['TotalCharges'] = churnData['TotalCharges'].fillna(np.mean(churnData['TotalCharges']))

smote = SMOTE()
X = churnData[['tenure', 'SeniorCitizen','MonthlyCharges', 'TotalCharges']]
transformer = StandardScaler().fit(X)
X = transformer.transform(X)
y = churnData['Churn']
X_sm, y_sm = smote.fit_resample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.33)
classification_model1 = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr')
classification_model2 = DecisionTreeClassifier()

model_pipeline = [classification_model1, classification_model2]
model_names = ['Logistic Regression', 'Decision Tree Classifier']
scores = {}
i=0
for model in model_pipeline:
    mean_score = np.mean(cross_val_score(model, X_train, y_train, cv=10))
    scores[model_names[i]] = mean_score
    i = i+1
print(scores)

{'Logistic Regression': 0.7295513804159338, 'Decision Tree Classifier': 0.7425361062248671}


In [8]:
#Apply TomekLinks for downsampling

#It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
#Use logistic regression to fit the model and compute the accuracy of the model.
#Use decision tree classifier to fit the model and compute the accuracy of the model.
#Compare the accuracies of the two models.
#You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.

In [10]:
from imblearn.under_sampling import TomekLinks
tl = TomekLinks(sampling_strategy='majority')
X_tl,y_tl=tl.fit_resample(X,y)

X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl, test_size=0.33)
classification_model1 = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr')
classification_model2 = DecisionTreeClassifier()

model_pipeline = [classification_model1, classification_model2]
model_names = ['Logistic Regression', 'Decision Tree Classifier']
scores = {}
i=0
for model in model_pipeline:
    mean_score = np.mean(cross_val_score(model, X_train, y_train, cv=10))
    scores[model_names[i]] = mean_score
    i = i+1
print(scores)

{'Logistic Regression': 0.7888932426360721, 'Decision Tree Classifier': 0.7404637263199689}


In [12]:
from sklearn.metrics import accuracy_score

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)

y_pred_dt = decision_tree.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
accuracy_dt

0.7621696801112656

In [None]:
# changes a bit