# Lab | Cross Validation

Lab | Cross Validation
For this lab, we will build a model on customer churn binary classification problem. You will be using files_for_lab/Customer-Churn.csv file.

Instructions
1. Apply SMOTE for upsampling the data

    * Use logistic regression to fit the model and compute the accuracy of the model.
    * Use decision tree classifier to fit the model and compute the accuracy of the model.
    * Compare the accuracies of the two models.

2. Apply TomekLinks for downsampling

    * It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
    * Use logistic regression to fit the model and compute the accuracy of the model.
    * Use decision tree classifier to fit the model and compute the accuracy of the model.
    * Compare the accuracies of the two models.
    * You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.

In [2]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from pandas.plotting import scatter_matrix
import seaborn as sns
from IPython.display import set_matplotlib_formats, HTML
from matplotlib.dates import DateFormatter
import matplotlib_inline 
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from matplotlib import colors as mcolors
from pandas.plotting import register_matplotlib_converters
import plotly.express as px
%matplotlib inline
%config InlineBackend.figure_format = 'png'
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore') 

In [3]:
# Formating Plots
# default styles
def set_sns_format(width=14, height=8):
    sns.set_theme(palette='pastel', context='notebook',rc={'savefig.dpi':300})
    matplotlib_inline.backend_inline.set_matplotlib_formats('retina')
    matplotlib.rcParams['figure.figsize'] = (width, height)
    return None
set_sns_format(width=14, height=8)

In [4]:
def add_value_labels(ax, typ, spacing=5):
    #This function add the labels in the bar and line plots
    #input the ax to add the labels, the type of plot
    
    space = spacing
    va = 'bottom'
    

    if typ == 'bar':
        for i in ax.patches:
            y_value = i.get_height()
            x_value = i.get_x() + i.get_width() / 2

            label = "{:.0f}".format(y_value)
            ax.annotate(label,(x_value, y_value), xytext=(0, space), 
                    textcoords="offset points", ha='center', va=va, fontsize=10)     

    if typ == 'line':
        for line in ax.lines:
            for x_value, y_value in zip(line.get_xdata(), line.get_ydata()):
                label = "{:.0f}".format(y_value)
                ax.annotate(label,(x_value, y_value), xytext=(0, space), 
                    textcoords="offset points", ha='center', va=va, fontsize=10)

In [5]:
df = pd.read_csv(r"C:\Users\ssai\OneDrive\Data_26-07\labs\lab-cross-validation\files_for_lab/Customer-Churn.csv")

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import TomekLinks
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier

In [7]:
df["TotalCharges"][df["TotalCharges"] == " "] = np.nan
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"])
df["TotalCharges"] = df["TotalCharges"].fillna(df["TotalCharges"].mean())

In [8]:
X = df[["tenure", "SeniorCitizen", "MonthlyCharges", "TotalCharges"]]
y = df.iloc[:,-1]
y = y.apply(lambda x: 1 if x=='Yes' else 0)

sc = StandardScaler()
lr = LogisticRegression()

X = sc.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.30)

# SMOTE

In [9]:
sm = SMOTE(k_neighbors = 3, random_state = 42)

X_train_SMOTE, y_train_SMOTE = sm.fit_resample(X_train, y_train)
lr.fit(X_train_SMOTE, y_train_SMOTE)

In [10]:
print("train:", lr.score(X_train_SMOTE, y_train_SMOTE))
print("test:", lr.score(X_test, y_test))

train: 0.7277854195323247
test: 0.7397065783246569


In [12]:
clf = DecisionTreeClassifier(random_state=0)
clf = clf.fit(X_train_SMOTE, y_train_SMOTE)

In [13]:
print("train:", clf.score(X_train_SMOTE, y_train_SMOTE))
print("test:", clf.score(X_test, y_test))

train: 0.99353507565337
test: 0.706578324656886


# Tomelinks

In [14]:
tl = TomekLinks(sampling_strategy='auto')
X_train_tl, y_train_tl = tl.fit_resample(X_train, y_train)
lr.fit(X_train_tl, y_train_tl)

In [15]:
print("train:", lr.score(X_train_tl, y_train_tl))
print("test:", lr.score(X_test, y_test))

train: 0.7903048914235578
test: 0.7841930903928065


In [16]:
clf = clf.fit(X_train_tl, y_train_tl)

In [17]:
X_train_tl, y_train_tl = tl.fit_resample(X_train_tl, y_train_tl)
lr.fit(X_train_tl, y_train_tl)

In [18]:
print("train:", lr.score(X_train_tl, y_train_tl))
print("test:", lr.score(X_test, y_test))

train: 0.789224526600541
test: 0.7827733080927591


* DecisionTreeClassifier suffers heavily from overfitting and results in worse scores on the test data. Applying TomekLinks a second time lowers scores very slightly.