# Lab | Cross Validation

For this lab, we will build a model on customer churn binary classification problem. You will be using files_for_lab/Customer-Churn.csv file.

## Instructions
1. Apply SMOTE for upsampling the data

- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.

2. Apply TomekLinks for downsampling

- It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.
- You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.

## Apply SMOTE for upsampling the data

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

%matplotlib inline

In [2]:
# Load the dataset
data = pd.read_csv('files_for_lab/Customer-Churn.csv')
data

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


In [3]:
# Cleaning
data.columns = data.columns.str.strip().str.replace(' ', '_').str.lower() 
data.columns

Index(['gender', 'seniorcitizen', 'partner', 'dependents', 'tenure',
       'phoneservice', 'onlinesecurity', 'onlinebackup', 'deviceprotection',
       'techsupport', 'streamingtv', 'streamingmovies', 'contract',
       'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

In [4]:
# Separate features and target variable
X = data.drop('churn', axis=1)
y = data['churn']

In [5]:
# Convert categorical features to one-hot encoded columns
X_encoded = pd.get_dummies(X)

In [6]:
# Apply SMOTE for upsampling
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_encoded, y)

In [7]:
# Split the upsampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

In [8]:
# Initialize and fit the logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression()

In [9]:
# Make predictions on the test set using logistic regression
y_pred_logreg = logreg.predict(X_test)

In [10]:
# Calculate accuracy for logistic regression
accuracy_logreg = accuracy_score(y_test, y_pred_logreg)
print("Logistic Regression Accuracy:", accuracy_logreg)

Logistic Regression Accuracy: 0.8323671497584542


In [11]:
# Initialize and fit the decision tree classifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

DecisionTreeClassifier()

In [12]:
# Make predictions on the test set using decision tree classifier
y_pred_dtc = dtc.predict(X_test)

In [13]:
# Calculate accuracy for decision tree classifier
accuracy_dtc = accuracy_score(y_test, y_pred_dtc)
print("Decision Tree Classifier Accuracy:", accuracy_dtc)

Decision Tree Classifier Accuracy: 0.8381642512077294


In [14]:
# Compare the accuracies
if accuracy_logreg > accuracy_dtc:
    print("Logistic Regression performs better.")
elif accuracy_dtc > accuracy_logreg:
    print("Decision Tree Classifier performs better.")
else:
    print("Both models have the same accuracy.")

Decision Tree Classifier performs better.


## Apply TomekLinks for downsampling

In [15]:
# Apply TomekLinks for downsampling
tomek_links = TomekLinks()
X_resampled, y_resampled = tomek_links.fit_resample(X_encoded, y)

In [16]:
# Split the downsampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

In [17]:
# Initialize and fit the logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression()

In [18]:
# Make predictions on the test set using logistic regression
y_pred_logreg = logreg.predict(X_test)

In [19]:
# Calculate accuracy for logistic regression
accuracy_logreg = accuracy_score(y_test, y_pred_logreg)
print("Logistic Regression Accuracy:", accuracy_logreg)

Logistic Regression Accuracy: 0.8074298711144806


In [20]:
# Initialize and fit the decision tree classifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

DecisionTreeClassifier()

In [21]:
# Make predictions on the test set using decision tree classifier
y_pred_dtc = dtc.predict(X_test)

In [22]:
# Calculate accuracy for decision tree classifier
accuracy_dtc = accuracy_score(y_test, y_pred_dtc)
print("Decision Tree Classifier Accuracy:", accuracy_dtc)

Decision Tree Classifier Accuracy: 0.8051554207733131


In [23]:
# Compare the accuracies
if accuracy_logreg > accuracy_dtc:
    print("Logistic Regression performs better.")
elif accuracy_dtc > accuracy_logreg:
    print("Decision Tree Classifier performs better.")
else:
    print("Both models have the same accuracy.")

Logistic Regression performs better.


In [24]:
# Apply TomekLinks again and check class imbalance
X_resampled2, y_resampled2 = tomek_links.fit_resample(X_resampled, y_resampled)

In [25]:
# Count the class imbalance
class_counts_before = y_resampled.value_counts()
class_counts_after = y_resampled2.value_counts()

print("Class Imbalance Before TomekLinks:")
print(class_counts_before)

print("Class Imbalance After TomekLinks:")
print(class_counts_after)

Class Imbalance Before TomekLinks:
No     4723
Yes    1869
Name: churn, dtype: int64
Class Imbalance After TomekLinks:
No     4590
Yes    1869
Name: churn, dtype: int64
