### Scenario

You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

### Instructions

1. Importing libraries <br>
- Import the required libraries and modules that you would need.
2. Reading the data <br>
- Read that data into Python and call the dataframe `churnData`.
3. Cleaning the data <br>
- Check the datatypes of all the columns in the data. You would see that the column `TotalCharges` is object type. Convert this column into numeric type using `pd.to_numeric` function.
- Check for null values in the dataframe. Replace the null values.
4. Preprocessing the data <br>
- Use the following features: `tenure`, `SeniorCitizen`, `MonthlyCharges` and `TotalCharges`:
- Scale the features either by using normalizer or a standard scaler.
5. Setting up the model <br>
- Split the data into a training set and a test set.
- Fit a logistic regression model on the training data.
- Check the accuracy on the test data.
6. Managing imbalance & checking accuracy <br>
- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model is.
7. Decision tree model <br>
- Apply SMOTE for upsampling the data
    - Use logistic regression to fit the model and compute the accuracy of the model. (see above)
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.
- Apply TomekLinks for downsampling
    - Use logistic regression to fit the model and compute the accuracy of the model. (see above)
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.
8. Apply Random Forests after upscaling with SMOTE
    - Note that since SMOTE works on numerical data only, we will first encode the categorical variables.

### 1. Importing libraries

In [25]:
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

### 2. Reading the data

In [2]:
churnData = pd.read_csv('Customer-Churn.csv')
churnData.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [3]:
churnData.shape

(7043, 16)

In [4]:
churnData.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

### 3. Cleaning the data

In [5]:
churnData.rename(columns={'gender':'Gender', 'tenure':'Tenure'}, inplace=True)

In [6]:
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   Tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

In [7]:
churnData['TotalCharges'] = pd.to_numeric(churnData['TotalCharges'], errors='coerce')

In [8]:
churnData.isna().sum()

Gender               0
SeniorCitizen        0
Partner              0
Dependents           0
Tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [9]:
# Since there are only 11 null values, I will drop the respective rows.
churnData.dropna(inplace=True)

### 4. Preprocessing the data

In [10]:
# Normalising
scaler = MinMaxScaler()
columns_to_scale = ['Tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']
churnData[columns_to_scale] = scaler.fit_transform(churnData[columns_to_scale])

In [11]:
numericals = churnData.select_dtypes(include=np.number)
categoricals = churnData.select_dtypes(include='object').drop('Churn', axis=1)
y = churnData['Churn']

# Encoding the categorical columns
preprocessor = OneHotEncoder()
cat_encoded = preprocessor.fit_transform(categoricals)

cat_encoded_df = pd.DataFrame(cat_encoded.toarray(), columns=preprocessor.get_feature_names_out(categoricals.columns))
churnData = pd.concat([cat_encoded_df, numericals, y], axis=1)

In [12]:
# There are new null values due to encoding
churnData.dropna(inplace=True)
churnData.shape

(7021, 34)

In [13]:
churnData['Churn'].value_counts()

No     5155
Yes    1866
Name: Churn, dtype: int64

### 5. Logistic regression model

In [14]:
def logistic_regression(X, y, imbalance_technique=None):    
    if imbalance_technique is not None:
        X, y = imbalance_technique.fit_resample(X, y)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    logreg = LogisticRegression()
    logreg.fit(X_train, y_train)
    y_pred = logreg.predict(X_test)
    
    print(imbalance_technique)
    unique, counts = np.unique(y_train, return_counts=True)
    print(np.column_stack((unique, counts)))
    
    accuracy = accuracy_score(y_test, y_pred)
    print('Accuracy:', accuracy)

### 6. Managing imbalance & checking accuracy

#### SMOTE & TomekLinks

In [15]:
y = churnData['Churn']
X = churnData.drop(['Churn'], axis=1)

# Without imbalance technique
logistic_regression(X, y)

# Upsampling with synthetic data with SMOTE
smote = SMOTE()
logistic_regression(X, y, imbalance_technique=smote)

# Reducing noise with TomekLinks by identifying pairs from majority and minority classes that are close to each other and removing those belonging to the majority class
tl = TomekLinks()
logistic_regression(X, y, imbalance_technique=tl)

None
[['No' 4094]
 ['Yes' 1522]]
Accuracy: 0.79644128113879
SMOTE()
[['No' 4138]
 ['Yes' 4110]]
Accuracy: 0.7468477206595538
TomekLinks()
[['No' 3574]
 ['Yes' 1496]]
Accuracy: 0.7981072555205048


The accuracy of the original dataset is already pretty good. With upsampling via SMOTE the model reduces accuracy while reducing the data with TomekLinks improves its performance slightly.

#### Random downsampling

In [16]:
category_No = churnData[churnData['Churn'] == 'No']
category_Yes = churnData[churnData['Churn'] == 'Yes']

category_No = category_No.sample(len(category_Yes))

churnData2 = pd.concat([category_No, category_Yes], axis=0)
churnData2 = churnData2.sample(frac=1)

X = churnData2.drop(['Churn'], axis=1)
y = churnData2['Churn']

# Applying the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
    
unique, counts = np.unique(y_train, return_counts=True)
print(np.column_stack((unique, counts)))
    
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

[['No' 1472]
 ['Yes' 1513]]
Accuracy: 0.714859437751004


#### Random upsampling

In [17]:
category_No = churnData[churnData['Churn'] == 'No']
category_Yes = churnData[churnData['Churn'] == 'Yes']

category_Yes = category_Yes.sample(len(category_No), replace=True)

churnData3 = pd.concat([category_No, category_Yes], axis=0)
churnData3 = churnData2.sample(frac=1)

X = churnData3.drop(['Churn'], axis=1)
y = churnData3['Churn']

# Applying the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
    
unique, counts = np.unique(y_train, return_counts=True)
print(np.column_stack((unique, counts)))
    
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

[['No' 1489]
 ['Yes' 1496]]
Accuracy: 0.7376171352074966


Random downsampling provides higher accuracy than random upsampling. But the best accuracy score is still achieved with TomekLinks.

I am confused and could not figure out why the total value counts before and after splitting the data is significantly high. See below.

In [18]:
# Before splitting
unique, counts = np.unique(y, return_counts=True)
print("Before splitting:")
print(np.column_stack((unique, counts)))

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# After splitting
unique_train, counts_train = np.unique(y_train, return_counts=True)
print("After splitting:")
print(np.column_stack((unique_train, counts_train)))

# Training the logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

# Class distribution after training
unique_pred, counts_pred = np.unique(y_train, return_counts=True)
print("After training:")
print(np.column_stack((unique_pred, counts_pred)))

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Before splitting:
[['No' 1866]
 ['Yes' 1866]]
After splitting:
[['No' 1489]
 ['Yes' 1496]]
After training:
[['No' 1489]
 ['Yes' 1496]]
Accuracy: 0.7376171352074966


### 7. Decision tree model

In [19]:
# Applying a decision tree model for classification
def decision_tree(X, y, imbalance_technique=None):
    if imbalance_technique is not None:
        X, y = imbalance_technique.fit_resample(X, y)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    dectree = DecisionTreeClassifier()
    dectree.fit(X_train, y_train)
    y_pred = dectree.predict(X_test)
    
    print(imbalance_technique)
    unique, counts = np.unique(y_train, return_counts=True)
    print(np.column_stack((unique, counts)))
    
    accuracy = accuracy_score(y_test, y_pred)  
    print('Accuracy:', accuracy)

In [20]:
y = churnData['Churn']
X = churnData.drop(['Churn'], axis=1)

# Without imbalance technique
decision_tree(X, y)

# Upsampling with synthetic data with SMOTE
smote = SMOTE()
decision_tree(X, y, imbalance_technique=smote)

# Reducing noise with TomekLinks by identifying pairs from majority and minority classes that are close to each other and removing those belonging to the majority class
tl = TomekLinks()
decision_tree(X, y, imbalance_technique=tl)

None
[['No' 4094]
 ['Yes' 1522]]
Accuracy: 0.7174377224199289
SMOTE()
[['No' 4138]
 ['Yes' 4110]]
Accuracy: 0.7327837051406402
TomekLinks()
[['No' 3574]
 ['Yes' 1496]]
Accuracy: 0.6987381703470031


With a decision tree classification model the highest accuracy is achieved after applying SMOTE upsampling.

The accuracy for the logistic regression model were: <br>

**None** <br>
[['No' 4094] <br>
 ['Yes' 1522]] <br>
Accuracy: 0.79644128113879 <br>
**SMOTE** <br>
[['No' 4138] <br>
 ['Yes' 4110]] <br>
Accuracy: 0.7473326867119302 <br>
**TomekLinks** <br>
[['No' 3574] <br>
 ['Yes' 1496]] <br>
Accuracy: 0.7981072555205048

Overall, the logistic regression model shows higher accuracy than the decision tree model.

### 8. Random forests model

In [26]:
# Applying a random forest model for classification
def random_forest(X, y, imbalance_technique=None):
    if imbalance_technique is not None:
        X, y = imbalance_technique.fit_resample(X, y)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    print(imbalance_technique)
    unique, counts = np.unique(y_train, return_counts=True)
    print(np.column_stack((unique, counts)))
    
    ranforest = RandomForestClassifier()
    ranforest.fit(X_train, y_train)
    y_pred = ranforest.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    print('Accuracy:', accuracy)

In [27]:
y = churnData['Churn']
X = churnData.drop(['Churn'], axis=1)

# Without imbalance technique
random_forest(X, y)

# Upsampling with synthetic data using SMOTE
smote = SMOTE()
random_forest(X, y, imbalance_technique=smote)

None
[['No' 4094]
 ['Yes' 1522]]
Accuracy: 0.7701067615658364
SMOTE()
[['No' 4138]
 ['Yes' 4110]]
Accuracy: 0.8026188166828322


The random forest model shows higher accuracy after upsampling with SMOTE.