1. Load the dataset and explore the variables.
2. Extract the independent variables and scale them.
3. Extract the target variable.
4. Build the logistic regression model.
5. Evaluate the model.
6. Even a simple model will give us more than 70% accuracy. Why?
7. Upsampling
8. Downsampling
9. **Synthetic Minority Oversampling TEchnique (SMOTE)** is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply `imblearn.over_sampling.SMOTE` to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?
10. **Tomek links** are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. Apply `imblearn.under_sampling.TomekLinks` to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
import warnings
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')

##### We will try to predict variable 'Churn' using a logistic regression on variables 'Tenure', 'SeniorCitizen', and 'MonthlyCharges'.

### 1. Load the dataset and explore the variables.

In [2]:
churnData = pd.read_csv('customer_churn.csv',sep=",")
churnData.head(5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
churnData = churnData.rename(columns={'customerID': 'CustomerID', 'gender' : 'Gender', 'tenure': 'Tenure'})
churnData.columns

Index(['CustomerID', 'Gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'Tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [4]:
churnData['Churn'].value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [5]:
5174/(5174+1869)
# We observe a huge imbalance between the two categories 'Yes' and 'No'.

0.7346301292063041

### 2. Extract the independent variables and scale them.

In [6]:
numericals = churnData.select_dtypes(include="number")
numericals.head()

Unnamed: 0,SeniorCitizen,Tenure,MonthlyCharges
0,0,1,29.85
1,0,34,56.95
2,0,2,53.85
3,0,45,42.3
4,0,2,70.7


In [7]:
transformer = StandardScaler().fit(numericals)
standard_x = transformer.transform(numericals)
X = pd.DataFrame(standard_x)
X.head()

Unnamed: 0,0,1,2
0,-0.439916,-1.277445,-1.160323
1,-0.439916,0.066327,-0.259629
2,-0.439916,-1.236724,-0.36266
3,-0.439916,0.514251,-0.746535
4,-0.439916,-1.236724,0.197365


In [8]:
X.columns = numericals.columns
X.head()

Unnamed: 0,SeniorCitizen,Tenure,MonthlyCharges
0,-0.439916,-1.277445,-1.160323
1,-0.439916,0.066327,-0.259629
2,-0.439916,-1.236724,-0.36266
3,-0.439916,0.514251,-0.746535
4,-0.439916,-1.236724,0.197365


### 3. Extract the target variable.

In [9]:
y = churnData["Churn"]
y

0        No
1        No
2       Yes
3        No
4       Yes
       ... 
7038     No
7039     No
7040     No
7041    Yes
7042     No
Name: Churn, Length: 7043, dtype: object

### 4. Build the logistic regression model.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [11]:
len(X_test)

2113

In [12]:
classing = LogisticRegression(random_state=0).fit(X_train, y_train)
predictions = classing.predict(X_test)

In [13]:
confusion_matrix(y_test,predictions)

array([[1420,  119],
       [ 317,  257]])

### 5. Evaluate the model.

In [14]:
classing.score(X_test,y_test)

0.7936583057264552

### 6. Even a simple model will give us more than 70% accuracy. Why?

The performance metric may not be appropriate: Sometimes, a high accuracy score may not be an appropriate metric for evaluating model performance. For example, when working with imbalanced classes, a model that always predicts the majority class may achieve high accuracy, but it may not be useful in practice.

##### Trying the model by increasing the imbalance.

In [15]:
yes = churnData[churnData['Churn']=='Yes']
no = churnData[churnData['Churn']=='No']
# Reducing the amount of 'YES' records to 500 samples.
yes = yes.sample(500)

In [16]:
no

Unnamed: 0,CustomerID,Gender,SeniorCitizen,Partner,Dependents,Tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
6,1452-KIOVK,Male,0,No,Yes,22,Yes,Yes,Fiber optic,No,...,No,No,Yes,No,Month-to-month,Yes,Credit card (automatic),89.10,1949.4,No
7,6713-OKOMC,Female,0,No,No,10,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,No,Mailed check,29.75,301.9,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7037,2569-WGERO,Female,0,No,No,72,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Bank transfer (automatic),21.15,1419.4,No
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No


In [17]:
yes

Unnamed: 0,CustomerID,Gender,SeniorCitizen,Partner,Dependents,Tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
6936,7693-LCKZL,Male,0,Yes,Yes,5,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,80.15,385,Yes
6764,7660-HDPJV,Female,0,No,No,1,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,69.20,69.2,Yes
3910,8938-UMKPI,Female,0,No,No,47,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,106.40,5127.95,Yes
3526,9026-RNUJS,Male,1,No,No,5,Yes,No,DSL,No,...,Yes,No,No,No,Month-to-month,No,Electronic check,50.35,237.25,Yes
505,5609-CEBID,Female,1,No,No,20,Yes,Yes,Fiber optic,No,...,Yes,No,No,Yes,Month-to-month,Yes,Electronic check,94.10,1782.4,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,7912-SYRQT,Female,0,No,No,7,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Bank transfer (automatic),75.10,552.95,Yes
6726,0685-MLYYM,Female,1,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,No,Electronic check,70.75,154.85,Yes
1306,0201-OAMXR,Female,0,No,No,70,Yes,Yes,Fiber optic,Yes,...,Yes,Yes,Yes,Yes,One year,No,Credit card (automatic),115.55,8127.6,Yes
4814,7270-BDIOA,Female,0,No,No,22,Yes,Yes,Fiber optic,No,...,No,Yes,Yes,No,Month-to-month,Yes,Electronic check,90.00,1993.8,Yes


In [18]:
# Concatenating the 'Yes' and 'No' rows. 
data = pd.concat([yes,no], axis=0)
print(data['Churn'].value_counts())
data.head()

No     5174
Yes     500
Name: Churn, dtype: int64


Unnamed: 0,CustomerID,Gender,SeniorCitizen,Partner,Dependents,Tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
6936,7693-LCKZL,Male,0,Yes,Yes,5,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,80.15,385.0,Yes
6764,7660-HDPJV,Female,0,No,No,1,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,69.2,69.2,Yes
3910,8938-UMKPI,Female,0,No,No,47,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,106.4,5127.95,Yes
3526,9026-RNUJS,Male,1,No,No,5,Yes,No,DSL,No,...,Yes,No,No,No,Month-to-month,No,Electronic check,50.35,237.25,Yes
505,5609-CEBID,Female,1,No,No,20,Yes,Yes,Fiber optic,No,...,Yes,No,No,Yes,Month-to-month,Yes,Electronic check,94.1,1782.4,Yes


In [19]:
# Shuffling the data for random order.
data = data.sample(frac=1)
data.head()

Unnamed: 0,CustomerID,Gender,SeniorCitizen,Partner,Dependents,Tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
1725,5935-FCCNB,Female,1,No,No,17,Yes,Yes,Fiber optic,No,...,No,No,Yes,Yes,Month-to-month,Yes,Electronic check,94.2,1608.15,No
1766,8763-KIAFH,Female,0,Yes,Yes,27,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,20.55,583.3,No
6254,9933-QRGTX,Female,0,Yes,No,60,Yes,No,Fiber optic,Yes,...,No,Yes,Yes,Yes,Two year,Yes,Electronic check,97.2,5611.75,No
3536,2254-DLXRI,Female,0,No,No,1,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,79.15,79.15,No
4131,2876-VBBBL,Female,0,No,No,1,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Month-to-month,Yes,Mailed check,20.25,20.25,Yes


In [20]:
numericData = data[['Tenure', 'SeniorCitizen','MonthlyCharges']]
transformer = StandardScaler().fit(data[['Tenure','SeniorCitizen','MonthlyCharges']])
scaled_x = transformer.transform(data[['Tenure','SeniorCitizen','MonthlyCharges']])

y = pd.DataFrame(data=data, columns=['Churn'])


X_train, X_test, y_train, y_test = train_test_split(scaled_x, y, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.9248385202583675

In [21]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.93      1.00      0.96      1575
         Yes       0.50      0.03      0.06       128

    accuracy                           0.92      1703
   macro avg       0.71      0.51      0.51      1703
weighted avg       0.89      0.92      0.89      1703



### 7. Upsampling.
##### Manual upsampling of the minority class by simply repeating samples from the minority class.

Despite the advantage of balancing classes, these techniques also have their weaknesses. The simplest implementation of upsampling is to duplicate random records from the minority class, which can cause overfitting. In undersampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.

In [22]:
counts = churnData['Churn'].value_counts()
counts

No     5174
Yes    1869
Name: Churn, dtype: int64

In [23]:
yes = churnData[churnData['Churn']=='Yes'].sample(counts[0], replace=True) # counts[0] equals answer 'Yes'
no = churnData[churnData['Churn']=='No']
data = pd.concat([yes,no], axis=0)
data = data.sample(frac=1)
data['Churn'].value_counts()

No     5174
Yes    5174
Name: Churn, dtype: int64

In [24]:
numericData = data[['Tenure', 'SeniorCitizen', 'MonthlyCharges']]
transformer = StandardScaler().fit(data[['Tenure', 'SeniorCitizen', 'MonthlyCharges']])
scaled_x = transformer.transform(data[['Tenure', 'SeniorCitizen', 'MonthlyCharges']])

y = pd.DataFrame(data=data, columns=['Churn'])

X_train, X_test, y_train, y_test = train_test_split(scaled_x, y, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7256038647342995

##### Upsampling and downsampling with the 'imblearn' library.

Other more sophisticated resampling techniques include:
Cluster the records of the majority class, and do the downsampling by removing records from each cluster, thus seeking to preserve information. 
In upsampling, instead of creating exact copies of the minority class records, we can introduce small variations into those copies, creating more diverse synthetic samples.

In [25]:
!pip install imblearn



In [26]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

In [27]:
ros = RandomOverSampler()
X = churnData[['Tenure', 'SeniorCitizen', 'MonthlyCharges']]
transformer = StandardScaler().fit(X)
X = transformer.transform(X)
y = churnData['Churn']
X_ros, y_ros = ros.fit_resample(X, y)

In [28]:
y.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [29]:
y_ros.value_counts()

No     5174
Yes    5174
Name: Churn, dtype: int64

In [30]:
transformer = StandardScaler().fit(X_ros)
X = transformer.transform(X_ros)

X_train, X_test, y_train, y_test = train_test_split(X, y_ros, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7465378421900161

##### More records to train the model but also similar results.

### 8. Downsampling.

In [31]:
rus = RandomUnderSampler()
X = churnData[['Tenure', 'SeniorCitizen', 'MonthlyCharges']]
transformer = StandardScaler().fit(X)
X = transformer.transform(X)
y = churnData['Churn']
X_rus, y_rus = rus.fit_resample(X, y)

In [32]:
y.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [33]:
y_rus.value_counts()

No     1869
Yes    1869
Name: Churn, dtype: int64

In [34]:
transformer = StandardScaler().fit(X_rus)
X = transformer.transform(X_rus)

X_train, X_test, y_train, y_test = train_test_split(X, y_rus, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7379679144385026

##### We have less data but pretty similar values to the previous ones obtained when we did upsampling manually.

### 9. Synthetic Minority Oversampling TEchnique (SMOTE).

##### SMOTE create as many fake samples from the minority class as needed in order to balance the classes. The SMOTE algorithm can be broken down into foll. steps: Randomly pick a point from the minority class. Compute the k-nearest neighbors (for some pre-specified k) for this point. Add k new points somewhere between the chosen point and each of its neighbors.

In [35]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X = churnData[['Tenure', 'SeniorCitizen', 'MonthlyCharges']]
transformer = StandardScaler().fit(X)
X = transformer.transform(X)
y = churnData['Churn']
X_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

No     5174
Yes    5174
Name: Churn, dtype: int64

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7458937198067633

##### Not a very big improvement because in his particular case the SMOTE has a similar effect as repeating samples from the minority class. It has a similar effect because the records within the minority class are similar to each other.

### 10. Tomek links.

Tomek links are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process.

In [37]:
from imblearn.under_sampling import TomekLinks

tl = TomekLinks(sampling_strategy = 'majority')
X_tl, y_tl = tl.fit_resample(X, y)
y_tl.value_counts()

No     4694
Yes    1869
Name: Churn, dtype: int64

In [38]:
X_tl2, y_tl2 = tl.fit_resample(X_tl, y_tl)
y_tl2.value_counts()

No     4541
Yes    1869
Name: Churn, dtype: int64

##### It does not make the two classes equal but only removes the points from the majority class that are close to points in minority class. The model can accurately distinguish between the minority and majority classes by removing these potentially confusing instances from the majority class.

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7973590655154901

In [40]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.83      0.90      0.86      1413
         Yes       0.67      0.54      0.60       556

    accuracy                           0.80      1969
   macro avg       0.75      0.72      0.73      1969
weighted avg       0.79      0.80      0.79      1969



##### The model has improved, it has reached the same accuracy as the simple model used initially. But now there is less imbalance, by rredusing the potential for misclassification of the minority class, hence the model is more valid since both classes are predicted.