# Lab | Imbalanced data

We will be using the `files_for_lab/customer_churn.csv` dataset to build a churn predictor.

### Instructions

1. Load the dataset and explore the variables.
2. We will try to predict variable `Churn` using a logistic regression on variables `tenure`, `SeniorCitizen`,`MonthlyCharges`.
3. Split the Dataset into X ('tenure', 'SeniorCitizen', 'MonthlyCharges') and y ('Churn')
4. Build the logistic regression model.
5. Evaluate the model.
6. Even a simple model will give us more than 70% accuracy. Why?
7. **Synthetic Minority Oversampling TEchnique (SMOTE)** is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply `imblearn.over_sampling.SMOTE` to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?


In [1]:
# Import Libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression

In [2]:
# Load the dataset
customer = pd.read_csv('customer_churn.csv')
pd.set_option('display.max_columns', None)
customer.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
# Check Null values
customer.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [4]:
customer.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SeniorCitizen,7043.0,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
MonthlyCharges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75


In [5]:
customer['SeniorCitizen'].value_counts(dropna = False)

SeniorCitizen
0    5901
1    1142
Name: count, dtype: int64

In [6]:
customer['tenure'].value_counts(dropna = False)

tenure
1     613
72    362
2     238
3     200
4     176
     ... 
28     57
39     56
44     51
36     50
0      11
Name: count, Length: 73, dtype: int64

In [7]:
customer['MonthlyCharges'].value_counts(dropna = False)

MonthlyCharges
20.05     61
19.85     45
19.95     44
19.90     44
20.00     43
          ..
23.65      1
114.70     1
43.65      1
87.80      1
78.70      1
Name: count, Length: 1585, dtype: int64

In [8]:
# Split the Dataset into X ('tenure', 'SeniorCitizen', 'MonthlyCharges') and y ('Churn')
X = customer[['tenure','SeniorCitizen','MonthlyCharges']]
y = customer[['Churn']]

In [9]:
display(X.head(),y.head())

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges
0,1,0,29.85
1,34,0,56.95
2,2,0,53.85
3,45,0,42.3
4,2,0,70.7


Unnamed: 0,Churn
0,No
1,No
2,Yes
3,No
4,Yes


In [10]:
y = customer['Churn'].map({'Yes':1,'No':0})

In [11]:
# here we start scaling, so need to do train-test-split before
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [12]:
# Using Scaling my F1 score has decreased 
# transformer = StandardScaler().fit(X_train)
# X_train_scaled = pd.DataFrame(transformer.transform(X_train),columns=X.columns)
# X_test_scaled = pd.DataFrame(transformer.transform(X_test),columns=X.columns)
# X_train_scaled.head()

In [13]:
# Apply model
LR = LogisticRegression(random_state=0, solver='lbfgs')
LR.fit(X_train, y_train)

In [14]:
LR.score(X_test, y_test)

0.78137421919364

In [15]:
pred = LR.predict(X_test)
print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

precision:  0.6120689655172413
recall:  0.46004319654427644
f1:  0.5252774352651048


In [16]:
confusion_matrix(y_test,pred)

array([[1163,  135],
       [ 250,  213]], dtype=int64)

Apply Synthetic Minority Oversampling TEchnique (SMOTE) 

In [18]:
sm = SMOTE(random_state=100, k_neighbors=3)
X_train_SMOTE,y_train_SMOTE = sm.fit_resample(X_train,y_train)

In [19]:
X_train_SMOTE.shape

(7752, 3)

In [20]:
LR = LogisticRegression(random_state=0, solver='lbfgs')
LR.fit(X_train_SMOTE, y_train_SMOTE)
pred = LR.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

precision:  0.4684813753581662
recall:  0.7062634989200864
f1:  0.5633074935400517


In [21]:
#LR.score(X_train_SMOTE, y_train_SMOTE)
LR.score(X_test, y_test)

0.7120954003407155

In [22]:
confusion_matrix(y_test,pred)

array([[927, 371],
       [136, 327]], dtype=int64)