![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Imbalanced data

We will be using the `files_for_lab/customer_churn.csv` dataset to build a churn predictor.

### Instructions

1. Load the dataset and explore the variables.
2. We will try to predict variable `Churn` using a logistic regression on variables `tenure`, `SeniorCitizen`,`MonthlyCharges`.
3. Extract the target variable.
4. Extract the independent variables and scale them.
5. Build the logistic regression model.
6. Evaluate the model.
7. Even a simple model will give us more than 70% accuracy. Why?
8. **Synthetic Minority Oversampling Technique (SMOTE)** is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply `imblearn.over_sampling.SMOTE` to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?
9. **Tomek links** are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. Apply `imblearn.under_sampling.TomekLinks` to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?

In [1]:
import pymysql
from sqlalchemy import create_engine
import pandas as pd
import getpass 

from sklearn.preprocessing import StandardScaler
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')


from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split      
from sklearn.linear_model import LogisticRegression  

from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks


### 1. Load the dataset and explore the variables.

In [2]:
data = pd.read_csv("files_for_lab/customer_churn.csv", sep=",")
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
data.shape

(7043, 21)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [5]:
data.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

### 2. We will try to predict variable Churn using a logistic regression on variables tenure, SeniorCitizen, MonthlyCharges.

In [6]:
data.Churn.value_counts() 

No     5174
Yes    1869
Name: Churn, dtype: int64

In [7]:
data.tenure.value_counts() 

1     613
72    362
2     238
3     200
4     176
     ... 
28     57
39     56
44     51
36     50
0      11
Name: tenure, Length: 73, dtype: int64

In [8]:
data.SeniorCitizen.value_counts() 

0    5901
1    1142
Name: SeniorCitizen, dtype: int64

In [9]:
data.MonthlyCharges.value_counts() 

20.05     61
19.85     45
19.95     44
19.90     44
20.00     43
          ..
23.65      1
114.70     1
43.65      1
87.80      1
78.70      1
Name: MonthlyCharges, Length: 1585, dtype: int64

### 3. Extract the target variable.

In [10]:
y=data['Churn']

### 4. Extract the independent variables and scale them.

In [11]:
numerical = data.select_dtypes(include = np.number)
numerical.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [12]:
from sklearn.preprocessing import MinMaxScaler

num_scal = MinMaxScaler().fit(numerical) 
num_minmax = num_scal.transform(numerical) 
numerical_scaled = pd.DataFrame(num_minmax, index=numerical.index, columns=numerical.columns)  #Transform numerical_scaled to a DataFrame, and keep column names and indexes.

numerical_scaled.head() 

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
0,0.0,0.013889,0.115423
1,0.0,0.472222,0.385075
2,0.0,0.027778,0.354229
3,0.0,0.625,0.239303
4,0.0,0.027778,0.521891


### 5. Build the logistic regression model.

In [13]:
X=numerical_scaled 
X.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
0,0.0,0.013889,0.115423
1,0.0,0.472222,0.385075
2,0.0,0.027778,0.354229
3,0.0,0.625,0.239303
4,0.0,0.027778,0.521891


In [14]:
#Train Test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [15]:
# We train the training dataset using Logistic Regression (Binary classification)

classing = LogisticRegression(random_state=0, multi_class = 'ovr').fit(X_train, y_train) #multi_class = 'ovr' not needed because this is binary
 

In [16]:
#Predicting on X_test (testing dataset)

predictions = classing.predict(X_test)

### 6. Evaluate the model.

In [17]:
y_test.value_counts() 

No     1539
Yes     574
Name: Churn, dtype: int64

In [18]:
pd.Series(predictions).value_counts() 

No     1744
Yes     369
dtype: int64

In [19]:
confusion_matrix(y_test,predictions)

array([[1425,  114],
       [ 319,  255]])

In [20]:
#Accuracy
classing.score(X_test,y_test)

0.7950780880265026

In [21]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.82      0.93      0.87      1539
         Yes       0.69      0.44      0.54       574

    accuracy                           0.80      2113
   macro avg       0.75      0.69      0.70      2113
weighted avg       0.78      0.80      0.78      2113



### 7. Even a simple model will give us more than 70% accuracy. Why?

Due to imbalanced data, the accuracy will be so high because if you have two options but +80% of the data refers to one, even if you fail, the failures will be so low compared to the right predictions

### 8. **Synthetic Minority Oversampling TEchnique (SMOTE)** is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply imblearn.over_sampling.SMOTE to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?

In [22]:
data.Churn.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [23]:
smote = SMOTE()
X_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

No     5174
Yes    5174
Name: Churn, dtype: int64

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7355877616747182

In [25]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.73      0.75      0.74      1557
         Yes       0.74      0.72      0.73      1548

    accuracy                           0.74      3105
   macro avg       0.74      0.74      0.74      3105
weighted avg       0.74      0.74      0.74      3105



-- Less Accuracy as expected but better precison predicting "YES" --

### 9. **Tomek links** are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. Apply imblearn.under_sampling.TomekLinks to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?

In [26]:
tl = TomekLinks(sampling_strategy='majority')
X_tl, y_tl = tl.fit_resample(X, y)
y_tl.value_counts()

No     4694
Yes    1869
Name: Churn, dtype: int64

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl, test_size=0.3, random_state=100)
classification = LogisticRegression(random_state=0, multi_class='ovr').fit(X_train, y_train)
predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7917724733367192

In [28]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.83      0.89      0.86      1421
         Yes       0.65      0.54      0.59       548

    accuracy                           0.79      1969
   macro avg       0.74      0.72      0.73      1969
weighted avg       0.78      0.79      0.79      1969



-- This did improve accuracy but also increases the difference between YES/NO precisions --