### Instructions

1. Load the dataset and explore the variables.
2. We will try to predict variable `Churn` using a logistic regression on variables `tenure`, `SeniorCitizen`,`MonthlyCharges`.
3. Extract the target variable.
4. Extract the independent variables and scale them.
5. Build the logistic regression model.
6. Evaluate the model.
7. Even a simple model will give us more than 70% accuracy. Why?
8. **Synthetic Minority Oversampling TEchnique (SMOTE)** is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply `imblearn.over_sampling.SMOTE` to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?

In [115]:
import imblearn
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt

import seaborn as sns

In [116]:
df = pd.read_csv('/Users/leozinho.air/Desktop/ironhack_da/class_10/lab-imbalanced-data/files_for_lab/customer_churn.csv')

df.isnull().values.any() # No Nan values

# Standardize columns

columns = []

for i in range(len(df.columns)):
    columns.append(df.columns[i].lower().replace(' ', '_'))

df.columns = columns

# Convert "satisfaction" to binary

df['churn'] = np.where(df['churn'] == 'Yes', 1,0).astype(int)

# Check unique values on indipendent vars

col_unique = ['tenure','seniorcitizen','monthlycharges']

for column in col_unique:
    print(f"Unique values in column '{column}': {df[column].unique()}") #  They are all numericals, i can already build the model 




Unique values in column 'tenure': [ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26  0
 39]
Unique values in column 'seniorcitizen': [0 1]
Unique values in column 'monthlycharges': [29.85 56.95 53.85 ... 63.1  44.2  78.7 ]


In [117]:
# Scaling the Indipendent Variables

independent_vars = df[['tenure', 'seniorcitizen', 'monthlycharges']]

# Initialize the scaler
from sklearn.preprocessing import MinMaxScaler
normalize = MinMaxScaler()

# Fit and transform the independent variables

normalized_vars = normalize.fit_transform(independent_vars)
normalized_X = pd.DataFrame(normalized_vars, columns=['tenure', 'seniorcitizen', 'monthlycharges'])

# Concat the Df

df_normalized = pd.concat([normalized_X,df['churn']], axis = 1)
df_normalized

Unnamed: 0,tenure,seniorcitizen,monthlycharges,churn
0,0.013889,0.0,0.115423,0
1,0.472222,0.0,0.385075,0
2,0.027778,0.0,0.354229,1
3,0.625000,0.0,0.239303,0
4,0.027778,0.0,0.521891,1
...,...,...,...,...
7038,0.333333,0.0,0.662189,0
7039,1.000000,0.0,0.845274,0
7040,0.152778,0.0,0.112935,0
7041,0.055556,1.0,0.558706,1


In [118]:
# X - y split

X = df_normalized.drop(['churn'], axis = 1)
y = df_normalized['churn']

# Train test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = 0.2)

# Create the model 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report,confusion_matrix


LR = LogisticRegression()
LR.fit(X_train, y_train)

LR.score(X_test, y_test)



0.8041163946061036

In [119]:
pred = LR.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

print(classification_report(y_test, pred))


precision:  0.6932270916334662
recall:  0.46648793565683644
f1:  0.5576923076923076
              precision    recall  f1-score   support

           0       0.83      0.93      0.87      1036
           1       0.69      0.47      0.56       373

    accuracy                           0.80      1409
   macro avg       0.76      0.70      0.72      1409
weighted avg       0.79      0.80      0.79      1409



In [120]:
confusion_matrix(y_test,pred)

array([[959,  77],
       [199, 174]])

In [121]:
# Even a simple model will give us more than 70% accuracy. Why?

df.groupby('churn').count()

# The high accuracy of our model could be caused by the imbalance between the values of Churn column
# An imbalanced dataset can sometimes lead to biased predictions. So it is good to seek another path to predict the 'churn' variable


Unnamed: 0_level_0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges
churn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174,5174
1,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869,1869


## SMOTE

In [122]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state =1,sampling_strategy=1.0) # one means that I want a ratio of 1 between majority and abudance classe. # 0.5 Means my minority class will be half has big as my majority class
X_train_SMOTE,y_train_SMOTE = sm.fit_resample(X_train,y_train)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report,confusion_matrix


LR = LogisticRegression(max_iter=1000)
LR.fit(X_train_SMOTE, y_train_SMOTE)
pred_sm = LR.predict(X_test)
LR.score(X_test, y_test)




0.7423704755145494

In [123]:
print("precision: ",precision_score(y_test,pred_sm))
print("recall: ",recall_score(y_test,pred_sm)) # we see a huge improvement of the recall
print("f1: ",f1_score(y_test,pred_sm))

print(classification_report(y_test, pred_sm))


precision:  0.5089285714285714
recall:  0.7640750670241286
f1:  0.6109324758842443
              precision    recall  f1-score   support

           0       0.90      0.73      0.81      1036
           1       0.51      0.76      0.61       373

    accuracy                           0.74      1409
   macro avg       0.70      0.75      0.71      1409
weighted avg       0.79      0.74      0.76      1409



In [124]:
# From the confusion matrix we can see that the number of False Negative decreased from 199 to 88

print(f'False Negatives decreased by {round((199-88)/199*100, 1)} %')
print(f'This signifies that the model identified {199-88} instances of churn that were previously missed by our earlier model."')


confusion_matrix(y_test,pred_sm)


False Negatives decreased by 55.8 %
This signifies that the model identified 111 instances of churn that were previously missed by our earlier model."


array([[761, 275],
       [ 88, 285]])