## Logistic Regression on a customer churn data set- telecoms subs

In [None]:
import pandas as pd
import numpy as np

# import plotting libraries
import seaborn as sns
import matplotlib.pyplot as plt

# import sklearn libraries
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix 
from sklearn.model_selection import train_test_split

### Load Data 

In [None]:
churnData = pd.read_csv('customer_churn.csv') 
churnData.head(5)

### Business question 

Can we predict if a customer account will churn ? 

### Normal intermediate steps 

we are skipping these steps of the process to focus on the data imbalance alone 

+ EDA 

+ Data cleaning 

+ Feature selection and engineering 

we will attempt to predict churn using just a few numerical features, and see what the baseline model gives us in terms of accuracy in class predictions

### Separate the dependent and independent variables 

In [None]:
X = churnData[['tenure', 'SeniorCitizen','MonthlyCharges']]
y = (churnData.Churn == 'Yes').astype(int)

In [None]:
X.info()

### Scale the numerical data 

In [None]:
transformer = StandardScaler().fit(X)
scaled_x = transformer.transform(X)

In [None]:
scaled_x

### Confirm the imbalance in our target label 

In [None]:
y.value_counts()

https://www.kdnuggets.com/2020/01/5-most-useful-techniques-handle-imbalanced-datasets.html

### Modeling including model validation with train-test-split

In [None]:
# train test split 
X_train, X_test, y_train, y_test = train_test_split(scaled_x, y, test_size=0.3, random_state=100)

In [None]:
# define and train a LogisticRegression model
classification = LogisticRegression(random_state=0, solver='lbfgs',
                        multi_class='ovr').fit(X_train, y_train)

# creating predictions based on our trained model
y_pred=model.predict(X_test)

### Evaluate the baseline model with confusion matrix 

In [None]:
# confusion matrix 
confusion_matrix(y_pred, y_test)
plot_confusion_matrix(classification,X_test, y_test)
plt.show()

Observation: It's evident that our model is way better in predicting the `Churn=False` customers correctly than the ones that have churned - because of the heavy misbalance within the target variable of `5174` (Did not churn) to `1869` (have churned)

In [None]:
model.score(scaled_x, y)

In [None]:
# % confusion matrix heat map

# shortened name
cnfmat = confusion_matrix(y_pred, y_test)

# creating a Dataframe out of our confusion matrix, easier to plot in seaborn. 
df_cm = pd.DataFrame(cnfmat, columns=np.unique(y_test), index = np.unique(y_test))

# column and index names to our df
df_cm.index.name = 'Actual (Churn?)'
df_cm.columns.name = 'Predicted (Churn?)'

# set the fontsize for my plot
sns.set(font_scale=1)

# set plot size
fig, ax = plt.subplots(figsize=(8,8))

# this function formatter is necessary to create a custom function which formats our values as % 
from matplotlib.ticker import FuncFormatter
fmt = lambda x,pos: '{:.0%}'.format(x)

# plot the heatmap for our confusion matrix
sns.heatmap(df_cm/df_cm.sum().sum(),  # plot the number of values as percentage of all values in the confusion matrix
            annot=True,
            fmt='.0%',
            cmap='hot',
            annot_kws={"size":15},
            cbar_kws={'format': FuncFormatter(fmt)}
           );

### applying SMOTE - (oversample the minority class of the target label)

### Model again and plot the confusion matrix (same code as above)

Observation

- 
- 
- 


### Applying TomekLinks to downsample majority class 

Observation:

- 
- 
- 


### which method to use ? 

* if you're unsure which method to use then, give [this article](https://machinelearningmastery.com/data-sampling-methods-for-imbalanced-classification/) a read. 
* In practice it was shown that combining several sampling techniques often yields the bet result so you can also explore hybrid methods in [this article](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/) to try to improve on the SMOTE result in this notebook