## Logistic Regression

Playground notebook for implementing logistic regression on preprocessed data

In [2]:
import pandas as pd
import numpy as np


In [103]:
# load preprocessed data
df = pd.read_parquet('../data/preprocessed.parquet')

### Dealing with Class Imbalance

From the EDA, we observed that number of users that didn't churned exceeds those that did by quite a far bit. This issue of class imbalance has to be addressed before fitting the model.

Failure to address this issue might lead to high accuracy in predicting the majority class (users that didn't churn) but fail to capture the minority class.

I'll leverage on over-sanmpling as the approach to deal with class imbalance. The technique used here will be Synthetic Minority Oversampling Technique, or SMOTE, which generates synthetic data for the minority class.

**What is SMOTE?**  
>SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

In [18]:
from imblearn.over_sampling import SMOTENC

In [None]:
cat_covariates = ['gender', 'SeniorCitizen', 'Partner', 'Dependents','PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

In [101]:
def apply_smotenc (data,cat_covariates_lst):
    """
    Function to apply SMOTE-NC on dataset

    Args:
        data [dataframe]: dataframe with both predictor and target variables, consisting of binary, categorical or continuous variables. NaN values are not allowed
        cat_covariates_lst [list]: list of names for the categorical variables

    Returns:
        df_smote [dataframe]: final dataset with SMOTE
    
    """

    try:
        # locate columns index where attributes are categorical variables
        cat_feat = np.where(data.iloc[:,:-1].columns.isin(cat_covariates_lst))[0]

        # initialise SMOTENC
        smote_nc = SMOTENC(categorical_features=cat_feat)

        # fit predictor and target variable
        x_smote, y_smote = smote_nc.fit_resample(data.iloc[:,:-1].to_numpy(),data.iloc[:,-1].to_numpy())

    except Exception as e:
        print(e)

    finally:
        # convert x and y arrays to dataframe
        x_smote_df = pd.DataFrame(x_smote,columns=data.iloc[:,:-1].columns)
        y_smote_df = pd.DataFrame(y_smote,columns=[data.iloc[:,-1].name])

        # concat into single dataframe
        df_smote = pd.concat([x_smote_df,y_smote_df],axis=1)

        return df_smote
        
    

In [104]:
df_smote = apply_smotenc(df,cat_covariates)

In [109]:
df_smote.Churn.value_counts()

No     5163
Yes    5163
Name: Churn, dtype: int64

After applying SMOTE, we have successfully balanced the churn class to equal ratio.

### Applying Logistic Regression