# Machine Learning Modeling

---

## Import packages & Load Data

In [86]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression

# shows plots in jupyter notebook
%matplotlib inline

# set plot style
sns.set(color_codes=True)

In [87]:
clients_df = pd.read_csv('../data/processed/processed_data.csv')
print('Shape of the dataset: ', clients_df.shape)
clients_df.head()

Shape of the dataset:  (14606, 16)


Unnamed: 0,consumption_last_year_energy,off_peak_forecast_energy,off_peak_forecast_power,has_gas,consumption_current_energy,active_products,net_margin,antiquity,consumption_current_power,churn,off_peak_mean_energy,off_peak_mean_power,off_peak_diff_energy,off_peak_diff_power,origin,channel
0,0,0.114481,40.606701,1,0.0,2,678.99,3,43.648,1,0.124787,40.942265,0.020057,3.700961,lxidpiddsbxsbosboudacockeimpuepw,foosdfpfkusacimwkcsosbicdxkicaua
1,4660,0.145711,44.311378,0,0.0,1,18.89,6,13.8,0,0.149609,44.311375,-0.003767,0.177779,kamkkxfxxuwbdslkwifmmcsiusiuosws,MISSING
2,544,0.165794,44.311378,0,0.0,1,6.6,6,13.856,0,0.170512,44.38545,-0.00467,0.177779,kamkkxfxxuwbdslkwifmmcsiusiuosws,foosdfpfkusacimwkcsosbicdxkicaua
3,1584,0.146694,44.311378,0,0.0,1,25.46,6,13.2,0,0.15121,44.400265,-0.004547,0.177779,kamkkxfxxuwbdslkwifmmcsiusiuosws,lmkebamcaaclubfxadlmueccxoimlema
4,4425,0.1169,40.606701,0,52.32,1,47.98,6,19.8,0,0.124174,40.688156,-0.006192,0.162916,kamkkxfxxuwbdslkwifmmcsiusiuosws,MISSING


---

## Data Preprocessing

In [95]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(clients_df.drop('churn', axis=1), clients_df['churn'], test_size=0.2, stratify=clients_df['churn'], random_state=42)

In [94]:
def preprocess_data():
    # features to be one hot encoded
    ohe_features = ['origin', 'channel']
    # features to be log transformed
    log_features = ['consumption_last_year_energy', 'off_peak_forecast_energy', 'off_peak_forecast_power', 'consumption_current_energy', 'consumption_current_power']
    # features to be standardized
    standardize_features = log_features + ['net_margin', 'off_peak_mean_energy', 'off_peak_mean_power', 'off_peak_diff_energy', 'off_peak_diff_power']
    
    # step 1: one hot encode categorical features
    ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    ohe.fit(X_train[ohe_features])
    X_train_ohe = pd.DataFrame(ohe.transform(X_train[ohe_features]), columns=ohe.get_feature_names_out(ohe_features))
    X_test_ohe = pd.DataFrame(ohe.transform(X_test[ohe_features]), columns=ohe.get_feature_names_out(ohe_features))
    print('Shape of the train set after one hot encoding: ', X_train_ohe.shape)
    print('Shape of the test set after one hot encoding: ', X_test_ohe.shape)

    # step 2: log transform features
    X_train[log_features] = np.log(X_train[log_features]+1)
    X_test[log_features] = np.log(X_test[log_features]+1)
    
    # step 3: standardize features
    scaler = StandardScaler()
    scaler.fit(X_train[standardize_features])
    X_train_scaled = pd.DataFrame(scaler.transform(X_train[standardize_features]), columns=standardize_features)
    X_test_scaled = pd.DataFrame(scaler.transform(X_test[standardize_features]), columns=standardize_features)
    print('Shape of the train set after standardization: ', X_train_scaled.shape)
    print('Shape of the test set after standardization: ', X_test_scaled.shape)
    
    # step 4: concatenate all features
    X_train_processed = pd.concat([X_train_ohe, X_train_scaled, X_train[['has_gas', 'active_products', 'antiquity']]], axis=1)
    X_test_processed = pd.concat([X_test_ohe, X_test_scaled, X_test[['has_gas', 'active_products', 'antiquity']]], axis=1)
    print('Shape of the train set after preprocessing: ', X_train_processed.shape)
    print('Shape of the test set after preprocessing: ', X_test_processed.shape)

    return X_train_processed, X_test_processed

In [96]:
X_train_processed, X_test_processed = preprocess_data()

Shape of the train set after one hot encoding:  (11684, 10)
Shape of the test set after one hot encoding:  (2922, 10)
Shape of the train set after standardization:  (11684, 10)
Shape of the test set after standardization:  (2922, 10)
Shape of the train set after preprocessing:  (13992, 23)
Shape of the test set after preprocessing:  (5291, 23)


In [91]:
X_train_processed.shape

(13992, 23)

In [92]:
X_train.shape

(11684, 15)

---

## Baseline Model

First, let's establish a baseline.
Since the churn rate is about 10%, we might think of random guessing, that predicts churn with a probability of 10%. We might also think of a model that always predicts non churn, which would be right 90% of the time.

Let's define a metric to evaluate our models.  
When we go back to the business problem, we don't want to lose customers, in other words, when there is a company that is likely to churn, we want to be able to identify it.  Recall is the metric that measures that.
One other thing is that when we predict that a company will churn, we want to be right, because we'll be giving it a discount of 20%, which is a cost for us. Precision is the metric that measures that.

Given that, we'll use the F1 score, which is the harmonic mean of precision and recall.

In [3]:
def print_scores(y_true, y_pred):
    print(f'Accuacy: {accuracy_score(y_true, y_pred):.2f}')
    print(f'Precision: {precision_score(y_true, y_pred, zero_division=0):.2f}')
    print(f'Recall: {recall_score(y_true, y_pred, zero_division=0):.2f}')
    print(f'F1: {f1_score(y_true, y_pred, zero_division=0):.2f}')

In [17]:
# first baseline
baseline_1_guesses = np.random.choice([0, 1], size=len(clients_df), p=[.9, .1])
print_scores(clients_df.churn, baseline_1_guesses)

Accuacy: 0.82
Precision: 0.11
Recall: 0.11
F1: 0.11


## Logistic Regression

We see that the baseline model has an F1 score of around 10%, with both precision and recall being equal.  
We need to achieve better results with out modeling.  
Since Logistic Regression is a simple model, let's see if it performs better than this simple baseline, if it does, let it be our actual baseline.

In [33]:
# one hot encode X_train using scikit learn
ohe = OneHotEncoder(handle_unknown='ignore')

# fit ohe to X_train
ohe.fit(X_train)

# transform X_train
X_train_ohe = ohe.transform(X_train)

In [32]:
# fit a logistic regression model
logreg_pipeline = Pipeline(pipeline_steps + [('logreg', LogisticRegression(random_state=42))])
logreg_pipeline.fit(X_train, y_train)
logreg_preds = logreg_pipeline.predict(X_test)
print_scores(y_test, logreg_preds)

ValueError: Cannot center sparse matrices: pass `with_mean=False` instead. See docstring for motivation and alternatives.