- title: course: (ML Series) Logistic Regression
- date: 2020-10-15 12:00
- category: course
- tags: python, machine learning
- slug: logistic regression
- authors: Julien Hernandez Lallement
- summary: A discussion of Logistic Regression algorithm
- illustration: 2020_10_logistic_regression.jpg

## Logistic Regression

### General Background

Generally speaking, regression methods are supervised learning techniques. They use continous scaled variables (independent variables) to predict the behavior of a dependent variable. They can use different equations that will fit straight lines (linear regression), polynomial functions (detecting interaction effects) or other funcftions to predict the dependent variable.

In this post, I will be focusing on logistic regressions, which are typically used as classification algorithms. The logistic regression is however not a classsification algorithm per se, but can be used as such when a probability threshold is set to differentiate between classes.

### Use case

Logisti regressions can be used in different scenarios:
* Classifying data based on continous scaled features (when the dependent variable is a categorical data, typically binary, i.e., 0 and 1)
* Classify whether an email is a spam or not

### Theoretical Background

Let's first explain why the most common use of Logistic Regression is classification. Again, to be clear, Logistic Regression is not a classification algorithm per se. It can be used for classification is a threshold is set, above and below which the vector of features will categorize the data point as belonging to one class or the other. 

Look at the graph below:

GRAPH

Here, the y axis shows the classified nature of the data points (spam or not spam), and the x axis shows a feature of these data points (say, the number of spelling mistakes in the mail's object). 

As you can imagine, a linear fit would be quite poor in this case. If new data comes in, with `y=1`, but with a high value of x, your linear fit will show a tremendous increase in residuals, and fit the data poorly. 

Moreover, if the y axis does not reflect classes but probabilities of belonging to a class, you will have issues because your linear fit might predict values >1 or <0, which would be nonsense.

One way to circumvent this problem is to use a `sigmoidal function`, such as the logistic function:

$
h_0(x) = \frac{1}{1 + e^{-\theta_o - \theta_1x}}
$

This function is shown in the figure above. Using it, a new comer data point would be classified as 1 or 0 based on the probability threshold for $h_0$. 

I believe that the threshold should be decided based on business and common sense. A threshold = 0.5 would be quite flexible, allowing some "smart" spams (if that exists) to land in the mailbox. In turn, we mighg increase our threshold to avoid any spam to land in the mailbox, while taking the rist of getting false positive (good mails that land in the spam folder). Trial and error is a good technique to monitor the threshold and its efficiency.

##### Cost function

While for the `Linear Regression` & variants, the cost function was typically a `Residual Sum of Squares`, this becomes a bit more tricky with the `Logistic Regression`. Indeed, while the cost function of Linear Regression have a single minimum, that is not always the case for logistic functions.

One typically used cost function is the following, which **do have** a single minimum:

$
J(\theta) = -\frac{1}{N} \sum \limits _{n=1} ^{N} (y^n Ln(h_\theta * x^n)) + (1 - y^n) Ln(1 - h_0 * x^n)
$

This function might seem quite complex at first, but when scrutunizing it, some might already see an analogy with the Maximum Likelihood Estimation using probabilities (see my post on Introduction to Machine Learning). And indeed, there is a clear link since minimizing this cost function means that you would have to maximize the likelihood of that data belonging to a class.

In other words, one can find that when y = 1, $J(0)$ decreases monotonically towards 0 when $h_\theta$ is equal to 1. 
Similarly, when y = 0, the cost function is also equal to 0 when $h_\theta$ is equal to 0. 

## Practical Demonstration

#### World of Warcraft dataset

We will use Logistic regression to predict gamers that will churn. Check this [repo](https://github.com/myles-oneill/WoWAH-parser) for the full data source, or get a small version of the data from my own repo, in the datasets folder. Check as well this nice [video](https://www.youtube.com/watch?v=a0p8xnrQb4s) for a full explanation of the data.

To be honest about the code, the data pipe below was developed during a course I attended at the Netherlands Institute for Neuroscience in 2019. I modified a few bits here and there add some data features, but the general logic was defined by group during the course.

As you will see, most of the functions are preparatory function.

One important one is the `add_churn_label` function, which is where the status of churner is defined. This is important, since it is the pattern that the algorithm will try to predict in unseen data.

The definition is a classic churn definition, with users having played in a certain time window, and having not logged into the game in a future time window.

Note that I focus on the implementation of the algorithm, so as usual in this series of post, i won't be going in data exploration and complex feature engineering. I leave the fun to you to explore that dataset, and increase the accurccy of the model.

In [2]:
import os
os.chdir('/home/julien/website/datasets')

In [30]:
df = pd.read_csv("wowah_data_small.csv")

In [27]:
import pandas as pd
import numpy as np
import datetime as dt
from datetime import timedelta
import plotnine as p9

In [53]:
def start_pipeline(dataf):
    return dataf.copy()

def fix_columns(dataf):
    dataf.columns = [c.lower().replace(" ","") for c in dataf.columns]
    return dataf

def fix_types(dataf):
    return dataf.assign(timestamp=lambda d: 
                        pd.to_datetime(d['timestamp'], 
                                       format="%m/%d/%y %H:%M:%S"))

def add_ts_features(dataf):
    return (dataf
           .assign(hour = lambda d: d['timestamp'].dt.hour)
           .assign(day_of_week = lambda d: d['timestamp'].dt.dayofweek))

def add_month_played(dataf):
    return (dataf
           .assign(month_played = lambda d: d['timestamp'].dt.month_name()))

def following_timestamp(dataf):
    return (dataf
           .sort_values(by=['char','timestamp'])
           .assign(following_timestamp = lambda d: (d['timestamp'].shift()))
           .assign(following_char = lambda d: (d['char'].shift()))
           )

def played_time(dataf):
    return (dataf
           .assign(played_time = lambda d: (d['timestamp'] - d['following_timestamp'])))

def add_churn_label(dataf, before_period=("2008-01-01", "2008-03-01"),
                    after_period=("2008-04-01", "2008-06-01"), min_rows=10):
    before_df = (dataf
                 .loc[lambda d: d['timestamp'] >= pd.to_datetime(before_period[0])]
                 .loc[lambda d: d['timestamp'] < pd.to_datetime(before_period[1])])
 
    after_df = (dataf
                 .loc[lambda d: d['timestamp'] >= pd.to_datetime(after_period[0])]
                 .loc[lambda d: d['timestamp'] < pd.to_datetime(after_period[1])])
 
    before_chars = (before_df
     .groupby("char")
     .count()
     .loc[lambda d: d['level'] > min_rows]
     .reset_index()['char'])
 
    after_chars = (after_df
     .groupby("char")
     .count()
     .reset_index()['char'])
 
    return (before_df
     .loc[lambda d: d['char'].isin(before_chars)]
     .assign(churned = lambda d: d['char'].isin(after_chars) == False))

df_clean = (df
 .pipe(start_pipeline)
 .pipe(fix_columns)
 .pipe(fix_types)
 .pipe(add_ts_features)
 .pipe(add_month_played)
 .pipe(following_timestamp)     
 .pipe(played_time) 
 .pipe(add_churn_label)       
)

The feature `char` defines the user in the dataframe, together with the entries for playing. Let's group by that and compute a few relevant data features to feed the model:

In [60]:
ml_df = (df_clean
       .groupby(['char'])
       .apply(lambda d: pd.Series({
           "time_played": d.shape[0],
           "churned": d['churned'].any(),
           "mean_level": d['level'].mean(),
           "var_level": d['level'].var(),
           "guild_bool": d['guild'].any(),
           "max_level": d['level'].max().astype(float),
           "min_level": d['level'].min().astype(float)
       }))
        .assign(level_speed = lambda d: (d['max_level'] - d['min_level']) / d['time_played'])
         )

OK! We are ready to fit a model to predict churn. This post is about `Logistic Regression` but I will use another classification algorithm, the `KNeighbors Classifier`, as comparison.

In [61]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

from sklego.preprocessing import ColumnSelector

In [None]:
# Pre Processing Pipeline
panda_grabber = Pipeline([
    ("union", FeatureUnion([
        ("continous", Pipeline([
            ("select", ColumnSelector(["max_level", "min_level"])),
            ("scale", StandardScaler())
        ])),
        ("discrete", Pipeline([
            ("select", ColumnSelector(["guild_bool"])),
            ("encode", OneHotEncoder(sparse=False))
        ]))
    ]))
])

# Main Pipeline
pipeline = Pipeline([
    ("grab", panda_grabber),
    ("poly", PolynomialFeatures(interaction_only=True)),
    #("scale", StandardScaler()), 
    ("model", KNeighborsClassifier(10, weights='distance'))
])

In [None]:
# Define polynomial
param_poly = [PolynomialFeatures(interaction_only=True), None]

# Define models to test
param_model = [KNeighborsClassifier(1), 
               KNeighborsClassifier(10), 
               LogisticRegression(solver='lbfgs')]

# Define Grid Search
mod = GridSearchCV(pipeline,
                   #iid=True,
                   return_train_score=True,
                   param_grid={"model": param_model, # This overwrites the pipeline declaration of the model
                               "poly": param_poly},
                   cv=10)

let's fit the different models

In [63]:
mod.fit(ml_df,ml_df['churned']);

In [64]:
# look at the cross validation results and add the number of neighbors used.
pd.DataFrame(mod.cv_results_).T

Unnamed: 0,0,1,2,3,4,5
mean_fit_time,0.00785432,0.00617559,0.0076494,0.00654972,0.0267492,0.0154052
std_fit_time,0.0011228,0.000319191,0.000846511,0.000612706,0.00277864,0.00545565
mean_score_time,0.0110934,0.00921001,0.0115895,0.0104084,0.00566826,0.00330086
std_score_time,0.00215004,0.000501717,0.00153521,0.000850591,0.000125533,0.00119705
param_model,"KNeighborsClassifier(algorithm='auto', leaf_si...","KNeighborsClassifier(algorithm='auto', leaf_si...","KNeighborsClassifier(algorithm='auto', leaf_si...","KNeighborsClassifier(algorithm='auto', leaf_si...","LogisticRegression(C=1.0, class_weight=None, d...","LogisticRegression(C=1.0, class_weight=None, d..."
param_poly,"PolynomialFeatures(degree=2, include_bias=True...",,"PolynomialFeatures(degree=2, include_bias=True...",,"PolynomialFeatures(degree=2, include_bias=True...",
params,{'model': KNeighborsClassifier(algorithm='auto...,{'model': KNeighborsClassifier(algorithm='auto...,{'model': KNeighborsClassifier(algorithm='auto...,{'model': KNeighborsClassifier(algorithm='auto...,"{'model': LogisticRegression(C=1.0, class_weig...","{'model': LogisticRegression(C=1.0, class_weig..."
split0_test_score,0.746094,0.742188,0.753906,0.75,0.746094,0.746094
split1_test_score,0.726562,0.722656,0.746094,0.75,0.746094,0.746094
split2_test_score,0.761719,0.753906,0.753906,0.757812,0.75,0.746094


In [65]:
pd.DataFrame(mod.cv_results_)[['param_model', 'mean_train_score', 'mean_test_score','std_test_score']]

Unnamed: 0,param_model,mean_train_score,mean_test_score,std_test_score
0,"KNeighborsClassifier(algorithm='auto', leaf_si...",0.836308,0.691201,0.082532
1,"KNeighborsClassifier(algorithm='auto', leaf_si...",0.7961,0.627129,0.16945
2,"KNeighborsClassifier(algorithm='auto', leaf_si...",0.774782,0.729125,0.057752
3,"KNeighborsClassifier(algorithm='auto', leaf_si...",0.773957,0.725596,0.068688
4,"LogisticRegression(C=1.0, class_weight=None, d...",0.759845,0.714222,0.103272
5,"LogisticRegression(C=1.0, class_weight=None, d...",0.758977,0.723206,0.106598


As we can see, the `Logistic Regression` shows a nice score in the test set. It is not much better than the KNN classifier, but in my opinion, is quite intuitive to explain to non tech stakeholders and might have therefore higher explainability power. 

## Final words

That was it for `Logistic Regression`. Hope you enjoyed it, and drop a mail if you have any comments, questions or (constructive_ criticism).