## Logistic Regression

Building upon our Machine-Learning series with the Blog, we will now introduce ways to implement **logistic regression** in Python.  If you are not familiar with this concept already (or even if you need a refresher), we *highly* recommend reading Derivative's more technical explanation of logistic regression and perusing our [recommended materials](#Additional-Learning-Resources) to learn more.

The goal of this post is to demonstrate two commonly used modules for building regression models, **scikit-learn** and **statsmodels**.  These modules each provide *a lot* of support for various statistical/machine learning models... so, you may be asking: *"how are they different?"*  Good question, reader!  The answer lies in the difference between *Statistics* 🧮️ and *Machine Learning* 🤖.
  - **statsmodels** builds models from the perspective of **Statistics**.  This produces more *explanatory* models that can answer questions like "What is the relationship between X and Y?".
  - **scikit-learn** builds models from the perspective of **Machine Learning**.  The goal of these models is to be as *predictive* as possible and answer the question "Given X, what should we predict for Y?"

As we will see, both modules are very easy to use, but there are some key differences that you should be aware of.  I hope after reading this post, you are able to choose which module best suits your problem.

### Table of Contents

This post is split into the following sections:

   1. [Data Setup](#1.-Data-Setup)
   2. [Using `statsmodels` to build a logistic regression 🧮️](#Using-statsmodels-to-build-a-logistic-regression)
   3. [Using `scikit-learn` to build a logistic-regression 🤖](#Using-scikit-learn-to-build-a-logistic-regression)
   4. [Conclusion](#Conclusion)
   5. [Additional Learning Resources](#Additional-Learning-Resources)

### 1. Data Setup

Stealing the awesome work of the "Data Preprocessing" post, we will be leveraging the *Titanic* 🛥️ dataset in this lesson.  Our **target variable** will be `Survived` (the indicator of whether a passenger survived), and we will test out the various other fields as features/explanatory variables in the model.

Here's what that data looked like, for a reminder:

In [None]:
import numpy as np
import pandas as pd

titanic = pd.read_csv("resources/titanic-output.csv")
display(titanic.head())
print(f"Columns: {titanic.columns.to_list()}")

In [None]:
target_var = 'Survived'
exclude_vars = ['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin',
                'Sex', 'Embarked']

X = titanic[[col for col in titanic.columns if col not in exclude_vars]].copy()
y = titanic[target_var]

In [None]:
mean_age = np.nanmean(X['Age'])
X['Age'] = X['Age'].apply(lambda x: x if not pd.isna(x) else mean_age)

Before we get started with modeling, we will do one more data pre-processing step: **splitting data into training/test/validation sets**.  These sets are defined as follows:
  - **Training**: The sample of data used to fit the model.
  - **Validation**: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
  - **Test**: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

There is a handy function to do this in `scikit-learn` that we will use; however, it only splits the data into 2 sets, so we just apply this twice to get the sets we want.  In this case, we will have a training set of 60% of the data, validation set of 20% of the data, and test set of 20% of the data.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=8675309)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
                                                  test_size=0.25, # 0.25 x 0.8 = 0.2
                                                  random_state=8675309) 

### Using `statsmodels` to build a logistic regression

First, we take a statistical approach to building a model. 🧮️

In [None]:
import statsmodels.api as sm 

In [None]:
X_sm_constant = sm.add_constant(np.asarray(X_train) )
## ^ Note Statsmodels does not include an intercept by default **different from what we will see in sklearn
logit = sm.Logit(np.asarray(y_train), X_sm_constant) 
result = logit.fit() 


As mentioned, `statsmodels` is best for explaining the relationship between X and Y.  In logistic regression, you can best interpret coefficients using an *odds ratio*.  

As we can see:

The intercept= -1.12546 which corresponds to the log odds of the probability of being in an honor class p .

We can go from the log odds to the odds by exponentiating the coefficient which gives us the odds O=0.3245.

We can go backwards to the probability by calculating p=O1+O = 0.245.


We can see that:

The intercept= -1.47085 which corresponds to the log odds for males being in an honor class (since male is the reference group, female=0).

The coefficient for female= 0.59278 which corresponds to the log of odds ratio between the female group and male group. The odds ratio equals 1.81 which means the odds for females are about 81% higher than the odds for males.


Odds Ratio: 1.131

Interpretation: As X increases by 1, the odds that the passenger survives are 1.13 times greater.

In [None]:
predictions = result.predict(X) 

### Using `scikit-learn` to build a logistic regression

Now we'll take a Machine Learning approach to build a model and we'll compare results 🤖

### Conclusion

As always in Python, there's "more than one way to skin a cat".  Depending on your problem, you may opt for either `statsmodels` or `scikit-learn` to build your model.  If you are more interested in generating the best **prediction**, you should probably opt for `scikit-learn`.  On the other hand, if you're more interested in explaining the relationship between X and Y, you should opt for `statsmodels`.  Also, there are key differences in these modules in terms of default parameters (e.g. intercepts and regularization) and other API features that you should be aware of.

#### Additional Learning Resources

 - scikit-learn documentation
 - statsmodels documentation
 - Statistical learning book
 - blog