# Logistic Regression with Scikit Learn

*Adapted from https://github.com/justmarkham*

### Libraries

- [scikit-learn](http://scikit-learn.org/stable/)
- pandas
- matplotlib

In this tutorial we will see some basic examples of Logistic Regression for classification.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
import seaborn as sns
%matplotlib inline

## Classification with Logistic Regression

|*|continuous|categorical|
|---|---|---|
|**supervised**|regression|**classification**|
|**unsupervised**|dim. reduction|clustering|

# Predicting Titanic survival with Logistic Regression

Let's use the data obtained by the _Encyclopedia Titanica_ to predict if a passenger survived the Titanic disaster.

<img src="img/titanic.jpg" width="600">

Let's import the dataset _titanic.csv_ (_hint_ use the `read_csv` pandas function):

In [None]:
titanic = ...
titanic.head(5)

What are the **features**?
- name: Name of the passenger
- sex: Male or Female
- age: Age in years
- sibsp: # of siblings / spouses aboard the Titanic
- parch: # of parents / children aboard the Titanic
- ticket: Ticket number
- fare: Ticket price
- cabin: Cabin number
- embarked: Port of Embarkation

What is the **response**?
- survived: whether the passenger survived the disaster or not

Print the number of survivors and death passengers, as well as the percentage of survivors. Is the dataset balanced?

In [None]:
dead = ...
survived = ...

# print the required information

Specify the columns to use as features:

In [None]:
titanic_features = ['sex', 'age', 'sibsp', 'parch', 'fare']

For the sake of this execise, we can assume the other features (name, cabin number, embarked) are not predictive.

### Let's prepare the feature vector for the training

The dataset contains one categorical variable: sex (male|female)

We need to convert it to a numerical variable. Use the pandas method `get_dummies` to take care of this. Check https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

In [None]:
X = ...
X.head()

The categorical feature _sex_ is converted in 2 boolean features.

Titanic sank in 1912: it was a lot of time ago! Some data may be missing. Let's check if there are undefined values. Use pandas' `isna` for this purpose: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html

In [None]:
# add your code here

Let's try to fix the data with a basic imputation method: replacing the missing values with the mean. Use pandas' `fillna` for this purpose: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

More info: https://en.wikipedia.org/wiki/Imputation_(statistics)

In [None]:
X = ...

# check if X has any missing values
...

Create the label vector `y`:

In [None]:
y = ...

Let's create a Logistic Regression model...

In [None]:
logistic = ...

... and evaluate the precison/recall with a cross validation (10 splits). For this, use the `cross_val_score` implementation provided by `sklearn` and already imported above. _Hint:_ check the `scoring` argument of this function: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

In [None]:
precision = ...
recall = ...

# Precision: avoid false positives
print("Precision: %0.2f (+/- %0.2f)" % (precision.mean(), precision.std() * 2))
# Recall: avoid false negatives
print("Recall: %0.2f (+/- %0.2f)" % (recall.mean(), recall.std() * 2))

### Explore the model output

Let's create a new Logistic Regression model and train it on the full dataset:

In [None]:
logistic = ...

# Train the model
...

Of course, since we trained the whole dataset, we don't have new samples to predict, but we can predict the outcome and the relative probability for some artificial samples. Would you have survived?

Remember the features:

In [None]:
X.columns

Would a man, 25 years old without relative onboard, and with a fare of 100 survive? _Hint:_ use pandas' `predict` to make the prediction.

In [None]:
test = ...

# Check if he would have survived
...

What is the probability distribution behind this prediction? _Hint:_ use pandas' `predict_proba` to find the prediction distribution.

In [None]:
# Probability distribution

What about a woman, 35 years old, alone onboard and with the same fare?

In [None]:
test = ...

# Check if he would have survived
...

# Probability distribution
...