# Regularization

Let's improve our understanding of what impacted **Titanic** passenger's chance of survival
- We will use logistic classifiers which are easy to interpret
- Remember we already did it with statsmodels in lecture "Decision Science - Logistic Regression"
- We were using `p-values` & statistical assumptions to detect which features were irrelevant / don't generalize
- This time, we will use `regularization` to detect relevant/irrelevant features based on under/overfitting criteria
- **Our goal is to compare `L1` and `L2` penalties**

## 1. We load and preprocess the data for you

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_titanic_dataset_encoded.csv")

# the dataset is already one-hot-encoded
data.head()

In [3]:
# We build X and y

y = data["survived"]
X = data.drop(columns="survived")
X.head()

In [4]:
# We MinMaxScale our features for you
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X.shape

## 1.  Logistic Regression without regularization

‚ùì Rank the feature by decreasing order of importance according to a simple **non-regularized** Logistic Regression
- Careful, `LogisticRegression` is penalized by default
- Increase `max_iter` to a larger number until model converges

In [5]:
# YOUR CODE HERE

‚ùìHow do you interpret, in plain english language, the value for the coefficient `sex_female` ?

<details>
    <summary>Answer</summary>

> "All other things being equal (such as age, ticket class etc...),
being a women increases your log-odds of survival by 2.67 (your coef value)"
    
> "Controling for all other explaining factors available in this dataset,
being a women increases your odds of survival by exp(2.67) = 14"

</details>


‚ùì What is the feature that most impacts the chances of survival according to your model?  
Fill the `top_1_feature` list below with the name of this feature

In [6]:
top_1_feature = [""]

In [8]:
from nbresult import ChallengeResult
result = ChallengeResult('unregularized', top_1_feature = top_1_feature)
result.write()
print(result.check())

## 2.  Logistic Regression with a L2 penalty

Let's use a **Logistic model** whose log-loss has been penalized with a **L2** term to figure out the **most important features** without overfitting.  
This is the "classification" equivalent to the "Ridge" regressor

‚ùì Instantiate a **strongly regularized** `LogisticRegression` and rank its feature importance
- By "strongly regularized" we mean "more than sklearn's default applied regularization factor". 
- Default sklearn's values are very useful orders of magnitudes to keep in mind for "scaled features"

In [9]:
# YOUR CODE HERE

‚ùì What are the top 2 features driving chances of survival according to your model?  
Fill the `top_2_features` list below with the name of these features

In [10]:
top_2_features = ["", ""]

#### üß™ Test your code below

In [12]:
from nbresult import ChallengeResult
result = ChallengeResult('ridge', top_2 = top_2_features)
result.write()
print(result.check())

## 2. Logistic Regression with a L1 penalty

This time, we'll use a logistic model whose log-loss has been penalized with a **L1** term to **filter-out the less important features**.  
This is the "classification" equivalent to the **Lasso** regressor

‚ùì Instantiate a **strongly regularized** `LogisticRegression` and rank its feature importance

In [13]:
# YOUR CODE HERE

‚ùì What are the features that have absolutely no impact on chances of survival, according to your L1 model?  
Fill the `zero_impact_features` list below with the name of these features, you may have to add elements to the list.

- Do you notice how some of them were "highly important" according to the non-regularized model ? 
- From now on, we will always regularize our linear models!

In [14]:
zero_impact_features = ["", "", "", ""]

#### üß™ Test your code below

In [16]:
from nbresult import ChallengeResult
result = ChallengeResult('lasso', zero_impact_features = zero_impact_features)
result.write()
print(result.check())

**üèÅ Congratulation! Don't forget to commit and push your notebook**