Skip to content

Commit

Permalink
ENH Convert some of the Wrap-up M4 content into exercise (INRIA#731)
Browse files Browse the repository at this point in the history
  • Loading branch information
ArturoAmorQ committed Oct 27, 2023
1 parent 767499b commit 008cff4
Show file tree
Hide file tree
Showing 6 changed files with 759 additions and 481 deletions.
7 changes: 4 additions & 3 deletions jupyter-book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -102,14 +102,15 @@ parts:
- file: python_scripts/linear_models_ex_02
- file: python_scripts/linear_models_sol_02
- file: python_scripts/linear_models_feature_engineering_classification.py
- file: python_scripts/logistic_regression_non_linear
- file: python_scripts/linear_models_ex_03
- file: python_scripts/linear_models_sol_03
- file: linear_models/linear_models_quiz_m4_02
- file: linear_models/linear_models_regularization_index
sections:
- file: linear_models/regularized_linear_models_slides
- file: python_scripts/linear_models_regularization
- file: python_scripts/linear_models_ex_03
- file: python_scripts/linear_models_sol_03
- file: python_scripts/linear_models_ex_04
- file: python_scripts/linear_models_sol_04
- file: linear_models/linear_models_quiz_m4_03
- file: linear_models/linear_models_wrap_up_quiz
- file: linear_models/linear_models_module_take_away
Expand Down
127 changes: 88 additions & 39 deletions python_scripts/linear_models_ex_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,69 +14,118 @@
# %% [markdown]
# # 📝 Exercise M4.03
#
# The parameter `penalty` can control the **type** of regularization to use,
# whereas the regularization **strength** is set using the parameter `C`.
# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
# this exercise, we ask you to train a logistic regression classifier using the
# `penalty="l2"` regularization (which happens to be the default in
# scikit-learn) to find by yourself the effect of the parameter `C`.
#
# We start by loading the dataset.
# Now, we tackle a more realistic classification problem instead of making a
# synthetic dataset. We start by loading the Adult Census dataset with the
# following snippet. For the moment we retain only the **numerical features**.

# %%
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.select_dtypes(["integer", "floating"])
data = data.drop(columns=["education-num"])
data

# %% [markdown]
# ```{note}
# If you want a deeper overview regarding this dataset, you can refer to the
# Appendix - Datasets description section at the end of this MOOC.
# ```
# We confirm that all the selected features are numerical.
#
# Compute the generalization performance in terms of accuracy of a linear model
# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
# cross-validation with `return_estimator=True` to be able to inspect the
# trained estimators.

# %%
import pandas as pd
# Write your code here.

penguins = pd.read_csv("../datasets/penguins_classification.csv")
# only keep the Adelie and Chinstrap classes
penguins = (
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
)
# %% [markdown]
# What is the most important feature seen by the logistic regression?
#
# You can use a boxplot to compare the absolute values of the coefficients while
# also visualizing the variability induced by the cross-validation resampling.

# %%
# Write your code here.

culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"
# %% [markdown]
# Let's now work with **both numerical and categorical features**. You can
# reload the Adult Census dataset with the following snippet:

# %%
from sklearn.model_selection import train_test_split
adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.drop(columns=["class", "education-num"])

# %% [markdown]
# Create a predictive model where:
# - The numerical data must be scaled.
# - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
# group categories concerning less than 1% of the total samples.
# - The predictor is a `LogisticRegression`. You may need to increase the number
# of `max_iter`, which is 100 by default.
#
# Use the same 10-fold cross-validation strategy with `return_estimator=True` as
# above to evaluate this complex pipeline.

penguins_train, penguins_test = train_test_split(penguins, random_state=0)
# %%
# Write your code here.

data_train = penguins_train[culmen_columns]
data_test = penguins_test[culmen_columns]
# %% [markdown]
# By comparing the cross-validation test scores of both models fold-to-fold,
# count the number of times the model using both numerical and categorical
# features has a better test score than the model using only numerical features.

target_train = penguins_train[target_column]
target_test = penguins_test[target_column]
# %%
# Write your code here.

# %% [markdown]
# First, let's create our predictive model.
# For the following questions, you can copy adn paste the following snippet to
# get the feature names from the column transformer here named `preprocessor`.
#
# ```python
# preprocessor.fit(data)
# feature_names = (
# preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(
# categorical_columns
# )
# ).tolist()
# feature_names += numerical_columns
# feature_names
# ```

# %%
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Write your code here.

logistic_regression = make_pipeline(
StandardScaler(), LogisticRegression(penalty="l2")
)
# %% [markdown]
# Notice that there are as many feature names as coefficients in the last step
# of your predictive pipeline.

# %% [markdown]
# Given the following candidates for the `C` parameter, find out the impact of
# `C` on the classifier decision boundary. You can use
# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
# decision function boundary.
# Which of the following pairs of features is most impacting the predictions of
# the logistic regression classifier based on the absolute magnitude of its
# coefficients?

# %%
Cs = [0.01, 0.1, 1, 10]
# Write your code here.

# %% [markdown]
# Now create a similar pipeline consisting of the same preprocessor as above,
# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
# Set `degree=2` and `interaction_only=True` to the feature engineering step.
# Remember not to include a "bias" feature to avoid introducing a redundancy
# with the intercept of the subsequent logistic regression.

# %%
# Write your code here.

# %% [markdown]
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
# By comparing the cross-validation test scores of both models fold-to-fold,
# count the number of times the model using multiplicative interactions and both
# numerical and categorical features has a better test score than the model
# without interactions.

# %%
# Write your code here.

# %%
# Write your code here.
170 changes: 170 additions & 0 deletions python_scripts/linear_models_ex_04.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.15.2
# kernelspec:
# display_name: Python 3
# name: python3
# ---

# %% [markdown]
# # 📝 Exercise M4.04
#
# In the previous Module we tuned the hyperparameter `C` of the logistic
# regression without mentioning that it controls the regularization strength.
# Later, on the slides on 🎥 **Intuitions on regularized linear models** we
# metioned that a small `C` provides a more regularized model, whereas a
# non-regularized model is obtained with an infinitely large value of `C`.
# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
# model.
#
# In this exercise, we ask you to train a logistic regression classifier using
# different values of the parameter `C` to find its effects by yourself.
#
# We start by loading the dataset. We only keep the Adelie and Chinstrap classes
# to keep the discussion simple.


# %% [markdown]
# ```{note}
# If you want a deeper overview regarding this dataset, you can refer to the
# Appendix - Datasets description section at the end of this MOOC.
# ```

# %%
import pandas as pd

penguins = pd.read_csv("../datasets/penguins_classification.csv")
penguins = (
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
)

culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"

# %%
from sklearn.model_selection import train_test_split

penguins_train, penguins_test = train_test_split(
penguins, random_state=0, test_size=0.4
)

data_train = penguins_train[culmen_columns]
data_test = penguins_test[culmen_columns]

target_train = penguins_train[target_column]
target_test = penguins_test[target_column]

# %% [markdown]
# We define a function to help us fit a given `model` and plot its decision
# boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging
# colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped
# to the white color. Equivalently, the darker the color, the closer the
# predicted probability is to 0 or 1 and the more confident the classifier is in
# its predictions.

# %%
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.inspection import DecisionBoundaryDisplay


def plot_decision_boundary(model):
model.fit(data_train, target_train)
accuracy = model.score(data_test, target_test)
C = model.get_params()["logisticregression__C"]

disp = DecisionBoundaryDisplay.from_estimator(
model,
data_train,
response_method="predict_proba",
plot_method="pcolormesh",
cmap="RdBu_r",
alpha=0.8,
vmin=0.0,
vmax=1.0,
)
DecisionBoundaryDisplay.from_estimator(
model,
data_train,
response_method="predict_proba",
plot_method="contour",
linestyles="--",
linewidths=1,
alpha=0.8,
levels=[0.5],
ax=disp.ax_,
)
sns.scatterplot(
data=penguins_train,
x=culmen_columns[0],
y=culmen_columns[1],
hue=target_column,
palette=["tab:blue", "tab:red"],
ax=disp.ax_,
)
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")


# %% [markdown]
# Let's now create our predictive model.

# %%
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())

# %% [markdown]
# ## Influence of the parameter `C` on the decision boundary
#
# Given the following candidates for the `C` parameter and the
# `plot_decision_boundary` function, find out the impact of `C` on the
# classifier's decision boundary.
#
# - How does the value of `C` impact the confidence on the predictions?
# - How does it impact the underfit/overfit trade-off?
# - How does it impact the position and orientation of the decision boundary?
#
# Try to give an interpretation on the reason for such behavior.

# %%
Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]

# Write your code here.

# %% [markdown]
# ## Impact of the regularization on the weights
#
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
# **Hint**: You can [access pipeline
# steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)
# by name or position. Then you can query the attributes of that step such as
# `coef_`.

# %%
# Write your code here.

# %% [markdown]
# ## Impact of the regularization on with non-linear feature engineering
#
# Use the `plot_decision_boundary` function to repeat the experiment using a
# non-linear feature engineering pipeline. For such purpose, insert
# `Nystroem(kernel="rbf", gamma=1, n_components=100)` between the
# `StandardScaler` and the `LogisticRegression` steps.
#
# - Does the value of `C` still impact the position of the decision boundary and
# the confidence of the model?
# - What can you say about the impact of `C` on the underfitting vs overfitting
# trade-off?

# %%
from sklearn.kernel_approximation import Nystroem

# Write your code here.
Loading

0 comments on commit 008cff4

Please sign in to comment.