ENH Convert some of the Wrap-up M4 content into exercise (INRIA#731)

ogrisel · Oct 27, 2023 · 008cff4 · 008cff4
1 parent 767499b
commit 008cff4
Show file tree

Hide file tree

Showing 6 changed files with 759 additions and 481 deletions.
diff --git a/jupyter-book/_toc.yml b/jupyter-book/_toc.yml
@@ -102,14 +102,15 @@ parts:
     - file: python_scripts/linear_models_ex_02
     - file: python_scripts/linear_models_sol_02
     - file: python_scripts/linear_models_feature_engineering_classification.py
-    - file: python_scripts/logistic_regression_non_linear
+    - file: python_scripts/linear_models_ex_03
+    - file: python_scripts/linear_models_sol_03
     - file: linear_models/linear_models_quiz_m4_02
   - file: linear_models/linear_models_regularization_index
     sections:
     - file: linear_models/regularized_linear_models_slides
     - file: python_scripts/linear_models_regularization
-    - file: python_scripts/linear_models_ex_03
-    - file: python_scripts/linear_models_sol_03
+    - file: python_scripts/linear_models_ex_04
+    - file: python_scripts/linear_models_sol_04
     - file: linear_models/linear_models_quiz_m4_03
   - file: linear_models/linear_models_wrap_up_quiz
   - file: linear_models/linear_models_module_take_away

diff --git a/python_scripts/linear_models_ex_03.py b/python_scripts/linear_models_ex_03.py
@@ -14,69 +14,118 @@
 # %% [markdown]
 # # 📝 Exercise M4.03
 #
-# The parameter `penalty` can control the **type** of regularization to use,
-# whereas the regularization **strength** is set using the parameter `C`.
-# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
-# this exercise, we ask you to train a logistic regression classifier using the
-# `penalty="l2"` regularization (which happens to be the default in
-# scikit-learn) to find by yourself the effect of the parameter `C`.
-#
-# We start by loading the dataset.
+# Now, we tackle a more realistic classification problem instead of making a
+# synthetic dataset. We start by loading the Adult Census dataset with the
+# following snippet. For the moment we retain only the **numerical features**.
+
+# %%
+import pandas as pd
+
+adult_census = pd.read_csv("../datasets/adult-census.csv")
+target = adult_census["class"]
+data = adult_census.select_dtypes(["integer", "floating"])
+data = data.drop(columns=["education-num"])
+data
 
 # %% [markdown]
-# ```{note}
-# If you want a deeper overview regarding this dataset, you can refer to the
-# Appendix - Datasets description section at the end of this MOOC.
-# ```
+# We confirm that all the selected features are numerical.
+#
+# Compute the generalization performance in terms of accuracy of a linear model
+# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
+# cross-validation with `return_estimator=True` to be able to inspect the
+# trained estimators.
 
 # %%
-import pandas as pd
+# Write your code here.
 
-penguins = pd.read_csv("../datasets/penguins_classification.csv")
-# only keep the Adelie and Chinstrap classes
-penguins = (
-    penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
-)
+# %% [markdown]
+# What is the most important feature seen by the logistic regression?
+#
+# You can use a boxplot to compare the absolute values of the coefficients while
+# also visualizing the variability induced by the cross-validation resampling.
+
+# %%
+# Write your code here.
 
-culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
-target_column = "Species"
+# %% [markdown]
+# Let's now work with **both numerical and categorical features**. You can
+# reload the Adult Census dataset with the following snippet:
 
 # %%
-from sklearn.model_selection import train_test_split
+adult_census = pd.read_csv("../datasets/adult-census.csv")
+target = adult_census["class"]
+data = adult_census.drop(columns=["class", "education-num"])
+
+# %% [markdown]
+# Create a predictive model where:
+# - The numerical data must be scaled.
+# - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
+#   group categories concerning less than 1% of the total samples.
+# - The predictor is a `LogisticRegression`. You may need to increase the number
+#   of `max_iter`, which is 100 by default.
+#
+# Use the same 10-fold cross-validation strategy with `return_estimator=True` as
+# above to evaluate this complex pipeline.
 
-penguins_train, penguins_test = train_test_split(penguins, random_state=0)
+# %%
+# Write your code here.
 
-data_train = penguins_train[culmen_columns]
-data_test = penguins_test[culmen_columns]
+# %% [markdown]
+# By comparing the cross-validation test scores of both models fold-to-fold,
+# count the number of times the model using both numerical and categorical
+# features has a better test score than the model using only numerical features.
 
-target_train = penguins_train[target_column]
-target_test = penguins_test[target_column]
+# %%
+# Write your code here.
 
 # %% [markdown]
-# First, let's create our predictive model.
+# For the following questions, you can copy adn paste the following snippet to
+# get the feature names from the column transformer here named `preprocessor`.
+#
+# ```python
+# preprocessor.fit(data)
+# feature_names = (
+#     preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(
+#         categorical_columns
+#     )
+# ).tolist()
+# feature_names += numerical_columns
+# feature_names
+# ```
 
 # %%
-from sklearn.pipeline import make_pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.linear_model import LogisticRegression
+# Write your code here.
 
-logistic_regression = make_pipeline(
-    StandardScaler(), LogisticRegression(penalty="l2")
-)
+# %% [markdown]
+# Notice that there are as many feature names as coefficients in the last step
+# of your predictive pipeline.
 
 # %% [markdown]
-# Given the following candidates for the `C` parameter, find out the impact of
-# `C` on the classifier decision boundary. You can use
-# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
-# decision function boundary.
+# Which of the following pairs of features is most impacting the predictions of
+# the logistic regression classifier based on the absolute magnitude of its
+# coefficients?
 
 # %%
-Cs = [0.01, 0.1, 1, 10]
+# Write your code here.
+
+# %% [markdown]
+# Now create a similar pipeline consisting of the same preprocessor as above,
+# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
+# Set `degree=2` and `interaction_only=True` to the feature engineering step.
+# Remember not to include a "bias" feature to avoid introducing a redundancy
+# with the intercept of the subsequent logistic regression.
 
+# %%
 # Write your code here.
 
 # %% [markdown]
-# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
+# By comparing the cross-validation test scores of both models fold-to-fold,
+# count the number of times the model using multiplicative interactions and both
+# numerical and categorical features has a better test score than the model
+# without interactions.
+
+# %%
+# Write your code here.
 
 # %%
 # Write your code here.
diff --git a/python_scripts/linear_models_ex_04.py b/python_scripts/linear_models_ex_04.py
@@ -0,0 +1,170 @@
+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.15.2
+#   kernelspec:
+#     display_name: Python 3
+#     name: python3
+# ---
+
+# %% [markdown]
+# # 📝 Exercise M4.04
+#
+# In the previous Module we tuned the hyperparameter `C` of the logistic
+# regression without mentioning that it controls the regularization strength.
+# Later, on the slides on 🎥 **Intuitions on regularized linear models** we
+# metioned that a small `C` provides a more regularized model, whereas a
+# non-regularized model is obtained with an infinitely large value of `C`.
+# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
+# model.
+#
+# In this exercise, we ask you to train a logistic regression classifier using
+# different values of the parameter `C` to find its effects by yourself.
+#
+# We start by loading the dataset. We only keep the Adelie and Chinstrap classes
+# to keep the discussion simple.
+
+
+# %% [markdown]
+# ```{note}
+# If you want a deeper overview regarding this dataset, you can refer to the
+# Appendix - Datasets description section at the end of this MOOC.
+# ```
+
+# %%
+import pandas as pd
+
+penguins = pd.read_csv("../datasets/penguins_classification.csv")
+penguins = (
+    penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
+)
+
+culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
+target_column = "Species"
+
+# %%
+from sklearn.model_selection import train_test_split
+
+penguins_train, penguins_test = train_test_split(
+    penguins, random_state=0, test_size=0.4
+)
+
+data_train = penguins_train[culmen_columns]
+data_test = penguins_test[culmen_columns]
+
+target_train = penguins_train[target_column]
+target_test = penguins_test[target_column]
+
+# %% [markdown]
+# We define a function to help us fit a given `model` and plot its decision
+# boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging
+# colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped
+# to the white color. Equivalently, the darker the color, the closer the
+# predicted probability is to 0 or 1 and the more confident the classifier is in
+# its predictions.
+
+# %%
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.inspection import DecisionBoundaryDisplay
+
+
+def plot_decision_boundary(model):
+    model.fit(data_train, target_train)
+    accuracy = model.score(data_test, target_test)
+    C = model.get_params()["logisticregression__C"]
+
+    disp = DecisionBoundaryDisplay.from_estimator(
+        model,
+        data_train,
+        response_method="predict_proba",
+        plot_method="pcolormesh",
+        cmap="RdBu_r",
+        alpha=0.8,
+        vmin=0.0,
+        vmax=1.0,
+    )
+    DecisionBoundaryDisplay.from_estimator(
+        model,
+        data_train,
+        response_method="predict_proba",
+        plot_method="contour",
+        linestyles="--",
+        linewidths=1,
+        alpha=0.8,
+        levels=[0.5],
+        ax=disp.ax_,
+    )
+    sns.scatterplot(
+        data=penguins_train,
+        x=culmen_columns[0],
+        y=culmen_columns[1],
+        hue=target_column,
+        palette=["tab:blue", "tab:red"],
+        ax=disp.ax_,
+    )
+    plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
+    plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
+
+
+# %% [markdown]
+# Let's now create our predictive model.
+
+# %%
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression
+
+logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())
+
+# %% [markdown]
+# ## Influence of the parameter `C` on the decision boundary
+#
+# Given the following candidates for the `C` parameter and the
+# `plot_decision_boundary` function, find out the impact of `C` on the
+# classifier's decision boundary.
+#
+# - How does the value of `C` impact the confidence on the predictions?
+# - How does it impact the underfit/overfit trade-off?
+# - How does it impact the position and orientation of the decision boundary?
+#
+# Try to give an interpretation on the reason for such behavior.
+
+# %%
+Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]
+
+# Write your code here.
+
+# %% [markdown]
+# ## Impact of the regularization on the weights
+#
+# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
+# **Hint**: You can [access pipeline
+# steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)
+# by name or position. Then you can query the attributes of that step such as
+# `coef_`.
+
+# %%
+# Write your code here.
+
+# %% [markdown]
+# ## Impact of the regularization on with non-linear feature engineering
+#
+# Use the `plot_decision_boundary` function to repeat the experiment using a
+# non-linear feature engineering pipeline. For such purpose, insert
+# `Nystroem(kernel="rbf", gamma=1, n_components=100)` between the
+# `StandardScaler` and the `LogisticRegression` steps.
+#
+# - Does the value of `C` still impact the position of the decision boundary and
+#   the confidence of the model?
+# - What can you say about the impact of `C` on the underfitting vs overfitting
+#   trade-off?
+
+# %%
+from sklearn.kernel_approximation import Nystroem
+
+# Write your code here.