Skip to content

Commit e68c62d

Browse files
committed
Fix feature selection tutorial
1 parent c216d38 commit e68c62d

File tree

2 files changed

+156
-23
lines changed

2 files changed

+156
-23
lines changed

_quarto.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ project:
88
- content/modelisation/index.qmd
99
- content/visualisation/index.qmd
1010
- content/annexes/corrections.qmd
11-
- content/modelisation/2_classification.qmd
11+
- content/modelisation/4_featureselection.qmd
1212

1313
profile:
1414
default: fr

content/modelisation/4_featureselection.qmd

Lines changed: 155 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,8 @@ le LASSO permet de fixer un certain nombre de coefficients à 0.
108108
Les variables dont la norme est non nulle passent ainsi le test de sélection.
109109

110110
::: {.callout-tip}
111+
## Le programme d'optimisation du LASSO
112+
111113
Le LASSO est un programme d'optimisation sous contrainte. On cherche à trouver l'estimateur $\beta$ qui minimise l'erreur quadratique (régression linéaire) sous une contrainte additionnelle régularisant les paramètres:
112114
$$
113115
\min_{\beta} \frac{1}{2}\mathbb{E}\bigg( \big( X\beta - y \big)^2 \bigg) \\
@@ -149,6 +151,8 @@ LASSO allows certain coefficients to be set to 0.
149151
Variables with non-zero norms thus pass the selection test.
150152

151153
::: {.callout-tip}
154+
## LASSO optimization program
155+
152156
LASSO is a constrained optimization problem. It seeks to find the estimator $\beta$ that minimizes the quadratic error (linear regression) under an additional constraint regularizing the parameters:
153157
$$
154158
\min_{\beta} \frac{1}{2}\mathbb{E}\bigg( \big( X\beta - y \big)^2 \bigg) \\
@@ -203,6 +207,110 @@ df2 = df2.loc[
203207
df2 = df2.loc[:,~df2.columns.duplicated()]
204208
```
205209

210+
:::: {.content-visible when-profile="fr"}
211+
Dans le prochain exercice, nous allons utiliser la fonction suivante pour avoir une matrice de corrélation plus esthétique que celle permise par défaut avec `Pandas`.
212+
:::
213+
214+
215+
:::: {.content-visible when-profile="en"}
216+
We will use the following function to represent correlation matrix.
217+
:::
218+
219+
220+
```{python}
221+
#| echo: true
222+
#| code-fold: true
223+
import numpy as np
224+
import pandas as pd
225+
import plotly.express as px
226+
227+
228+
def plot_corr_heatmap(
229+
df: pd.DataFrame,
230+
drop_cols=None,
231+
column_labels: dict | None = None,
232+
decimals: int = 2,
233+
width: int = 600,
234+
height: int = 600,
235+
show_xlabels: bool = False
236+
):
237+
"""
238+
Trace une heatmap de corrélation (triangle inférieur) à partir d'un DataFrame.
239+
240+
Paramètres
241+
----------
242+
df : pd.DataFrame
243+
DataFrame d'entrée.
244+
drop_cols : list ou None
245+
Liste de colonnes à supprimer avant le calcul de la corrélation
246+
(ex: ['winner']).
247+
column_labels : dict ou None
248+
Dictionnaire pour renommer les colonnes (ex: column_labels).
249+
decimals : int
250+
Nombre de décimales pour l'arrondi avant corr().
251+
width, height : int
252+
Dimensions de la figure Plotly.
253+
show_xlabels : bool
254+
Afficher ou non les labels en abscisse.
255+
"""
256+
data = df.copy()
257+
258+
# 1. Colonnes à drop
259+
if drop_cols is not None:
260+
data = data.drop(columns=drop_cols)
261+
262+
# 2. Arrondi + renommage éventuel
263+
if column_labels is not None:
264+
data = data.rename(columns=column_labels)
265+
data = data.round(decimals)
266+
267+
# 3. Matrice de corrélation
268+
corr = data.corr()
269+
270+
# 4. Masque triangle supérieur
271+
mask = np.triu(np.ones_like(corr, dtype=bool))
272+
corr_masked = corr.mask(mask)
273+
274+
# 5. Heatmap Plotly
275+
fig = px.imshow(
276+
corr_masked.values,
277+
x=corr.columns,
278+
y=corr.columns,
279+
color_continuous_scale='RdBu_r', # échelle inversée
280+
zmin=-1,
281+
zmax=1,
282+
text_auto=".2f"
283+
)
284+
285+
# 6. Hover custom
286+
fig.update_traces(
287+
hovertemplate="Var 1: %{y}<br>Var 2: %{x}<br>Corr: %{z:.2f}<extra></extra>"
288+
)
289+
290+
# 7. Layout
291+
fig.update_layout(
292+
coloraxis_showscale=False,
293+
xaxis=dict(
294+
showticklabels=show_xlabels,
295+
title=None,
296+
ticks=''
297+
),
298+
yaxis=dict(
299+
showticklabels=show_xlabels,
300+
title=None,
301+
ticks=''
302+
),
303+
plot_bgcolor="rgba(0,0,0,0)",
304+
margin=dict(t=10, b=10, l=10, r=10),
305+
width=width,
306+
height=height
307+
)
308+
309+
return fig
310+
311+
```
312+
313+
206314
:::: {.content-visible when-profile="fr"}
207315
Dans cet exercice, nous utiliserons
208316
également une fonction pour extraire
@@ -468,6 +576,8 @@ preprocessor
468576
```
469577

470578
```{python}
579+
#| output: false
580+
471581
# Question 4
472582
model = Lasso(fit_intercept=True, alpha = 0.1)
473583
@@ -521,7 +631,11 @@ The model is quite parsimonious as it uses a subset of our initial variables (es
521631

522632

523633
```{python}
524-
features_selec.str.replace("(number__|category__)", "", regex = True)
634+
pd.DataFrame(
635+
{
636+
"selected": features_selec.str.replace("(number__|category__)", "", regex = True)
637+
}
638+
)
525639
```
526640

527641
:::: {.content-visible when-profile="fr"}
@@ -537,15 +651,15 @@ Additionally, redundant variables are being selected. A more thorough data clean
537651

538652

539653
```{python}
540-
#| output: false
541654
#4. Corrélations entre les variables sélectionnées
542655
543-
features_selected = features_selec.loc[features_selec.str.startswith("number__")].str.replace("number__", "", regex = True)
544-
545-
corr = df2.loc[: , features_selected].corr()
656+
features_selected = (
657+
features_selec
658+
.loc[features_selec.str.startswith("number__")]
659+
.str.replace("number__", "", regex = True)
660+
)
546661
547-
plt.figure()
548-
p = corr.style.background_gradient(cmap='coolwarm', axis=None).format('{:.2f}')
662+
p = plot_corr_heatmap(df2.loc[: , features_selected])
549663
p
550664
```
551665

@@ -623,7 +737,7 @@ The parsimonious model is (slightly) more performant:
623737
```{python}
624738
pd.DataFrame({
625739
"parcimonieux": [rmse_parci, rsq_parci, len(features_selected)],
626-
"non parcimonieux": [rmse_nonparci, rsq_nonparci, ols_pipeline[-1].coef_.shape[1] + 1]},
740+
"non parcimonieux": [rmse_nonparci, rsq_nonparci, ols_pipeline[-1].coef_.shape[0] + 1]},
627741
index = ['RMSE', 'R2', 'Nombre de paramètres']
628742
)
629743
```
@@ -682,6 +796,7 @@ difficult to handle.
682796

683797
```{python}
684798
#| echo: true
799+
#| output: false
685800
from sklearn.impute import SimpleImputer
686801
from sklearn.preprocessing import StandardScaler
687802
@@ -698,11 +813,14 @@ numeric_pipeline = Pipeline(steps=[
698813
('impute', SimpleImputer(strategy='mean')),
699814
('scale', StandardScaler())
700815
])
816+
701817
preprocessed_features = pd.DataFrame(
702818
numeric_pipeline.fit_transform(
703819
X_train.drop(columns = categorical_features)
704820
)
705821
)
822+
823+
numeric_pipeline
706824
```
707825

708826

@@ -732,6 +850,7 @@ varies (explore $\alpha \in [0.001,0.01,0.02,0.025,0.05,0.1,0.25,0.5,0.8,1.0]$).
732850

733851
```{python}
734852
#| output: false
853+
from plotnine import *
735854
736855
#6. Utilisation de lasso_path
737856
my_alphas = np.array([0.001,0.01,0.02,0.025,0.05,0.1,0.25,0.5,0.8,1.0])
@@ -742,16 +861,24 @@ alpha_for_path, coefs_lasso, _ = lasso_path(
742861
alphas=my_alphas)
743862
#print(coefs_lasso)
744863
nb_non_zero = np.apply_along_axis(func1d=np.count_nonzero,arr=coefs_lasso,axis=0)
745-
nb_non_zero = pd.DataFrame(
746-
nb_non_zero
747-
).sum(axis = 0)
864+
865+
output_lasso = pd.DataFrame(
866+
{"alpha": alpha_for_path, "non_zero": nb_non_zero}
867+
)
868+
748869
749870
# graphique
750871
751-
sns.set_style("whitegrid")
752-
plt.figure()
753-
p = sns.lineplot(y=nb_non_zero, x=alpha_for_path)
754-
p.set(title = r"Number variables and regularization parameter ($\alpha$)", xlabel=r'$\alpha$', ylabel='Nb. de variables')
872+
p = (
873+
ggplot(output_lasso) +
874+
geom_line(aes(x = "alpha", y = "non_zero")) +
875+
labs(
876+
title = r"Number variables and regularization parameter ($\alpha$)",
877+
x = r'$\alpha$',
878+
y = 'Nb. de variables'
879+
) +
880+
theme_minimal()
881+
)
755882
```
756883

757884
:::: {.content-visible when-profile="fr"}
@@ -766,7 +893,7 @@ the number of parameters is as follows:
766893

767894

768895
```{python}
769-
p.figure.get_figure()
896+
p
770897
```
771898

772899
:::: {.content-visible when-profile="fr"}
@@ -778,11 +905,6 @@ We see that the higher $\alpha$ is, the fewer variables the model selects.
778905
::::
779906

780907

781-
```{python}
782-
#| output: false
783-
p.figure.get_figure().savefig("featured_selection.png")
784-
```
785-
786908
:::: {.content-visible when-profile="fr"}
787909
# Validation croisée pour sélectionner le modèle
788910

@@ -878,10 +1000,21 @@ lasso_optimal = lasso_pipeline.fit(X_train,y_train)
8781000
features_selec2 = extract_features_selected(lasso_optimal)
8791001
```
8801002

1003+
:::: {.content-visible when-profile="fr"}
8811004
Les variables sélectionnées sont :
1005+
::::
1006+
1007+
1008+
:::: {.content-visible when-profile="fr"}
1009+
Selected features are:
1010+
::::
8821011

8831012
```{python}
884-
features_selec2.str.replace("(number__|category__)", "", regex = True)
1013+
pd.DataFrame(
1014+
{
1015+
"selected": features_selec2.str.replace("(number__|category__)", "", regex = True)
1016+
}
1017+
)
8851018
```
8861019

8871020

0 commit comments

Comments
 (0)