Avoid deprecated ravel method and fix notebooks launch url

linogaliana · linogaliana · commit 3e04ed3c4b17 · 2025-11-24T20:08:05.000Z
close #655
diff --git a/.gitignore b/.gitignore
@@ -54,3 +54,5 @@ jsconfig.json
 /.quarto/
 
 /.luarc.json
+
+**/*.quarto_ipynb
diff --git a/_extensions/badges/badges.lua b/_extensions/badges/badges.lua
@@ -83,7 +83,7 @@ function reminder_badges(args, kwargs)
 
 
   local onyxiaInitArgs = { section, chapterNoExtension }
-  if correction then
+  if (correction == "true") then
     table.insert(onyxiaInitArgs, "correction")
   end
 
diff --git a/_quarto.yml b/_quarto.yml
@@ -5,10 +5,10 @@ project:
     - 404.qmd
     - content/getting-started/index.qmd
     - content/getting-started/index.qmd
-    - content/getting-started/03_revisions.qmd
     - content/modelisation/index.qmd
     - content/visualisation/index.qmd
     - content/annexes/corrections.qmd
+    - content/modelisation/2_classification.qmd
 
 profile:
   default: fr
diff --git a/content/modelisation/2_classification.qmd b/content/modelisation/2_classification.qmd
@@ -191,23 +191,27 @@ import matplotlib.pyplot as plt
 
 1. Créer une variable *dummy* appelée `y` dont la valeur vaut 1 quand les républicains l'emportent. 
 2. En utilisant la fonction prête à l'emploi nommée `train_test_split` de la librairie `sklearn.model_selection`,
-créer des échantillons de test (20 % des observations) et d'estimation (80 %) avec comme *features* : `'Unemployment_rate_2019', 'Median_Household_Income_2021', 'Percent of adults with less than a high school diploma, 2018-22', "Percent of adults with a bachelor's degree or higher, 2018-22"` et comme *label* la variable `y`. 
+créer des échantillons de test (20 % des observations) et d'estimation (80 %) avec comme *features* :
 
-*Note: Il se peut que vous ayez le warning suivant :*
-
-> A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel()
+```{.python}
+xvars = [
+  "Unemployment_rate_2019", "Median_Household_Income_2021",
+  "Percent of adults with less than a high school diploma, 2018-22",
+  "Percent of adults with a bachelor's degree or higher, 2018-22"
+]
+```
 
-*Note : Pour éviter ce warning à chaque fois que vous estimez votre modèle, vous pouvez utiliser `DataFrame[['y']].values.ravel()` plutôt que `DataFrame[['y']]` lorsque vous constituez vos échantillons.*
+et comme *label* la variable `y`. 
 
 3. Entraîner un classifieur SVM avec comme paramètre de régularisation `C = 1`. Regarder les mesures de performance suivante : `accuracy`, `f1`, `recall` et `precision`.
 
-4. Vérifier la matrice de confusion : vous devriez voir que malgré des scores en apparence pas si mauvais, il y a un problème notable. 
+2. Vérifier la matrice de confusion : vous devriez voir que malgré des scores en apparence pas si mauvais, il y a un problème notable. 
 
-5. Refaire les questions précédentes avec des variables normalisées. Le résultat est-il différent ?
+3. Refaire les questions précédentes avec des variables normalisées. Le résultat est-il différent ?
 
-6. Changer de variables *x*. Utiliser uniquement le résultat passé du vote démocrate (année 2016) et le revenu. Les variables en question sont `share_2016_republican` et `Median_Household_Income_2021`. Regarder les résultats, notamment la matrice de confusion. 
+4. Changer de variables *x*. Utiliser uniquement le résultat passé du vote démocrate (année 2016) et le revenu. Les variables en question sont `share_2016_republican` et `Median_Household_Income_2021`. Regarder les résultats, notamment la matrice de confusion. 
 
-7. [OPTIONNEL] Faire une 5-fold validation croisée pour déterminer le paramètre *C* idéal. 
+5. [OPTIONNEL] Faire une 5-fold validation croisée pour déterminer le paramètre *C* idéal. 
 ::::
 
 :::
@@ -222,33 +226,27 @@ créer des échantillons de test (20 % des observations) et d'estimation (80 %)
 2. Using the ready-to-use function `train_test_split` from the `sklearn.model_selection` library, 
 create test samples (20% of the observations) and training samples (80%) with the following *features*: 
 
-```python
-vars = [
+```{.python}
+xvars = [
   "Unemployment_rate_2019", "Median_Household_Income_2021",
   "Percent of adults with less than a high school diploma, 2018-22",
   "Percent of adults with a bachelor's degree or higher, 2018-22"
 ]
 ```
 
 
-
 and use the variable `y` as the *label*. 
 
-*Note: You may encounter the following warning:*
-
-> A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel()
-
-*Note: To avoid this warning every time you train your model, you can use `DataFrame[['y']].values.ravel()` instead of `DataFrame[['y']]` when preparing your samples.*
-
 3. Train an SVM classifier with a regularization parameter `C = 1`. Examine the following performance metrics: `accuracy`, `f1`, `recall`, and `precision`.
 
 4. Check the confusion matrix: despite seemingly reasonable scores, you should notice a significant issue.
 
 5. Repeat the previous steps using normalized variables. Are the results different?
 
 
+6. Change the *x* variables. Use only the previous Democratic vote result (2016) and income. Variables in question are `share_2016_republican` and `Median_Household_Income_2021`. Examine results, in particular the confusion matrix.
+
 7. [OPTIONAL] Perform 5-fold cross-validation to determine the ideal *C* parameter. 
-6. Change the *x* variables. Use only the previous Democratic vote result (2016) and income. The variables in question are `share_2016_republican` and `Median_Household_Income_2021    `. Examine the results, particularly the confusion matrix.
 
 :::
 ::::
@@ -268,14 +266,12 @@ xvars = [
   "Percent of adults with a bachelor's degree or higher, 2018-22"
 ]
 
-
-
 df = votes.loc[:, ["y"] + xvars]
 df = df.dropna()
 
 X_train, X_test, y_train, y_test = train_test_split(
-    df[xvars],
-    df[['y']].values.ravel(), test_size=0.2, random_state=123
+    df.loc[: , xvars],
+    df['y'], test_size=0.2, random_state=123
 )
 ```
 
@@ -361,13 +357,13 @@ Standardizing the variables ultimately does not bring any improvement:
 import sklearn.preprocessing as preprocessing
 
 X = df.loc[:, xvars]
-y = df[['y']]
+y = df['y']
 scaler = preprocessing.StandardScaler().fit(X)
 X = scaler.transform(X)
 
 X_train, X_test, y_train, y_test = train_test_split(
     X,
-    y.values.ravel(), test_size=0.2, random_state=0
+    y, test_size=0.2, random_state=0
 )
 
 clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
@@ -408,17 +404,17 @@ out = pd.DataFrame.from_dict(
 ```{python}
 # Question 6
 votes['y'] = (votes['votes_gop'] > votes['votes_dem']).astype(int)
-df = votes[["y", "share_2016_republican", 'Median_Household_Income_2021']]
+df = votes.loc[: , ["y", "share_2016_republican", 'Median_Household_Income_2021']]
 tempdf = df.dropna(how = "any")
 
-X = votes[['share_2016_republican', 'Median_Household_Income_2021']]
-y = tempdf[['y']]
+X = tempdf.loc[:, ['share_2016_republican', 'Median_Household_Income_2021']]
+y = tempdf['y']
 scaler = preprocessing.StandardScaler().fit(X)
 X = scaler.transform(X)
 
 X_train, X_test, y_train, y_test = train_test_split(
     X,
-    y.values.ravel(), test_size=0.2, random_state=0
+    y, test_size=0.2, random_state=0
 )
 
 clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
diff --git a/content/modelisation/6_pipeline.qmd b/content/modelisation/6_pipeline.qmd
@@ -519,7 +519,7 @@ mutations2 = mutations2.groupby('dep').sample(frac = 0.1, random_state = 123)
 
 X_train, X_test, y_train, y_test = train_test_split(
     mutations2.drop("Valeur_fonciere", axis = 1),
-    mutations2[["Valeur_fonciere"]].values.ravel(),
+    mutations2["Valeur_fonciere"],
     test_size = 0.2, random_state = 123, stratify=mutations2[['dep']]
 )
 ```
@@ -837,7 +837,7 @@ pipe = Pipeline(steps=[('preprocessor', transformer),
 
 X_train, X_test, y_train, y_test = train_test_split(
     mutations_paris.drop("Valeur_fonciere", axis = 1),
-    mutations_paris[["Valeur_fonciere"]].values.ravel(),
+    mutations_paris["Valeur_fonciere"],
     test_size = 0.2, random_state = 123
 )
 
diff --git a/content/modelisation/app/train.py b/content/modelisation/app/train.py
@@ -37,7 +37,7 @@ def create_data_cv(data):
 
     X_train, X_test, y_train, y_test = train_test_split(
         data.drop("Valeur_fonciere", axis=1),
-        data[["Valeur_fonciere"]].values.ravel(),
+        data["Valeur_fonciere"],
         test_size=0.2,
         random_state=123,
     )

Original file line number	Diff line number	Diff line change
`@@ -519,7 +519,7 @@ mutations2 = mutations2.groupby('dep').sample(frac = 0.1, random_state = 123)`
`519`	`519`
`520`	`520`	`X_train, X_test, y_train, y_test = train_test_split(`
`521`	`521`	`mutations2.drop("Valeur_fonciere", axis = 1),`
`522`		`- mutations2[["Valeur_fonciere"]].values.ravel(),`
	`522`	`+ mutations2["Valeur_fonciere"],`
`523`	`523`	`test_size = 0.2, random_state = 123, stratify=mutations2[['dep']]`
`524`	`524`	`)`
`525`	`525`	```
`@@ -837,7 +837,7 @@ pipe = Pipeline(steps=[('preprocessor', transformer),`
`837`	`837`
`838`	`838`	`X_train, X_test, y_train, y_test = train_test_split(`
`839`	`839`	`mutations_paris.drop("Valeur_fonciere", axis = 1),`
`840`		`- mutations_paris[["Valeur_fonciere"]].values.ravel(),`
	`840`	`+ mutations_paris["Valeur_fonciere"],`
`841`	`841`	`test_size = 0.2, random_state = 123`
`842`	`842`	`)`
`843`	`843`
Original file line number	Diff line number	Diff line change
`@@ -37,7 +37,7 @@ def create_data_cv(data):`
`37`	`37`
`38`	`38`	`X_train, X_test, y_train, y_test = train_test_split(`
`39`	`39`	`data.drop("Valeur_fonciere", axis=1),`
`40`		`- data[["Valeur_fonciere"]].values.ravel(),`
	`40`	`+ data["Valeur_fonciere"],`
`41`	`41`	`test_size=0.2,`
`42`	`42`	`random_state=123,`
`43`	`43`	`)`