English version

linogaliana · linogaliana · commit 64c12dc523b0 · 2024-11-07T19:28:22.000Z
diff --git a/_quarto-en.yml b/_quarto-en.yml
@@ -18,6 +18,7 @@ project:
     - content/manipulation/04b_regex_TP.qmd
     - content/visualisation/index.qmd
     - content/modelisation/index.qmd
+    - content/modelisation/0_preprocessing.qmd
 
 
 website:
diff --git a/content/modelisation/01_preprocessing/_exo2.qmd b/content/modelisation/01_preprocessing/_exo2.qmd
@@ -28,8 +28,7 @@ __This exercise is OPTIONAL__
 `Percent of adults with less than a high school diploma, 2015-19`,
 `Percent of adults with a bachelor's degree or higher, 2015-19`.
 2. Use a graph to represent the correlation matrix. You can use the `seaborn` package and its `heatmap` function.
-3. Display a scatter plot matrix of the variables in `df2` using `pd.plotting.scatter_matrix`.
-4. (Optional) Recreate these figures using `Plotly`, which also offers the option to create a correlation matrix.
+
 
 :::
 
@@ -82,7 +81,7 @@ sns.heatmap(
     corr,
     mask=mask,     # Mask upper triangular matrix
     cmap=cmap,
-    annot=True
+    annot=True,
     vmax=.3,
     vmin=-.3,
     center=0,      # The center value of the legend. With divergent cmap, where white is
@@ -95,29 +94,6 @@ plt.show(fig)
 ```
 
 
-::: {.content-visible when-profile="fr"}
-Alors que celle construite directement avec `corr` de `Pandas`
-ressemblera plutôt à ce tableau :
-:::
-
-::: {.content-visible when-profile="en"}
-
-Whereas the one constructed directly with `corr` from `Pandas` will look more like this table:
-:::
-
-
-
-```{python}
-#| output: false
-#| echo: true
-# Construction directement avec pandas également possible
-g2 = df2.drop("winner", axis = 1).corr().style.background_gradient(cmap='coolwarm').format('{:.2f}')
-```
-
-```{python}
-g2
-```
-
 ::: {.content-visible when-profile="fr"}
 Le nuage de point obtenu à l'issue de la question 3 ressemblera à :
 :::
@@ -143,12 +119,3 @@ The result of question 4, on the other hand, should look like the following char
 :::
 
 
-```{python}
-#| echo: true
-# 4. Matrice de corrélation avec plotly
-import plotly
-import plotly.express as px
-htmlsnip2 = px.scatter_matrix(df2)
-htmlsnip2.update_traces(diagonal_visible=False)
-htmlsnip2.show()
-```
diff --git a/content/modelisation/0_preprocessing.qmd b/content/modelisation/0_preprocessing.qmd
@@ -16,7 +16,7 @@ description: |
 description-en: |
   In order to obtain data that is consistent with modeling assumptions, it is essential to take the time to prepare the data to be supplied to a model. The quality of the prediction depends heavily on this preliminary work, known as _preprocessing_. This chapter presents the issues involved and illustrates them using the `Scikit Learn` library, which makes this work less tedious and more reliable.
 bibliography: ../../reference.bib
-image: featured_preprocessing.png
+image: https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/artisan.jfif
 echo: false
 ---
 
@@ -99,7 +99,7 @@ The `Scikit` user guide is a valuable reference to consult regularly. The sectio
 
 
 
-:::: {.content-visible when-profile="en"}
+:::: {.content-visible when-profile="fr"}
 
 ::: {.note}
 ## `Scikit Learn`, un succès français ! 🐓🥖🥐
@@ -551,6 +551,7 @@ Detecting data drift is crucial to adjust or retrain the model, ensuring its rel
 ::::
 
 
+::: {.content-visible when-profile="fr"}
 ### Normalisation
 
 La **normalisation** est l'action de transformer les données de manière
@@ -560,87 +561,26 @@ Par défaut, la norme utilisée par `Scikit` est une norme  $\mathcal{l}_2$.
 Cette transformation est particulièrement utilisée en classification de texte ou pour effectuer du *clustering*.
 
 Au passage, ceci est l'occasion de découvrir comment découper ses données en plusieurs échantillons grâce à la fonction [`train_test_split`](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html) de `Scikit`. Nous allons faire un échantillon de 70% des données pour estimer les paramètres de normalisation (phase d'apprentissage) et extrapoler aux 30% de données restantes. Cette répartition est assez classique mais est bien-sûr adaptable selon les projets. L'avantage d'utiliser [`train_test_split`](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html) plutôt que de faire soi-même les échantillonnages avec la méthode `sample` de `Pandas` est que la fonction de `Scikit` permettra d'aller beaucoup plus loin dans le paramétrage de l'échantillonnage, notamment si on désire de la stratification, tout en étant fiable. Faire ceci de manière manuelle est fastidieux et risqué car potentiellement complexe à mettre en oeuvre sans erreur. 
-
-
-::: {.exercise}
-## Exercice 4 : Normalisation
-
-1. A l'aide de la documentation de la fonction [`train_test_split`](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html) de `Scikit`, créer deux échantillons (respectivement 70% et 30% des données).
-1. Normaliser la variable `Median_Household_Income_2019` (ne pas écraser les valeurs !) et regarder l'histogramme avant/après normalisation.
-3. Vérifier que la norme $\mathcal{l}_2$ est bien égale à 1 (grâce à la fonction `np.linalg.norm` et l'argument `axis=1` pour les 10 premières observations, sur l'ensemble d'entraînement puis sur les autres observations.
-
 :::
 
+::: {.content-visible when-profile="en"}
 
-```{python}
-from sklearn.model_selection import train_test_split
-
-X_train, X_test, y_train, y_test = train_test_split(
-    df2.drop(columns = "winner"), df2['winner'], test_size=0.3)
-```
-
-```{python}
-#| eval: false
-# 1. Normalisation des variables et vérification sur Median_Household_Income_2019
-scaler = preprocessing.Normalizer().fit(X_train)
-X1 = pd.DataFrame(
-  scaler.transform(X_train),
-  columns = X_train.columns
-)
-X2 = pd.DataFrame(
-  scaler.transform(X_test),
-  columns = X_test.columns
-)
-```
-
-```{python}
-p1 = (
-  ggplot(X_train, aes(x = "Median_Household_Income_2019")) +
-  geom_histogram() +
-  labs(x = "2019 Median household income ($)")
-)
-p2 = (
-  ggplot(X1, aes(x = "Median_Household_Income_2019")) +
-  geom_histogram() +
-  labs(x = "2019 Median household income (normalized, training sample)")
-)
-p3 = (
-  ggplot(X2, aes(x = "Median_Household_Income_2019")) +
-  geom_histogram() +
-  labs(x = "2019 Median household income (normalized, extrapolated sample)")
-)
-```
-
-
-```{python}
-#| fig-cap: "Question 2, avant normalisation"
-p1
-```
+### Normalization
 
-```{python}
-#| fig-cap: "Question 2, variable transformée, sur l'échantillon de normalisation"
-p2
-```
+**Normalization** is the process of transforming data to achieve a unit norm ($\mathcal{l}_1$ or $\mathcal{l}_2$).
+In other words, with the appropriate norm, the sum of elements equals 1.
+By default, `Scikit` uses an $\mathcal{l}_2$ norm.
+This transformation is especially useful in text classification or clustering.
 
-```{python}
-#| fig-cap: "Question 2, variable transformée, à partir des paramètres entraînés"
-p3
-```
+This is also an opportunity to explore how to split data into multiple samples using the [`train_test_split`](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html) function in `Scikit`. We will create a 70% sample of the data to estimate normalization parameters (training phase) and extrapolate to the remaining 30%. This split is fairly standard but, of course, adaptable depending on the project. The advantage of using [`train_test_split`](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html) instead of manually sampling with `Pandas`’ `sample` method is that `Scikit`’s function allows for much more control over sampling, particularly if stratification is desired, while being reliable. Doing this manually can be tedious and risky, as it is potentially complex to implement without errors.
+:::
 
-Enfin, si on calcule la norme, on obtient bien le résultat attendu à la fois sur l'échantillon _train_ et sur l'échantillon extrapolé.
 
-```{python}
-# 3. Vérification de la norme L2
-pd.DataFrame(
-  {
-    "X_train_norm2": np.linalg.norm(X1.head(10), axis=1),
-    "X_test_norm2": np.linalg.norm(X2.head(10), axis=1)
-  }
-).head(5)
-```
+{{< include "01_preprocessing/_exo4.qmd" >}}
 
 
 
+::: {.content-visible when-profile="fr"}
 ## Encodage des valeurs catégorielles
 
 Les données catégorielles doivent être recodées sous forme de valeurs numériques pour être intégrés aux modèles de *machine learning*.
@@ -701,76 +641,107 @@ puis retirer dans les deux queues de la distribution les valeurs extrêmes corre
 
 Plusieurs packages permettent de faire ce type d'opérations, qui sont parfois plus complexes si on s'intéresse aux outlier sur plusieurs variables.
 On pourra notamment citer la fonction `IsolationForest()` du package `sklearn.ensemble`.
+:::
+::: {.content-visible when-profile="en"}
 
-## Exercice d'application
+## Encoding Categorical Values
 
-::: {.caution}
-## Attention aux nouvelles modalités !
+Categorical data must be recoded into numeric values to be integrated into machine learning models.
 
-Les _transformers_ créent un _mapping_ entre des modélités textuelles et des valeurs numériques. Cela présuppose que les données sur lesquelles a été construit ce _mapping_ intègrent l'ensemble des valeurs possibles pour les modalités textuelles. 
+This can be done in several ways with `Scikit`:
 
-Néanmoins, si de nouvelles modalités apparaissent, le classifieur ne saura pas comment celles-ci doivent être transformées en valeurs numériques. Cela provoquera une erreur pour `Scikit`. Cette erreur technique est logique puisqu'il faudrait mettre à jour non seulement le _mapping_ mais aussi l'estimation des paramètres sous-jacents. 
+* `LabelEncoder`: transforms a vector `["a","b","c"]` into a numeric vector `[0,1,2]`. This approach has the drawback of introducing an order to the categories, which is not always desirable.
+* `OrdinalEncoder`: a generalized version of `LabelEncoder` designed to apply to matrices ($X$), while `LabelEncoder` applies mainly to a vector ($y$).
 
-:::
+For one-hot encoding, several methods are available:
 
+* `pandas.get_dummies` performs a dummy expansion.
+A vector of size *n* with *K* categories will be transformed into a matrix of size $n \times K$, where each column represents a dummy variable for category *k*.
+There are $K$ categories, resulting in multicollinearity.
+In linear regression with a constant,
+one category should be removed before estimation.
 
+* `OneHotEncoder` is a generalized (and optimized) version of dummy expansion. This is the recommended method.
 
-::: {.exercise}
-## Exercice 5 : Encoder des variables catégorielles
 
-1. Créer `df` qui conserve uniquement les variables `state_name` et `county_name` dans `votes`.
-2. Appliquer à `state_name` un `LabelEncoder`
-*Note : Le résultat du label encoding est relativement intuitif, notamment quand on le met en relation avec le vecteur initial.*
+## Imputation
+
+Data often contains missing values, that is, observations in our _DataFrame_ containing a `NaN`. These gaps can cause bugs or misinterpretations when moving to modeling.
+One initial approach could be to remove all observations with a `NaN` in at least one column.
+However, if our table contains many `NaN`s, or if these are spread across numerous columns,
+we risk removing a significant number of rows, and, with that, losing important information, as missing values are rarely [randomly distributed](https://stefvanbuuren.name/fimd/sec-MCAR.html).
+
+While this solution remains viable in many cases, a more robust approach called *imputation* exists. This method involves replacing missing values with a specified value. For example:
+
+- Mean imputation: replacing all `NaN`s in a column with the column's average;
+- Median imputation on the same principle, or using the most frequent column value for categorical variables;
+- Regression imputation: using other variables to interpolate an appropriate replacement value.
+
+More complex methods are available, but in many cases, the above approaches can provide much more satisfactory results.
+The `Scikit` package makes imputation very straightforward ([documentation here](https://scikit-learn.org/stable/modules/impute.html)).
+
 
-3. Regarder la *dummy expansion* de `state_name`
-4. Appliquer un `OrdinalEncoder` à `df[['state_name', 'county_name']]`
-*Note : Le résultat du _ordinal encoding_ est cohérent avec celui du label encoding*
+## Handling Outliers
 
-5. Appliquer un `OneHotEncoder` à `df[['state_name', 'county_name']]`
+Outliers are observations that significantly deviate from the general trend of other observations in a dataset. In other words, they are data points that stand out unusually from the overall data distribution.
+This may be due to data entry errors, respondents who incorrectly answered a survey, or simply extreme values that may bias a model too much.
 
-*Note : `scikit` optimise l'objet nécessaire pour stocker le résultat d'un modèle de transformation. Par exemple, le résultat de l'encoding One Hot est un objet très volumineux. Dans ce cas, `scikit` utilise une matrice Sparse.*
+For example, these could be 3 individuals measuring over 4 meters in height within a population or household incomes exceeding 10 million euros per month at a national level.
 
+It is good practice to routinely examine the distribution of available variables
+to check if some values deviate too significantly from others.
+Sometimes these values will interest us, for instance, if we are focusing solely on very high incomes (top 0.1%) in France. However, often these values will be more of a hindrance, especially if they don’t make sense in the real world.
+
+If we find that the presence of these extreme values or *outliers* in our dataset is more problematic than helpful,
+it is reasonable to simply remove them.
+Most of the time, we set a percentage of data to remove, such as 0.1%, 1%, or 5%,
+then remove the corresponding extreme values in both tails of the distribution.
+
+Several packages can perform these operations, which can become complex if we examine outliers across multiple variables.
+The `IsolationForest()` function in the `sklearn.ensemble` package is particularly noteworthy.
 :::
 
-```{python}
-#1. Création de df
-df = votes.loc[
-  :,["state_name",'county_name']
-]
-```
 
-Si on regarde les _labels_ et leurs transpositions numériques via `LabelEncoder`
 
-```{python}
-#2. Appliquer un LabelEncoder à stat_name
-label_enc = preprocessing.LabelEncoder().fit(df['state_name'])
-np.column_stack((label_enc.transform(df['state_name']),df['state_name']))
-```
+::: {.content-visible when-profile="fr"}
+## Exercice d'application
 
+::: {.caution}
+## Attention aux nouvelles modalités !
 
-```{python}
-# 3. dummy expansion de state_name
-pd.get_dummies(df['state_name'])
-```
+Les _transformers_ créent un _mapping_ entre des modalités textuelles et des valeurs numériques. Cela présuppose que les données sur lesquelles a été construit ce _mapping_ intègrent l'ensemble des valeurs possibles pour les modalités textuelles. 
 
-Si on regarde l'`OrdinalEncoder`:
+Néanmoins, si de nouvelles modalités apparaissent, le classifieur ne saura pas comment celles-ci doivent être transformées en valeurs numériques. Cela provoquera une erreur pour `Scikit`. Cette erreur technique est logique puisqu'il faudrait mettre à jour non seulement le _mapping_ mais aussi l'estimation des paramètres sous-jacents. 
 
-```{python}
-# 4. OrdinalEncoder
-ord_enc = preprocessing.OrdinalEncoder().fit(df)
-# ord_enc.transform(df[['state', 'county']])
-ord_enc.transform(df)[:,0]
-```
+:::
 
+::::
+
+:::: {.content-visible when-profile="en"}
+
+## Application exercise
+
+::: {.caution}
+## Be careful with new categories!
+
+Transformers create a mapping between text categories and numeric values. This assumes that the data used to build this mapping includes all possible values for the text categories.
+
+However, if new categories appear, the classifier will not know how to transform these into numeric values, which will cause an error in `Scikit`. This technical error makes sense, as it would require updating not only the mapping but also the estimation of underlying parameters.
+
+:::
+::::
+
+{{< include "01_preprocessing/_exo5.qmd" >}}
 
-```{python}
-# 5. OneHotEncoder
-onehot_enc = preprocessing.OneHotEncoder().fit(df)
-onehot_enc.transform(df)
-```
 
 
-# Références
+::: {.content-visible when-profile="fr"}
+# Références {.unnumbered}
+:::
+
+::: {.content-visible when-profile="en"}
+# Reference {.unnumbered}
+:::
 
 ::: {#refs}
 :::