linogaliana
diff --git a/‎_quarto-en.yml
Lines changed: 5 additions & 2 deletions b/‎_quarto-en.yml
Lines changed: 5 additions & 2 deletions
diff --git a/‎_quarto.yml
Lines changed: 1 addition & 1 deletion b/‎_quarto.yml
Lines changed: 1 addition & 1 deletion
diff --git a/‎content/NLP/01_intro.qmd
Lines changed: 546 additions & 228 deletions b/‎content/NLP/01_intro.qmd
Lines changed: 546 additions & 228 deletions
diff --git a/‎content/NLP/01_intro/exercise1.qmd
Lines changed: 31 additions & 0 deletions b/‎content/NLP/01_intro/exercise1.qmd
Lines changed: 31 additions & 0 deletions
diff --git a/‎content/NLP/01_intro/exercise2.qmd
Lines changed: 128 additions & 0 deletions b/‎content/NLP/01_intro/exercise2.qmd
Lines changed: 128 additions & 0 deletions
diff --git a/‎content/NLP/01_intro/exercise3.qmd
Lines changed: 92 additions & 0 deletions b/‎content/NLP/01_intro/exercise3.qmd
Lines changed: 92 additions & 0 deletions
diff --git a/‎content/NLP/01_intro/exercise4.qmd b/‎content/NLP/01_intro/exercise4.qmd
@@ -26,6 +26,9 @@ project:
     - content/modelisation/3_regression.qmd
     - content/modelisation/4_featureselection.qmd
     - content/modelisation/5_clustering.qmd
+    - content/NLP/index.qmd
+    - content/NLP/01_intro.qmd
+    - content/NLP/02_exoclean.qmd
 
 
 website:
@@ -44,7 +47,7 @@ website:
       <a href="#" onclick="this.href = location.href.replace('/en/', '/');" class="button-cta">
           <button class="btn"><i class="fa fa-language"></i> Passer à la version française 🇫🇷</button>
       </a>
-  sidebar: 
+  sidebar:
     - id: introduction
       title: "Introduction"
       #collapse-level: 2
@@ -54,7 +57,7 @@ website:
         - content/getting-started/01_environment.qmd
         - content/getting-started/02_data_analysis.qmd
         - content/getting-started/03_revisions.qmd
-    - id: manipulation 
+    - id: manipulation
       title: "Wrangle"
       #collapse-level: 2
       contents:
 
@@ -5,11 +5,11 @@ project:
     - 404.qmd
     - content/getting-started/index.qmd
     - content/manipulation/index.qmd
-    - content/manipulation/04c_API_TP.qmd
     - content/visualisation/index.qmd
     - content/getting-started/01_environment.qmd
     - content/modelisation/index.qmd
     - content/NLP/index.qmd
+    - content/NLP/01_intro.qmd
     - content/annexes/corrections.qmd
     - content/annexes/evaluation.qmd
     - content/git/*.qmd
 
@@ -0,0 +1,31 @@
+:::: {.content-visible when-profile="fr"}
+
+::: {.exercise}
+## Exercice 1 : Fréquence d'un mot
+
+Dans un premier temps, nous allons nous concentrer sur notre corpus anglo-saxon (`horror`)
+
+1. Compter le nombre de phrases, pour chaque auteur, où apparaît le mot `fear`.
+2. Utiliser `pywaffle` pour obtenir les graphiques ci-dessous qui résument
+de manière synthétique le nombre d'occurrences du mot *"fear"* par auteur.
+3. Refaire l'analyse avec le mot *"horror"*.
+
+:::
+
+::::
+
+:::: {.content-visible when-profile="en"}
+
+::: {.exercise}
+## Exercise 1: Word Frequency
+
+First, we will focus on our Anglo-Saxon corpus (`horror`)
+
+1. Count the number of sentences, for each author, in which the word `fear` appears.
+2. Use `pywaffle` to generate the charts below that visually summarize
+the number of occurrences of the word *"fear"* by author.
+3. Repeat the analysis with the word *"horror"*.
+
+:::
+
+::::
@@ -0,0 +1,128 @@
+:::: {.content-visible when-profile="fr"}
+
+::: {.exercise}
+## Exercice 2 : Wordcloud
+
+1. En utilisant la fonction `wordCloud`, faire trois nuages de mot pour représenter les mots les plus utilisés par chaque auteur du corpus `horror`[^random_state].
+2. Faire un nuage de mot du corpus `dumas` en utilisant un masque
+comme celui-ci
+
+<details>
+<summary>
+Exemple de masque pour la question 2
+</summary>
+
+![URL de l'image: https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png
+](https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png)
+
+</details>
+
+[^random_state]: Pour avoir les mêmes résultats que ci-dessous, vous pouvez fixer l'argument `random_state=21`.
+
+:::
+
+::::
+
+:::: {.content-visible when-profile="en"}
+
+::: {.exercise}
+## Exercise 2: Wordcloud
+
+1. Using the `wordCloud` function, create three word clouds to represent the most commonly used words by each author in the `horror` corpus[^random_state].
+2. Create a word cloud for the `dumas` corpus using a mask
+like the one below.
+
+<details>
+<summary>
+Example mask for question 2
+</summary>
+
+![Image URL: https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png
+](https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png)
+
+</details>
+
+[^random_state]: To obtain the same results as shown below, you can set the argument `random_state=21`.
+
+:::
+
+::::
+
+
+```{python}
+from wordcloud import WordCloud
+
+#1. Wordclouds trois auteurs
+def graph_wordcloud(author, train_data, varname = "Text"):
+  txt = train_data.loc[train_data['Author']==author, varname]
+  all_text = ' '.join([text for text in txt])
+  wordcloud = WordCloud(width=800, height=500,
+                      random_state=21,
+                      max_words=2000,
+                      background_color = "white",
+                      colormap='Set2').generate(all_text)
+  return wordcloud
+
+n_topics = ["HPL","EAP","MWS"]
+```
+
+::: {.content-visible when-profile="fr"}
+Les nuages de points obtenus à la question 1 sont les suivants:
+:::
+
+::: {.content-visible when-profile="en"}
+The word clouds generated for question 1 are as follows:
+:::
+
+```{python}
+#| label: fig-wordcloud-spooky
+#| layout-ncol: 2
+#| fig-cap:
+#|   - "Lovercraft"
+#|   - "Poe"
+#|   - "Shelley"
+for i in range(len(n_topics)):
+    wordcloud = graph_wordcloud(n_topics[i], horror)
+    plt.imshow(wordcloud)
+    plt.axis('off')
+    plt.show()
+```
+
+```{python}
+import wordcloud
+import numpy as np
+import io
+import requests
+import PIL
+import matplotlib.pyplot as plt
+
+img = "https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png"
+book_mask = np.array(
+  PIL.Image.open(io.BytesIO(requests.get(img).content))
+)
+
+def make_wordcloud(corpus):
+    wc = wordcloud.WordCloud(background_color="white", max_words=2000, mask=book_mask, contour_width=3, contour_color='steelblue')
+    wc.generate(corpus)
+    return wc
+
+wordcloud_dumas = make_wordcloud(dumas)
+```
+
+::: {.content-visible when-profile="fr"}
+Alors que celui obtenu à partir de l'oeuvre de Dumas prend
+la forme
+:::
+
+::: {.content-visible when-profile="en"}
+Whereas the one generated from Dumas' work takes
+the shape
+:::
+
+
+```{python}
+#| fig-cap: Nuage de mot produit à partir du Comte de Monte Cristo
+#| label: fig-wordcloud-dumas
+plt.imshow(wordcloud_dumas, interpolation='bilinear')
+plt.axis("off")
+```
@@ -0,0 +1,92 @@
+:::: {.content-visible when-profile="fr"}
+
+::: {.exercise}
+## Exercice 3 : Nettoyage du texte
+
+1. Reprendre l'ouvrage de Dumas et nettoyer celui-ci avec `Spacy`. Refaire le nuage de mots et conclure.
+2. Faire ce même exercice sur le jeu de données anglo-saxon. Idéalement, vous devriez être en mesure d'utiliser la fonctionnalité de _pipeline_ de `SpaCy`.
+
+:::
+
+::::
+
+:::: {.content-visible when-profile="en"}
+
+::: {.exercise}
+## Exercise 3: Text Cleaning
+
+1. Take Dumas' work and clean it using `Spacy`. Generate the word cloud again and draw your conclusions.
+2. Perform the same task on the Anglo-Saxon dataset. Ideally, you should be able to use the `SpaCy` _pipeline_ functionality.
+
+:::
+
+::::
+
+```{.python include="cleantext.py"}
+```
+
+```{python}
+del clean_text
+def clean_text(doc):
+    # Tokenize, remove stop words and punctuation, and lemmatize
+    cleaned_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
+    # Join tokens back into a single string
+    cleaned_text = ' '.join(cleaned_tokens)
+    return cleaned_text
+```
+
+
+```{python}
+#| label: clean-stopwords-dumas
+# Process the text with spaCy
+doc = nlp(
+  dumas[:30000],
+  disable=['ner', 'textcat']
+)
+
+# Clean the text
+cleaned_dumas = clean_text(doc)
+```
+
+```{python}
+wordcloud_dumas_nostop = make_wordcloud(cleaned_dumas)
+```
+
+Ces retraitements commencent à porter leurs fruits puisque des mots ayant plus
+de sens commencent à se dégager, notamment les noms des personnages
+(Dantès, Danglart, etc.):
+
+
+```{python}
+#| fig-cap: Nuage de mot produit à partir du Comte de Monte Cristo après nettoyage
+#| label: fig-wordcloud-dumas-nostop
+plt.imshow(wordcloud_dumas_nostop, interpolation='bilinear')
+plt.axis("off")
+```
+
+
+
+
+```{python}
+#| label: clean-stopwords-horror
+# Question 2
+docs = nlp_english.pipe(horror["Text"])
+cleaned_texts = [clean_text(doc) for doc in docs]
+horror['preprocessed_text'] = cleaned_texts
+```
+
+```{python}
+#| fig-cap:
+#|   - "Lovercraft"
+#|   - "Poe"
+#|   - "Shelley"
+#| label: fig-wordcloud-spooky-nostop
+fig = plt.figure(figsize=(15, 12))
+for i in range(len(n_topics)):
+    wordcloud = graph_wordcloud(n_topics[i], horror, "preprocessed_text")
+    plt.imshow(wordcloud)
+    plt.axis('off')
+    plt.show()
+```
+
+