Skip to content

Commit d6b6712

Browse files
authored
Traduction des chapitres NLP (#603)
* index * description * nlp * NLP exo * titles * chap2 * Plus 1 * trad du chap3
1 parent e161783 commit d6b6712

File tree

14 files changed

+2473
-589
lines changed

14 files changed

+2473
-589
lines changed

_quarto-en.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,9 @@ project:
2626
- content/modelisation/3_regression.qmd
2727
- content/modelisation/4_featureselection.qmd
2828
- content/modelisation/5_clustering.qmd
29+
- content/NLP/index.qmd
30+
- content/NLP/01_intro.qmd
31+
- content/NLP/02_exoclean.qmd
2932

3033

3134
website:
@@ -44,7 +47,7 @@ website:
4447
<a href="#" onclick="this.href = location.href.replace('/en/', '/');" class="button-cta">
4548
<button class="btn"><i class="fa fa-language"></i> Passer à la version française 🇫🇷</button>
4649
</a>
47-
sidebar:
50+
sidebar:
4851
- id: introduction
4952
title: "Introduction"
5053
#collapse-level: 2
@@ -54,7 +57,7 @@ website:
5457
- content/getting-started/01_environment.qmd
5558
- content/getting-started/02_data_analysis.qmd
5659
- content/getting-started/03_revisions.qmd
57-
- id: manipulation
60+
- id: manipulation
5861
title: "Wrangle"
5962
#collapse-level: 2
6063
contents:

_quarto.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,11 @@ project:
55
- 404.qmd
66
- content/getting-started/index.qmd
77
- content/manipulation/index.qmd
8-
- content/manipulation/04c_API_TP.qmd
98
- content/visualisation/index.qmd
109
- content/getting-started/01_environment.qmd
1110
- content/modelisation/index.qmd
1211
- content/NLP/index.qmd
12+
- content/NLP/01_intro.qmd
1313
- content/annexes/corrections.qmd
1414
- content/annexes/evaluation.qmd
1515
- content/git/*.qmd

content/NLP/01_intro.qmd

Lines changed: 546 additions & 228 deletions
Large diffs are not rendered by default.

content/NLP/01_intro/exercise1.qmd

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
:::: {.content-visible when-profile="fr"}
2+
3+
::: {.exercise}
4+
## Exercice 1 : Fréquence d'un mot
5+
6+
Dans un premier temps, nous allons nous concentrer sur notre corpus anglo-saxon (`horror`)
7+
8+
1. Compter le nombre de phrases, pour chaque auteur, où apparaît le mot `fear`.
9+
2. Utiliser `pywaffle` pour obtenir les graphiques ci-dessous qui résument
10+
de manière synthétique le nombre d'occurrences du mot *"fear"* par auteur.
11+
3. Refaire l'analyse avec le mot *"horror"*.
12+
13+
:::
14+
15+
::::
16+
17+
:::: {.content-visible when-profile="en"}
18+
19+
::: {.exercise}
20+
## Exercise 1: Word Frequency
21+
22+
First, we will focus on our Anglo-Saxon corpus (`horror`)
23+
24+
1. Count the number of sentences, for each author, in which the word `fear` appears.
25+
2. Use `pywaffle` to generate the charts below that visually summarize
26+
the number of occurrences of the word *"fear"* by author.
27+
3. Repeat the analysis with the word *"horror"*.
28+
29+
:::
30+
31+
::::

content/NLP/01_intro/exercise2.qmd

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
:::: {.content-visible when-profile="fr"}
2+
3+
::: {.exercise}
4+
## Exercice 2 : Wordcloud
5+
6+
1. En utilisant la fonction `wordCloud`, faire trois nuages de mot pour représenter les mots les plus utilisés par chaque auteur du corpus `horror`[^random_state].
7+
2. Faire un nuage de mot du corpus `dumas` en utilisant un masque
8+
comme celui-ci
9+
10+
<details>
11+
<summary>
12+
Exemple de masque pour la question 2
13+
</summary>
14+
15+
![URL de l'image: https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png
16+
](https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png)
17+
18+
</details>
19+
20+
[^random_state]: Pour avoir les mêmes résultats que ci-dessous, vous pouvez fixer l'argument `random_state=21`.
21+
22+
:::
23+
24+
::::
25+
26+
:::: {.content-visible when-profile="en"}
27+
28+
::: {.exercise}
29+
## Exercise 2: Wordcloud
30+
31+
1. Using the `wordCloud` function, create three word clouds to represent the most commonly used words by each author in the `horror` corpus[^random_state].
32+
2. Create a word cloud for the `dumas` corpus using a mask
33+
like the one below.
34+
35+
<details>
36+
<summary>
37+
Example mask for question 2
38+
</summary>
39+
40+
![Image URL: https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png
41+
](https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png)
42+
43+
</details>
44+
45+
[^random_state]: To obtain the same results as shown below, you can set the argument `random_state=21`.
46+
47+
:::
48+
49+
::::
50+
51+
52+
```{python}
53+
from wordcloud import WordCloud
54+
55+
#1. Wordclouds trois auteurs
56+
def graph_wordcloud(author, train_data, varname = "Text"):
57+
txt = train_data.loc[train_data['Author']==author, varname]
58+
all_text = ' '.join([text for text in txt])
59+
wordcloud = WordCloud(width=800, height=500,
60+
random_state=21,
61+
max_words=2000,
62+
background_color = "white",
63+
colormap='Set2').generate(all_text)
64+
return wordcloud
65+
66+
n_topics = ["HPL","EAP","MWS"]
67+
```
68+
69+
::: {.content-visible when-profile="fr"}
70+
Les nuages de points obtenus à la question 1 sont les suivants:
71+
:::
72+
73+
::: {.content-visible when-profile="en"}
74+
The word clouds generated for question 1 are as follows:
75+
:::
76+
77+
```{python}
78+
#| label: fig-wordcloud-spooky
79+
#| layout-ncol: 2
80+
#| fig-cap:
81+
#| - "Lovercraft"
82+
#| - "Poe"
83+
#| - "Shelley"
84+
for i in range(len(n_topics)):
85+
wordcloud = graph_wordcloud(n_topics[i], horror)
86+
plt.imshow(wordcloud)
87+
plt.axis('off')
88+
plt.show()
89+
```
90+
91+
```{python}
92+
import wordcloud
93+
import numpy as np
94+
import io
95+
import requests
96+
import PIL
97+
import matplotlib.pyplot as plt
98+
99+
img = "https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png"
100+
book_mask = np.array(
101+
PIL.Image.open(io.BytesIO(requests.get(img).content))
102+
)
103+
104+
def make_wordcloud(corpus):
105+
wc = wordcloud.WordCloud(background_color="white", max_words=2000, mask=book_mask, contour_width=3, contour_color='steelblue')
106+
wc.generate(corpus)
107+
return wc
108+
109+
wordcloud_dumas = make_wordcloud(dumas)
110+
```
111+
112+
::: {.content-visible when-profile="fr"}
113+
Alors que celui obtenu à partir de l'oeuvre de Dumas prend
114+
la forme
115+
:::
116+
117+
::: {.content-visible when-profile="en"}
118+
Whereas the one generated from Dumas' work takes
119+
the shape
120+
:::
121+
122+
123+
```{python}
124+
#| fig-cap: Nuage de mot produit à partir du Comte de Monte Cristo
125+
#| label: fig-wordcloud-dumas
126+
plt.imshow(wordcloud_dumas, interpolation='bilinear')
127+
plt.axis("off")
128+
```

content/NLP/01_intro/exercise3.qmd

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
:::: {.content-visible when-profile="fr"}
2+
3+
::: {.exercise}
4+
## Exercice 3 : Nettoyage du texte
5+
6+
1. Reprendre l'ouvrage de Dumas et nettoyer celui-ci avec `Spacy`. Refaire le nuage de mots et conclure.
7+
2. Faire ce même exercice sur le jeu de données anglo-saxon. Idéalement, vous devriez être en mesure d'utiliser la fonctionnalité de _pipeline_ de `SpaCy`.
8+
9+
:::
10+
11+
::::
12+
13+
:::: {.content-visible when-profile="en"}
14+
15+
::: {.exercise}
16+
## Exercise 3: Text Cleaning
17+
18+
1. Take Dumas' work and clean it using `Spacy`. Generate the word cloud again and draw your conclusions.
19+
2. Perform the same task on the Anglo-Saxon dataset. Ideally, you should be able to use the `SpaCy` _pipeline_ functionality.
20+
21+
:::
22+
23+
::::
24+
25+
```{.python include="cleantext.py"}
26+
```
27+
28+
```{python}
29+
del clean_text
30+
def clean_text(doc):
31+
# Tokenize, remove stop words and punctuation, and lemmatize
32+
cleaned_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
33+
# Join tokens back into a single string
34+
cleaned_text = ' '.join(cleaned_tokens)
35+
return cleaned_text
36+
```
37+
38+
39+
```{python}
40+
#| label: clean-stopwords-dumas
41+
# Process the text with spaCy
42+
doc = nlp(
43+
dumas[:30000],
44+
disable=['ner', 'textcat']
45+
)
46+
47+
# Clean the text
48+
cleaned_dumas = clean_text(doc)
49+
```
50+
51+
```{python}
52+
wordcloud_dumas_nostop = make_wordcloud(cleaned_dumas)
53+
```
54+
55+
Ces retraitements commencent à porter leurs fruits puisque des mots ayant plus
56+
de sens commencent à se dégager, notamment les noms des personnages
57+
(Dantès, Danglart, etc.):
58+
59+
60+
```{python}
61+
#| fig-cap: Nuage de mot produit à partir du Comte de Monte Cristo après nettoyage
62+
#| label: fig-wordcloud-dumas-nostop
63+
plt.imshow(wordcloud_dumas_nostop, interpolation='bilinear')
64+
plt.axis("off")
65+
```
66+
67+
68+
69+
70+
```{python}
71+
#| label: clean-stopwords-horror
72+
# Question 2
73+
docs = nlp_english.pipe(horror["Text"])
74+
cleaned_texts = [clean_text(doc) for doc in docs]
75+
horror['preprocessed_text'] = cleaned_texts
76+
```
77+
78+
```{python}
79+
#| fig-cap:
80+
#| - "Lovercraft"
81+
#| - "Poe"
82+
#| - "Shelley"
83+
#| label: fig-wordcloud-spooky-nostop
84+
fig = plt.figure(figsize=(15, 12))
85+
for i in range(len(n_topics)):
86+
wordcloud = graph_wordcloud(n_topics[i], horror, "preprocessed_text")
87+
plt.imshow(wordcloud)
88+
plt.axis('off')
89+
plt.show()
90+
```
91+
92+

content/NLP/01_intro/exercise4.qmd

Whitespace-only changes.

0 commit comments

Comments
 (0)