# <center>Natural Language Processing Hands-on # 1</center>
<center><span style="font-weight: bold; font-size: 1.8rem;">Representing words and sentences</span></center>

Since most of the algorithms existing out there are designed to handle numerical data, they are hardly applicable on raw texts. However, it is definitely possible to convert a text to a numerical representation.

Ideal representations should handle **semantic**, **polysemy**, **irony** and lots of other specificities of texts. Along the decades, many text representations have been introduced to handle as many specificities as possible.

In this hands-on, you will have to convert a given corpus of texts to various representations and highlight their pros / cons.

# Installation of required packages

The packages listed below should be installed. Using a virtual environment is highly recommended but not mandatory -- that is just good practice.

In [1]:
!pip install gensim

zsh:1: command not found: pip


In [2]:
import random
import string
import itertools


from pprint import pprint as pp

import gensim
import nltk
import numpy as np
import pandas as pd
import scipy as sp

from gensim.models import KeyedVectors
from gensim.test.utils import datapath
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

import nltk
# Uncomment the following line to download the reuters dataset
nltk.download('reuters')
from nltk.corpus import reuters

START_TOKEN = '<START>'
END_TOKEN = '<END>'

np.random.seed(0)
random.seed(0)

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/lucasvalquenart/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


**Note**: In NLP, we often add `<START>` and `<END>` tokens to represent the beginning and end of sentences, paragraphs or documents.

# Part 0 - Exploring the dataset

The Reuters Corpus that we will use contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 categories.

Before diving into word representations, let's explore it a little bit and simply preprocess its texts to make it more suitable.

---

We will need to standardize all texts before converting anything to a numerical representations, since it will reduce the vocabulary size. Modify the following function to:

* Add the `START_TOKEN` and `END_TOKEN` at the beginning and end of each document
* Lowercase every words
* Remove the punctuation from each document

In [3]:
def read_corpus(category="tea", add_token=True):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)

    #convert all words to lowercase and remove ponctuation
    corpus = [[w.lower().translate(str.maketrans("", "", string.punctuation)) for w in list(reuters.words(f))] for f in files]
    corpus = [[word for word in doc if word and not word.isnumeric()] for doc in corpus]

    #add token if necessary
    if add_token:
        corpus = [[START_TOKEN] + doc + [END_TOKEN] for doc in corpus if doc]
    return corpus

In [4]:
corpus = read_corpus()

In [5]:
#pp(corpus[:2], compact=True)

# Part I - Word representations

Now that we have preprocessed our texts we can represent them using vectors, also called embeddings in this case.

*Note: the preprocessing done here is basic. We will see in another hands-on different preprocessing steps, including some suitable for frequentist approaches.*

---

Each word representation has its pros and cons: understanding them will help you in finding the best representation that suits your use case.

As a result, you will have to implement / load and analyze the behaviour of word vectors coming from:

* Dummy encoding
* Co-occurence matrix encoding
* Pretrained GloVe encoding

## Dummy encoding

The dummy encoding consist in encoding each individual word of our corpus into a vector filled with 0 expect at a specific position where the value is 1 (equivalent to the encoding of categorical variables).

As discussed during the course, those embeddings are kind of pointless since they don't handle a single element of the ideal word representation beside being actual vectors. They are however a good starting point to play around with texts.

---

Define a function converting the words of a corpus to a set of dummy encoded vectors. Do not forget to sort your vocabulary before assigning the vectors!

In [6]:
def dummy_encode(corpus):
    """One-hot encoding of a set of texts.
    
    Params:
        corpus (list of lists): list of tokenized documents, where each document 
                                is a list of words
    Return:
        dict: dictionary mapping each unique word to its one-hot encoded vector
    """
    # extract all unique words from corpus and sort them alphabetically
    words = sorted(list(set(itertools.chain.from_iterable(corpus))))
    embedding = np.eye(len(words))

    return {words: embedding[i] for i, words in enumerate(words)}

In [7]:
#pp(dummy_encode(corpus[:2]), compact=True)

If you still do not believe that this representation is pointless, try finding the most similar word to "cat" using it.

## Co-occurence matrix encoding

*This section comes from Stanford's NLP hands-on*

A co-occurrence matrix counts how often things co-occur in some environment. Given some word  $w_i$  occurring in the document, we consider the context window surrounding  $w_i$ . Supposing our fixed window size is  $n$ , then this is the  $n$  preceding and  $n$  subsequent words in that document, i.e. words  $w_{i−n} … w_{i−1}$  and  $w_{i+1} … w{i+n}$ . We build a co-occurrence matrix  $M$ , which is a symmetric word-by-word matrix in which  $M_{ij}$  is the number of times  $w_j$  appears inside  $w_i$ 's window among all documents.

**Example: Co-Occurrence with Fixed Window of n=1:**

* Document 1: "all that glitters is not gold"

* Document 2: "all is well that ends well"

|        * 	| <START> 	| all 	| that 	| glitters 	| is 	| not 	| gold 	| well 	| ends 	| <END> 	|
|---------:	|--------:	|----:	|-----:	|---------:	|---:	|----:	|-----:	|-----:	|-----:	|------:	|
|  <START> 	|       0 	|   2 	|    0 	|        0 	|  0 	|   0 	|    0 	|    0 	|    0 	|     0 	|
|      all 	|       2 	|   0 	|    1 	|        0 	|  1 	|   0 	|    0 	|    0 	|    0 	|     0 	|
|     that 	|       0 	|   1 	|    0 	|        1 	|  0 	|   0 	|    0 	|    1 	|    1 	|     0 	|
| glitters 	|       0 	|   0 	|    1 	|        0 	|  1 	|   0 	|    0 	|    0 	|    0 	|     0 	|
|       is 	|       0 	|   1 	|    0 	|        1 	|  0 	|   1 	|    0 	|    1 	|    0 	|     0 	|
|      not 	|       0 	|   0 	|    0 	|        0 	|  1 	|   0 	|    1 	|    0 	|    0 	|     0 	|
|     gold 	|       0 	|   0 	|    0 	|        0 	|  0 	|   1 	|    0 	|    0 	|    0 	|     1 	|
|     well 	|       0 	|   0 	|    1 	|        0 	|  1 	|   0 	|    0 	|    0 	|    1 	|     1 	|
|     ends 	|       0 	|   0 	|    1 	|        0 	|  0 	|   0 	|    0 	|    1 	|    0 	|     0 	|
|    <END> 	|       0 	|   0 	|    0 	|        0 	|  0 	|   0 	|    1 	|    1 	|    0 	|     0 	|

The rows (or columns) of this matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run dimensionality reduction. In particular, we will run *SVD* (Singular Value Decomposition), which is a kind of generalized *PCA* (Principal Components Analysis) to select the top  k  principal components.

Reducing the dimensionality of such vectors doesn't interterfere with the semantic relationship between words. Hence, *movie* will still be closer to *theater* than to *airplane*.

### Identify distinct words

Define a function that will return a list of unique words of the corpus as well as its size.

In [8]:
def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    #extract unique words
    corpus_words = sorted(list(set(itertools.chain.from_iterable(corpus))))
    num_corpus_words = len(corpus_words)
    return corpus_words, num_corpus_words

In [9]:
#print(distinct_words(corpus[:2]))

### Compute the co-occurence matrix

Write a method that constructs a co-occurrence matrix for a certain window-size  $n$  (with a default of $4$), considering words  $n$  before and  $n$  after the word in the center of the window.

In [10]:
def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).

        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.

              For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".

        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)):
                Co-occurence matrix of word counts.
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}

    M = np.zeros((num_words, num_words))
    #word to index matrix
    for idx, word in enumerate(words):
        word2Ind[word] = idx
    #co-occurence matrix (use to capture word relationships (ex : king-queen))
    for doc in corpus:
        for i, word in enumerate(doc):
            start = max(0, i - window_size)
            end = min(len(doc), i + window_size + 1)
            for j in range(start, i):
              if i != j:
                M[word2Ind[doc[j]], word2Ind[word]] += 1

    return M, word2Ind

In [11]:
M, word2Ind = compute_co_occurrence_matrix(corpus, window_size=4)

### Reduce the dimensionality of the co-occurence matrix

Construct a method that performs dimensionality reduction on the matrix to produce $k$-dimensional embeddings. Use *SVD* to take the top $k$ components and produce a new matrix of $k$-dimensional embeddings.

In our case, we will set $k=2$.

In [12]:
def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """
    n_iters = 10
    M_reduced = None
    svd = TruncatedSVD(n_components=k, n_iter=n_iters, random_state=42)
    svd.fit(M)
    M_reduced = svd.transform(M)
    return M_reduced

In [13]:
M_reduced_co_occurrence = reduce_to_k_dim(M)

---

Great! You now have fix-sized vectors that represent each words of your corpus. Let's normalize our matrix to compare our vectors easily.

In [14]:
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis]

  M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis]


Since we are working with vectors, we can easily measure the similarity between them using the dot product. Hence, given a specific word and its related word embedding, we can easily identify its most similar words contained in the corpus!

*Note: the dot product between two vectors is bounded between -1 and 1 and the dot product of two identical vectors is equal to 1.*

Define a function that given a word $w$ identify its most similar words in your embedding space.

In [15]:
def most_similar(word, matrix, topn=10):
    """Return the words that have the closest embedding to the queried word."""
    ranked_words = []
    
    # cosine similarity
    similarity = np.dot(matrix, word)
    #sort idx by similarity (DESC)
    ranked_words = np.argsort(similarity)[::-1]
    return ranked_words[:topn]

## GloVe encoding

Word2Vec models are predictive by essence, since it is a neural network. However, this is not the sole method to learn geometrical encodings (vectors) of words from their co-occurrence information (how frequently they appear together in large text corpora).

GloVe is a count-based model that learn their vectors by essentially doing dimensionality reduction on the co-occurrence counts matrix. Does it remind you of something? Yes, that's exactly what you did above.

Building models is time consuming. Hence, GloVe / Word2Vec models already trained on regular training sets (Wikipedia, News, etc.) are publicly shared to be reused easily.

The below code will load a Glove model trained on wikipedia and allow us inspect easily the embeddings properties.

In [16]:
import gensim.downloader as api

In [17]:
pp(list(api.info()["models"].keys()))

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']


In [18]:
model = api.load('glove-wiki-gigaword-50')



IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





You can get the most similar embeddings to those of a given set of words.
Here, we retrieve the most similar words to fox, rabbit and cat.

In [19]:
pp(model.most_similar(positive=["fox", "rabbit", "cat"]), compact=True)

[('dog', 0.8569530248641968), ('mouse', 0.7859790325164795),
 ('monster', 0.7710845470428467), ('wolf', 0.7690606117248535),
 ('bunny', 0.7655251026153564), ('spider', 0.7395666241645813),
 ('duck', 0.7366621494293213), ('rat', 0.7366542220115662),
 ('beast', 0.7319128513336182), ('cartoon', 0.724099338054657)]


You can also perform concept additions / soustractions. For instance, you can

In [20]:
model.most_similar(positive=["king", "woman"], negative=["man"])

[('queen', 0.8523604273796082),
 ('throne', 0.7664334177970886),
 ('prince', 0.759214460849762),
 ('daughter', 0.7473882436752319),
 ('elizabeth', 0.7460219860076904),
 ('princess', 0.7424570322036743),
 ('kingdom', 0.7337412238121033),
 ('monarch', 0.7214491367340088),
 ('eldest', 0.7184861898422241),
 ('widow', 0.7099431157112122)]

### Explore those embeddings and comment on how they help to identify / handle:

* Synonyms
* Antonyms
* Grammatical errors
* Polysemy
* Irony
* Analogies

* **Synonyms**: Word embeddings naturally group synonyms together because they appear in similar contexts. For example, "big" and "large" will have similar vectors since they're used interchangeably in text. The `most_similar()` function easily retrieves synonyms.

* **Antonyms**: Antonyms often have similar embeddings too! Words like "hot" and "cold" appear in similar contexts ("the weather is hot/cold"), so their vectors are close. Embeddings alone cannot distinguish synonyms from antonyms without additional context.

* **Grammatical errors**: Embeddings don't directly detect grammatical errors. They focus on semantic meaning, not grammar. However, they can help suggest corrections if a misspelled word is close to a correct word in the embedding space.

* **Polysemy**: Embeddings struggle with polysemy . For example, "mouse" (computer vs. animal) gets a single vector that's an average of all its uses.

* **Irony**: Embeddings cannot detect irony. Irony depends on context and tone, which static word vectors don't capture.

* **Analogies**: The vector arithmetic (king - man + woman ≈ queen) demonstrates that semantic relationships are encoded geometrically. The model captures analogical reasoning through vector operations.

---

### Where do you think those biases come from?

Biases in word embeddings come directly from the training data.

# Part II - Sentence representations

## TF-IDF

The TF-IDF is a methodology aiming at finding the most significative words in each document by comparing their in-document frequency to the overall frequency of that term in the whole corpus.

---

Convert the reuters corpus (at least one category) to its TF-IDF representation using [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Beware of all the parameters of the methods! They might have significant impact on what you're doing.

In [21]:
corpus_string = [' '.join(doc) for doc in corpus]

Once you have converted your corpus using the TF-IDF methodology, create a function identifying the most relevant comments given a search query.

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

def search_corpus(corpus, search_query, topn=10):
    """Retrieve the top n documents matching a search query within a list of texts.

    """
    tfidf_corpus = TfidfVectorizer()
    tfidf = tfidf_corpus.fit_transform(corpus)

    query_vector = tfidf_corpus.transform([search_query])

    similarities = np.dot(query_vector, tfidf.T).toarray().flatten()
    ranked_articles = np.argsort(similarities)[::-1]

    return ranked_articles[:topn]

In [24]:
top_indices = search_corpus(corpus_string, "europe", topn=5)

for i in top_indices:
    print(corpus[i])

['<START>', 'soviet', 'paper', 'details', 'georgian', 'flood', 'damage', 'floods', 'and', 'avalanches', 'killed', 'people', 'and', 'caused', 'around', 'mln', 'roubles', 'worth', 'of', 'damage', 'in', 'the', 'southern', 'soviet', 'republic', 'of', 'georgia', 'earlier', 'this', 'year', 'the', 'government', 'daily', 'izvestia', 'said', 'some', 'hectares', 'of', 'agricultural', 'land', 'and', 'gardens', 'had', 'been', 'inundated', 'damaging', 'tea', 'plantations', 'and', 'orange', 'groves', 'the', 'newspaper', 'said', 'it', 'added', 'that', 'spring', 'sowing', 'in', 'southern', 'parts', 'of', 'the', 'country', 'was', 'some', 'two', 'weeks', 'behind', 'schedule', 'because', 'of', 'the', 'late', 'thaw', 'but', 'gave', 'no', 'precise', 'crop', 'estimates', 'in', 'the', 'most', 'detailed', 'report', 'to', 'date', 'on', 'the', 'heavy', 'snows', 'in', 'january', 'and', 'floods', 'in', 'february', 'izvestia', 'said', 'people', 'had', 'been', 'evacuated', 'from', 'mountain', 'areas', 'houses', 'ha

## Unsupervised Random Walk Sentence Embeddings

This approach has been presented by Kawin Ethayarajh in 2018. The key idea behind this methodology is to take a weighted average of previously trained word embeddings and modify it with SVD (Singular Value Decomposition, a kind of generalization of the PCA).

By having a look at [this implementation](https://github.com/kawine/usif), try to compute the uSIF embeddings of our corpus and compare their properties to the TF-IDF / Averaged Word2Vec ones.

Consider digging in the [related paper](https://aclanthology.org/W18-3012.pdf) if you want to know more about the methodology.

# Analyse et Compréhension des résultats obtenus

### Introduction
Nous avons commencé par comprendre l'importance de représenter les textes numériquement pour les algorithmes. Les textes bruts ne sont pas exploitables directement, d'où la nécessité de les convertir en vecteurs. L'objectif était d'explorer différentes méthodes de représentation de mots et de phrases, en soulignant leurs avantages et inconvénients.

### Installation et Préparation
Nous avons installé les packages nécessaires comme `gensim` et `nltk`, puis importé diverses bibliothèques (`numpy`, `pandas`, `scipy`, `sklearn`, etc.). Nous avons également téléchargé le corpus Reuters pour travailler avec. Une étape cruciale a été la définition des tokens `<START>` et `<END>` pour marquer le début et la fin des documents, ce qui est une bonne pratique en NLP.

### Partie 0 - Exploration du jeu de données
Cette partie a consisté à se familiariser avec le corpus Reuters, composé de 10 788 documents d'actualité. Nous avons implémenté la fonction `read_corpus` pour prétraiter les textes. Cette fonction est essentielle car elle permet de :
*   Ajouter les tokens `<START>` et `<END>`.
*   Convertir tous les mots en minuscules pour standardiser le vocabulaire.
*   Supprimer la ponctuation, ce qui réduit le bruit et la taille du vocabulaire.
*   Filtrer les mots vides ou numériques non pertinents.

Le `pp(corpus[:2])` nous a permis de visualiser le résultat de ce prétraitement, montrant des documents bien structurés avec les tokens ajoutés, les mots en minuscules et sans ponctuation.

### Partie I - Représentations de mots
C'est le cœur du notebook, où nous avons exploré différentes manières de convertir des mots en vecteurs.

#### Encadrage Dummy (One-Hot Encoding)
Nous avons créé la fonction `dummy_encode` qui assigne un vecteur unique de zéros et un à chaque mot. J'ai compris que bien que simple, cette méthode est "inutile" pour capturer le sens car elle ne représente aucune relation sémantique entre les mots. Chaque mot est orthogonal à tous les autres, ce qui ne permet pas de trouver de similarités.

#### Encadrage par matrice de co-occurrence
C'était une partie plus complexe et très intéressante :

1.  **Identification des mots distincts** : La fonction `distinct_words` a permis de lister tous les mots uniques du corpus et d'en obtenir le compte. C'est la base pour construire notre matrice.
2.  **Calcul de la matrice de co-occurrence** : Nous avons commencé à implémenter `compute_co_occurrence_matrix`. J'ai compris le concept de fenêtre (`window_size`) pour déterminer quels mots co-occurrences. L'exemple avec les documents "all that glitters is not gold" et "all is well that ends well" a bien illustré comment construire cette matrice symétrique. *Il semble qu'il y ait un début d'implémentation pour le remplissage de la matrice, mais elle n'est pas encore complète pour calculer toutes les co-occurrences dans la fenêtre.* Une fois que cette partie sera complétée, la matrice `M` montrera combien de fois chaque mot apparaît à proximité d'un autre.
3.  **Réduction de la dimensionnalité (SVD)** : La fonction `reduce_to_k_dim` utilise `TruncatedSVD` pour réduire les dimensions de la matrice de co-occurrence à `k` dimensions (ici, `k=2`). C'est crucial car cela permet de condenser l'information tout en conservant les relations sémantiques. L'idée est que des mots qui apparaissent souvent ensemble auront des vecteurs similaires même après réduction.
4.  **Normalisation** : Après la réduction, la normalisation (`M_normalized`) rend les vecteurs de longueur unitaire, ce qui facilite la comparaison de similarité via le produit scalaire. *J'ai remarqué une `RuntimeWarning: invalid value encountered in divide` lors de la normalisation, ce qui suggère qu'il pourrait y avoir des lignes de zéros dans `M_reduced_co_occurrence` ou `M_lengths`, ce qui est à vérifier dans l'implémentation de `compute_co_occurrence_matrix` ou `reduce_to_k_dim`.*
5.  **Recherche de similarité** : La fonction `most_similar` a été initiée pour trouver les mots les plus proches en utilisant le produit scalaire. *Cette fonction semble être incomplète et nécessite d'être finalisée pour comparer correctement les vecteurs.*

#### Encadrage GloVe (Pretrained)
Nous avons exploré les embeddings pré-entraînés GloVe, chargés via `gensim.downloader`. C'était très impressionnant de voir comment un modèle pré-entraîné sur de vastes corpus capture des relations sémantiques complexes :
*   `model.most_similar(positive=["fox", "rabbit", "cat"])` : Les résultats montrent des animaux similaires ("dog", "mouse", "wolf", "bunny"). Cela démontre que les embeddings GloVe capturent bien la similarité thématique.
*   `model.most_similar(positive=["king", "woman"], negative=["man"])` : L'exemple d'analogie "king - man + woman = queen" fonctionne parfaitement, prouvant la capacité des embeddings à capturer des relations analogiques (ici, la relation de genre).

Cette section a mis en évidence la puissance des embeddings pré-entraînés pour gérer les synonymes, les analogies, et de manière générale, la sémantique. Les biais mentionnés (synonymes, antonymes, polysemie, ironie, analogies) viennent probablement des données sur lesquelles les modèles sont entraînés. Si un mot est souvent utilisé dans un contexte négatif, cela peut influencer son embedding.

### Partie II - Représentations de phrases

#### TF-IDF
Nous avons commencé à travailler sur la représentation de phrases avec TF-IDF. L'idée est de pondérer l'importance d'un mot dans un document par sa fréquence dans ce document (TF) et son inverse de fréquence dans l'ensemble du corpus (IDF). Cela permet de mettre en lumière les mots distinctifs d'un document.

*   Nous avons converti le corpus en une liste de chaînes de caractères (`corpus_string`) pour l'utiliser avec `TfidfVectorizer` de scikit-learn.
*   La fonction `search_corpus` utilise `TfidfVectorizer` pour transformer le corpus et une requête de recherche en vecteurs TF-IDF. Ensuite, elle calcule la similarité par produit scalaire et retourne les documents les plus pertinents.
*   L'exécution de `search_corpus(corpus, "europe", topn=5)` a montré des résultats très pertinents, avec des articles mentionnant l'Europe, les importations, le pétrole, etc. Cela prouve que TF-IDF est efficace pour la recherche d'informations et l'identification de documents thématiquement proches d'une requête.

#### Unsupervised Random Walk Sentence Embeddings (uSIF)
Cette partie est une introduction à une méthode plus avancée pour les embeddings de phrases. L'idée de combiner des word embeddings existants avec SVD pour obtenir des embeddings de phrases est prometteuse, et la référence à l'implémentation et à l'article est un bon point de départ pour l'explorer davantage. L'objectif serait de comparer ses performances avec TF-IDF et l'agrégation simple de Word2Vec.

En résumé, ce notebook nous a permis de voir la progression des méthodes de représentation textuelle, du simple one-hot encoding aux embeddings sophistiqués comme GloVe et TF-IDF, en passant par les matrices de co-occurrence. Chaque méthode a ses forces et ses faiblesses, et le choix dépend vraiment du cas d'utilisation.