# Synthetizing textual information with embeddings

Lino Galiana  
2025-10-07

> **Warning**
>
> Ce chapitre va évoluer prochainement.

<div class="badge-container"><div class="badge-text">If you want to try the examples in this tutorial:</div><a href="https://github.com/linogaliana/python-datascientist-notebooks/blob/main/notebooks/en/NLP/03_embedding.ipynb" target="_blank" rel="noopener"><img src="https://img.shields.io/static/v1?logo=github&label=&message=View%20on%20GitHub&color=181717" alt="View on GitHub"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/vscode-pytorch?autoLaunch=true&name=«03_embedding»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-vscode.sh»&init.personalInitArgs=«en/NLP%2003_embedding%20correction»" target="_blank" rel="noopener"><img src="https://custom-icon-badges.demolab.com/badge/SSP%20Cloud-Lancer_avec_VSCode-blue?logo=vsc&logoColor=white" alt="Onyxia"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/jupyter-pytorch?autoLaunch=true&name=«03_embedding»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-jupyter.sh»&init.personalInitArgs=«en/NLP%2003_embedding%20correction»" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/SSP%20Cloud-Lancer_avec_Jupyter-orange?logo=Jupyter&logoColor=orange" alt="Onyxia"></a>
<a href="https://colab.research.google.com/github/linogaliana/python-datascientist-notebooks-colab//en/blob/main//notebooks/en/NLP/03_embedding.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br></div>

# 1. Introduction

This page builds on certain aspects presented in the [introductory section](../../content/NLP/02_exoclean.qmd).
We will advance our understanding of NLP issues through language modeling.

We start from the conclusion noted at the end of the previous chapter: frequentist approaches have several shortcomings, such as modeling language based on statistical regularities without considering word or phrase proximity, and difficulty incorporating context.

The aim of this chapter is to address the first of those points. This will serve as an introduction to *embeddings*, the language representations at the core of modern language models used in everyday tools like `DeepL` or `ChatGPT`.

## 1.1 Data Used

We will continue our exploration of literature using the same three English-language authors:

-   Edgar Allan Poe (EAP);
-   HP Lovecraft (HPL);
-   Mary Wollstonecraft Shelley (MWS).

The dataset is available in a CSV file hosted on [`Github`](https://github.com/GU4243-ADS/spring2018-project1-ginnyqg/blob/master/data/spooky.csv), and can be directly downloaded from:
<https://github.com/GU4243-ADS/spring2018-project1-ginnyqg/raw/master/data/spooky.csv>.

To explore the topic of *embeddings*, we will use a language modeling task: predicting the author of a given text. A language model represents a text or language as a probability distribution over terms (usually words).

> **Sources of Inspiration**
>
> This chapter is inspired by several online resources:
>
> -   A [first notebook on `Kaggle`](https://www.kaggle.com/enerrio/scary-nlp-with-spacy-and-keras)
>     and [a second one](https://www.kaggle.com/meiyizi/spooky-nlp-and-topic-modelling-tutorial/notebook);
> -   A [Github repository](https://github.com/GU4243-ADS/spring2018-project1-ginnyqg);

## 1.2 Required Packages

As in the [previous section](../../content/NLP/02_exoclean.qmd), we need to install specialized NLP libraries along with their dependencies. This tutorial will use several libraries, including some that depend on `PyTorch`, which is a large framework.

> **`PyTorch` on `SSPCloud`**
>
> **The following note is only relevant for users of `SSPCloud`.**
>
> The standard `Python` services on `SSPCloud` (such as `vscode-python` and `jupyter-python`) do not include `PyTorch` by default. This library is quite large (around 600MB) and requires specific configuration to work seamlessly across different software environments. For ecological sustainability, this enhanced environment is not provided by default. However, when needed, an environment with `PyTorch` preinstalled is available.
>
> To access it, simply start a `vscode-pytorch` or `jupyter-pytorch` service. If you used one of the buttons above, this pre-configured service was automatically launched for you.

In [None]:
!pip install numpy pandas spacy transformers scikit-learn langchain_community

Next, since we will be using the `SpaCy` library with a corpus of English texts,
we need to download the English NLP model. For this, you can refer to
[the official `SpaCy` documentation](https://spacy.io/usage/models),
which is extremely well-designed.

In [None]:
!python -m spacy download en_core_web_sm

# 2. Data Preparation

We will once again use the `spooky` dataset:

In [None]:
import pandas as pd

data_url = 'https://github.com/GU4243-ADS/spring2018-project1-ginnyqg/raw/master/data/spooky.csv'
spooky_df = pd.read_csv(data_url)

The dataset pairs each author with a sentence they wrote:

In [None]:
spooky_df.head()

## 2.1 Preprocessing

As discussed in the previous chapter, the first step in any work with textual data is often *preprocessing*, which typically includes tokenization and text cleaning.

Here, we will stick to minimal preprocessing: removing punctuation and stop words (for visualization and count-based vectorization methods).

To begin the cleaning process,
we will use the `en_core_web_sm` model from `Spacy`

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

We will use a `spacy` `pipe` that automates and parallelizes
a number of operations. *Pipes* in NLP are similar to
`scikit` pipelines or `pandas` pipes. They are well-suited tools
for industrializing various *preprocessing* tasks:

In [None]:
from typing import List
import spacy

def clean_docs(
    texts: List[str],
    remove_stopwords: bool = False,
    n_process: int = 4,
    remove_punctuation: bool = True
) -> List[str]:
    """
    Cleans a list of text documents by tokenizing, optionally removing stopwords, and optionally removing punctuation.

    Parameters:
        texts (List[str]): List of text documents to clean.
        remove_stopwords (bool): Whether to remove stopwords. Default is False.
        n_process (int): Number of processes to use for processing. Default is 4.
        remove_punctuation (bool): Whether to remove punctuation. Default is True.

    Returns:
        List[str]: List of cleaned text documents.
    """
    # Load spacy's nlp model
    docs = nlp.pipe(
        texts,
        n_process=n_process,
        disable=['parser', 'ner', 'lemmatizer', 'textcat']
    )

    # Pre-load stopwords for faster checking
    stopwords = set(nlp.Defaults.stop_words)

    # Process documents
    docs_cleaned = (
        ' '.join(
            tok.text.lower().strip()
            for tok in doc
            if (not remove_punctuation or not tok.is_punct) and
               (not remove_stopwords or tok.text.lower() not in stopwords)
        )
        for doc in docs
    )

    return list(docs_cleaned)

We apply the `clean_docs` function to our `pandas` column.
Since `pandas.Series` are iterable, they behave like lists and
work very well with our `spacy` pipe.

In [None]:
spooky_df['text_clean'] = clean_docs(spooky_df['text'])

In [None]:
spooky_df.head()

## 2.2 Encoding the Target Variable

We perform a simple encoding of the target variable:
there are three categories (authors), represented by integers 0, 1, and 2.
For this, we use `Scikit`’s `LabelEncoder`, previously introduced
in the [modeling section](../../content/modelisation/0_preprocessing.qmd). We will use the `fit_transform` method, which conveniently combines
fitting (i.e., creating a mapping between numerical values and labels)
and transforming the same column in one step.

We can check the classes of our `LabelEncoder`:

## 2.3 Creating the Training and Test Sets

We set aside a test sample (20%) before performing any analysis (even descriptive).
This ensures a rigorous evaluation of our models at the end, since these data will never have been seen during training.

Our initial dataset is not balanced—some authors have more texts than others.
To ensure fair evaluation of our model, we will stratify the sampling so that the training and test sets contain a similar distribution of authors.

In [None]:
from sklearn.model_selection import train_test_split

y = spooky_df["author"]
X = spooky_df['text_clean']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Preview of the first element in `X_train`:

# 3. Vectorization Using the *Bag of Words* Approach

Representing our texts as a bag of words allows us to vectorize the corpus and thus obtain a numerical representation of each text. From there, we can perform various types of modeling tasks.

Let’s define our vector representation using TF-IDF with `Scikit`:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

pipeline_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000)),
])
pipeline_tfidf

Let’s go ahead and train our model to vectorize the text using the TF-IDF method. At this stage, we are not yet concerned with evaluation, so we will train on the entire dataset, not just `X_train`.

In [None]:
pipeline_tfidf.fit(spooky_df['text_clean'])

## 3.1 Finding the Most Similar Text

First, we can look for the text that is closest—according to TF-IDF similarity—to a given sentence. Let’s take the following example:

In [None]:
text = "He was afraid by Frankenstein monster"

How can we find the text most similar to this one? We need to transform our sentence into the same vector representation, then compare it to the other texts using that same form.

This is essentially an information retrieval task—a classic NLP use case—implemented, for example, by search engines. Since the term “Frankenstein” is quite distinctive, we should be able to identify similarities with other texts written by Mary Shelley using TF-IDF.

A metric commonly used to compare vectors is cosine similarity. This is a central measure in modern NLP. While it is more meaningful with dense vectors (which we’ll explore soon), it still provides a useful exercise for understanding similarity between two vectors, even when those vectors are sparse, as in the bag-of-words approach.

If each dimension of a vector represents a direction, cosine similarity measures the angle between two vectors. The smaller the angle, the closer the vectors.

![](https://miro.medium.com/v2/resize:fit:824/1*GK56xmDIWtNQAD_jnBIt2g.png)

### 3.1.1 With `Scikit-Learn`

> **Exercise 1: Similarity Search with TF-IDF**
>
> 1.  Use the `transform` method to vectorize the entire training corpus.
>
> 2.  Assuming your vectorized training set is named `X_train_tfidf`, you can convert it to a *DataFrame* with the following command:
>
> ``` python
> X_train_tfidf = pd.DataFrame(
>     X_train_tfidf.todense(), columns=pipeline_tfidf.get_feature_names_out()
> )
> ```
>
> 1.  Use `Scikit`’s `cosine_similarity` method to compute cosine similarity between your vectorized text and the training corpus using this code:
>
> ``` python
> import numpy as np
> from sklearn.metrics.pairwise import cosine_similarity
>
> cosine_similarities = cosine_similarity(
>     X_train_tfidf,
>     pipeline_tfidf.transform([text])
> ).flatten()
>
> top_4_indices = np.argsort(cosine_similarities)[-4:][::-1]  # Descending sort
> top_4_similarities = cosine_similarities[top_4_indices]
> ```
>
> 1.  Retrieve the corresponding documents. Are you satisfied with the result? Do you understand what happened?

A l’issue de l’exercice, les 4 textes les plus similaires sont:

### 3.1.2 With `Langchain`

This approach to computing text similarity is rather tedious with `Scikit`. With the rapid development of `Python` applications leveraging language models, a rich ecosystem has emerged to make these tasks achievable in just a few lines of code.

Among the most valuable tools is [`Langchain`](https://www.langchain.com/), a high-level `Python` ecosystem for building production-ready pipelines using textual data.

We will proceed here in two steps:

-   Create a *retriever*, which involves vectorizing our corpus (texts from the three authors) using TF-IDF and storing it in a vector database.
-   Vectorize our search query (`text`, created earlier) on the fly and retrieve its closest match from the vector database.

Vectorizing our corpus is very straightforward using `Langchain`,
as `Scikit`’s `TfidfVectorizer` is wrapped in a dedicated module provided by `Langchain`.

In [None]:
from langchain_community.retrievers import TFIDFRetriever
from langchain_community.document_loaders import DataFrameLoader

loader = DataFrameLoader(spooky_df, page_content_column="text_clean")

retriever = TFIDFRetriever.from_documents(
    loader.load()
)

This `retriever` object serves as an entry point to our corpus. `Langchain` is particularly valuable in NLP projects because it provides standardized entry points, allowing you to easily switch out vectorizers without needing to change how the results are used at the end of the pipeline.

The `invoke` method is used to find the most similar vectors to our search query:

In [None]:
retriever.invoke(text)

[Document(metadata={'id': 'id12587', 'text': 'Listen to me, Frankenstein.', 'author': 'MWS', 'author_encoded': 2}, page_content='listen to me frankenstein'),
 Document(metadata={'id': 'id09284', 'text': 'I screamed aloud that I was not afraid; that I never could be afraid; and others screamed with me for solace.', 'author': 'HPL', 'author_encoded': 1}, page_content='i screamed aloud that i was not afraid that i never could be afraid and others screamed with me for solace'),
 Document(metadata={'id': 'id09797', 'text': 'It seemed to be a sort of monster, or symbol representing a monster, of a form which only a diseased fancy could conceive.', 'author': 'HPL', 'author_encoded': 1}, page_content='it seemed to be a sort of monster or symbol representing a monster of a form which only a diseased fancy could conceive'),
 Document(metadata={'id': 'id10816', 'text': 'And, as I have implied, it was not of the dead man himself that I became afraid.', 'author': 'HPL', 'author_encoded': 1}, page_c

The output is a `Langchain` object, which is not convenient for our purposes here. We convert it into a *DataFrame*:

In [None]:
documents = []
for best_echoes in retriever.invoke(text):
    documents += [{**best_echoes.metadata, **{"text_clean": best_echoes.page_content}}]

documents = pd.DataFrame(documents)

We can add the similarity score column to this *DataFrame*:

We do indeed retrieve the same documents:

> **The BM25 Metric**
>
> BM25 is a probabilistic relevance-based information retrieval model, similar to TF-IDF. It is commonly used in search engines to rank documents relative to a query.
>
> BM25 combines term frequency (TF), inverse document frequency (IDF), and a normalization based on document length. In other words, it improves on TF-IDF by adjusting scores based on string length to avoid overemphasizing longer documents.
>
> BM25 performs particularly well in environments where documents vary in length and content. This is why search engines such as `Elasticsearch` have made it a cornerstone of their search mechanisms.

Why aren’t all results relevant? We can anticipate several reasons.

The first hypothesis is that we’re training our vectorizer on a biased corpus. While “Frankenstein” is a rare term, it appears more frequently in our dataset than in general English usage. The inverse document frequency is thus biased against the term: its appearance should be a much stronger indicator that the text belongs to Mary Shelley. While addressing this might slightly improve relevance, it’s not the core issue.

The frequentist approach assumes all terms are equally distinct. A sentence containing the word *“creature”* won’t get a higher score when searching for *“monster”*. Again, we’ve treated our corpus as a bag where words are independent—there’s no increased likelihood of encountering *“Frankenstein”* after *“doctor”*. These limitations point us toward the topic of *embeddings*. Even though the frequentist method may seem a bit *old school*, it’s not useless and often provides a “tough to beat baseline.” In fields like information extraction from short texts, where every term carries strong signal, this approach is often effective.

## 3.2 Finding the Closest Author: An Introduction to the Naive Bayes Classifier

Before diving into *embeddings*, let’s explore a slightly different use case within our probabilistic framework. Suppose we want to predict the author of a given text. If our previous intuition holds—certain words are more likely to appear in texts by specific authors—then we can train an automatic classification algorithm to predict the author based on the text.

The most natural method for this task is the Naive Bayes classifier. This model is a perfect fit for the frequentist approach we’ve used so far, as it relies on the probabilities of word occurrences per author.

The Naive Bayes classifier applies a decision rule: it selects the most probable class given the observed structure of the document—i.e., the words that appear in it.

In other words, we choose the class $\widehat{c}$ that is most probable given the terms in document $d$.

<span id="eq-definition-bayes">$$
\widehat{c} = \arg \max_{c \in \mathcal{C}} \mathbb{P}(c|d) =  \arg \max_{c \in \mathcal{C}} \frac{ \mathbb{P}(d|c)\mathbb{P}(c)}{\mathbb{P}(d)}
 \qquad(3.1)$$</span>

As is common in Bayesian estimation, we can ignore constant terms such as $\mathbb{P}(d)$. The definition of the predicted class can thus be reformulated as follows:

<span id="eq-rewriting-bayes">$$
\widehat{c} = \arg \max_{c \in \mathcal{C}} \mathbb{P}(d|c)\mathbb{P}(c)
 \qquad(3.2)$$</span>

The bag-of-words assumption comes into play here. A document $d$ is considered a collection of words $w_i$, where word order is irrelevant. In other words, we can build a model based on individual words without involving conditional probabilities related to their order.
The second strong assumption is the naive assumption from which the method gets its name: the probability of drawing a word depends only on the category $c$ to which the document belongs. In other words, a document is treated as a sequence of independent word draws, where the probability depends solely on the author.

As explained in the dedicated box, under these assumptions, the classifier can be rewritten in the following form

$$
\widehat{c} = \arg \max_{c \in \mathcal{C}} \mathbb{P}(c)\prod_{w \in \mathcal{W}}{\mathbb{P}(w|c)}
$$

where $\mathcal{W}$ is the set of words in the corpus (our vocabulary).

Empirically, this is a supervised learning task where the *label* is the document class and the *features* are our vectorized words. In practice, the probabilities are estimated from word counts in the corpus and the distribution of document types.

While it is possible to compute all these quantities manually, `Scikit` makes it easy to implement a Naive Bayes estimator after vectorizing the corpus, as shown in the next exercise. However, this may introduce a practical issue: ideally, the test set should not contain new words that were not in the training set, since these new dimensions did not exist during training. In practice, the most common solution is the one adopted here: these words are ignored.

> **Exercise 2: The Naive Bayes Classifier**
>
> 1.  Starting from the previous example, define a *pipeline* that vectorizes each document (using `CountVectorizer` instead of `TFIDFVectorizer`) and performs prediction using a Naive Bayes model.
> 2.  Train this model and make predictions on the test set.
> 3.  Evaluate the performance of your model.
> 4.  Make a prediction for the sentence we previously stored in the `text` variable. Do you get the expected result?
> 5.  Examine the predicted probabilities (using the `predict_proba` method).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

We obtain a satisfactory accuracy:

The breakdown of performance metrics is as follows:

Unsurprisingly, we get Mary Shelley as the predicted author:

Finally, when examining the predicted probabilities (question 5), we see that the prediction is very confident:

> **Understanding the logic of the naive Bayes classifier**
>
> Suppose we are in a classification problem with classes $(c_1,...,c_K)$ (set denoted $\mathcal{C}$). Placing ourselves within the bag-of-words framework, we can ignore the positions of words in documents, which would greatly complicate the writing of our equations.
>
> The equation <a href="#eq-rewriting-bayes" class="quarto-xref">Equation 3.2</a> can be rewritten
>
> $$
> \widehat{c} = \arg \max_{c \in \mathcal{C}} \mathbb{P}(w_1, ..., w_n|c)\mathbb{P}(c)
> $$
>
> In the Bayesian world, we call $\mathbb{P}(w_1, ..., w_n|c)$ the likelihood and $\mathbb{P}(c)$ the prior.
>
> The naive Bayes assumption allows us to treat a document as a sequence of random draws whose probabilities depend only on the category. In this case, drawing a sentence is a sequence of word draws and the compound probability is therefore
>
> $$
> \mathbb{P}(w_1, ..., w_n|c) = \prod_{i=1}^n \mathbb{P}(w_i|c)
> $$
>
> For example, simplifying to two classes, if the probabilities are those from <a href="#tbl-fake-proba" class="quarto-xref">Table 3.1</a>, the sentence *“afraid by Doctor Frankenstein”* will have a little less than 1% chance (0.8%) of being written if the author is Mary Shelley but will be even less likely with Lovecraft (0.006%) because while *“afraid”* is very probable with him, Frankenstein is a rare event that makes this word composition unlikely.
>
> | Word ($w_i$) | Probability for Mary Shelley | Probability for Lovecraft |
> |--------------|------------------------------|---------------------------|
> | Afraid       | 0.1                          | 0.6                       |
> | By           | 0.2                          | 0.2                       |
> | Doctor       | 0.2                          | 0.05                      |
> | Frankenstein | 0.2                          | 0.01                      |
>
> Table 3.1: Fictional example of drawing probabilities
>
> By combining these different equations, we get
>
> $$
> \widehat{c} = \arg \max_{c \in \mathcal{C}} \mathbb{P}(c)\prod_{w \in \mathcal{W}}{\mathbb{P}(w|c)}
> $$
>
> The empirical counterpart of $\mathbb{P}(c)$ is quite obvious: the observed frequency of each category (the authors) in our corpus. In other words,
>
> $$
> \widehat{\mathbb{P}(c)} = \frac{n_c}{n_{doc}}
> $$
>
> What is the empirical counterpart of $\mathbb{P}(w_i|c)$? It is the frequency of appearance of the word in question for the author. To calculate it, we simply count the number of times it appears for the author and divide by the number of words by the author.

# 4. The Word2Vec model, a more synthetic representation

## 4.1 Towards a more synthetic representation of language

The vector representation resulting from the *bag of words* approach is not very synthetic or stable and is quite crude.

If we have a small corpus, we will have problems extrapolating since new texts are very likely to bring new words, which are new feature dimensions that were not present in the training corpus. This is conceptually a problem since *machine learning* algorithms are not intended to predict on characteristics they have not been trained on[1].

Conversely, the more text we have in a corpus, the larger our vector representation will be. For example, if your bag of words has seen the entire French vocabulary, which is 60,000 words according to the [French Academy](https://www.dictionnaire-academie.fr/article/QDL056) (estimates being 200,000 for the English language), this results in vectors of considerable size. However, the diversity of texts is, in practice, much lower: common use of French requires around 3,000 words and most texts, especially if they are short, do not use such a complete vocabulary. This therefore implies very sparse vectors, with many 0s.

Vectorization according to this approach is therefore inefficient; the signal is poorly compressed. Dense representations, that is, of smaller dimension but all carrying information, seem more adequate to be able to generalize our language modeling.
The algorithm that made this approach famous is the `Word2Vec` model, in some ways the first common ancestor of modern LLMs. The vector representation of `Word2Vec` is quite synthetic: the dimension of these *embeddings* is between 100 and 300.

## 4.2 Semantic relationships between terms

This dense representation will represent a solution to a limitation of the *bag of words* approach that we have mentioned multiple times. Each of these dimensions will represent a latent factor,
that is, an unobserved variable,
in the same way as principal components produced by a PCA. These latent dimensions can be interpreted as “fundamental” dimensions of language.

<figure>
<img src="https://jalammar.github.io/images/word2vec/word2vec.png" alt="Illustration of the principle of Word2Vec representation (source: Jay Alammar)" />
<figcaption aria-hidden="true">Illustration of the principle of Word2Vec representation (source: <a href="https://jalammar.github.io/illustrated-word2vec/">Jay Alammar</a>)</figcaption>
</figure>

For example, a human knows that a document containing the word *“King”*
and another document containing the word *“Queen”* are very likely
to address similar subjects. A well-trained `Word2Vec` model will capture
that there exists a latent factor of type *“royalty”*
and the similarity between the vectors associated with the two words will be strong.

The magic goes even further: the model will also capture that there exists a
latent factor of type *“gender”*,
and will allow the construction of a semantic space in which
arithmetic relationships between vectors make sense. For example,

$$
\text{king} - \text{man} + \text{woman} ≈ \text{queen}
$$

or, to revisit the example from the original `Word2Vec` paper (Mikolov 2013),

$$
\text{Paris} - \text{France} + \text{Italy} ≈ \text{Rome}
$$

<figure>
<img src="https://ssphub.netlify.app/post/embedding/word_embedding.png" alt="Illustration of lexical embedding. Source: Blog post Word Embedding: Basics" />
<figcaption aria-hidden="true">Illustration of lexical embedding. Source: Blog post <a href="https://medium.com/@hari4om/word-embedding-d816f643140">Word Embedding: Basics</a></figcaption>
</figure>

Another “miracle” of this approach is that it allows a form of transfer between languages. Since semantic relationships can be similar across languages, many common words can be mapped between languages if they share a common base (such as Western languages). This concept is the foundation of automatic translators and multilingual AI systems.

<figure>
<img src="https://engineering.fb.com/wp-content/uploads/2018/01/GJ_9lgFMnVaR0ZYAAAAAAABV9MkQbj0JAAAC.gif" alt="Example of translation between two vector representations. Source: Meta" />
<figcaption aria-hidden="true">Example of translation between two vector representations. Source: <a href="https://engineering.fb.com/2018/01/24/ml-applications/under-the-hood-multilingual-embeddings/">Meta</a></figcaption>
</figure>

## 4.3 How are these models trained?

These models are trained from a prediction task solved by a simple neural network, generally with a reinforcement approach.

The fundamental idea is that the meaning of a word is understood by looking at words that frequently appear in its neighborhood. For a given word, we will therefore try to predict the words that appear in a window around the target word.

By repeating this task many times and on a sufficiently varied corpus,
we finally obtain *embeddings* for each word in the vocabulary,
which present the properties discussed previously. The collection of `Wikipedia` articles is one of the preferred corpora for people who have built lexical
embeddings. It indeed contains complete sentences, unlike information from social media comments,
and proposes interesting connections between people, places, etc.

The context of a word is defined by a fixed-size window around this word. The window size is a parameter of the *embedding* construction. The corpus provides a large set of word-context examples, which can be used to train a neural network.

More precisely, there are two approaches, whose details we will not develop:

-   *Continuous bag of words* (CBOW), where the model is trained to predict a word from its context;
-   *Skip-gram*, where the model attempts to predict the context from a single word.

<figure>
<img src="https://ssphub.netlify.app/post/embedding/CBOW_Skipgram_training.png" alt="Illustration of the difference between CBOW and Skip-gram approaches" />
<figcaption aria-hidden="true">Illustration of the difference between CBOW and Skip-gram approaches</figcaption>
</figure>

## 4.4 Related models

Several models have a direct lineage with the `Word2Vec` model although they distinguish themselves by the nature of the architecture used.

This is the case, for example, of the [`GloVe`](https://nlp.stanford.edu/projects/glove/) model, developed in 2014 at Stanford,
which does not rely on neural networks but on the construction of a large word co-occurrence matrix. For each word, the task is to calculate the frequencies of appearance of other words in a fixed-size window around it. The obtained co-occurrence matrix is then factorized by a singular value decomposition.

The [`FastText`](https://fasttext.cc/) model, developed in 2016 by a `Facebook` team, works similarly to `Word2Vec` but particularly distinguishes itself on two points:

-   In addition to the words themselves, the model learns representations for character n-grams (character subsequences of size $n$, for example *“tar”*, *“art”* and *“rte”* are the trigrams of the word *“tarte”*), which makes it particularly robust to spelling variations;
-   The model has been optimized so that its training is particularly fast.

The [`FastText`](https://fasttext.cc/) model is particularly effective for automatic classification problems. INSEE uses it for example for several models of classification of textual labels in nomenclatures.

<figure>
<img src="https://ssphub.netlify.app/post/embedding/fasttext.png" alt="Illustration of the fastText model" />
<figcaption aria-hidden="true">Illustration of the fastText model</figcaption>
</figure>

Here is an example of an automated profession classification project in the typology
of activity nomenclatures (PCS) based on a model trained by the `Fasttext` library:

[1] This remark may seem surprising while generative AIs occupy an important place in our usage. Nevertheless, we must keep in mind that while you ask new questions to AIs, you ask them in terms they know: natural language in a language present in their training corpus, digital images that are therefore interpretable by a machine, etc. In other words, your *prompt* is not, in itself, unknown to the AI, it can interpret it even if its content is new and original.

These models are inheritors of `Word2Vec` in the sense that they adopt a dense, low-dimensional vector representation of textual documents. `Word2Vec` remains a model that inherits from the bag-of-words logic. The representation of a sentence or document is a form of average of the representations of the words that compose them.

Since 2013, several revolutions have led to enriching language models to go beyond a word-by-word representation of these. Much more complex architectures to represent not only words in the form of *embeddings* but also sentences and documents are now at work and can be linked to the revolution of *transformer* architectures.

# 5. *Transformers*: a richer representation of language

While the `Word2Vec` model is trained contextually, its purpose is to give a vector representation of a word in an absolute manner, independent of context. For example, the term *“bank”* will have exactly the same vector representation whether it appears in the sentence *“She runs towards the sandbank”* or *“He’s waiting for you on a bench in the park”*. This is a major limitation of this type of approach and we can well imagine the importance of context for language interpretation.

The objective of *transformer* architectures is to enable contextual vector representations. In other words, a word will have several vector representations, depending on its context of occurrence. These models rely on the attention mechanism (Vaswani 2017). Before this approach, when a model learned to vectorize a text and reached the nth word, the only memory it kept was that of the previous word. By recurrence, this meant it kept a memory of previous words but this tended to dissipate. Consequently, for a word appearing far in the sentence, it was likely that the context from the beginning of the sentence was forgotten. In other words, in the sentence *“at the beach, he was going to explore the bank”*, it was very likely that upon reaching the word *“bank”*, the model had forgotten the beginning of the sentence which was nevertheless important for interpretation.

The objective of the attention mechanism is to create an internal memory in the model allowing, for any word in a text, to keep track of other words. Of course, not all are relevant for interpreting the text but this avoids forgetting those that are important. The main innovation of recent years in NLP has been to manage to create large-scale attention mechanisms without making the models intractable. The context windows of the most performant models are becoming immense. For example, the Llama 3.1 model (made public by Meta in July 2024) offers a context window of 128,000 *tokens*, or about 96,000 words, the equivalent of Tolkien’s *Hobbit*. In other words, to deduce the subtlety of a word’s meaning, this model can browse through a context as long as a novel of about 300 pages.

The two models that marked their era in the field are the `BERT` models developed in 2018 by *Google* (which was already behind `Word2Vec`) and the first version of the well-known `GPT` from `OpenAI`, which, in 2017, was the first pre-trained model based on the *transformer* architecture. These two *transformer* families differ in how they integrate context to make a prediction. `GPT` is an autoregressive model, therefore only considers the *tokens* before the one we want to predict. `BERT` uses the *tokens* to the left and right to infer context. These two major trained language models are trained by self-reinforcement, mainly on next *token* prediction tasks (“The Hugging Face Course, 2022” 2022). Since the success of `ChatGPT`, the new GPT models (from version 3 onwards) are no longer *open source*. To use them, one must therefore go through OpenAI’s APIs. There are nevertheless many alternatives whose weights are open, if not *open source*[1], which allow using these LLMs through `Python`, notably through the `transformers` library developed by *Hugging Face*.

When working with small-sized corpora,
it’s generally a bad idea to train your own model *from scratch*. Fortunately, models pre-trained on very large corpora are available. They allow for *transfer learning*, that is, to benefit from the performance of a model that has been trained on another task or on another corpus.

> **Exercise 3**
>
> 1.  Repeat a train/test split with 500 random lines
> 2.  Import the `all-MiniLM-L6-v2` model with the `sentence transformers` package. Encode `X_train` and `X_test`.
> 3.  Perform a classification using a simple method, such as CVS, based on the *embeddings* produced in the previous question. As the training set is small, you can perform cross-validation.
> 4.  Understand why the performance is worse than that of Bayes’ naive classifier.

Answer to question 1:

``` python
random_rows = spooky_df.sample(500)
y = random_rows["author"]
X = random_rows['text']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
```

Réponse à la question 2:

``` python
from sentence_transformers import SentenceTransformer
from sklearn.svm import LinearSVC

model = SentenceTransformer(
    "all-MiniLM-L6-v2", model_kwargs={"torch_dtype": "float16"}
)

X_train_vectors = model.encode(X_train.values)
X_test_vectors = model.encode(X_test.values)
```

Answer to question 3:

``` python
from sklearn.model_selection import cross_val_score

clf = LinearSVC(max_iter=10000, C=0.1, dual="auto")

scores = cross_val_score(
    clf, X_train_vectors, y_train,
    cv=4, scoring='f1_micro', n_jobs=4
)

print(f"CV scores {scores}")
print(f"Mean F1 {np.mean(scores)}")
```

**But why, with a very complicated method, can’t we beat a very simple one?**

There are several possible reasons:

-   the TF-IDF is a simple model, but it still performs very well
    (this is known as a ‘tough-to-beat baseline’).
-   the classification of authors is a very specific and arduous task,
    which does not do justice to the *embeddings*. As we said earlier, the latter are particularly relevant when it comes to semantic similarity between texts (*clustering*, etc.).

In the case of our classification task, it is likely that
certain words (character names, place names) are sufficient to classify in a relevant way,
This is not captured by *embeddings*, which give all words the same importance.

Mikolov, Tomas. 2013. “Efficient Estimation of Word Representations in Vector Space.” *arXiv Preprint arXiv:1301.3781* 3781.

“The Hugging Face Course, 2022.” 2022. <https://huggingface.co/course>.

Vaswani, A. 2017. “Attention Is All You Need.” *Advances in Neural Information Processing Systems*.

[1] Some organizations, like Meta for Llama, make available the post-training weights of their model on the *Hugging Face* platform, allowing reuse of these models if the license permits. Nevertheless, these are not *open source* models since the code used to train the models and constitute the learning corpora, derived from massive data collection by *webscraping*, and any additional annotations to make specialized versions, are not shared.