# NLP & representation learning: Neural Embeddings, Text Classification


To use statistical classifiers with text, it is first necessary to vectorize the text. In the first practical session we explored the **Bag of Word (BoW)** model.

Modern **state of the art** methods uses  embeddings to vectorize the text before classification in order to avoid feature engineering.

## [Dataset](https://thome.isir.upmc.fr/classes/RITAL/json_pol.json)


## "Modern" NLP pipeline

By opposition to the **bag of word** model, in the modern NLP pipeline everything is **embeddings**. Instead of encoding a text as a **sparse vector** of length $D$ (size of feature dictionnary) the goal is to encode the text in a meaningful dense vector of a small size $|e| <<< |D|$.


The raw classification pipeline is then the following:

```
raw text ---|embedding table|-->  vectors --|Neural Net|--> class
```


### Using a  language model:

How to tokenize the text and extract a feature dictionnary is still a manual task. To directly have meaningful embeddings, it is common to use a pre-trained language model such as `word2vec` which we explore in this practical.

In this setting, the pipeline becomes the following:
```
      
raw text ---|(pre-trained) Language Model|--> vectors --|classifier (or fine-tuning)|--> class
```


- #### Classic word embeddings

 - [Word2Vec](https://arxiv.org/abs/1301.3781)
 - [Glove](https://nlp.stanford.edu/projects/glove/)


- #### bleeding edge language models techniques (see next)

 - [UMLFIT](https://arxiv.org/abs/1801.06146)
 - [ELMO](https://arxiv.org/abs/1802.05365)
 - [GPT](https://blog.openai.com/language-unsupervised/)
 - [BERT](https://arxiv.org/abs/1810.04805)






### Goal of this session:

1. Train word embeddings on training dataset
2. Tinker with the learnt embeddings and see learnt relations
3. Tinker with pre-trained embeddings.
4. Use those embeddings for classification
5. Compare different embedding models

## STEP 0: Loading data

In [39]:
import json
from collections import Counter

# Loading json
file = './json_pol.json'

with open(file,encoding="utf-8") as f:
    data = json.load(f)

#C'est une classe du module collections qui permet de compter les occurrences de chaque élément dans une collection 
#(liste, tuple, etc.). Elle retourne un dictionnaire où les clés sont les éléments uniques 
#de la collection, et les valeurs sont leurs comptes respectifs.

# Quick Check
print(data[:3])
counter = Counter((x[1] for x in data))
print(counter)
print("Number of reviews : ", len(data))
print("----> # of positive : ", counter[1])
print("----> # of negative : ", counter[0])
print("")
print(data[0])

[['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old Justin Henry and a script that was sympathetic to each character (and each one\'s predicament), the thought provoking elements linger long after the tear jerking ones are over. Overall, superior acting from a solid cast, excellent directing, and a very powerful script. The right touches of humor throughout help keep a "heavy" subject from becoming tedious or difficult to sit through. Lastly, this film stands the test of time and seems in no way dated, decades after it was released.', 1], ["This was one of those films I would always come across (be it on TV or cheap DVD), but never struck me to give it a shot as I thought I wasn't missing out on much. It was on one night and I thought oh well\x85 why not. A good decision too, as I would kick mys

## Word2Vec: Quick Recap

**[Word2Vec](https://arxiv.org/abs/1301.3781) is composed of two distinct language models (CBOW and SG), optimized to quickly learn word vectors**


given a random text: `i'm taking the dog out for a walk`



### (a) Continuous Bag of Word (CBOW)
    -  predicts a word given a context
    
maximizing `p(dog | i'm taking the ___ out for a walk)`
    
### (b) Skip-Gram (SG)               
    -  predicts a context given a word
    
 maximizing `p(i'm taking the out for a walk | dog)`



   

## STEP 1: train a language model (word2vec)

Gensim has one of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) fastest implementation.


### Train:

In [4]:
# if gensim not installed yet
! pip install gensim --break-system-packages

Defaulting to user installation because normal site-packages is not writeable


In [9]:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

text = [t.split() for t,p in data]

# the following configuration is the default configuration
w2v = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model
                                min_count=5,
                                sample=0.001, workers=3,
                                sg=1, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

2026-02-16 17:01:39,667 : INFO : collecting all words and their counts
2026-02-16 17:01:39,667 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2026-02-16 17:01:40,052 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2026-02-16 17:01:40,442 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2026-02-16 17:01:40,638 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2026-02-16 17:01:40,638 : INFO : Creating a fresh vocabulary
2026-02-16 17:01:40,791 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2026-02-16T17:01:40.791101', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0]', 'platform': 'Linux-6.8.0-100-generic-x86_64-with-glibc2.39', 'event': 'prepare_vocab'}
2026-02-16 17:01:40,791 : INFO : Word2Vec lifecycle ev

In [10]:
# Worth it to save the previous embedding
w2v.save("W2v-movies.dat")
# You will be able to reload them:
# w2v = gensim.models.Word2Vec.load("W2v-movies.dat")
# and you can continue the learning process if needed

2026-02-16 17:03:28,618 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'W2v-movies.dat', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2026-02-16T17:03:28.618267', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0]', 'platform': 'Linux-6.8.0-100-generic-x86_64-with-glibc2.39', 'event': 'saving'}
2026-02-16 17:03:28,619 : INFO : not storing attribute cum_table
2026-02-16 17:03:28,694 : INFO : saved W2v-movies.dat


In [None]:
w2v = gensim.models.Word2Vec.load("W2v-movies.dat") #pour ne pas relancer le processus un peu long

## STEP 2: Test learnt embeddings

The word embedding space directly encodes similarities between words: the vector coding for the word "great" will be closer to the vector coding for "good" than to the one coding for "bad". Generally, [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is the distance used when considering distance between vectors.

KeyedVectors have a built in [similarity](https://radimrehurek.com/gensim/models /keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.similarity) method to compute the cosine similarity between words

In [14]:
# is great really closer to good than to bad ?
print("great and good:",w2v.wv.similarity("great","good"))
print("great and bad:",w2v.wv.similarity("great","bad"))
print("car and vehicle:",w2v.wv.similarity("car","vehicle"))
print("boy and girl:",w2v.wv.similarity("girl","boy"))
print("boy and man:",w2v.wv.similarity("man","boy"))

great and good: 0.77287775
great and bad: 0.45807964
car and vehicle: 0.36336285
boy and girl: 0.835626
boy and man: 0.749683


Since cosine distance encodes similarity, neighboring words are supposed to be similar. The [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.most_similar) method returns the `topn` words given a query.

In [15]:
# The query can be as simple as a word, such as "movie"

# Try changing the word
w2v.wv.most_similar("movie",topn=5) # 5 most similar words
#w2v.wv.most_similar("awesome",topn=5)
#w2v.wv.most_similar("actor",topn=5)

[('film', 0.9285252690315247),
 ('"movie"', 0.81612628698349),
 ('movie,', 0.7630794644355774),
 ('flick', 0.7629202008247375),
 ('"film"', 0.7519605159759521)]

But it can be a more complicated query
Word embedding spaces tend to encode much more.

The most famous exemple is: `vec(king) - vec(man) + vec(woman) => vec(queen)`

In [24]:
# What is awesome - good + bad ?
#w2v.wv.most_similar(positive=["awesome","bad"],negative=["good"],topn=3)

#w2v.wv.most_similar(positive=["actor","woman"],negative=["man"],topn=3) # do the famous exemple works for actor ?

w2v.wv.most_similar(positive=["king","woman","girl"],negative=["man"],topn=10)
# Try other things like plurals for exemple.

[('princess', 0.7496501803398132),
 ('witch', 0.7445906400680542),
 ('girl,', 0.7398815751075745),
 ('nun', 0.7359115481376648),
 ('prostitute', 0.7325125336647034),
 ('hooker', 0.7285485863685608),
 ('lady', 0.7254506945610046),
 ('girl"', 0.719394326210022),
 ('baby,', 0.7187190055847168),
 ('Amudha', 0.7157213091850281)]

**To test learnt "synctactic" and "semantic" similarities, Mikolov et al. introduced a special dataset containing a wide variety of three way similarities.**

**You can download the dataset [here](https://thome.isir.upmc.fr/classes/RITAL/questions-words.txt).**

In [20]:
out = w2v.wv.evaluate_word_analogies("questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2026-02-16 17:08:40,075 : INFO : Evaluating word analogies for top 300000 words in the model on questions-words.txt
2026-02-16 17:08:40,288 : INFO : capital-common-countries: 2.2% (2/90)
2026-02-16 17:08:40,441 : INFO : capital-world: 2.8% (2/71)
2026-02-16 17:08:40,510 : INFO : currency: 0.0% (0/28)
2026-02-16 17:08:41,111 : INFO : city-in-state: 0.0% (0/329)
2026-02-16 17:08:41,696 : INFO : family: 32.5% (111/342)
2026-02-16 17:08:43,246 : INFO : gram1-adjective-to-adverb: 2.4% (22/930)
2026-02-16 17:08:44,201 : INFO : gram2-opposite: 3.8% (21/552)
2026-02-16 17:08:46,643 : INFO : gram3-comparative: 20.7% (261/1260)
2026-02-16 17:08:47,835 : INFO : gram4-superlative: 8.0% (56/702)
2026-02-16 17:08:49,127 : INFO : gram5-present-participle: 19.7% (149/756)
2026-02-16 17:08:50,699 : INFO : gram6-nationality-adjective: 3.3% (26/792)
2026-02-16 17:08:52,910 : INFO : gram7-past-tense: 16.2% (204/1260)
2026-02-16 17:08:54,428 : INFO : gram8-plural: 5.3% (43/812)
2026-02-16 17:08:55,873 : IN

**When training the w2v models on the review dataset, since it hasn't been learnt with a lot of data, it does not perform very well.**


## Word2vec visualisation

In [22]:
w2v.wv.most_similar(positive=["France","Paris"],topn=3)

[('France,', 0.8523359298706055),
 ('Venice,', 0.8267824053764343),
 ('London', 0.8234754800796509)]

## STEP 3: Loading a pre-trained model

In Gensim, embeddings are loaded and can be used via the ["KeyedVectors"](https://radimrehurek.com/gensim/models/keyedvectors.html) class

> Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module.

>The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.

>The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, they key can also correspond to a document, a graph node etc. To generalize over different use-cases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.

**You can download the pre-trained word embedding [HERE](https://thome.isir.upmc.fr/classes/RITAL/word2vec-google-news-300.dat) .**

In [35]:
from gensim.test.utils import get_tmpfile
from gensim.models import KeyedVectors
import gensim.downloader as api
bload = False
fname = "word2vec-google-news-300"
sdir = "" # Change

if(bload==True):
    wv_pre_trained = KeyedVectors.load(sdir+fname+".dat")
else:
    wv_pre_trained = api.load(fname)
    wv_pre_trained.save(sdir+fname+".dat")


2026-02-16 17:20:40,276 : INFO : Creating /home/nilsbarrellon/gensim-data




2026-02-16 17:32:46,301 : INFO : word2vec-google-news-300 downloaded
2026-02-16 17:32:46,305 : INFO : loading projection weights from /home/nilsbarrellon/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
2026-02-16 17:33:18,236 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from /home/nilsbarrellon/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz', 'binary': True, 'encoding': 'utf8', 'datetime': '2026-02-16T17:33:18.236641', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0]', 'platform': 'Linux-6.8.0-100-generic-x86_64-with-glibc2.39', 'event': 'load_word2vec_format'}
2026-02-16 17:33:18,237 : INFO : KeyedVectors lifecycle event {'fname_or_handle': 'word2vec-google-news-300.dat', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2026-02-16T17:33:18.237482', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0]', 'platform': '

**Perform the "synctactic" and "semantic" evaluations again. Conclude on the pre-trained embeddings.**

## STEP 4:  sentiment classification

In the previous practical session, we used a bag of word approach to transform text into vectors.
Here, we propose to try to use word vectors (previously learnt or loaded).


### <font color='green'> Since we have only word vectors and that sentences are made of multiple words, we need to aggregate them. </font>


### (1) Vectorize reviews using word vectors:

Word aggregation can be done in different ways:

- Sum
- Average
- Min/feature
- Max/feature

#### a few pointers:

- `w2v.wv.vocab` is a `set()` of the vocabulary (all existing words in your model)
- `np.minimum(a,b) and np.maximum(a,b)` respectively return element-wise min/max

In [54]:
import numpy as np
# We first need to vectorize text:
# First we propose to a sum of them
#Bibliothèques utiles
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


def vectorize(text, w2v, mean=False):
    """
    Vectorise un texte en utilisant un modèle Word2Vec.
    :param text: Texte à vectoriser (str).
    :param w2v: Modèle Word2Vec entraîné (gensim.models.Word2Vec).
    :param mean: Si True, retourne la moyenne des vecteurs des mots. Sinon, retourne la somme.
    :return: Vecteur NumPy (np.array) représentant le texte.
    """
    vectors = []
    for word in text.split():
        if word in w2v.wv.key_to_index:  # Vérifie si le mot est dans le vocabulaire
            vectors.append(w2v.wv[word])  # Ajoute le vecteur du mot

    if not vectors:  # Si aucun mot n'est dans le vocabulaire
        return np.zeros(w2v.vector_size)  # Retourne un vecteur nul

    if mean:
        return np.mean(vectors, axis=0)  # Moyenne des vecteurs
    else:
        return np.sum(vectors, axis=0)  # Somme des vecteurs
#on coupe le jeu de données en 80% de données d'entrainement, 20% de données pour les tests

[train, test]  = train_test_split(data, test_size=0.2, random_state=42, shuffle=True)
#label d'entrainement
classes = [pol for text,pol in train]
#données d'entrainement
X = [vectorize(text,w2v) for text,pol in train]
#données de test
X_test = [vectorize(text,w2v) for text,pol in test]
#labels de test
true = [pol for text,pol in test]

#let's see what a review vector looks like.
print(X[0])

[  -3.6030931   81.09755     -5.5625744  -32.537807   -13.523959
 -149.08589     46.395557   178.24171    -49.124313   -56.371647
   -6.567054  -116.84839    -30.888432    59.093906    37.168816
  -65.31096     54.803596   -30.867128   -46.19165   -209.28311
   30.657885   -39.241558   146.07489    -56.333878    -1.6967841
   65.81456    -60.338875    -2.684317  -114.31843     37.438686
   85.73729      4.8225527   -4.6872067 -117.13731    -15.731178
   21.141417    -0.8496002  -82.54633    -68.87561   -157.08452
    2.7410727 -131.53127    -72.49354     48.381435    60.973804
  -30.774494   -56.276104    -8.592021    59.7097      24.78802
   55.970375   -94.613655    37.70264     12.96038    -76.094986
    7.6361637   42.042484     4.9743958  -74.76654     28.26746
   28.929958  -110.046394    48.576397    61.086758  -115.15952
  113.765495     2.3422656  129.27287   -124.883194    77.206985
  -41.333717    91.983795    89.56144      5.6670938   89.5685
   45.964237    31.110245     8

### (2) Train a classifier
as in the previous practical session, train a logistic regression to do sentiment classification with word vectors



In [55]:


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,recall_score, f1_score


#Logistic Regression
t = 1e-8
C=100.0
lr_clf = LogisticRegression(random_state=0, solver='liblinear',max_iter=100, tol=t, C=C)
lr_clf.fit(X, classes)

#affichage des précisions

pred_lrt = lr_clf.predict(X)
pred_lr = lr_clf.predict(X_test)

# Métriques pour Régression Logistique
print("Logistic Regression")
print(f"Logistic Regression accuracy train={accuracy_score(classes, pred_lrt):.4f}, test={accuracy_score(true, pred_lr):.4f}")
print(f"Logistic Regression recall train={recall_score(classes, pred_lrt, average='macro'):.4f}, test={recall_score(true, pred_lr, average='macro'):.4f}")
print(f"Logistic Regression F1-score train={f1_score(classes, pred_lrt, average='macro'):.4f}, test={f1_score(true, pred_lr, average='macro'):.4f}")
print("--------------------------------------------------------------")




Logistic Regression
Logistic Regression accuracy train=0.8297, test=0.8290
Logistic Regression recall train=0.8297, test=0.8290
Logistic Regression F1-score train=0.8297, test=0.8290
--------------------------------------------------------------


performance should be worst than with bag of word (~80%). Sum/Mean aggregation does not work well on long reviews (especially with many frequent words). This adds a lot of noise.

## **Todo**:  Try answering the following questions:

- Which word2vec model works best: skip-gram or cbow
- Do pretrained vectors work best than those learnt on the train dataset ?

## **Todo**: evaluate the same pipeline on speaker ID task (Chirac/Mitterrand)


**(Bonus)** To have a better accuracy, we could try two things:
- Better aggregation methods (weight by tf-idf ?)
- Another word vectorizing method such as [fasttext](https://radimrehurek.com/gensim/models/fasttext.html)
- A document vectorizing method such as [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)