**All Rights Reserved**

**Copyright (c) 2025 IRT Saint-Exupery**

*Author & contact:* 
* mouhcine.mendil@irt-saintexupery.com 

# Introduction to Natural Language Processing (NLP)

<div align="center">
    <h2>Lab Sessions 1 & 2</h2>
</div>

By now, you may have encountered various data structures such as tables, images, and time series. But have you ever wondered how computers can understand human language? This is where Natural Language Processing (NLP) comes in: it is one of the most popular fields in AI today. Thanks to recent developments in foundation models that rely on transformer architectures, NLP capabilities have has seen revolutionary advancements. In this tutorial, you will get an introduction to this exciting field that combines computer science, linguistics, and machine learning. We will explore the world of text data and provide you with the skills necessary to process them and create models that can unlock knowledge and extract valuable insights.

## 0. Setup

<div class="alert alert-block alert-warning">

There are many special characters for text formatting that control rendering like line jumps, symbols, etc. 

⚠️ Use <code>print</code> for raw text output and the function <code>printmd</code> (defined below) to have a beautiful markdown rendering when needed.

</div>


**NLP tools requirements:**

* Natural Language Toolkit (NLTK) is one of the leading platforms for building Python programs to work with human language data.

* Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is NLP and information retrieval (IR) community.


In [None]:
# Run this cell to install requirements
%pip install --user --upgrade nltk pandas gensim seaborn scikit-learn textblob regex tqdm

In [None]:
from IPython.display import Markdown, display
from copy import deepcopy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from tqdm import tqdm

nltk.download("stopwords")
nltk.download("averaged_perceptron_tagger")
nltk.download("punkt")

from nltk.corpus import stopwords


def printmd(string):
    display(Markdown(string))

# 1. Sentiment Analysis of IMDb Movie Reviews

One of the most common applications of NLP is text classification. Therefore, you will kickoff this notebook by addressing the task of **sentiment analysis**. The aim is understanding the positive or negative sentiment/opinion expressed in text. 

## 1.1 IMDb Dataset 

IMDb dataset contains 50,000 movie reviews in English, each labelled as positive or negative. 
You can download the dataset from [Kaggle](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) (you need to sign in).

<div class='alert alert-info'>

<b> Exercise 1.1 </b>

- Load the dataset in a dataframe. Note the feature column (review) and target column (sentiment).
- How many samples are there ?
- Plot the target distribution. Is the dataset balanced ?
- Transform the target values to binary (i.e "positive" -> 1 & "negative" -> 0). Use [`np.where`](https://numpy.org/doc/stable/reference/generated/numpy.where.html).

</div>

In [None]:
## TODO: load csv file
df_raw = ...

## TODO: count the number of samples
...

## TODO: plot the distribution of sentiments
...

In [None]:
## TODO: binarize the "sentiment column"

df_raw["sentiment"] = ...

In [None]:
assert set(df_raw["sentiment"]) == set(
    {0, 1}
), "You still have non-binarized values, try again"

## 1.2 Data cleaning

Textual data require special handling in the pre-processing phase. 


<div class='alert alert-info'>

<b> Exercise 1.2.1.1 </b>

- Take a look on the first two reviews using <code>print</code> and <code>printmd</code>. What do you notice ?
</div>

In [None]:
# Let's take a look at the first two reviews using print and printmd

# First review
printmd("### First review")
...  # print
...  # printmd

# Second review
printmd("-------")
printmd("### Second review")
...  # print
...  # printmd

<div class="alert alert-info">

<h4><b>Intermediary Exercises</b></h4>

Strings in Python are sequences of characters, which allows us to manipulate and process them in various ways. Let's explore:

<ol>
  <li>
    <b>Extract characters:</b> 
    Print the list of <strong>characters</strong> that form the first review.
  </li>
  <li>
    <b>Extract words:</b> 
    If we want to extract words instead of characters, we can use the 
    <a href="https://docs.python.org/3/library/stdtypes.html#str.split" target="_blank"><code>split</code></a> 
    method. By specifying a suitable separator (e.g., a space for English and French), we can break the string into words.
    <ul>
      <li>Print the list of <strong>words</strong> forming the first review.</li>
    </ul>
  </li>
  <li>
    <b>Rebuild the original Review:</b> 
    Concatenate the extracted list of words back into the original sentence using the 
    <a href="https://docs.python.org/3/library/stdtypes.html#str.join" target="_blank"><code>join</code></a> 
    method. Use a space to interleave each word.
  </li>
</ol>

</div>

In [None]:
# 1. Print the characters of the first review
char_list = []
...  # extract characters from the first review
printmd("### Characters of the first review")
print(char_list)

# 2. Print the words of the first review
first_review_words = ...  # extract words from the first review
printmd("### Words of the first review")
print(first_review_words)

# 3. Rebuild the first review from the list of words
first_review_rebuilt = ...
printmd("### Rebuilt first review")
print(first_review_rebuilt)

assert (
    first_review_rebuilt == df_raw["review"][0]
), "The review is not correctly rebuilt"

<div class='alert alert-info'>

<b> Exercise 1.2.1.2 </b>

- Text data usually contains artifacts only relevent for visualization (e.g., HTML tags and special characters) and undesired content such as spelling mistakes. Such elements are not only useless for modeling but can also be harmful as they pollute the relevant information. For each of the following data cleaning operations, write a function to apply them on the IMDB dataset:    
    1. Lowering all capital letters.
    2. Using the [regex](https://docs.python.org/3/library/re.html) Python library (`re.sub`), substitute (by a space) the patterns associated to Hyperlinks ("https:something" or "http:something"), Mentions ("@something") and HTML elements ("\<something>" or "\</something>").
    3. Stop words, i.e commonly occurring words in a language that are considered to have minimal meaning on their own.
        * Using <code>stopwords.words</code>, list all stopwords in english. 
        * Delete stopwords from IMDB movie reviews (you can use [`str.replace`](https://docs.python.org/3/library/stdtypes.html#str.replace))
    4. Replace with space any special characters, that is anything other than alpha-numerical characters, such as punctuation and symbols (you can use [isalnum](https://docs.python.org/3/library/stdtypes.html#str.isalnum)).
</div>

In [None]:
def lower_case(text):
    # Lowercase text
    lowered_text = ...
    return lowered_text


def remove_patterns(text):
    # replace hyperlinks, mentions and html elements
    patterns = "https?:\S+|http?:\S+|@\S+|<[^<]+?>"
    replacement_text = " "
    cleaned_text = ...
    return cleaned_text


english_stop_words = ...  # TODO
printmd("### English stop words")
print(f"There are {len(english_stop_words)} stop words in english: \n")
print(english_stop_words)


def clean_stop_words(text):
    english_stop_words = ...
    # list of the words we will keep
    clean_words = []
    # Parse text and remove stop words
    ...
    # return the text without the stop words
    return ...


def clean_non_alphanum(text):
    # Remove non alphanumeric characters
    clean_characters_list = []
    # Loop over the characters of the text to keep only the alphanumeric ones
    ...
    # join the characters to rebuild the text
    cleaned_text = ...
    return cleaned_text

<div class='alert alert-info'>

<b> Exercise 1.2.1.3 </b>
- Apply the previous operations on a copy of the raw dataframe and compare the first review before and after cleaning. Use [`apply`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) to sequentially apply the preprocessing functions (⚠️ **order is important**). 
</div>

In [None]:
# Init cleaned dataframe, we keep the copy of raw dataframe untouched
df_clean = deepcopy(df_raw)

# Apply operations
list_operations = ...
for op in tqdm(list_operations):
    df_clean.review = ...

In [None]:
# Compare the first review before and after cleaning
printmd("### First review before cleaning")
printmd(df_raw["review"][0])
printmd("-----")
printmd("### First review after cleaning")
printmd(df_clean["review"][0])

**Stemming**

<blockquote>
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root
</blockquote>

[-- Wikipedia](https://en.wikipedia.org/wiki/Stemming)

<div align="center">
  <img src="figures/stemming.png" width="50%"/>
  <figcaption>Source: Quora</figcaption>
</div>

<div class='alert alert-info'>

<b> Exercise 1.2.2 </b>

- NLTK offers several stemmers. Write a function `apply_stemmer` to apply the `Lancaster` stemmer (known for its accuracy for English text).
- Apply the stemmer on a copy of the cleaned dataset. Compare the sentences before and after stemming; what do you notice ?

<div/>

In [None]:
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()


def apply_stemmer(text):
    # Split text into words
    words = ...
    # Apply stemming to each word
    stemmed_words = ...
    # Join the stemmed words back into a string
    stemmed_text = ...
    return stemmed_text


# Init stem dataframe, we keep the copy the cleaned dataframe untouched
df_stem = deepcopy(df_clean)
# Apply stemming
df_stem.review = ...

In [None]:
# Compare the cleaned first review before and after stemming
printmd("### Cleaned first review before stemming")
printmd(df_clean["review"][0])
printmd("-----")
printmd("### Cleaned first review after stemming")
printmd(df_stem["review"][0])

➡️ By grouping together words with similar meanings into their base form, stemming reduces the overall number of unique words the model needs to deal with. Stemming helps to normalize text data by handling variations of words due to tense, plurals, or derivational suffixes, which can be beneficial for tasks where a smaller vocabulary can improve efficiency. However, stemming can sometimes lead to the creation of non-words or words with altered meanings. 

Lemmatization is another option that is more accurate and preserves meaning better, but can be computationally more expensive.

IMDb dataset is simple enough, **so we will not use stemming or lemmatization**. But note that these two operations exist and can be helpful for more complicated dataset or tasks. 

## 1.3 Data Preparation

<div class='alert alert-info'>

<b> Exercise 1.3.1 </b>

- Split the **clean data** into train (80%) and test (20%) subsets. Use [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) with a random seed $=0$.
- Ensure the train and test subsets are balanced by visualizing the distributions of sentiments in both subsets. Use a plot to confirm that the sentiment distribution is similar in the two subsets.

<p>
<b>Note:</b> Typically, datasets are split into three subsets: <strong>train</strong>, <strong>validation</strong>, and <strong>test</strong>. Since we are not performing hyperparameter tuning nor regularization in this exercise, we will omit the validation set. However, when tuning models, always reserve a portion of your data specifically for validation.
</p>

<div/>

In [None]:
# Import train_test_split from sklearn
...

df_train, df_test = ...

# Check the sentiment distribution of the train and test sets
fig, ax = plt.subplots(1, 2, figsize=(20, 5))
# Plot the sentiment distribution of the train set on the first subplot
...
# Plot the sentiment distribution of the test set on the second subplot
...

**Tokenization**

To build a model for our task, we need to further preprocess the text by choping it into words or subwords called **tokens**, instead of individual characters.


<div align="center">
  <img src="figures/tokenization.png" width="40%"/>
  <figcaption>Author: Shann Khosla<figcaption/>
</div>

For our task on IMDb dataset, we will use spaces for token boundaries.

<div class="alert alert-block alert-warning">

⚠️⚠️⚠️ It's important to note that using spaces to separate words may not be appropriate for all languages. For instance, Chinese writing doesn't use spaces between words, Vietnamese uses spaces even within words, and German often combines multiple words without spaces. Even in English, spaces are not always the best way to tokenize text, as seen in examples like "hot dog" or "#funnyvideos." ⚠️⚠️⚠️

To address these issues, there are several methods to tokenize and detokenize text at the subword level. We can cite for example Byte Pair Encoding (BPE), Unigram language modeling (ULM), WordPiece and SentencePiece. You can find many state-of-the-art, fast and optimized tokenizes in [the tokenizers library by Hugging Face](https://huggingface.co/docs/tokenizers/index).
</div>

<div class='alert alert-info'>

<b> Exercise 1.3.2 </b>
- Perform a <strong>word-level tokenization</strong> using <code>nltk.tokenize.word_tokenize</code>.
- Analyze the behavior of <code>word_tokenize</code>. Does it simply split text based on spaces, or does it handle text in a more nuanced way? Provide an explanation based on your observations.    

<div/>


In [None]:
from nltk.tokenize import word_tokenize
import nltk

nltk.download("punkt_tab")


# tokenization function
def tokenize_text(text):
    tokens = ...
    return tokens


# Init stem dataframe, we keep the copy the cleaned dataframe untouched
df_train_tokenized = deepcopy(df_train)
df_test_tokenized = deepcopy(df_test)

# Apply tokenization
df_train_tokenized.review = ...
df_test_tokenized.review = ...

df_train_tokenized.reset_index(inplace=True, drop=True)
df_test_tokenized.reset_index(inplace=True, drop=True)

Here's a quick overview of some important NLP terminology:

- **Document**: a single sample text in the dataset (e.g., a single movie review).

- **Corpus**:  a large collection of documents (reviews) in the dataset.

- **Vocabulary**:  The complete set of unique tokens for the training corpus. The vocabulary is a subset of the words used for training that might exist in the broader corpus, and should therefore be comprehensive to avoid encountering a word not in its vocabulary (i.e.,  Out-of-Vocabulary words). 

<div class="alert alert-block alert-warning">
⚠️ Usually, limitation are applied to the vocabulary size which typically involves selecting the most frequent tokens, since it’s unlikely that very rare words will be important for the task. Limiting the vocabulary size will reduce the number of parameters the model needs to learn. 

In real-world scenarios, we cannot precisely predict the tokens we will encounter during operation. Therefore, to avoid bias, you should construct the vocabulary (and corpus) based solely on your **training set**.
</div>

<div class='alert alert-info'>
<b> Exercise 1.3.3 </b>

* Recover all the tokens from the training corpus and store them in a list. We recommend using [`pandas.DataFrame.explode`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html).
* Count the number of unique tokens that constitute the full vocabulary.
* Reduce the vocabulary to the 10,000 most frequent tokens.
* Store the result in a dictionary `vocab_token2index`, where each token (key) is assigned a unique integer ID (for example, a value between 0 and 9999).

</div>


In [None]:
# extract all tokens from df_train_tokenized
training_tokens = ...

# Count the number of tokens in the vocabulary
print(f"The number of tokens in the full vocabulary is: {...}")

# Reduce size to the 10000th most frequent
vocabulary = ...

# token to index dict
vocab_token2index = ...

# Count the number of tokens in the limited vocabulary
print(f"The number of tokens in the filtered vocabulary is: {len(vocab_token2index)}")

### Text vectorization

Text Vectorization consists of converting textual data into numerical representations. This makes it possible to process textual data in machine learning algorithms that typically work with numerical values. Common vectorization techniques include:

* **One-Hot Encoding (OHE)**: represents a token as a binary vector that has the size of the vocabulary. Each token is assigned a unique position in the vector, corresponding to its index in the vocabulary. In the relevant position, the vector has $1$ to indicate the token's presence and $0$ otherwise. This approach is simple but turns out to be inefficient for large vocabularies. For instance, if there are 50,000 tokens in the vocabulary, one-hot encoding would create a 50,000-dimensional **sparse** vector (which mostly contains zeros) per token. If we need a encoding at the document level, a strategy needs to be adopted to aggregate the one-hot representations of the tokens composing the documents.  

* **Bag-of-Words (BoW)**: represents text data as a numerical vector, where each unique word in the vocabulary corresponds to a dimension in the vector space. Unlike one-hot encoding, which uses binary values to indicate the presence of a single token, BoW counts the occurrences of each token in a document. This representation summarizes the document's content while ignoring token order, context, and semantics.

* **Term Frequency- Inverse Document Frequency (TF-IDF)**:is a numerical statistic used to evaluate the importance of a token (word or term) in a document relative to a collection of documents (corpus). It scores a token by multiplying its Term Frequency (TF) by its Inverse Document Frequency (IDF):
    * Term Frequency (TF): The frequency of a token $t$ in a document $d$, normalized by the total number of tokens in the document $d$:
    $$TF(t, d)= \frac{\text{Number of times the token $t$ appears in the document $d$}}{\text{Total number of tokens in document $d$}}$$
    * Inverse Document Frequency (IDF): A measure of how unique or rare a token $t$ is across the corpus. Tokens that appear in fewer documents (e.g., technical terms) are assigned higher importance than those common in all documents (e.g., stop words like "a," "the").
    $$IDF(t)= \log(\frac{\text{Number of documents in the corpus}}{\text{Number of documents in the corpus containing the token $t$}})$$
    * The TF-IDF The TF-IDF score of a token $t$ in a document $d$ is the product of its TF and IDF values:
    $$TF\text{-}IDF(t, d) = TF(t, d) \cdot IDF(t)$$

* **Embeddings**: An embedding is a numerical representation of complex data, where each token (e.g., word or phrase) is represented as a vector in an $n$-dimensional space. Unlike traditional vectorization techniques like Bag-of-Words or TF-IDF, embeddings are typically compact, dense, and highly informative representations. One of the main advantages of embeddings is their ability to encode semantic relationships. This means that words with similar meanings (synonyms) are represented by vectors that are closer together in the $n$-dimensional space. There are many technique to efficiently learn embeddings in an unsupervised way, some of the most popular **word embedding models** are Word2vec, GloVe, fastText, ELMo and BERT. **You can find online already pretrained models for these popular embeddings**.  
<div align="center">
<img src="figures/taxonomy.png"/>
<div/>

#### Traditional Encoders

Let's first explore traditional encoding techniques on our data, including BoW and TF-IDF. Keep in mind that we can flexibly choose the encoding technique that goes with our classification models (remember, our goal is to perform sentiment analysis on IMDb reviews).  

In the next section, we will spend some time on embedding to understand and visualize how they process the data.

</div>
<div class='alert alert-info'>
<b> Exercise 1.3.4 </b>

* Implement the **Bag-of-Words (BoW) vectorizer** using the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from scikit-learn.
* Apply the BoW vectorization on the training data.
* Verify that the first reviews in the training set have been correctly vectorized. How can you confirm this?  
</div>


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

bow_encoder = CountVectorizer(
    vocabulary=..., tokenizer=lambda x: x, preprocessor=lambda x: x
)
X_train_bow = ...
y_train = ...

In [None]:
# Verify the vectorized data
# --------------------------
...

</div>
<div class='alert alert-info'>
<b> Exercise 1.3.5 </b>

* Implement the **TF-IDF vectorizer** using the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from scikit-learn.
* Apply the TF-IDF vectorization on the training data.
* Pick some reviews and print their 10 most important tokens based on their TF-IDF scores.  
* Reflect: Can you infer whether the associated sentiment is positive or negative based on the most important tokens?

</div>


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_encoder = TfidfVectorizer(...)
X_train_tfidf = ...
y_train = ...

## 1.4  Word Embeddings

Let's learn how to use word embedding models and explore some interesting features on specific examples.

</div>
<div class='alert alert-info'>
<b> Exercise 1.4.1 </b>


* Using gensim [downloader](https://radimrehurek.com/gensim/downloader.html), load a pretrained word embedding model. You can find the list of available models [here](https://github.com/piskvorky/gensim-data?tab=readme-ov-file#models) (suggestion: start with `glove-wiki-gigaword-100`).

* Use the word embedding model to vectorize the following tokens: `["man", "woman", "king", "queen", "prince", "princess", "actor", "actress", "movies"]`

* What is the dimension of the embedding space ? 

* Try loading another model and compare the size of the embedding vectors.

<div/>

In [None]:
import gensim.downloader as api

# Embedding model
print("Loading embedding model...")
embedding_model = ...

In [None]:
words = [
    "man",
    "woman",
    "king",
    "queen",
    "prince",
    "princess",
    "actor",
    "actress",
    "movies",
]

# Get the vectors for the words
word_embeddings = ...
printmd(f"The dimension of the embeddings is: {...}")

Word embeddings often have hundreds (or even thousands) of dimensions that capture the meaning and relationships of words. However, such high-dimensional objects are difficult to visualize directly.

There is a way to visualize these vectors using machine learning techniques for dimensionality reduction such as Principle Component Analysis (PCA). We will project the high-dimensional word embeddings into a 2D space, while preserving the relative distances between similar words. This helps visualize semantic relationships embedded in the original high-dimensional space. 

<div class="alert alert-block alert-warning">

⚠️ BoW and TF-IDF vectorizations can also be visualized using dimensionality reduction. But since they don't capture semantic relationships, the distance between words doesn't necessarily represent their actual similarities in meaning. ⚠️

</div>

</div>
<div class='alert alert-info'>
<b> Exercise 1.4.2 </b>

* Use [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to project the embeddings of the previous exercise into a 2D space.
* Visualize the 2D word embeddings using a scatter plot.
* Compute the vector `embedding("actor")-embedding("man")+embedding("woman")` and add it to the same scatter plot.
* Interpret the resulting vector `mystery_embedding` and find its most similar word using [`similar_by_vector`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.similar_by_vector)?
* Reflect: do arithmetic operations on word embeddings preserve semantic meaning?
<div/>

In [None]:
from sklearn.decomposition import PCA

# Init PCA object
pca = ...
# fit/project embeddings in 2D
projected_embeddings = ...

# Arithmetic embedding
mystery_embedding = ...
mystery_embedding_projected = ...

# add mystery embedding to the words
if "mystery" not in words:
    words.append("mystery")
projected_embeddings = np.vstack([projected_embeddings, mystery_embedding_projected])

# Plot the embeddings
plt.figure(figsize=(5, 5))
plt.scatter(...)

for i, word in enumerate(words):
    plt.annotate(word, (projected_embeddings[i, 0], projected_embeddings[i, 1]))

plt.title("PCA projection of word embeddings")
plt.show()

In [None]:
# Most similar word to the mystery embedding
most_similar_word = ...
printmd(f"The most similar word to the mystery embedding is: {most_similar_word}")

Now that we’ve explored word-level embeddings, let’s move on to **document embeddings**. Instead of representing individual words, we’ll generate embeddings for entire IMDb reviews using **Doc2Vec**.

</div>
<div class='alert alert-info'>
<b> Exercise 1.4.3 </b>

* Complete the code below to initialize and train [Gensim Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html) model. 
* Generate vector embeddings for the training documents.
* What is the size of the embeddings?
<div/>


In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Tag the reviews
tagged_reviews = [
    TaggedDocument(words=review, tags=[idx])
    for idx, review in enumerate(df_train_tokenized.review)
]

# Init Doc2Vec model
print("Init Doc2Vec model...")
doc2vec_encoder = ...

# Build vocabulary
print("Building vocabulary...")
doc2vec_encoder.build_vocab(tagged_reviews)

# Train the model
print("Training the model...")
doc2vec_encoder.train(
    tagged_reviews,
    total_examples=doc2vec_encoder.corpus_count,
    epochs=doc2vec_encoder.epochs,
)

In [None]:
# Compute embedding vectors for the training set
print("Infer vectors for the training set...")
X_train_doc2vec = np.array(...)

#### Summing Up

In this part, we explored various methods to vectorize our documents. Importantly, we ensured that the vocabulary was extracted, and the encoders were trained exclusively on the training data. This is crucial to avoid leaking information from the test set, which could lead to a biased evaluation of model performance.

Let’s refactor the code to create a reusable function that utilizes the pre-fitted vectorizers. This function will be specifically useful later for vectorizing the test data, ensuring consistency in our evaluation process.

In [None]:
def vectorize_data(df_reviews, method):
    """
    Vectorize the data using the specified method
    Args:
        df_reviews: pd.Series, reviews to vectorize
        method: str, method to use for vectorization ("bow", "tfidf" or "doc2vec")
    Returns:
        X: np.array, vectorized data
    """
    if method == "bow":
        X = bow_encoder.transform(df_reviews)
    elif method == "tfidf":
        X = tfidf_encoder.transform(df_reviews)
    elif method == "doc2vec":
        X = np.array(
            [doc2vec_encoder.infer_vector(review) for review in df_train_tokenized]
        )
    else:
        raise ValueError("Unknown method")
    return X

## 1.5 Classification

We have pre-fitted our vectorizers on the training data, it's time to apply them to the test set.

<div class='alert alert-info'>
<b> Exercise 1.5.1 </b>

* Ensure you are using clean tokenized test data before vectorization.
* Vectorize the tokenized test data using the pre-fitted vectorizers for:
  - Bag of Words (BoW)
  - TF-IDF
  - Doc2Vec

* Assign the sentiment labels from the test data to a variable `y_test`.

<div/>

In [None]:
X_test = {}

print("BoW vectorization ...")
X_test["bow"] = ...
print("TF-IDF vectorization ...")
X_test["tfidf"] = ...
print("Doc2Vec vectorization ...")
X_test["doc2vec"] = ...

# Get the target
y_test = df_test_tokenized.sentiment

X_train = {
    "bow": X_train_bow,
    "tfidf": X_train_tfidf,
    "doc2vec": X_train_doc2vec,
}
# Get the target
y_train = df_train_tokenized.sentiment

<div class='alert alert-info'>
<b> Exercise 1.5.2 </b>

- Train two classifier models: **Logistic Regression** and **Random Forest** for each of the following vectorization techniques:
  - Bag of Words (BoW)
  - TF-IDF
  - Doc2Vec

  This results in a total of **six models**. Organize these models in a nested dictionary structure, where:
  - The first level of the dictionary is indexed by the **model name** (e.g., `"logistic_regression"`, `"random_forest"`).
  - The second level is indexed by the **vectorization technique** (e.g., `"bow"`, `"tfidf"`, `"doc2vec"`).

- Evaluate all trained models on the **test data**:
  1. Compute the **accuracy** for each model.
  2. Display the **confusion matrix** for each model to analyze their performance in detail.

<div/>


In [None]:
...

<div class='alert alert-info'>
<b> Exercise 1.5.3 </b>

* Create a bar chart comparing the **accuracy** of all six models (use distinct colors to differentiate the vectorization techniques).
* Identify the best combination of **classifier** and **vectorization technique** based on the highest accuracy and balanced performance across both classes.
* Reflect: do the results align with your expectations?

<div/>


In [None]:
...