<a href="https://colab.research.google.com/github/martatolos/eae-dsaa-2025/blob/main/nlp_svm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Natural Language Processing (NLP) - Part I

Goal of the session:

Implement essential pipeline to process and classify documents using `nltk`, `scikit-learn` and `fastembed`.

### Prerequisites

First, let's install all necessary requirements

In [None]:
# Install requirements
%pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 seaborn==0.13.2 fastembed==0.7.0 nltk==3.9.1

In [None]:
import nltk
import pandas as pd

from collections import Counter
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from typing import Iterable

# Install necessary nltk packages
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('punkt')

EN_STOP_WORDS = {word: True for word in stopwords.words('english')}

### Read and Inspect the Data

In this tutorial we will leverage a dataset that contains movie reviews rated as positive or negative.

In [110]:
data_url = "https://raw.githubusercontent.com/martatolos/eae-dsaa-2025/refs/heads/main/reviews.tsv"
data = pd.read_csv(data_url, sep="\t", names=["review_id", "label", "text"], nrows=300, index_col="review_id")
data

Unnamed: 0_level_0,label,text
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,pos,"You know, Robin Williams, God bless him, is co..."
1,pos,I thought it was a pretty good movie and shoul...
2,neg,tries to be funny and fails miserably. The ani...
3,neg,"As long as there's been 3d technology, (1950's..."
4,neg,"The characters are cliched and predictable, wi..."
...,...,...
295,neg,"I enjoyed the feel of the opening few minutes,..."
296,neg,I was aware of Rohmer's admiration for the lat...
297,neg,A Movie about a bunch of some kind of filmmake...
298,pos,Matthau and Lemmon are at their very best in t...


The text originate from a website, so we have some markup language (e.g., `<br />` for line break) in the text.

In [None]:
data["text"] = data["text"].apply(lambda x: x.replace("<br />", " "))

Let's inspect the label distribution of the data set

In [None]:
# Output label distribution
data["label"].value_counts()

In [None]:
# Print class ratios
data["label"].value_counts(normalize=True)

Investigate the most frequent words for each label.

In [None]:
def get_word_frequencies(texts: Iterable[str]) -> Counter:
    """ Get word frequency map.

    :param texts: List of texts to be investigated
    :return: Counter object of word frequencies
    """
    # Concatenate all texts
    complete_text = " ".join(texts)

    # Tokenize the text in single words
    words = nltk.word_tokenize(complete_text)

    # Remove stop words and non-alphabetic words
    words = [word for word in words if word.isalpha() and word not in EN_STOP_WORDS]

    return Counter(words)


In [None]:
top_pos_words = get_word_frequencies(data[data["label"] == "pos"]["text"])
print(top_pos_words.most_common(20))

In [None]:
top_neg_words = get_word_frequencies(data[data["label"] == "neg"]["text"])
print(top_neg_words.most_common(20))

### Train a document classifier

1. First we have to transform our string labels (i.e. "pos" and "neg") to indices (i.e, 0 and 1). This can be done using the `LabelEncoder` from the `scikit-learn` library.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the encoder
encoder = LabelEncoder()

# Fit and transform the labels
encoded_labels = encoder.fit_transform(data["label"])

In [None]:
# Inspect the original and encoded labels of the first 5 reviews
print("Original labels:", data["label"].head(5))
print("Encoded labels:", encoded_labels[:5])

In [None]:
# The encoder stores the mapping between the class labels and the indices internally
print(encoder.classes_)

# This information can be used to easily transform between class labels and indices and vice versa
print(encoder.transform(["pos", "pos"]))
print(encoder.inverse_transform([0, 0, 1]))

2. Split the dataset into train and test

In [None]:
# Split our dataset into train and test subset
X_train, X_test, y_train, y_test = train_test_split(data["text"], encoded_labels, test_size=0.3, random_state=70, stratify=encoded_labels)

In [None]:
X_train

In [None]:
y_train

3. Transform the texts into feature vectors

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer(
    lowercase=True,
    ngram_range=(2, 2),
    max_df=0.90,     # ignore terms that occur in >90% of documents
    min_df=2,        # ignore terms in <2 documents
    stop_words=list(EN_STOP_WORDS.keys()),  # or "english"

    #  Use binary=True when presence of a word is more important than its frequency
    # binary=True
)

# Fit and transform the documents
tfidf_train = vectorizer.fit_transform(X_train)
tfidf_train

In [None]:
# Convert to DataFrame for readability (optional)
feature_names = vectorizer.get_feature_names_out()
df_tfidf_train = pd.DataFrame(tfidf_train.toarray(), columns=feature_names)

df_tfidf_train

4. Train a classification model

In [None]:
# Train the classifier with the dense features
from sklearn.naive_bayes import MultinomialNB

clf_nb = MultinomialNB()
clf_nb.fit(tfidf_train, y_train)

5. Evaluate the classifier

In [None]:
# Transform the test texts into feature vectors
tfidf_test = vectorizer.transform(X_test)

y_pred = clf_nb.predict(tfidf_test)
print(classification_report(y_test, y_pred, target_names=encoder.classes_))

### Exercise: Investigate different options for text preprocessing and classification models

Experiment with different options, e.g.:
- Investigate different parameter options for the TF-IDF vectorizer
- Test different classification models (e.g., LogisticRegression, SVM, ...)

In [None]:
# Add you code here





### Use Dense Embeddings for Text Encoding

In the example above we used traditional statistical text encoding methods generating sparse, high-dimensional feature vectors. In the following cells, we'll use a text encoding model producing dense, low-dimensional embeddings instead. For this, we'll use the `fastembed` library and the `BAAI/bge-small-en-v1.5` language model (33.4 million parameters).

In [None]:
from fastembed import TextEmbedding

model = "BAAI/bge-small-en-v1.5"

# This will trigger the model download and initialization
embedding_model = TextEmbedding(model_name=model)
print(f"The model {model} is ready to use.")

In [None]:
# Encode the texts
embeddings_train = list(embedding_model.embed(X_train))

In [None]:
# Prints the vector for the first data point
print(embeddings_train[0])

# Prints the size of the vector
print(embeddings_train[0].shape)

Retrain the classifier

In [None]:
# Train the classifier with the dense features
from sklearn.linear_model import LogisticRegression

clf_lg = LogisticRegression()
clf_lg.fit(embeddings_train, y_train)

In [None]:
# Encode test texts
embeddings_test = list(embedding_model.embed(X_test))

In [None]:
# Evaluate the classifier
y_pred = clf_lg.predict(embeddings_test)
print(classification_report(y_test, y_pred, target_names=encoder.classes_))

### Exercise: Test your own classifier

Test the classifiers that you investigated in the previous exercise now with the dense embedding. Which changes can be recognized.

In [None]:
# Add your code

