# **Text mining: SENTIMENT ANALYSIS**

## 🎓 Master’s Program in Data Science & Advanced Analytics  
**Nova IMS** | June 2025  
**Course:** Text Mining

## 👥 Team **Group 34**  
- **[Philippe Dutranoit]** | [20240518]  
- **[Diogo Duarte]** | [20240525]  
- **[Rui luz]** | [20211628]  
- **[Rodrigo Sardinha]** | [20211627]  

## 📊 Goal of the notebook

In this notebook, we finalize our approach using the model and methodology that achieved the best overall results.

After evaluating several alternatives, we selected the combination that performed best in terms of both **accuracy** and **F1-macro score**. Specifically, we used a **Multi-Layer Perceptron (MLPClassifier)** with text embeddings generated by the pretrained transformer model **`cardiffnlp/twitter-roberta-base-sentiment-latest`**.

This transformer model is used as a fixed feature extractor, providing high-quality embeddings from the input text. These embeddings are then used to train the MLPClassifier.

To improve performance, we performed a **Grid Search** to tune the hyperparameters of the MLPClassifier. While ideally this hyperparameter tuning should have been conducted for all models considered during the evaluation phase, due to time constraints, we applied it only to the final selected model.

Finally, we use the trained and tuned model to make predictions on the test set for final evaluation.


# Imports

In [12]:
import pandas as pd
import numpy as np

from transformers import pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV

In [3]:
X_train = pd.read_csv('../Data/X_train.csv')
y_train = pd.read_csv('../Data/Y_train.csv')
X_val = pd.read_csv('../Data/X_val.csv')
y_val = pd.read_csv('../Data/Y_val.csv')

df_test = pd.read_csv('../Data/test.csv')
df_test.isna().sum()

X_train_texts = X_train['text'].tolist()
y_train = y_train['label']

X_val_texts = X_val['text'].tolist()
y_val = y_val['label']


Unnamed: 0,0
id,0
text,0


# Model Final training

## Feature Extraction Pipeline

In [6]:
print("Loading transformer model...")
feature_extractor = pipeline(
    "feature-extraction",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    device=0
)

def extract_cls_embeddings(texts, extractor):
    print("Extracting embeddings...")
    embeddings = extractor(texts, batch_size=16, truncation=True)

    if isinstance(embeddings[0][0], list):  # Multi-layer format
        print("Detected multi-layer embeddings → using last layer.")
        cls_embeddings = np.array([np.array(e[0][-1]) for e in embeddings])
    else:  # Single-layer
        print("Detected single-layer embeddings.")
        cls_embeddings = np.array([np.array(e[0]) for e in embeddings])

    print(f"Extracted CLS embeddings shape: {cls_embeddings.shape}")
    return cls_embeddings

Loading transformer model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
X_train_emb = extract_cls_embeddings(X_train_texts, feature_extractor)
X_val_emb = extract_cls_embeddings(X_val_texts, feature_extractor)

## Classifier Training

In [13]:
param_grid = {
    'hidden_layer_sizes': [
        (128,),
        (128, 64),
        (256, 128),
        (256, 128, 64)
    ],
    'activation': ['relu', 'tanh'],
    'solver': ['adam', 'sgd'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01],
}

mlp = MLPClassifier(max_iter=300, random_state=42)

grid_search = GridSearchCV(mlp, param_grid, cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)
grid_search.fit(X_train_emb, y_train)

best_mlp = grid_search.best_estimator_
print("------------------------")
print("\nBest parameters found:", grid_search.best_params_)

Fitting 3 folds for each of 16 candidates, totalling 48 fits
------------------------

Best parameters found: {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': (128,), 'solver': 'adam'}


# Result evaluation

In [14]:
y_train_pred = best_mlp.predict(X_train_emb)
y_val_pred = best_mlp.predict(X_val_emb)

print("\nClassification Report (Validation Set):\n", classification_report(y_val, y_val_pred))

print("--------------------------")
print("Train Accuracy:",        accuracy_score(y_train, y_train_pred))
print("Val Accuracy:",          accuracy_score(y_val, y_val_pred))
print("Train F1 Macro Score:",  f1_score(y_train, y_train_pred, average='macro'))
print("Val F1 Macro Score:",   f1_score(y_val, y_val_pred, average='macro'))
print("--------------------------")



Classification Report (Validation Set):
               precision    recall  f1-score   support

           0       0.70      0.66      0.68       288
           1       0.81      0.74      0.77       385
           2       0.87      0.91      0.89      1236

    accuracy                           0.84      1909
   macro avg       0.80      0.77      0.78      1909
weighted avg       0.84      0.84      0.84      1909

--------------------------
Train Accuracy: 1.0
Val Accuracy: 0.8376113148245155
Train F1 Macro Score: 1.0
Val F1 Macro Score: 0.7824439641534937
--------------------------


# Final prediction

In [None]:
X_test_texts = df_test['text'].tolist()

X_test_emb = extract_cls_embeddings(X_test_texts, feature_extractor)

test_preds =best_mlp.predict(X_test_emb)

df_test['labels'] = test_preds

## Export

In [20]:
df_fianl = df_test[['id', 'labels']]
df_fianl.to_csv('../Data/pred_34.csv', index=False)

Unnamed: 0,id,labels
0,0,1
1,1,2
2,2,2
3,3,1
4,4,2
...,...,...
2383,2383,2
2384,2384,0
2385,2385,2
2386,2386,1
