# **Text mining: SENTIMENT ANALYSIS**

## 🎓 Master’s Program in Data Science & Advanced Analytics  
**Nova IMS** | June 2025  
**Course:** Text Mining

## 👥 Team **Group 34**  
- **[Philippe Dutranoit]** | [20240518]  
- **[Diogo Duarte]** | [20240525]  
- **[Rui luz]** | [20211628]  
- **[Rodrigo Sardinha]** | [20211627]  

## 📊 Goal of the notebook

In this notebook, we finalize our approach using the model and methodology that achieved the best overall results.

After evaluating several alternatives, we selected the combination that performed best in terms of both **accuracy** and **F1-macro score**. Specifically, we used a **Multi-Layer Perceptron (MLPClassifier)** with text embeddings generated by the pretrained transformer model **`cardiffnlp/twitter-roberta-base-sentiment-latest`**.

This transformer model is used as a fixed feature extractor, providing high-quality embeddings from the input text. These embeddings are then used to train the MLPClassifier.

Finally, we use the trained model to make predictions on the test set for final evaluation.


# Imports

In [5]:
import pandas as pd
import numpy as np

from transformers import pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score

In [3]:
X_train = pd.read_csv('/content/X_train.csv')
y_train = pd.read_csv('/content/Y_train.csv')
X_val = pd.read_csv('/content/X_val.csv')
y_val = pd.read_csv('/content/Y_val.csv')

df_test = pd.read_csv('/content/test.csv')
df_test.isna().sum()

X_train_texts = X_train['text'].tolist()
y_train = y_train['label']

X_val_texts = X_val['text'].tolist()
y_val = y_val['label']


Unnamed: 0,0
id,0
text,0


# Pipeline

In [6]:
print("Loading transformer model...")
feature_extractor = pipeline(
    "feature-extraction",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    device=0
)

def extract_cls_embeddings(texts, extractor):
    print("Extracting embeddings...")
    embeddings = extractor(texts, batch_size=16, truncation=True)

    if isinstance(embeddings[0][0], list):  # Multi-layer format
        print("Detected multi-layer embeddings → using last layer.")
        cls_embeddings = np.array([np.array(e[0][-1]) for e in embeddings])
    else:  # Single-layer
        print("Detected single-layer embeddings.")
        cls_embeddings = np.array([np.array(e[0]) for e in embeddings])

    print(f"Extracted CLS embeddings shape: {cls_embeddings.shape}")
    return cls_embeddings

Loading transformer model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
X_train_emb = extract_cls_embeddings(X_train_texts, feature_extractor)
X_val_emb = extract_cls_embeddings(X_val_texts, feature_extractor)

# Train The MLP

In [9]:

print("Training MLP classifier...")
mlp = MLPClassifier(hidden_layer_sizes=(128, 64), max_iter=300)
mlp.fit(X_train_emb, y_train)

y_pred = mlp.predict(X_val_emb)

print("\nEvaluation Results:")
print("Accuracy:", accuracy_score(y_val, y_pred))
print("F1 Score:", f1_score(y_val, y_pred, average='weighted'))
print("\nClassification Report:\n", classification_report(y_val, y_pred))

Training MLP classifier...

Evaluation Results:
Accuracy: 0.818753273965427
F1 Score: 0.8171101582326838

Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.66      0.67       288
           1       0.77      0.70      0.73       385
           2       0.86      0.89      0.88      1236

    accuracy                           0.82      1909
   macro avg       0.77      0.75      0.76      1909
weighted avg       0.82      0.82      0.82      1909



In [8]:

# X_test_texts = df_test['text'].tolist()

# # Extract embeddings
# X_test_emb = extract_cls_embeddings(X_test_texts, feature_extractor)

# # Predict labels
# test_preds = mlp.predict(X_test_emb)

# # Save predictions (optional)
# df_test['predicted_label'] = test_preds
# df_test.to_csv('test_with_predictions.csv', index=False)

# print("Test predictions saved to 'test_with_predictions.csv'")


Extracting embeddings...
Detected multi-layer embeddings → using last layer.
Extracted CLS embeddings shape: (2388, 768)
Test predictions saved to 'test_with_predictions.csv'
