# **Text mining: SENTIMENT ANALYSIS**

## 🎓 Master’s Program in Data Science & Advanced Analytics  
**Nova IMS** | June 2025  
**Course:** Text Mining

## 👥 Team **Group 34**  
- **[Philippe Dutranoit]** | [20240518]  
- **[Diogo Duarte]** | [20240525]  
- **[Rui luz]** | [20211628]  
- **[Rodrigo Sardinha]** | [20211627]  

## 📊 Goal of the notebook


In this notebook we finalize our approach using the model and the approach that gave use the best results. we finalize the project by making the prediction on the test set for the model evaluation

# Imports 

In [None]:
import pandas as pd
import numpy as np

In [None]:
X_train = pd.read_csv('../Data/X_train.csv')
y_train = pd.read_csv('../Data/y_train.csv')
X_test = pd.read_csv('../Data/X_val.csv')
y_test = pd.read_csv('../Data/y_val.csv')


In [1]:
import pandas as pd
import numpy as np
from transformers import pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score

# ------------------------------------------
# 1. Load your text data
# ------------------------------------------
# Make sure your files have columns: "text" and "label"
train_df = pd.read_csv('train.csv')  # columns: 'text', 'label'
val_df = pd.read_csv('val.csv')      # columns: 'text', 'label'

X_train_texts = train_df['text'].tolist()
y_train = train_df['label']

X_val_texts = val_df['text'].tolist()
y_val = val_df['label']

# ------------------------------------------
# 2. Define embedding extractor
# ------------------------------------------
print("Loading transformer model...")
feature_extractor = pipeline(
    "feature-extraction",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    device=0  # use -1 if no GPU available
)

def extract_cls_embeddings(texts, extractor):
    print("Extracting embeddings...")
    embeddings = extractor(texts, batch_size=16, truncation=True)
    
    if isinstance(embeddings[0][0], list):  # Multi-layer format
        print("Detected multi-layer embeddings → using last layer.")
        cls_embeddings = np.array([np.array(e[0][-1]) for e in embeddings])
    else:  # Single-layer
        print("Detected single-layer embeddings.")
        cls_embeddings = np.array([np.array(e[0]) for e in embeddings])
    
    print(f"Extracted CLS embeddings shape: {cls_embeddings.shape}")
    return cls_embeddings

# ------------------------------------------
# 3. Extract embeddings for train and val
# ------------------------------------------
X_train_emb = extract_cls_embeddings(X_train_texts, feature_extractor)
X_val_emb = extract_cls_embeddings(X_val_texts, feature_extractor)

# ------------------------------------------
# 4. Train MLP Classifier
# ------------------------------------------
print("Training MLP classifier...")
mlp = MLPClassifier(hidden_layer_sizes=(100,), activation='relu', solver='adam', max_iter=300, random_state=42)
mlp.fit(X_train_emb, y_train)

# ------------------------------------------
# 5. Evaluate on validation set
# ------------------------------------------
y_pred = mlp.predict(X_val_emb)

print("\nEvaluation Results:")
print("Accuracy:", accuracy_score(y_val, y_pred))
print("F1 Score:", f1_score(y_val, y_pred, average='weighted'))
print("\nClassification Report:\n", classification_report(y_val, y_pred))





KeyboardInterrupt: 

In [None]:
# ------------------------------------------
# 6. Predict on Test Set (optional)
# ------------------------------------------
# Load test data
test_df = pd.read_csv('test.csv')  # Must have a 'text' column
X_test_texts = test_df['text'].tolist()

# Extract embeddings
X_test_emb = extract_cls_embeddings(X_test_texts, feature_extractor)

# Predict labels
test_preds = mlp.predict(X_test_emb)

# Save predictions (optional)
test_df['predicted_label'] = test_preds
test_df.to_csv('test_with_predictions.csv', index=False)

print("Test predictions saved to 'test_with_predictions.csv'")
