[<font color='steelblue'>1. - __Imports and Load Data__</font>](#one-bullet) <br>
[<font color='steelblue'>2. - __Corpus Splitting__</font>](#two-bullet) <br>
[<font color='steelblue'>3. - __RoBERTa Embedding__</font>](#three-bullet) <br>
[<font color='steelblue'>4. - __Final Model__</font>](#four-bullet) <br>

Group 7
|Name | Student ID |
|----|----|
|Joana Rodrigues| 20240603|
|Mara Simões| 20240326|
|Matilde Street| 20240523|
|Rafael Silva| 20240511|

<hr>
<a class="anchor" id="one-bullet"> 
<d style="color:white;">

# 1. Imports and Load Data
</a> 
</d>   

In [1]:
# Basic utilities for data handling, analysis, and visualization
import numpy as np
import pandas as pd

# Progress bar for loops and long-running processes
from tqdm import tqdm

# Machine learning tools for model training, evaluation, and vectorization
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix, precision_score, recall_score

# Transformers and pre-trained models from Hugging Face
import torch
from transformers import (
    RobertaTokenizer, RobertaModel
)

# Additional libraries for advanced modeling and preprocessing
from imblearn.over_sampling import SMOTE

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# pickle to load best model
import joblib

In [2]:
# to read our data

df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_train = pd.DataFrame(data=df_train, columns=['text', 'label']).reset_index(drop=True)
df_test = pd.DataFrame(data=df_test, columns=['text', 'label']).reset_index(drop=True)

# copy of the df to apply in pre-trained models without data cleaning
df_train_raw = df_train.copy()

<hr>
<a class="anchor" id="two-bullet"> 
<d style="color:white;">

# 2. Corpus Splitting
</a> 
</d>   

In [3]:
# SPLIT FOR MODELS WITH RAW TEXT (FOR PRE-TRAINED MODELS)
X_raw = df_train_raw['text']
y_raw = df_train_raw['label']

X_train_raw, X_val_raw, y_train_raw, y_val_raw = train_test_split(X_raw, y_raw, test_size=0.2, stratify=y_raw, random_state=42)

<hr>
<a class="anchor" id="three-bullet"> 
<d style="color:white;">

# 3. RoBERTa Embedding
</a> 
</d>    

In [None]:
# Load pretrained RoBERTa model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
model.eval()  # Set to eval mode
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Function to convert corpus to embeddings 
def roberta_corpus2vec(corpus, tokenizer, model, max_length=128):
    embeddings = []
    with torch.no_grad():
        for doc in tqdm(corpus):
            inputs = tokenizer(doc, return_tensors='pt', padding='max_length',
                               truncation=True, max_length=max_length)
            inputs = {k: v.to(device) for k, v in inputs.items()}
            outputs = model(**inputs)
            cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze(0)  # CLS token
            embeddings.append(cls_embedding.cpu().numpy())
    return np.vstack(embeddings)

# Encode train and validation sets with original (raw) text
X_train_roberta = roberta_corpus2vec(X_train_raw, tokenizer, model)
X_val_roberta = roberta_corpus2vec(X_val_raw, tokenizer, model)
df_test_roberta = roberta_corpus2vec(df_test['text'], tokenizer, model)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 7634/7634 [21:51<00:00,  5.82it/s]  
100%|██████████| 1909/1909 [03:51<00:00,  8.24it/s]
100%|██████████| 2388/2388 [04:48<00:00,  8.27it/s]


**RoBERTa with SMOTE**  

In [5]:
smote = SMOTE(random_state=13)
X_train_roberta_resampled, y_train_resampled = smote.fit_resample(X_train_roberta, y_train_raw)

<hr>
<a class="anchor" id="four-bullet"> 
<d style="color:white;">

# 4. Final Model
</a> 
</d>   

In [None]:
def evaluate_model_predictions(y_pred_train, y_pred_val, y_train, y_val, show_confusion_matrix = True, show_classification_report = True):    
    train_accuracy = accuracy_score(y_train, y_pred_train)
    train_f1 = f1_score(y_train, y_pred_train, average='macro')
    
    val_accuracy = accuracy_score(y_val, y_pred_val)
    val_f1 = f1_score(y_val, y_pred_val, average='macro')

    val_precision = precision_score(y_val, y_pred_val, average='macro')
    val_recall = recall_score(y_val, y_pred_val, average='macro')

    print(f"Accuracy of train: {train_accuracy:.4f}")
    print(f"F1 Macro (Train): {train_f1:.4f}")
    print(f"Accuracy of val: {val_accuracy:.4f}")
    print(f"\033[1mF1 Macro (Val)\033[0m: {val_f1:.4f}")
    print(f"Precision (Val): {val_precision:.4f}")
    print(f"Recall (Val): {val_recall:.4f}")
    
    if show_confusion_matrix==True:
        print('\nConfusion Matrix for Validation Data:')    
        print(confusion_matrix(y_val, y_pred_val))

    if show_classification_report==True:
        print('\nClassification Report for Validation Data:')
        print(classification_report(y_val, y_pred_val))

    return val_accuracy, val_f1, val_precision, val_recall

### **Ensemble Model with Voting Classifier Combining Logistic Regression, MLP, and XGBoost**

#### With Roberta

In [None]:
# Load the best model from a pickle (saved from the tests notebook)
best_ensemble = joblib.load('best_ensemble_model.pkl')

In [8]:
y_pred_train = best_ensemble.predict(X_train_roberta_resampled)
y_pred_val = best_ensemble.predict(X_val_roberta)
evaluate_model_predictions(y_pred_train=y_pred_train, y_pred_val=y_pred_val, y_train=y_train_resampled, y_val=y_val_raw)

Accuracy of train: 0.9892
F1 Macro (Train): 0.9892
Accuracy of val: 0.8491
[1mF1 Macro (Val)[0m: 0.7906
Precision (Val): 0.7935
Recall (Val): 0.7879

Confusion Matrix for Validation Data:
[[ 199   39   50]
 [  28  292   65]
 [  50   56 1130]]

Classification Report for Validation Data:
              precision    recall  f1-score   support

           0       0.72      0.69      0.70       288
           1       0.75      0.76      0.76       385
           2       0.91      0.91      0.91      1236

    accuracy                           0.85      1909
   macro avg       0.79      0.79      0.79      1909
weighted avg       0.85      0.85      0.85      1909



(0.8491356731272918,
 0.7906081592040758,
 0.7935213460864045,
 0.7878844209548093)

### Export Test Predictions

In [None]:
# predict on the test set using the best ensemble model
y_pred_test = best_ensemble.predict(df_test_roberta)
ids = df_test.index

final = pd.DataFrame({
    'id': ids,
    'label': y_pred_test
})

# Save the final predictions to a CSV file
final.to_csv('pred_07.csv', index=False, sep=';')