# MLOps Sentiment Analysis - Sistema di Monitoraggio Completo

Questo notebook implementa il sistema completo di monitoraggio e retraining per il modello di sentiment analysis. Include:
- Download del dataset
- Valutazione del modello
- Logging delle predizioni
- Tracking delle metriche
- Rilevazione del drift
- Trigger di retraining
- Visualizzazioni dei risultati

In [1]:
from datasets import load_dataset

# Correct dataset path
dt = load_dataset("tweet_eval")

# Print sample counts
print(f"Train samples: {len(dt['train'])}")
print(f"Test samples: {len(dt['test'])}")

Repo card metadata block was not found. Setting CardData to empty.


Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Train samples: 240540
Test samples: 139136


In [11]:
# Verifica che il modello ritorni sempre 3 classi
result = analyze_sentiment("Test")
print("Classi disponibili:", list(result.keys()))

Classi disponibili: ['Negativo', 'Neutro', 'Positivo']


In [2]:
# Esempio di inferenza di base
from src.sentiment_model import analyze_sentiment

text = "Oggi è una bella giornata!"
result = analyze_sentiment(text)
print(f"Testo: {text}")
print(f"Risultato: {result}")

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

RobertaForSequenceClassification LOAD REPORT from: /workspaces/mlops-ex/twitter-roberta-base-sentiment
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Testo: Oggi è una bella giornata!
Risultato: {'Negativo': 0.0330863781273365, 'Neutro': 0.6953458189964294, 'Positivo': 0.27156782150268555}


## Valutazione del Modello su Test Set

Valutiamo il modello su un campione del test set calcolando accuracy e F1 score.

In [19]:
from datasets import load_dataset, Dataset
import os

# Try loading tweet_eval default builder
dataset = load_dataset("tweet_eval")
# Use the test split if available
test_dataset = dataset['test']

# If the loaded test split has no label fields, attempt to load local files for the 'sentiment' task
sample_keys = list(test_dataset[0].keys()) if len(test_dataset) > 0 else []
if not any(k in ('sentiment', 'label', 'label_id') for k in sample_keys):
    local_base = os.path.join('tweet_eval', 'datasets', 'sentiment')
    text_path = os.path.join(local_base, 'test_text.txt')
    label_path = os.path.join(local_base, 'test_labels.txt')
    if os.path.exists(text_path) and os.path.exists(label_path):
        texts = open(text_path, 'r', encoding='utf-8').read().splitlines()
        labels = [int(x) for x in open(label_path, 'r', encoding='utf-8').read().splitlines()]
        test_dataset = Dataset.from_dict({
            'text': texts,
            'sentiment': labels
        })
        print("Loaded local tweet_eval 'sentiment' test split from workspace files.")
    else:
        print("No label fields found and local sentiment files not present.")

# Visualizza le colonne disponibili
try:
    print("Colonne disponibili:", test_dataset.column_names)
except Exception:
    print("Colonne disponibili: ", list(test_dataset[0].keys()) if len(test_dataset) > 0 else [])

# Visualizza un campione
print("\nPrimo campione:")
print(test_dataset[0])

Repo card metadata block was not found. Setting CardData to empty.


Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Loaded local tweet_eval 'sentiment' test split from workspace files.
Colonne disponibili: ['text', 'sentiment']

Primo campione:
{'text': "@user @user what do these '1/2 naked pics' have to do with anything? They're not even like that. ", 'sentiment': 1}


In [20]:
# Valutazione del modello
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from src.sentiment_model import analyze_sentiment
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Usa i primi 50 campioni etichettati per la valutazione (per velocità)
sample_size = min(50, len(test_dataset))
# Randomly sample a larger pool and then pick labeled samples from it
pool_size = min(len(test_dataset), sample_size * 3)
sample_indices = np.random.choice(len(test_dataset), pool_size, replace=False)

# Mapping delle label
label_mapping = {0: "Negativo", 1: "Neutro", 2: "Positivo"}
reverse_mapping = {"Negativo": 0, "Neutro": 1, "Positivo": 2}

# Valutazione
predictions = []
true_labels = []
confidences = []

for idx in sample_indices:
    sample = test_dataset[int(idx)]
    text = sample['text']
    # Robust label lookup: support several possible field names
    if 'sentiment' in sample:
        raw_label = sample['sentiment']
    elif 'label' in sample:
        raw_label = sample['label']
    elif 'label_id' in sample:
        raw_label = sample['label_id']
    else:
        # Skip samples without labels (some splits may be unlabeled)
        continue
    # Convert numeric label to its string mapping if necessary
    if isinstance(raw_label, int):
        true_label = label_mapping[raw_label]
    else:
        true_label = raw_label
    
    # Effettua la predizione
    result = analyze_sentiment(text)
    pred_label = max(result, key=result.get)
    confidence = result[pred_label]
    
    predictions.append(pred_label)
    true_labels.append(true_label)
    confidences.append(confidence)
    # Stop once we've collected enough labeled samples
    if len(true_labels) >= sample_size:
        break

# If no labeled samples were found, inform the user and skip metric computation
if len(true_labels) == 0:
    print("No labeled samples found in the test dataset; cannot compute metrics.")
else:
    # Calcola le metriche
    accuracy = accuracy_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions, average='weighted', zero_division=0)

    print("=" * 60)
    print("VALUTAZIONE DEL MODELLO SUL TEST SET")
    print("=" * 60)
    print(f"\nAccuracy: {accuracy:.4f}")
    print(f"F1 Score (weighted): {f1:.4f}")
    print(f"Confidenza media: {np.mean(confidences):.4f}")
    print(f"\n{classification_report(true_labels, predictions, zero_division=0)}")

VALUTAZIONE DEL MODELLO SUL TEST SET

Accuracy: 0.7600
F1 Score (weighted): 0.7600
Confidenza media: 0.7100

              precision    recall  f1-score   support

    Negativo       0.69      0.73      0.71        15
      Neutro       0.84      0.72      0.78        29
    Positivo       0.67      1.00      0.80         6

    accuracy                           0.76        50
   macro avg       0.73      0.82      0.76        50
weighted avg       0.77      0.76      0.76        50

