
# Assignment 1 — Sexism Detection

**Group members:** _[Jacopo Francesco Amoretti]_  

---

## Checklist di consegna
- [ ] Task 1 — Corpus (aggregazione majority vote, filtraggio EN, encoding label)
- [ ] Task 2 — Data Cleaning (emoji/hashtag/mention/url/symbols/quotes + lemmatizzazione)
- [ ] Task 3 — Text Encoding (GloVe + gestione OOV + embedding matrix)
- [ ] Task 4 — Modelli (BiLSTM baseline e stacked)
- [ ] Task 5 — Training & Evaluation (≥ 3 seed, macro F1/Prec/Rec, avg ± std)
- [ ] Task 6 — Transformers (Twitter-roBERTa-base-hate + Trainer)
- [ ] Task 7 — Error Analysis (pattern errori, confusion/PR, esempi)
- [ ] Task 8 — Report (riassunto risultati, figure, tabella metriche)



## Setup

Eseguire questa sezione una volta all'inizio. Imposta seed, librerie e percorsi.


In [1]:

# === Import di base ===
import os
import re
import json
import math
import random
import numpy as np
import pandas as pd
from pathlib import Path

# Visualizzazione/plots
import matplotlib.pyplot as plt

# Metriche
from sklearn.metrics import classification_report, precision_recall_fscore_support, confusion_matrix

# Opzionale: barra di progresso
try:
    from tqdm.auto import tqdm
except Exception:
    tqdm = lambda x: x

# Seed e device
SEED = 1337
random.seed(SEED)
np.random.seed(SEED)

# Percorsi progetto (modificabili)
DATA_DIR = Path('data')          # Deve contenere: train.json, val.json, test.json
GLOVE_DIR = Path('glove')        # File come glove.6B.100d.txt (o altri)
ARTIFACTS_DIR = Path('artifacts')# Dove salvare vocab, mapping, embedding matrix, ecc.
MODELS_DIR = Path('models')
RESULTS_DIR = Path('results')

for d in [ARTIFACTS_DIR, MODELS_DIR, RESULTS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print('Setup ok.')


Matplotlib is building the font cache; this may take a moment.


Setup ok.



# Task 1 — Corpus

**Obiettivo**  
1. Caricare `train/val/test` JSON in DataFrame.  
2. Majority voting su `labels_task2` → nuova colonna `label` (rimuovere item senza maggioranza).  
3. Filtrare `lang == 'en'`.  
4. Tenere solo colonne: `id_EXIST`, `lang`, `tweet`, `label`.  
5. Encoding `label` con mapping:
```python
label2id = {'-': 0, 'DIRECT': 1, 'JUDGEMENTAL': 2, 'REPORTED': 3}
```


In [None]:

# == Helper: majority vote su una lista di etichette ==
from collections import Counter

def majority_vote(labels):
    # labels: lista di stringhe (e.g. ['DIRECT','DIRECT','REPORTED','-','JUDGEMENTAL','REPORTED'])
    cnt = Counter(labels)
    # Trova il valore con conteggio max
    top = cnt.most_common()
    if len(top) == 0:
        return None, False
    # Verifica unicità della maggioranza
    if len(top) > 1 and top[0][1] == top[1][1]:
        return None, False  # no clear majority
    return top[0][0], True

label2id = {'-': 0, 'DIRECT': 1, 'JUDGEMENTAL': 2, 'REPORTED': 3}
id2label = {v:k for k,v in label2id.items()}


In [None]:

# == Caricamento JSON in DataFrame ==
# Atteso: tre file JSON con la struttura del repo del corso (train/val/test)
# Sostituire i nomi file qui sotto con quelli effettivi se diversi.
train_path = DATA_DIR / 'train.json'
val_path   = DATA_DIR / 'val.json'
test_path  = DATA_DIR / 'test.json'

def load_exist_json(path: Path):
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    # data è un dict con chiavi id (string) -> record
    df = pd.DataFrame.from_dict(data, orient='index')
    return df

df_train_raw = load_exist_json(train_path)
df_val_raw   = load_exist_json(val_path)
df_test_raw  = load_exist_json(test_path)

print('Train raw:', df_train_raw.shape, '| Val raw:', df_val_raw.shape, '| Test raw:', df_test_raw.shape)
df_train_raw.head(2)


In [None]:

# == Majority voting su labels_task2 e pulizia ==
def apply_majority_and_filter(df):
    mv_labels = []
    keep_mask = []
    for _, row in df.iterrows():
        mv, ok = majority_vote(row['labels_task2'])
        mv_labels.append(mv)
        keep_mask.append(ok)
    df = df.copy()
    df['label'] = mv_labels
    df = df[pd.Series(keep_mask).values]  # rimuove quelli senza maggioranza
    return df

df_train_mv = apply_majority_and_filter(df_train_raw)
df_val_mv   = apply_majority_and_filter(df_val_raw)
df_test_mv  = apply_majority_and_filter(df_test_raw)

print('Dopo majority vote ->',
      'Train:', df_train_mv.shape, 'Val:', df_val_mv.shape, 'Test:', df_test_mv.shape)


In [None]:

# == Filtro lingua EN e selezione colonne ==
KEEP_COLS = ['id_EXIST', 'lang', 'tweet', 'label']

def filter_and_select(df):
    df = df[df['lang'] == 'en'].copy()
    df = df[KEEP_COLS].copy()
    return df

df_train = filter_and_select(df_train_mv)
df_val   = filter_and_select(df_val_mv)
df_test  = filter_and_select(df_test_mv)

print('Solo EN ->', 'Train:', df_train.shape, 'Val:', df_val.shape, 'Test:', df_test.shape)
df_train.head(3)


In [None]:

# == Encoding label ==
def encode_labels(df):
    df = df.copy()
    df['label_id'] = df['label'].map(label2id)
    return df

df_train = encode_labels(df_train)
df_val   = encode_labels(df_val)
df_test  = encode_labels(df_test)

print(df_train['label'].value_counts())
df_train.head(3)



# Task 2 — Data Cleaning

**Requisiti**  
- Rimuovere: emoji, hashtag (`#...`), mention (`@user`), URL, caratteri speciali/simboli, virgolette tipografiche.  
- Lemmatizzazione (English).

> **Nota:** potete usare `spaCy` con `en_core_web_sm` oppure `nltk`/`stanza`. Sotto trovate una pipeline base con spaCy; se il modello non è installato, vedere i commenti.


In [None]:

# == Pulizia testuale: regex base ==

URL_RE = re.compile(r'https?://\S+|www\.\S+')
MENTION_RE = re.compile(r'@\w+')
HASHTAG_RE = re.compile(r'#\w+')
EMOJI_RE = re.compile('[\U00010000-\U0010ffff]', flags=re.UNICODE)  # rimuove emoji unicode
SPECIAL_QUOTES_REPLACEMENTS = {
    '“': '"', '”': '"', '‘': "'", '’': "'",
    '«': '"', '»': '"', '…': '...'
}

def normalize_quotes(text: str):
    for k, v in SPECIAL_QUOTES_REPLACEMENTS.items():
        text = text.replace(k, v)
    return text

def basic_clean(text: str):
    text = normalize_quotes(text)
    text = URL_RE.sub(' ', text)
    text = MENTION_RE.sub(' ', text)
    text = HASHTAG_RE.sub(' ', text)
    text = EMOJI_RE.sub(' ', text)
    # rimuove caratteri non alfanumerici (teniamo punteggiatura minima .,!?,')
    text = re.sub(r"[^0-9A-Za-z'\.,!\?\s]", ' ', text)
    # spazi multipli
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Test rapido
print(basic_clean("Check this: https://ex.com @user #hashtag 😅 “quote” — symbols!"))


In [None]:

# == Lemmatizzazione con spaCy (fallback graceful) ==
USE_SPACY = True
try:
    import spacy
    try:
        nlp = spacy.load('en_core_web_sm', disable=['ner'])
    except Exception:
        # Se non installato: !python -m spacy download en_core_web_sm
        nlp = None
        print("Attenzione: modello spaCy 'en_core_web_sm' non installato. Installare e rieseguire.")
except Exception:
    USE_SPACY = False
    nlp = None
    print('spaCy non disponibile; saltando lemmatizzazione o usare altra libreria.')

def lemmatize_en(texts):
    if nlp is None:
        return texts  # no-op
    docs = nlp.pipe(texts, batch_size=512)
    out = []
    for doc in docs:
        lemmas = [t.lemma_.lower() for t in doc if not t.is_space]
        out.append(' '.join(lemmas))
    return out

def apply_clean_and_lemma(df, text_col='tweet'):
    df = df.copy()
    df['clean'] = df[text_col].astype(str).apply(basic_clean)
    df['clean_lemma'] = lemmatize_en(df['clean'].tolist())
    return df

# Applicazione (potete ridurre a subset per debug rapido)
df_train = apply_clean_and_lemma(df_train, 'tweet')
df_val   = apply_clean_and_lemma(df_val, 'tweet')
df_test  = apply_clean_and_lemma(df_test, 'tweet')

df_train[['tweet','clean','clean_lemma']].head(3)



# Task 3 — Text Encoding (GloVe + OOV)

**Obiettivi**  
- Tokenizzare `df_train['clean_lemma']` (o `clean` se non lemmatizzate).  
- Costruire vocabolario: **tutti i token di train** (oppure unione `train ∪ GloVe`).  
- Caricare GloVe e creare `embedding_matrix` con:
  - token in GloVe → vettore preaddestrato
  - token OOV (in train ma non in GloVe) → vettore custom (es. random)
  - token ignoti in val/test → `<UNK>` con embedding statico


In [None]:

# == Tokenizzazione molto semplice (whitespace + punteggiatura minima) ==
TOKEN_RE = re.compile(r"\w+('[\w]+)?")

def simple_tokenize(text):
    return TOKEN_RE.findall(text.lower())

def build_vocab_from_train(texts, min_freq=1):
    from collections import Counter
    c = Counter()
    for t in texts:
        for tok in simple_tokenize(t):
            c[tok] += 1
    vocab = {tok for tok, f in c.items() if f >= min_freq}
    return vocab, c

train_texts = df_train['clean_lemma'] if 'clean_lemma' in df_train.columns else df_train['clean']
vocab_set, freq = build_vocab_from_train(train_texts.tolist(), min_freq=1)
print('Vocab size (train):', len(vocab_set))


In [None]:

# == Caricamento GloVe ==
# Esempio atteso: GLOVE_DIR / 'glove.6B.100d.txt'
# Modificare EMB_DIM e filename secondo il file che avete.
EMB_DIM = 100
GLOVE_FILE = GLOVE_DIR / f'glove.6B.{EMB_DIM}d.txt'

def load_glove(path):
    emb = {}
    if not path.exists():
        print(f'ATTENZIONE: file GloVe non trovato: {path}. Caricheremo solo OOV random.')
        return emb
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.rstrip().split(' ')
            w = parts[0]
            vec = np.asarray(parts[1:], dtype=np.float32)
            emb[w] = vec
    print('GloVe vettori caricati:', len(emb))
    return emb

glove = load_glove(GLOVE_FILE)


In [None]:

# == Costruzione vocabolario finale e embedding matrix ==
# Strategie accettate: usare solo token di train oppure unione (train ∪ GloVe).
USE_UNION_WITH_GLOVE = False

SPECIAL_TOKENS = ['<PAD>', '<UNK>']
token_list = sorted(vocab_set)
if USE_UNION_WITH_GLOVE and len(glove) > 0:
    token_list = sorted(set(token_list).union(set(glove.keys())))

itos = SPECIAL_TOKENS + token_list
stoi = {tok:i for i, tok in enumerate(itos)}

def rand_vec(d):
    # Inizializzazione per OOV
    return np.random.normal(0, 0.1, size=(d,)).astype(np.float32)

embedding_matrix = np.zeros((len(itos), EMB_DIM), dtype=np.float32)
# <PAD> = zero vector per mascheramento
embedding_matrix[stoi['<PAD>']] = np.zeros(EMB_DIM, dtype=np.float32)
# <UNK> = random (statico)
embedding_matrix[stoi['<UNK>']] = rand_vec(EMB_DIM)

oov_count = 0
for tok in token_list:
    idx = stoi[tok]
    if tok in glove:
        embedding_matrix[idx] = glove[tok]
    else:
        embedding_matrix[idx] = rand_vec(EMB_DIM)
        oov_count += 1

print('Vocab totale:', len(itos), '| OOV (train vs GloVe):', oov_count)

# Salvataggio mapping e embedding matrix per i task successivi
np.save(ARTIFACTS_DIR / 'embedding_matrix.npy', embedding_matrix)
pd.Series(itos).to_csv(ARTIFACTS_DIR / 'itos.csv', index=False)
pd.Series(stoi).to_csv(ARTIFACTS_DIR / 'stoi.csv')
print('Salvati embedding_matrix.npy, itos.csv, stoi.csv in', ARTIFACTS_DIR)


In [None]:

# == Conversione testo -> ID (per modelli con Embedding layer) ==
MAX_LEN = 64  # modificabile
PAD_ID = stoi['<PAD>']
UNK_ID = stoi['<UNK>']

def encode_text(text, max_len=MAX_LEN):
    toks = simple_tokenize(text)
    ids = [stoi.get(t, UNK_ID) for t in toks]
    if len(ids) < max_len:
        ids = ids + [PAD_ID] * (max_len - len(ids))
    else:
        ids = ids[:max_len]
    return ids

def encode_dataframe(df, text_col='clean_lemma'):
    X = np.vstack([encode_text(t) for t in df[text_col].tolist()])
    y = df['label_id'].values.astype(int)
    return X, y

X_train, y_train = encode_dataframe(df_train)
X_val,   y_val   = encode_dataframe(df_val)
X_test,  y_test  = encode_dataframe(df_test)

X_train.shape, X_val.shape, X_test.shape



# Task 4 — Model Definition (BiLSTM)

**Richiesto**  
- **Baseline:** Bidirectional LSTM + Dense finale.  
- **Stacked:** una seconda BiLSTM sopra la prima.  
- Si può usare Keras (consigliato) o PyTorch. Sotto trovate Keras.


In [None]:

# == Keras BiLSTM models ==
import tensorflow as tf
from tensorflow.keras import layers, models

NUM_CLASSES = 4  # '-', DIRECT, JUDGEMENTAL, REPORTED
EMBED_TRAINABLE = False  # facoltativo

def build_baseline_bilstm(vocab_size, emb_dim, embedding_matrix, max_len=64):
    inp = layers.Input(shape=(max_len,), name='input_ids')
    emb = layers.Embedding(input_dim=vocab_size,
                           output_dim=emb_dim,
                           weights=[embedding_matrix],
                           trainable=EMBED_TRAINABLE,
                           mask_zero=True,
                           name='encoder_embedding')(inp)
    x = layers.Bidirectional(layers.LSTM(128))(emb)
    x = layers.Dropout(0.2)(x)
    out = layers.Dense(NUM_CLASSES, activation='softmax')(x)
    model = models.Model(inp, out, name='bilstm_baseline')
    return model

def build_stacked_bilstm(vocab_size, emb_dim, embedding_matrix, max_len=64):
    inp = layers.Input(shape=(max_len,), name='input_ids')
    emb = layers.Embedding(input_dim=vocab_size,
                           output_dim=emb_dim,
                           weights=[embedding_matrix],
                           trainable=EMBED_TRAINABLE,
                           mask_zero=True,
                           name='encoder_embedding')(inp)
    x = layers.Bidirectional(layers.LSTM(128, return_sequences=True))(emb)
    x = layers.Bidirectional(layers.LSTM(64))(x)
    x = layers.Dropout(0.3)(x)
    out = layers.Dense(NUM_CLASSES, activation='softmax')(x)
    model = models.Model(inp, out, name='bilstm_stacked')
    return model

VOCAB_SIZE = embedding_matrix.shape[0]
EMB_DIM = embedding_matrix.shape[1]

baseline = build_baseline_bilstm(VOCAB_SIZE, EMB_DIM, embedding_matrix, MAX_LEN)
stacked  = build_stacked_bilstm(VOCAB_SIZE, EMB_DIM, embedding_matrix, MAX_LEN)

baseline.summary()



# Task 5 — Training & Evaluation

**Requisiti**  
- Allenare **≥ 3 seed** per ciascun modello.  
- Valutare su **validation** con macro F1/Precision/Recall.  
- Riportare **media ± std**.  
- Scegliere il modello migliore con **macro F1** su validation.


In [None]:

# == Loop di training multi-seed (scheletro) ==
def train_and_eval(model_fn, X_tr, y_tr, X_va, y_va, seeds=[1337, 2025, 42], epochs=5, batch_size=64):
    histories = []
    scores = []
    for s in seeds:
        tf.keras.utils.set_random_seed(s)
        model = model_fn(VOCAB_SIZE, EMB_DIM, embedding_matrix, MAX_LEN)
        model.compile(optimizer='adam',
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])
        h = model.fit(X_tr, y_tr, validation_data=(X_va, y_va),
                      epochs=epochs, batch_size=batch_size, verbose=1)
        histories.append(h.history)
        # Valutazione dettagliata
        y_pred = np.argmax(model.predict(X_va), axis=1)
        prec, rec, f1, _ = precision_recall_fscore_support(y_va, y_pred, average='macro', zero_division=0)
        scores.append({'seed': s, 'precision': prec, 'recall': rec, 'f1': f1})
    return histories, pd.DataFrame(scores)

# Esempio d'uso (commentato per evitare esecuzione involontaria in ambienti lenti):
# hist_base, df_scores_base = train_and_eval(build_baseline_bilstm, X_train, y_train, X_val, y_val)
# df_scores_base, df_scores_base.mean(), df_scores_base.std()


In [None]:

# == Funzioni utili per report metriche e confusion matrix ==
def evaluate_predictions(y_true, y_pred, labels_map=id2label):
    report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)
    df_rep = pd.DataFrame(report).T
    cm = confusion_matrix(y_true, y_pred, labels=sorted(labels_map.keys()))
    return df_rep, cm

def plot_confusion_matrix(cm, labels):
    fig, ax = plt.subplots(figsize=(5,5))
    im = ax.imshow(cm, interpolation='nearest')
    ax.set_xticks(range(len(labels)))
    ax.set_yticks(range(len(labels)))
    ax.set_xticklabels(labels, rotation=45, ha='right')
    ax.set_yticklabels(labels)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('True')
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, cm[i, j], ha='center', va='center')
    plt.tight_layout()
    plt.show()



# Task 6 — Transformers (Twitter-roBERTa-base-hate)

Modello: **cardiffnlp/twitter-roberta-base-hate**  
Passi:
1. Tokenizzazione con tokenizer HF.
2. Preparazione `Dataset` (train/val/test).
3. `Trainer` con metrica macro F1.
4. Valutazione su **test** con le **stesse metriche** dei modelli LSTM.

> **Nota:** richiede internet per scaricare il modello; se non disponibile, preparare in locale.


In [None]:

# == Scheletro per HuggingFace Trainer ==
# Commentato per evitare errori in ambienti offline.

# from datasets import Dataset
# from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# MODEL_NAME = "cardiffnlp/twitter-roberta-base-hate"

# # preparazione dataset HF
# def to_hf_dataset(df, text_col='clean_lemma'):
#    return Dataset.from_pandas(df[[text_col, 'label_id']].rename(columns={text_col:'text','label_id':'label'}))

# ds_train = to_hf_dataset(df_train)
# ds_val   = to_hf_dataset(df_val)
# ds_test  = to_hf_dataset(df_test)

# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# def tokenize_fn(ex):
#     return tokenizer(ex['text'], truncation=True, padding='max_length', max_length=64)
# ds_train = ds_train.map(tokenize_fn, batched=True)
# ds_val   = ds_val.map(tokenize_fn, batched=True)
# ds_test  = ds_test.map(tokenize_fn, batched=True)

# model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=4)

# def compute_metrics(eval_pred):
#     logits, labels = eval_pred
#     preds = logits.argmax(axis=-1)
#     prec, rec, f1, _ = precision_recall_fscore_support(labels, preds, average='macro', zero_division=0)
#     return {'macro_f1': f1, 'macro_precision': prec, 'macro_recall': rec}

# args = TrainingArguments(
#     output_dir='hf_outputs',
#     evaluation_strategy='epoch',
#     save_strategy='epoch',
#     learning_rate=2e-5,
#     per_device_train_batch_size=32,
#     per_device_eval_batch_size=64,
#     num_train_epochs=3,
#     weight_decay=0.01,
#     load_best_model_at_end=True,
#     metric_for_best_model='macro_f1',
#     logging_steps=50,
# )

# trainer = Trainer(
#     model=model,
#     args=args,
#     train_dataset=ds_train,
#     eval_dataset=ds_val,
#     compute_metrics=compute_metrics,
# )

# # trainer.train()
# # eval_results = trainer.evaluate(ds_test)
# # eval_results



# Task 7 — Error Analysis

**Suggerimenti**  
- Confusion matrix per modello migliore.  
- Precision/Recall per classe (tabella).  
- Esempi di misclassificazioni tipiche (shortlist).  
- Nota su OOV, imbalance, differenze LSTM vs Transformer.


In [None]:

# == Esempio: raccolta errori tipici (da completare dopo il training) ==
# y_true = y_val
# y_pred = y_pred_val  # ottenuto dal modello migliore
# err_idx = np.where(y_true != y_pred)[0][:20]
# df_errors = df_val.iloc[err_idx][['tweet','clean_lemma','label','label_id']].copy()
# df_errors['pred_label'] = [id2label[i] for i in y_pred[err_idx]]
# df_errors.head(10)



# Task 8 — Report (Template)

**Da fare nel PDF (max 2 pagine):**
- Breve descrizione del dataset e preprocessing (Task 1–2).
- Descrizione encoding e vocabolario/GloVe (Task 3).
- Modelli LSTM (architettura/hyperparam) e Transformer (Task 4–6).
- Tabella con metriche macro (avg ± std) su validation; metriche finali su test.
- Figure: learning curves (acc/loss) e confusion matrix.
- Error analysis: pattern ricorrenti, proposte di miglioramento.



---

## Note operative
- **Eseguite** le celle in ordine dall'alto verso il basso.
- **Aggiornate** i percorsi a `data/` e `glove/` se necessario.
- **Documentate** scelte e iperparametri nelle celle Markdown.
- **Pulite** output superflui prima della consegna.
