# Assignment 1
**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Sexism Detection, Multi-class Classification, RNNs, Transformers, Huggingface



# Contact
For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

- Federico Ruggeri -> federico.ruggeri6@unibo.it
- Eleonora Mancini -> e.mancini@unibo.it

Professor:
- Paolo Torroni -> p.torroni@unibo.it

# Introduction
You are asked to address the [EXIST 2023 Task 2](https://clef2023.clef-initiative.eu/index.php?page=Pages/labs.html#EXIST) on sexism detection.

## Problem Definition

This task aims to categorize the sexist messages according to the intention of the author in one of the following categories: (i) direct sexist message, (ii) reported sexist message and (iii) judgemental message.

### Examples:

#### DIRECT 
The intention was to write a message that is sexist by itself or incites to be sexist, as in:

''*A woman needs love, to fill the fridge, if a man can give this to her in return for her services (housework, cooking, etc), I don’t see what else she needs.*''

#### REPORTED
The intention is to report and share a sexist situation suffered by a woman or women in first or third person, as in:

''*Today, one of my year 1 class pupils could not believe he’d lost a race against a girl.*''

#### JUDGEMENTAL
The intention was to judge, since the tweet describes sexist situations or behaviours with the aim of condemning them.

''*As usual, the woman was the one quitting her job for the family’s welfare…*''

# [Task 1 - 1.0 points] Corpus

We have preparared a small version of EXIST dataset in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material/tree/main/2025-2026/Assignment%201/data).

Check the `A1/data` folder. It contains 3 `.json` files representing `training`, `validation` and `test` sets.


### Dataset Description
- The dataset contains tweets in both English and Spanish.
- There are labels for multiple tasks, but we are focusing on **Task 2**.
- For Task 2, labels are assigned by six annotators.
- The labels for Task 2 represent whether the tweet is non-sexist ('-') or its sexist intention ('DIRECT', 'REPORTED', 'JUDGEMENTAL').







### Example

```
    "203260": {
        "id_EXIST": "203260",
        "lang": "en",
        "tweet": "ik when mandy says “you look like a whore” i look cute as FUCK",
        "number_annotators": 6,
        "annotators": ["Annotator_473", "Annotator_474", "Annotator_475", "Annotator_476", "Annotator_477", "Annotator_27"],
        "gender_annotators": ["F", "F", "M", "M", "M", "F"],
        "age_annotators": ["18-22", "23-45", "18-22", "23-45", "46+", "46+"],
        "labels_task1": ["YES", "YES", "YES", "NO", "YES", "YES"],
        "labels_task2": ["DIRECT", "DIRECT", "REPORTED", "-", "JUDGEMENTAL", "REPORTED"],
        "labels_task3": [
          ["STEREOTYPING-DOMINANCE"],
          ["OBJECTIFICATION"],
          ["SEXUAL-VIOLENCE"],
          ["-"],
          ["STEREOTYPING-DOMINANCE", "OBJECTIFICATION"],
          ["OBJECTIFICATION"]
        ],
        "split": "TRAIN_EN"
      }
    }
```

### Instructions
1. **Download** the `A1/data` folder.
2. **Load** the three JSON files and encode them as ``pandas.DataFrame``.
3. **Aggregate labels** for Task 2 using majority voting and store them in a new dataframe column called `label`. Items without a clear majority will be removed from the dataset.
4. **Filter the DataFrame** to keep only rows where the `lang` column is `'en'`.
5. **Remove unwanted columns**: Keep only `id_EXIST`, `lang`, `tweet`, and `label`.
6. **Encode the `label` column**: Use the following mapping

```
{
    '-': 0,
    'DIRECT': 1,
    'JUDGEMENTAL': 2,
    'REPORTED': 3
}
```

In [25]:
# file management
import sys
import shutil
import urllib
import tarfile
from pathlib import Path

# dataframe management
import pandas as pd

# data manipulation
import numpy as np

# for readability
from typing import Iterable
from collections import Counter

# viz
from tqdm import tqdm

import random


In [6]:
train = pd.read_json('data/training.json', orient='index')
val = pd.read_json('data/validation.json', orient='index')
test = pd.read_json('data/test.json', orient='index')

In [7]:
def majority_vote(labels):
    """Apply majority voting to get the label with strict majority (>50%)"""
    if not isinstance(labels, list) or len(labels) == 0:
        return None
    top_label, freq = Counter(labels).most_common(1)[0]
    return top_label if freq > len(labels) / 2 else None

In [10]:
# Label mapping for Task 2
label_map = {'-': 0, 'DIRECT': 1, 'JUDGEMENTAL': 2, 'REPORTED': 3}

# Process train, validation, and test sets
for name in ('train', 'val', 'test'):
    df = globals()[name].copy()
    
    # Step 1: Aggregate labels using majority voting
    df['label'] = df['labels_task2'].apply(majority_vote)
    
    # Remove items without a clear majority
    df = df.dropna(subset=['label'])
    
    # Step 2: Filter to keep only English rows
    df = df[df['lang'] == 'en']
    
    # Step 3: Keep only required columns
    df = df[['id_EXIST', 'lang', 'tweet', 'label']]
    
    # Step 4: Encode the label column
    df['label'] = df['label'].map(label_map)
    
    # Update the global variable
    globals()[name] = df.reset_index(drop=True)

print(f"Train set shape: {train.shape}")
print(f"Validation set shape: {val.shape}")
print(f"Test set shape: {test.shape}")
print(f"\nLabel distribution in train set:\n{train['label'].value_counts().sort_index()}")
train.head()

Train set shape: (2202, 4)
Validation set shape: (115, 4)
Test set shape: (217, 4)

Label distribution in train set:
label
0    1733
1     336
2      42
3      91
Name: count, dtype: int64


Unnamed: 0,id_EXIST,lang,tweet,label
0,200002,en,Writing a uni essay in my local pub with a cof...,3
1,200006,en,According to a customer I have plenty of time ...,3
2,200008,en,New to the shelves this week - looking forward...,0
3,200010,en,I guess that’s fairly normal for a Neanderthal...,0
4,200011,en,#EverydaySexism means women usually end up in ...,2


# [Task2 - 0.5 points] Data Cleaning
In the context of tweets, we have noisy and informal data that often includes unnecessary elements like emojis, hashtags, mentions, and URLs. These elements may interfere with the text analysis.



### Instructions
- **Remove emojis** from the tweets.
- **Remove hashtags** (e.g., `#example`).
- **Remove mentions** such as `@user`.
- **Remove URLs** from the tweets.
- **Remove special characters and symbols**.
- **Remove specific quote characters** (e.g., curly quotes).
- **Perform lemmatization** to reduce words to their base form.

In [14]:
import re
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    """Convert treebank POS tags to WordNet POS tags for better lemmatization"""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    return wordnet.NOUN  # Default to noun

def clean_tweet(text):
    """Clean tweet text by removing noise and performing lemmatization"""
    if not isinstance(text, str):
        return text
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove mentions (@user)
    text = re.sub(r'@\w+', '', text)
    
    # Remove hashtags (#example)
    text = re.sub(r'#\w+', '', text)
    
    # Remove emojis
    text = re.sub(r'[^\w\s\-\.\,\!\?\']', '', text)
    
    # Remove specific quote characters (curly quotes, etc.)
    text = re.sub(r'[""''`´]', '"', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove special characters and short tokens, then lemmatize
    cleaned_tokens = []
    pos_tags = nltk.pos_tag(tokens)
    
    for token, pos in pos_tags:
        # Skip if token is too short or only special characters
        if len(token) < 2:
            continue
        # Lemmatize using POS tag
        wordnet_pos = get_wordnet_pos(pos)
        lemmatized = lemmatizer.lemmatize(token, pos=wordnet_pos)
        cleaned_tokens.append(lemmatized)
    
    return ' '.join(cleaned_tokens)

# Apply cleaning to all datasets
for name in ('train', 'val', 'test'):
    print(f"Cleaning {name} set...")
    globals()[name]['tweet'] = globals()[name]['tweet'].apply(clean_tweet)

print("\nData cleaning completed!")
print(f"\nSample cleaned tweets from train set:")
for i in range(min(3, len(train))):
    print(f"{i+1}. {train['tweet'].iloc[i][:100]}...")


Cleaning train set...
Cleaning val set...
Cleaning test set...

Data cleaning completed!

Sample cleaned tweets from train set:
1. write uni essay in my local pub with coffee random old man keep ask me drunk question when 'm try to...
2. accord to customer have plenty of time to go spent the stirling coin he want to pay me with in derry...
3. new to the shelf this week look forward to read these book...


# [Task 3 - 0.5 points] Text Encoding
To train a neural sexism classifier, you first need to encode text into numerical format.




### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.





### What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe **must** be added to the vocabulary.
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **special token** (e.g., ``<UNK>``) and a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)



### More about OOV

For a given token:

* **If in train set**: add to vocabulary and assign an embedding (use GloVe if token in GloVe, custom embedding otherwise).
* **If in val/test set**: assign special token if not in vocabulary and assign custom embedding.

Your vocabulary **should**:

* Contain all tokens in train set; or
* Union of tokens in train set and in GloVe $\rightarrow$ we make use of existing knowledge!

In [19]:
# Download GloVe embeddings
import urllib.request
import zipfile
import os

# GloVe embedding dimension
EMBEDDING_DIM = 100

# Create embeddings directory
embeddings_dir = 'embeddings'
if not os.path.exists(embeddings_dir):
    os.makedirs(embeddings_dir)

# Download GloVe embeddings (6B tokens, 100d)
glove_file = os.path.join(embeddings_dir, f'glove.6B.{EMBEDDING_DIM}d.txt')

if not os.path.exists(glove_file):
    print("Downloading GloVe embeddings...")
    url = "http://nlp.stanford.edu/data/glove.6B.zip"
    zip_path = os.path.join(embeddings_dir, 'glove.6B.zip')
    urllib.request.urlretrieve(url, zip_path)
    
    print("Extracting embeddings...")
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(embeddings_dir)
    os.remove(zip_path)
    print("Done!")

# Load GloVe embeddings into a dictionary
print(f"Loading GloVe embeddings ({EMBEDDING_DIM}d)...")
glove_embeddings = {}
with open(glove_file, 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype='float32')
        glove_embeddings[word] = vector

print(f"Loaded {len(glove_embeddings)} word embeddings from GloVe")

# Build vocabulary from training set
print("\nBuilding vocabulary...")
vocab = {}
vocab_idx = 0

# Special tokens
vocab['<PAD>'] = vocab_idx
vocab_idx += 1
vocab['<UNK>'] = vocab_idx
vocab_idx += 1

# Collect all unique tokens from training data
train_tokens = set()
for tweet in train['tweet']:
    tokens = tweet.split()
    train_tokens.update(tokens)

print(f"Tokens in training set: {len(train_tokens)}")

# First, add all training tokens to vocabulary
for token in sorted(train_tokens):
    if token not in vocab:
        vocab[token] = vocab_idx
        vocab_idx += 1

# Optionally add GloVe tokens not in training (union approach)
# This enriches our vocabulary with words that might appear in val/test
glove_tokens_added = 0
# Sample 10% of GloVe tokens not in training to keep vocabulary manageable
glove_sample = np.random.choice(list(set(glove_embeddings.keys()) - train_tokens), 
                                 size=min(50000, len(glove_embeddings) // 10), 
                                 replace=False)
for token in sorted(glove_sample):
    if token not in vocab:
        vocab[token] = vocab_idx
        vocab_idx += 1
        glove_tokens_added += 1

print(f"Vocabulary size: {len(vocab)}")
print(f"  - Training tokens: {len(train_tokens)}")
print(f"  - GloVe tokens added: {glove_tokens_added}")

# Create embedding matrix
print("\nCreating embedding matrix...")
embedding_matrix = np.zeros((len(vocab), EMBEDDING_DIM))

# Initialize embeddings
oov_count = 0
glove_count = 0

for token, idx in vocab.items():
    if token in glove_embeddings:
        # Token found in GloVe - use pre-trained embedding
        embedding_matrix[idx] = glove_embeddings[token]
        glove_count += 1
    elif token == '<PAD>':
        # PAD token gets zero vector (will be masked)
        embedding_matrix[idx] = np.zeros(EMBEDDING_DIM)
    elif token == '<UNK>':
        # UNK token gets random initialization
        embedding_matrix[idx] = np.random.normal(scale=0.6, size=EMBEDDING_DIM)
    else:
        # OOV tokens in training set get random initialization
        embedding_matrix[idx] = np.random.normal(scale=0.6, size=EMBEDDING_DIM)
        oov_count += 1

print(f"Embeddings from GloVe: {glove_count}")
print(f"Custom embeddings (OOV training tokens): {oov_count}")
print(f"Embedding matrix shape: {embedding_matrix.shape}")

# Create reverse vocabulary (index to token)
idx_to_vocab = {v: k for k, v in vocab.items()}

# Function to encode tweets for validation and test sets
def encode_tweet(tweet, vocab):
    """
    Encode a tweet to token indices.
    
    Strategy for tokens:
    - If token in vocabulary: use its index
    - If token NOT in vocabulary (OOV in val/test): map to <UNK> token
    
    This ensures all val/test tokens get an embedding (either their own or <UNK>)
    """
    tokens = tweet.split()
    indices = []
    for token in tokens:
        if token in vocab:
            indices.append(vocab[token])
        else:
            # Token not in vocabulary -> map to <UNK>
            indices.append(vocab['<UNK>'])
    return indices

# Encode validation and test sets using the token-to-index mapping
print("\nEncoding datasets...")
train['encoded_tweet'] = train['tweet'].apply(lambda x: encode_tweet(x, vocab))
val['encoded_tweet'] = val['tweet'].apply(lambda x: encode_tweet(x, vocab))
test['encoded_tweet'] = test['tweet'].apply(lambda x: encode_tweet(x, vocab))

print("Datasets encoded successfully!")


print("\nVocabulary Coverage:")
print(f"  - Vocabulary size: {len(vocab)}")
print(f"  - GloVe coverage: {glove_count/len(vocab)*100:.1f}%")
print(f"  - Custom coverage: {oov_count/len(vocab)*100:.1f}%")
print("="*70)

print("\nExample vocab and embeddings:")
for token in ['the', 'woman', 'sexist', '<UNK>', '<PAD>']:
    if token in vocab:
        idx = vocab[token]
        print(f"  {token}: index={idx}, embedding_dim={embedding_matrix[idx].shape[0]}")


Loading GloVe embeddings (100d)...
Loaded 400000 word embeddings from GloVe

Building vocabulary...
Tokens in training set: 8362
Vocabulary size: 48364
  - Training tokens: 8362
  - GloVe tokens added: 40000

Creating embedding matrix...
Embeddings from GloVe: 47056
Custom embeddings (OOV training tokens): 1306
Embedding matrix shape: (48364, 100)

Encoding datasets...
Datasets encoded successfully!

Vocabulary Coverage:
  - Vocabulary size: 48364
  - GloVe coverage: 97.3%
  - Custom coverage: 2.7%

Example vocab and embeddings:
  the: index=7362, embedding_dim=100
  woman: index=8170, embedding_dim=100
  sexist: index=6582, embedding_dim=100
  <UNK>: index=1, embedding_dim=100
  <PAD>: index=0, embedding_dim=100


# [Task 4 - 1.0 points] Model definition

You are now tasked to define your sexism classifier.




### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.

* **Stacked**: add an additional Bidirectional LSTM layer to the Baseline model.

**Note**: You are **free** to experiment with hyper-parameters.

### Token to embedding mapping

You can follow two approaches for encoding tokens in your classifier.

### Work directly with embeddings

- Compute the embedding of each input token
- Feed the mini-batches of shape ``(batch_size, # tokens, embedding_dim)`` to your model

### Work with Embedding layer

- Encode input tokens to token ids
- Define a Embedding layer as the first layer of your model
- Compute the embedding matrix of all known tokens (i.e., tokens in your vocabulary)
- Initialize the Embedding layer with the computed embedding matrix
- You are **free** to set the Embedding layer trainable or not

In [None]:
embedding = tf.keras.layers.Embedding(input_dim=vocab_size,
                                      output_dim=embedding_dimension,
                                      weights=[embedding_matrix],
                                      mask_zero=True,                   # automatically masks padding tokens
                                      name='encoder_embedding')

In [31]:
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Hyperparameters
MAX_SEQ_LENGTH = 100  
LSTM_UNITS = 64
DROPOUT_RATE = 0.3
LEARNING_RATE = 0.001
NUM_CLASSES = 4  # 0: non-sexist, 1: direct, 2: judgemental, 3: reported

# Pad sequences to the same length
print("Padding sequences...")
train_padded = pad_sequences(train['encoded_tweet'], maxlen=MAX_SEQ_LENGTH, padding='post')
val_padded = pad_sequences(val['encoded_tweet'], maxlen=MAX_SEQ_LENGTH, padding='post')
test_padded = pad_sequences(test['encoded_tweet'], maxlen=MAX_SEQ_LENGTH, padding='post')

print(f"Padded sequences shape: {train_padded.shape}")

# Prepare labels
y_train = train['label'].values
y_val = val['label'].values
y_test = test['label'].values

print(f"Train labels shape: {y_train.shape}")
print(f"Val labels shape: {y_val.shape}")
print(f"Test labels shape: {y_test.shape}")

# Function to create Baseline model
def create_baseline_model(vocab_size, embedding_dim, max_seq_length, embedding_matrix, 
                          lstm_units=LSTM_UNITS, dropout_rate=DROPOUT_RATE, 
                          num_classes=NUM_CLASSES, learning_rate=LEARNING_RATE):
    """
    Create Baseline model: Embedding -> Bidirectional LSTM -> Dense
    
    Args:
        vocab_size: Size of vocabulary
        embedding_dim: Embedding dimension
        max_seq_length: Maximum sequence length
        embedding_matrix: Pre-trained embedding matrix
        lstm_units: Number of LSTM units
        dropout_rate: Dropout rate
        num_classes: Number of output classes
        learning_rate: Learning rate for optimizer
    
    Returns:
        Compiled model
    """
    model = Sequential([
        Embedding(input_dim=vocab_size,
                 output_dim=embedding_dim,
                 weights=[embedding_matrix],
                 input_length=max_seq_length,
                 trainable=False,  # Keep GloVe embeddings frozen
                 name='embedding'),
        
        Bidirectional(LSTM(units=lstm_units, return_sequences=False),
                     name='bilstm_1'),
        
        # Dropout(dropout_rate),
        
        # Dense(units=32, activation='relu', name='dense_hidden'),
        
        Dropout(dropout_rate),
        
        Dense(units=num_classes, activation='softmax', name='output')
    ])
    
    model.compile(optimizer=Adam(learning_rate=learning_rate),
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    
    return model

# Function to create Stacked model
def create_stacked_model(vocab_size, embedding_dim, max_seq_length, embedding_matrix,
                         lstm_units=LSTM_UNITS, dropout_rate=DROPOUT_RATE,
                         num_classes=NUM_CLASSES, learning_rate=LEARNING_RATE):
    """
    Create Stacked model: Embedding -> BiLSTM -> BiLSTM -> Dense
    
    Args:
        vocab_size: Size of vocabulary
        embedding_dim: Embedding dimension
        max_seq_length: Maximum sequence length
        embedding_matrix: Pre-trained embedding matrix
        lstm_units: Number of LSTM units per layer
        dropout_rate: Dropout rate
        num_classes: Number of output classes
        learning_rate: Learning rate for optimizer
    
    Returns:
        Compiled model
    """
    model = Sequential([
        Embedding(input_dim=vocab_size,
                 output_dim=embedding_dim,
                 weights=[embedding_matrix],
                 input_length=max_seq_length,
                 trainable=False,  # Keep GloVe embeddings frozen
                 name='embedding'),
        
        Bidirectional(LSTM(units=lstm_units, return_sequences=True),
                     name='bilstm_1'),
        
        Dropout(dropout_rate),
        
        Bidirectional(LSTM(units=lstm_units, return_sequences=False),
                     name='bilstm_2'),
        
        # Dropout(dropout_rate),
        
        # Dense(units=32, activation='relu', name='dense_hidden'),
        
        Dropout(dropout_rate),
        
        Dense(units=num_classes, activation='softmax', name='output')
    ])
    
    model.compile(optimizer=Adam(learning_rate=learning_rate),
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
    
    return model

# Create models
print("\n" + "="*70)
print("MODEL CREATION")
print("="*70)

baseline_model = create_baseline_model(
    vocab_size=len(vocab),
    embedding_dim=EMBEDDING_DIM,
    max_seq_length=MAX_SEQ_LENGTH,
    embedding_matrix=embedding_matrix,
    lstm_units=LSTM_UNITS,
    dropout_rate=DROPOUT_RATE,
    num_classes=NUM_CLASSES,
    learning_rate=LEARNING_RATE
)

stacked_model = create_stacked_model(
    vocab_size=len(vocab),
    embedding_dim=EMBEDDING_DIM,
    max_seq_length=MAX_SEQ_LENGTH,
    embedding_matrix=embedding_matrix,
    lstm_units=LSTM_UNITS,
    dropout_rate=DROPOUT_RATE,
    num_classes=NUM_CLASSES,
    learning_rate=LEARNING_RATE
)

print("\nBASELINE MODEL ARCHITECTURE:")
baseline_model.summary()

print("\n" + "="*70)
print("\nSTACKED MODEL ARCHITECTURE:")
stacked_model.summary()

# Configuration summary
print("\n" + "="*70)
print("HYPERPARAMETER CONFIGURATION")
print("="*70)
print(f"Max Sequence Length: {MAX_SEQ_LENGTH}")
print(f"Embedding Dimension: {EMBEDDING_DIM}")
print(f"LSTM Units: {LSTM_UNITS}")
print(f"Dropout Rate: {DROPOUT_RATE}")
print(f"Learning Rate: {LEARNING_RATE}")
print(f"Number of Classes: {NUM_CLASSES}")
print(f"Vocabulary Size: {len(vocab)}")
print("="*70)


Padding sequences...
Padded sequences shape: (2202, 100)
Train labels shape: (2202,)
Val labels shape: (115,)
Test labels shape: (217,)

MODEL CREATION

BASELINE MODEL ARCHITECTURE:




STACKED MODEL ARCHITECTURE:



HYPERPARAMETER CONFIGURATION
Max Sequence Length: 100
Embedding Dimension: 100
LSTM Units: 64
Dropout Rate: 0.3
Learning Rate: 0.001
Number of Classes: 4
Vocabulary Size: 48364


# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline and Stacked models.



### Instructions

* Pick **at least** three seeds for robust estimation.
* Train **all** models on the train set.
* Evaluate **all** models on the validation and test sets.
* Compute macro F1-score, precision, and recall metrics on the validation set.
* Report average and standard deviation measures over seeds for each metric.
* Pick the **best** performing model according to the observed validation set performance (use macro F1-score).

In [32]:
def set_seeds(seed):
    """Set random seeds for reproducibility"""
    # Set Python's random seed
    random.seed(seed)
    
    # Set NumPy's random seed
    np.random.seed(seed)
    
    # Set TensorFlow's random seed
    tf.random.set_seed(seed)

In [None]:
SEEDS = [42, 53, 82]
EPOCHS = 50
BATCH_SIZE = 512
models = {}

for seed in SEEDS:
    print("\n" + "="*70)
    print(f"TRAINING WITH SEED: {seed}")
    print("="*70)
    
    # Set seeds for reproducibility
    set_seeds(seed)
    
    models[seed] = {}
    
    # Re-create models for each seed
    models[seed]['baseline_model'] = create_baseline_model(
    vocab_size=len(vocab),
    embedding_dim=EMBEDDING_DIM,
    max_seq_length=MAX_SEQ_LENGTH,
    embedding_matrix=embedding_matrix,
    lstm_units=LSTM_UNITS,
    dropout_rate=DROPOUT_RATE,
    num_classes=NUM_CLASSES,
    learning_rate=LEARNING_RATE
)

    models[seed]['stacked_model'] = create_stacked_model(
        vocab_size=len(vocab),
        embedding_dim=EMBEDDING_DIM,
        max_seq_length=MAX_SEQ_LENGTH,
        embedding_matrix=embedding_matrix,
        lstm_units=LSTM_UNITS,
        dropout_rate=DROPOUT_RATE,
        num_classes=NUM_CLASSES,
        learning_rate=LEARNING_RATE
    )
    
    # Train Baseline model
    print("\nTraining Baseline Model...")
    models[seed]['baseline_model'].fit(train_padded, y_train,
                       validation_data=(val_padded, y_val),
                       epochs=EPOCHS,
                       batch_size=BATCH_SIZE)
    
    # Train Stacked model
    print("\nTraining Stacked Model...")
    models[seed]['stacked_model'].fit(train_padded, y_train,
                      validation_data=(val_padded, y_val),
                      epochs=EPOCHS,
                      batch_size=BATCH_SIZE)


TRAINING WITH SEED: 42





Training Baseline Model...
Epoch 1/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 530ms/step - accuracy: 0.6449 - loss: 1.1698 - val_accuracy: 0.7739 - val_loss: 0.9257
Epoch 2/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 310ms/step - accuracy: 0.7861 - loss: 0.8222 - val_accuracy: 0.7826 - val_loss: 0.7978
Epoch 3/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 306ms/step - accuracy: 0.7870 - loss: 0.7484 - val_accuracy: 0.7826 - val_loss: 0.7976
Epoch 4/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 304ms/step - accuracy: 0.7870 - loss: 0.7253 - val_accuracy: 0.7826 - val_loss: 0.7876
Epoch 5/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 298ms/step - accuracy: 0.7870 - loss: 0.7039 - val_accuracy: 0.7826 - val_loss: 0.7937
Epoch 6/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 306ms/step - accuracy: 0.7870 - loss: 0.6935 - val_accuracy: 0.7826 - val_loss: 0.7899
Epoch 7/20


In [72]:
from sklearn.metrics import f1_score, precision_score, recall_score

# Compute metrics on validation set for each model and seed
val_metrics = {}
val_metrics = {"baseline_model": {}, "stacked_model": {}}

for seed in SEEDS:
    # Baseline Model Validation Metrics
    val_metrics["baseline_model"][seed] = {}
    
    baseline_pred = models[seed]['baseline_model'].predict(val_padded, verbose=0)
    baseline_pred_classes = np.argmax(baseline_pred, axis=1)
    val_metrics["baseline_model"][seed]["precision"] = precision_score(y_val, baseline_pred_classes, average='macro', zero_division=0)
    val_metrics["baseline_model"][seed]["recall"] = recall_score(y_val, baseline_pred_classes, average='macro', zero_division=0)
    val_metrics["baseline_model"][seed]["f1"] = f1_score(y_val, baseline_pred_classes, average='macro', zero_division=0)
    
    # Stacked Model Validation Metrics
    val_metrics["stacked_model"][seed] = {}
    
    stacked_pred = models[seed]['stacked_model'].predict(val_padded, verbose=0)
    stacked_pred_classes = np.argmax(stacked_pred, axis=1)
    val_metrics["stacked_model"][seed]["precision"] = precision_score(y_val, stacked_pred_classes, average='macro', zero_division=0)
    val_metrics["stacked_model"][seed]["recall"] = recall_score(y_val, stacked_pred_classes, average='macro', zero_division=0)
    val_metrics["stacked_model"][seed]["f1"] = f1_score(y_val, stacked_pred_classes, average='macro', zero_division=0)

# Compute average and std dev across seeds
print("\n" + "="*70)
print("AVERAGE METRICS ACROSS SEEDS")
print("="*70)

baseline_f1_scores = [val_metrics["baseline_model"][seed]['f1'] for seed in SEEDS]
baseline_precision_scores = [val_metrics["baseline_model"][seed]['precision'] for seed in SEEDS]
baseline_recall_scores = [val_metrics["baseline_model"][seed]['recall'] for seed in SEEDS]

stacked_f1_scores = [val_metrics["stacked_model"][seed]['f1'] for seed in SEEDS]
stacked_precision_scores = [val_metrics["stacked_model"][seed]['precision'] for seed in SEEDS]
stacked_recall_scores = [val_metrics["stacked_model"][seed]['recall'] for seed in SEEDS]

print("\nBaseline Model:")
print(f"  F1-Score (macro):     {np.mean(baseline_f1_scores):.4f} ± {np.std(baseline_f1_scores):.4f}")
print(f"  Precision (macro):    {np.mean(baseline_precision_scores):.4f} ± {np.std(baseline_precision_scores):.4f}")
print(f"  Recall (macro):       {np.mean(baseline_recall_scores):.4f} ± {np.std(baseline_recall_scores):.4f}")

print("\nStacked Model:")
print(f"  F1-Score (macro):     {np.mean(stacked_f1_scores):.4f} ± {np.std(stacked_f1_scores):.4f}")
print(f"  Precision (macro):    {np.mean(stacked_precision_scores):.4f} ± {np.std(stacked_precision_scores):.4f}")
print(f"  Recall (macro):       {np.mean(stacked_recall_scores):.4f} ± {np.std(stacked_recall_scores):.4f}")

# Determine best model
best_baseline_f1 = np.max(baseline_f1_scores)
best_stacked_f1 = np.max(stacked_f1_scores)
best_model_type = 'Baseline' if best_baseline_f1 >= best_stacked_f1 else 'Stacked'
best_f1_score = max(best_baseline_f1, best_stacked_f1)

print("\n" + "="*70)
print(f"BEST MODEL: {best_model_type} (Macro F1-Score: {best_f1_score:.4f})")
print("="*70)


AVERAGE METRICS ACROSS SEEDS

Baseline Model:
  F1-Score (macro):     0.2397 ± 0.0296
  Precision (macro):    0.2528 ± 0.0812
  Recall (macro):       0.2610 ± 0.0175

Stacked Model:
  F1-Score (macro):     0.3283 ± 0.0524
  Precision (macro):    0.4037 ± 0.0572
  Recall (macro):       0.3229 ± 0.0450

BEST MODEL: Stacked (Macro F1-Score: 0.3994)


# [Task 6 - 1.0 points] Transformers

In this section, you will use a transformer model specifically trained for hate speech detection, namely [Twitter-roBERTa-base for Hate Speech Detection](https://huggingface.co/cardiffnlp/twitter-roberta-base-hate).




### Relevant Material
- Tutorial 3

### Instructions
- **Load the Tokenizer and Model**

- **Preprocess the Dataset**:
   You will need to preprocess your dataset to prepare it for input into the model. Tokenize your text data using the appropriate tokenizer and ensure it is formatted correctly.

- **Train the Model**:
   Use the `Trainer` to train the model on your training data.

- **Evaluate the Model on the Test Set** using the same metrics used for LSTM-based models.

# [Task 7 - 0.5 points] Error Analysis

After evaluating the model, perform a brief error analysis on the **test set**:

### Instructions

 - Review the results and identify common errors.

 - Summarize your findings regarding the errors and their impact on performance (e.g. but not limited to Out-of-Vocabulary (OOV) words, data imbalance, and performance differences between the custom model and the transformer...)
 - Suggest possible solutions to address the identified errors.

# [Task 8 - 0.5 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is **not a copy-paste** of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.


# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

## Bonus Points
Bonus points are arbitrarily assigned based on significant contributions such as:
- Outstanding error analysis
- Masterclass code organization
- Suitable extensions

**Note**: bonus points are only assigned if all task points are attributed (i.e., 6/6).

**Possible Suggestions for Bonus Points:**
- **Try other preprocessing strategies**: e.g., but not limited to, explore techniques tailored specifically for tweets or  methods that are common in social media text.
- **Experiment with other custom architectures or models from HuggingFace**
- **Explore Spanish tweets**: e.g., but not limited to, leverage multilingual models to process Spanish tweets and assess their performance compared to monolingual models.

# FAQ

Please check this frequently asked questions before contacting us

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.


### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Robust Evaluation

Each model is trained with at least 3 random seeds.

Task 5 requires you to compute the average performance over the 3 seeds and its corresponding standard deviation.

### Expected Results

Task 2 leaderboard reports around 40-50 F1-score.
However, note that they perform a hierarchical classification.

That said, results around 30-40 F1-score are **expected** given the task's complexity.

### Model Selection for Analysis

To carry out the error analysis you are **free** to either

* Pick examples or perform comparisons with an individual seed run model (e.g., Baseline seed 1337)
* Perform ensembling via, for instance, majority voting to obtain a single model.

### Error Analysis

Some topics for discussion include:
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.


# The End

Feel free to reach out for questions/doubts!