# Assignment 1
**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Sexism Detection, Multi-class Classification, RNNs, Transformers, Huggingface



# Contact
For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

- Federico Ruggeri -> federico.ruggeri6@unibo.it
- Eleonora Mancini -> e.mancini@unibo.it

Professor:
- Paolo Torroni -> p.torroni@unibo.it

# Introduction
You are asked to address the [EXIST 2023 Task 1](https://clef2023.clef-initiative.eu/index.php?page=Pages/labs.html#EXIST) on sexism detection.

## Problem Definition
The systems have to decide whether or not a given tweet contains or describes sexist expressions or behaviors (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behavior).

### Examples:

**Text**: *Can’t go a day without women womening*

**Label**: Sexist

**Text**: *''Society's set norms! Happy men's day though!#weareequal''*

**Label**: Not sexist

# [Task 1 - 1.0 points] Corpus

We have preparared a small version of EXIST dataset in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material/tree/main/2024-2025/Assignment%201/data).

Check the `A1/data` folder. It contains 3 `.json` files representing `training`, `validation` and `test` sets.

The three sets are slightly unbalanced, with a bias toward the `Non-sexist` class.



### Dataset Description
- The dataset contains tweets in both English and Spanish.
- There are labels for multiple tasks, but we are focusing on **Task 1**.
- For Task 1, soft labels are assigned by six annotators.
- The labels for Task 1 represent whether the tweet is sexist ("YES") or not ("NO").







### Example


    "203260": {
        "id_EXIST": "203260",
        "lang": "en",
        "tweet": "ik when mandy says “you look like a whore” i look cute as FUCK",
        "number_annotators": 6,
        "annotators": ["Annotator_473", "Annotator_474", "Annotator_475", "Annotator_476", "Annotator_477", "Annotator_27"],
        "gender_annotators": ["F", "F", "M", "M", "M", "F"],
        "age_annotators": ["18-22", "23-45", "18-22", "23-45", "46+", "46+"],
        "labels_task1": ["YES", "YES", "YES", "NO", "YES", "YES"],
        "labels_task2": ["DIRECT", "DIRECT", "REPORTED", "-", "JUDGEMENTAL", "REPORTED"],
        "labels_task3": [
          ["STEREOTYPING-DOMINANCE"],
          ["OBJECTIFICATION"],
          ["SEXUAL-VIOLENCE"],
          ["-"],
          ["STEREOTYPING-DOMINANCE", "OBJECTIFICATION"],
          ["OBJECTIFICATION"]
        ],
        "split": "TRAIN_EN"
      }
    }

### Instructions
1. **Download** the `A1/data` folder.
2. **Load** the three JSON files and encode them as pandas dataframes.
3. **Generate hard labels** for Task 1 using majority voting and store them in a new dataframe column called `hard_label_task1`. Items without a clear majority will be removed from the dataset.
4. **Filter the DataFrame** to keep only rows where the `lang` column is `'en'`.
5. **Remove unwanted columns**: Keep only `id_EXIST`, `lang`, `tweet`, and `hard_label_task1`.
6. **Encode the `hard_label_task1` column**: Use 1 to represent "YES" and 0 to represent "NO".

In [1]:
pip install emoji

Collecting emoji
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.0-py3-none-any.whl (586 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.0


In [2]:
pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.2-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.2-py3-none-any.whl (7.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.2


In [3]:
pip install textblob



### 2. Loading of the three JSON files as dataframes

In [6]:
import pandas as pd

#data_train = pd.read_json('data/training.json')
#data_train = pd.read_json('content/training.json') #Lorenzo
data_train = pd.read_json('/content/training.json') #Diego


df_train = data_train.T
df_train.reset_index(inplace=True)
df_train.rename(columns={'index': 'id'}, inplace=True)

#data_test = pd.read_json('data/test.json')
#data_test = pd.read_json('content/test.json') #Lorenzo
data_test = pd.read_json('/content/test.json') #Diego

df_test = data_test.T
df_test.reset_index(inplace=True)
df_test.rename(columns={'index': 'id'}, inplace=True)

#data_val = pd.read_json('data/validation.json')
#data_val = pd.read_json('content/validation.json') #Lorenzo
data_val = pd.read_json('/content/validation.json') #Diego


df_val = data_val.T
df_val.reset_index(inplace=True)
df_val.rename(columns={'index': 'id'}, inplace=True)

# DataFrame displaying
#print(df_train.head())

### 3. Generate hard_label_task1

In [7]:
from collections import Counter

def majority_vote(labels):
    label_counts = Counter(labels)
    most_common = label_counts.most_common(1)
    if len(most_common) > 0 and most_common[0][1] > len(labels) / 2:
        return most_common[0][0]
    return None

df_train['hard_label_task1'] = df_train['labels_task1'].apply(majority_vote)
df_train = df_train.dropna(subset=['hard_label_task1'])
df_train.reset_index(drop=True, inplace=True)

df_test['hard_label_task1'] = df_test['labels_task1'].apply(majority_vote)
df_test = df_test.dropna(subset=['hard_label_task1'])
df_test.reset_index(drop=True, inplace=True)

df_val['hard_label_task1'] = df_val['labels_task1'].apply(majority_vote)
df_val = df_val.dropna(subset=['hard_label_task1'])
df_val.reset_index(drop=True, inplace=True)


### 4. Filter DataFrame

In [8]:
df_train = df_train[df_train['lang'] == 'en']
df_test = df_test[df_test['lang'] == 'en']
df_val = df_val[df_val['lang'] == 'en']

### 5. Remove Unwanted Columns

In [9]:
df_train = df_train[['id_EXIST', 'lang', 'tweet', 'hard_label_task1']]
df_test = df_test[['id_EXIST', 'lang', 'tweet', 'hard_label_task1']]
df_val = df_val[['id_EXIST', 'lang', 'tweet', 'hard_label_task1']]

### 6. hard_label_task1 encoding

In [10]:
df_train['hard_label_task1'] = df_train['hard_label_task1'].replace({'YES': 1, 'NO': 0})
df_test['hard_label_task1'] = df_test['hard_label_task1'].replace({'YES': 1, 'NO': 0})
df_val['hard_label_task1'] = df_val['hard_label_task1'].replace({'YES': 1, 'NO': 0})

  df_train['hard_label_task1'] = df_train['hard_label_task1'].replace({'YES': 1, 'NO': 0})
  df_test['hard_label_task1'] = df_test['hard_label_task1'].replace({'YES': 1, 'NO': 0})
  df_val['hard_label_task1'] = df_val['hard_label_task1'].replace({'YES': 1, 'NO': 0})


# [Task2 - 0.5 points] Data Cleaning
In the context of tweets, we have noisy and informal data that often includes unnecessary elements like emojis, hashtags, mentions, and URLs. These elements may interfere with the text analysis.



### Instructions
- **Remove emojis** from the tweets.
- **Remove hashtags** (e.g., `#example`).
- **Remove mentions** such as `@user`.
- **Remove URLs** from the tweets.
- **Remove special characters and symbols**.
- **Remove specific quote characters** (e.g., curly quotes).
- **Perform lemmatization** to reduce words to their base form.

In [11]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet

nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

lemmatizer = WordNetLemmatizer()

def get_wordnet_key(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

#Lemmatize each token
def lem_text(text: str):
  tokens = word_tokenize(text)
  tagged = pos_tag(tokens)
  words = [lemmatizer.lemmatize(word, get_wordnet_key(tag)) for word, tag in tagged]
  return " ".join(words)

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


In [12]:
import emoji
import re
import nltk
from tqdm import tqdm
import unicodedata
from textblob import TextBlob
#from nltk.tokenize import word_tokenize
from emoji import *
from nltk import word_tokenize


# necessary for being able to tokenize
nltk.download('punkt_tab')
nltk.download('punkt')

def correct_spelling(text):
  blob = TextBlob(text)
  corrected_text = blob.correct()
  return str(corrected_text)

def remove_style(text):
    # Normalize text into the closest ASCII equivalent
    return ''.join(
        c for c in unicodedata.normalize('NFKC', text)
        if not unicodedata.combining(c)  # Exclude combining marks
    )

def split_merge_word(text):
    # Use regex to find boundaries between lowercase and uppercase
    return re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', text)

def replace_space(text):
  return re.sub(r'\s+', ' ', text).strip()

# Function to clean and preprocess tweets
def clean_tweet(tweet):
    # Remove mentions (@user)
    tweet = re.sub(r'@\w+', ' ', tweet)
    # Remove hashtags (#example)
    tweet = re.sub(r'#\w+', ' ', tweet)
    # Remove URLs
    tweet = re.sub(r'http\S+|www.\S+', ' ', tweet)
    # Remove special characters and symbols
    tweet = re.sub(r'[^\w\s]', ' ', tweet)
    # Remove emojis
    tweet = replace_emoji(tweet, ' ')
    # Remove specific quote characters (e.g., curly quotes)
    cleaned_tweet = tweet.replace('“', ' ').replace('”', ' ').replace('’', " ").replace("‘"," ").replace('"', " ").replace("'", " ")

    return cleaned_tweet

def clean_column_dataset(df_column):
  cleaned_tweets = []

  for tweet in tqdm(df_column):
    cleaned_tweet = clean_tweet(tweet)   #Clean the text
    lem_tweet = lem_text(cleaned_tweet) #Lemmatize the text
    lem_tweet_split = split_merge_word(lem_tweet) #Split words like "endYou"
    norm_tweet = remove_style(lem_tweet_split) #remove bold and italic style
    lowercase_tweet = norm_tweet.lower() #lower case the dataset
    cleaned_tweets.append(lowercase_tweet)  #Save the results
  return cleaned_tweets

df_train['cleaned_tweet'] = clean_column_dataset(df_train['tweet'])
df_val['cleaned_tweet'] = clean_column_dataset(df_val['tweet'])
df_test['cleaned_tweet'] = clean_column_dataset(df_test['tweet'])


print(df_train[['tweet', 'cleaned_tweet']].head())

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
100%|██████████| 2870/2870 [00:08<00:00, 330.99it/s]
100%|██████████| 158/158 [00:00<00:00, 611.48it/s]
100%|██████████| 286/286 [00:00<00:00, 725.81it/s]

                                                  tweet  \
3194  Writing a uni essay in my local pub with a cof...   
3195  @UniversalORL it is 2021 not 1921. I dont appr...   
3196  According to a customer I have plenty of time ...   
3197  So only 'blokes' drink beer? Sorry, but if you...   
3198  New to the shelves this week - looking forward...   

                                          cleaned_tweet  
3194  writing a uni essay in my local pub with a cof...  
3195  it be 2021 not 1921 i dont appreciate that on ...  
3196  according to a customer i have plenty of time ...  
3197  so only blokes drink beer sorry but if you are...  
3198  new to the shelf this week look forward to rea...  





In [13]:
df_train[['tweet', 'cleaned_tweet']].head(100)

Unnamed: 0,tweet,cleaned_tweet
3194,Writing a uni essay in my local pub with a cof...,writing a uni essay in my local pub with a cof...
3195,@UniversalORL it is 2021 not 1921. I dont appr...,it be 2021 not 1921 i dont appreciate that on ...
3196,According to a customer I have plenty of time ...,according to a customer i have plenty of time ...
3197,"So only 'blokes' drink beer? Sorry, but if you...",so only blokes drink beer sorry but if you are...
3198,New to the shelves this week - looking forward...,new to the shelf this week look forward to rea...
...,...,...
3289,They can fight for Hijabs but not against Trip...,they can fight for hijabs but not against trip...
3290,"The whiskey and cigars, the $99 dollar seminar...",the whiskey and cigars the 99 dollar seminar t...
3291,I would be glad to see the violence on Twitter...,i would be glad to see the violence on twitter...
3292,@ProteanRedux @HugoThePinkCat @DeclarationOn @...,action aid conduct a survey on street harassme...


# [Task 3 - 0.5 points] Text Encoding
To train a neural sexism classifier, you first need to encode text into numerical format.




### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.





In [14]:
from typing import Dict, List
from collections import OrderedDict
from tqdm import tqdm
import pandas as pd

def build_vocabulary(df: pd.DataFrame) -> (Dict[int, str], Dict[str, int], List[str]):
    """
    Given a dataset, builds the corresponding word vocabulary.

    :param df: dataset from which we want to build the word vocabulary (pandas.DataFrame)
    :return:
      - word vocabulary: vocabulary index to word
      - inverse word vocabulary: word to vocabulary index
      - word listing: set of unique terms that build up the vocabulary
    """
    idx_to_word = OrderedDict()
    word_to_idx = OrderedDict()

    curr_idx = 0
    for sentence in tqdm(df.cleaned_tweet.values):
        tokens = sentence.split()
        for token in tokens:
            if token not in word_to_idx:
                word_to_idx[token] = curr_idx
                idx_to_word[curr_idx] = token
                curr_idx += 1


    word_to_idx["[UNK]"] = curr_idx
    idx_to_word[curr_idx] = '[UNK]'

    word_listing = list(idx_to_word.values())
    return idx_to_word, word_to_idx, word_listing

In [13]:
#df_train['text'] = df_train['cleaned_tweet']
#df_test['text'] = df_test['cleaned_tweet']
#df_val['text'] = df_val['cleaned_tweet']

In [15]:
idx_to_word_train, word_to_idx_train, word_listing_train = build_vocabulary(df_train)
idx_to_word_test, word_to_idx_test, word_listing_test = build_vocabulary(df_test)
idx_to_word_val, word_to_idx_val, word_listing_val = build_vocabulary(df_val)

100%|██████████| 2870/2870 [00:00<00:00, 160497.75it/s]
100%|██████████| 286/286 [00:00<00:00, 102125.91it/s]
100%|██████████| 158/158 [00:00<00:00, 72323.48it/s]


In [16]:
def evaluate_vocabulary(idx_to_word: Dict[int, str], word_to_idx: Dict[str, int],
                        word_listing: List[str], df: pd.DataFrame, check_default_size: bool = False):
    print("[Vocabulary Evaluation] Size checking...")
    assert len(idx_to_word) == len(word_to_idx)
    assert len(idx_to_word) == len(word_listing)

    print("[Vocabulary Evaluation] Content checking...")
    for i in tqdm(range(0, len(idx_to_word))):
        assert idx_to_word[i] in word_to_idx
        assert word_to_idx[idx_to_word[i]] == i

    print("[Vocabulary Evaluation] Consistency checking...")
    _, _, first_word_listing = build_vocabulary(df)
    _, _, second_word_listing = build_vocabulary(df)
    assert first_word_listing == second_word_listing

    print("[Vocabulary Evaluation] Toy example checking...")
    toy_df = pd.DataFrame.from_dict({
        'cleaned_tweet': ["all that glitters is not gold", "all in all i like this assignment"]
    })
    _, _, toy_word_listing = build_vocabulary(toy_df)
    toy_valid_vocabulary = set(' '.join(toy_df.cleaned_tweet.values).split())

    toy_valid_vocabulary.add("[UNK]")
    #print(toy_valid_vocabulary)
    #print(toy_word_listing)
    assert set(toy_word_listing) == toy_valid_vocabulary

In [17]:
print("Vocabulary evaluation...")
evaluate_vocabulary(idx_to_word_train, word_to_idx_train, word_listing_train, df_train)
print("Evaluation completed!")

Vocabulary evaluation...
[Vocabulary Evaluation] Size checking...
[Vocabulary Evaluation] Content checking...


100%|██████████| 9388/9388 [00:00<00:00, 1395079.75it/s]


[Vocabulary Evaluation] Consistency checking...


100%|██████████| 2870/2870 [00:00<00:00, 160851.62it/s]
100%|██████████| 2870/2870 [00:00<00:00, 188273.65it/s]


[Vocabulary Evaluation] Toy example checking...


100%|██████████| 2/2 [00:00<00:00, 20410.24it/s]

Evaluation completed!





In [18]:
import gensim
import gensim.downloader as gloader

def load_embedding_glove_model(embedding_dimension: int = 50) -> gensim.models.keyedvectors.KeyedVectors:
    download_path = "glove-wiki-gigaword-{}".format(embedding_dimension)
    try:
        emb_model = gloader.load(download_path)
    except ValueError as e:
        print("Invalid embedding model name! Check the embedding dimension:")
        print("Glove: 50, 100, 200, 300")
        raise e

    return emb_model

In [19]:
embedding_model = load_embedding_glove_model(embedding_dimension=50)



### Note : What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe **must** be added to the vocabulary.
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **special token** (e.g., [UNK]) and a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)



### More about OOV

For a given token:

* **If in train set**: add to vocabulary and assign an embedding (use GloVe if token in GloVe, custom embedding otherwise).
* **If in val/test set**: assign special token if not in vocabulary and assign custom embedding.

Your vocabulary **should**:

* Contain all tokens in train set; or
* Union of tokens in train set and in GloVe $\rightarrow$ we make use of existing knowledge!

In [20]:
def check_OOV_terms(embedding_model: gensim.models.keyedvectors.KeyedVectors,
                    word_listing: List[str]):
    """
    Checks differences between pre-trained embedding model vocabulary
    and dataset specific vocabulary in order to highlight out-of-vocabulary terms.

    :param embedding_model: pre-trained word embedding model (gensim wrapper)
    :param word_listing: dataset specific vocabulary (list)

    :return
        - list of OOV terms
    """
    embedding_vocabulary = set(embedding_model.key_to_index.keys())
    oov = set(word_listing).difference(embedding_vocabulary)
    return list(oov)

In [21]:
oov_terms_train = check_OOV_terms(embedding_model, word_listing_train)
oov_percentage_train = float(len(oov_terms_train)) * 100 / len(word_listing_train)
print(f"Total OOV terms: {len(oov_terms_train)} ({oov_percentage_train:.2f}%)")

oov_terms_test = check_OOV_terms(embedding_model, word_listing_test)
oov_percentage_test = float(len(oov_terms_test)) * 100 / len(word_listing_test)
print(f"Total OOV terms: {len(oov_terms_test)} ({oov_percentage_test:.2f}%)")

oov_terms_val = check_OOV_terms(embedding_model, word_listing_val)
oov_percentage_val = float(len(oov_terms_val)) * 100 / len(word_listing_val)
print(f"Total OOV terms: {len(oov_terms_val)} ({oov_percentage_val:.2f}%)")

Total OOV terms: 880 (9.37%)
Total OOV terms: 102 (4.72%)
Total OOV terms: 68 (4.42%)


In [22]:
def substitute_oov_tweet(df, oov_terms, new_token):
  new_col = []
  for sentence in tqdm(df.values):
    new_seq = []
    for token in sentence.split():
      if token in oov_terms:
        new_seq.append(new_token)
      else:
        new_seq.append(token)

    new_seq = ' '.join(new_seq)
    new_col.append(new_seq)
  return new_col


Substitute every oov token in validation/test set with a special token [UNK]

In [23]:
df_val["cleaned_tweet"] = substitute_oov_tweet(df_val["cleaned_tweet"], oov_terms_val, "[UNK]")
df_test["cleaned_tweet"] = substitute_oov_tweet(df_test["cleaned_tweet"], oov_terms_test, "[UNK]")

100%|██████████| 158/158 [00:00<00:00, 25417.10it/s]
100%|██████████| 286/286 [00:00<00:00, 23706.94it/s]


In [24]:
import numpy as np
from gensim.models import KeyedVectors

def add_oov_terms_with_batches(embedding_model: KeyedVectors, oov_terms: List[str], vector_size: int = None, batch_size: int = 1000):
    vector_size = vector_size or embedding_model.vector_size

    # Create a new KeyedVectors object
    new_kv = KeyedVectors(vector_size)

    # Prepare data for batch addition
    words = list(embedding_model.key_to_index.keys()) + oov_terms
    vectors = [embedding_model[word] for word in embedding_model.key_to_index] + [np.random.uniform(-0.1, 0.1, vector_size) for _ in oov_terms]

    # Add vectors in batches
    for i in range(0, len(words), batch_size):
        batch_words = words[i:i + batch_size]
        batch_vectors = vectors[i:i + batch_size]
        new_kv.add_vectors(batch_words, batch_vectors)

    return new_kv

# Example usage:
vector_size = embedding_model.vector_size
extended_model = add_oov_terms_with_batches(embedding_model, oov_terms_train, vector_size, batch_size=1000)

In [None]:
"""
# Assuming you have a KeyedVectors object `embedding_model` and a list `oov_terms`
extended_model = add_oov_terms_with_batches(embedding_model, oov_terms_train, batch_size=1000)

# Verify the size of the new vocabulary
print("Extended vocabulary size:", len(extended_model.key_to_index))
"""


'\n# Assuming you have a KeyedVectors object `embedding_model` and a list `oov_terms`\nextended_model = add_oov_terms_with_batches(embedding_model, oov_terms_train, batch_size=1000)\n\n# Verify the size of the new vocabulary\nprint("Extended vocabulary size:", len(extended_model.key_to_index))\n'

In [None]:
"""
import numpy as np
import gensim

def assign_static_embeddings(oov_terms, embedding_dim):
    oov_to_token = {term: f"[UNK]" for i, term in enumerate(oov_terms)}

    np.random.seed(42)  # For reproducibility
    static_embeddings = {
        token: np.random.uniform(-0.1, 0.1, embedding_dim)
        for token in oov_to_token.values()
    }

    return oov_to_token, static_embeddings

embedding_dim = embedding_model.vector_size

special_token_test, static_embedding_test = assign_static_embeddings(oov_terms_test, embedding_dim)
special_token_val, static_embedding_val = assign_static_embeddings(oov_terms_val, embedding_dim)
"""

'\nimport numpy as np\nimport gensim\n\ndef assign_static_embeddings(oov_terms, embedding_dim):\n    oov_to_token = {term: f"[UNK]" for i, term in enumerate(oov_terms)}\n\n    np.random.seed(42)  # For reproducibility\n    static_embeddings = {\n        token: np.random.uniform(-0.1, 0.1, embedding_dim)\n        for token in oov_to_token.values()\n    }\n\n    return oov_to_token, static_embeddings\n\nembedding_dim = embedding_model.vector_size\n\nspecial_token_test, static_embedding_test = assign_static_embeddings(oov_terms_test, embedding_dim)\nspecial_token_val, static_embedding_val = assign_static_embeddings(oov_terms_val, embedding_dim)\n'

# [Task 4 - 1.0 points] Model definition

You are now tasked to define your sexism classifier.




### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.
* You are **free** to experiment with hyper-parameters to define the baseline model.

* **Model 1**: add an additional LSTM layer to the Baseline model.

In [25]:
import tensorflow as tf
import numpy as np
import tensorflow.keras as keras
from keras.optimizers import AdamW
from keras.regularizers import l2
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import Bidirectional
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import BatchNormalization
np.random.seed(42)

def getBaselineModel(vocab_size, embedding_dimension, embedding_matrix, n_units = 128):
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=embedding_dimension, weights=embedding_matrix, mask_zero=True, name='encoder_embedding_baseline', trainable = True),
        Bidirectional(LSTM(n_units, return_sequences=False)),
        Dense(1, activation='sigmoid', kernel_regularizer=l2(0.05)),
        #TimeDistributed(Dense(units=len(-----), activation='softmax'), name = 'timedistr_dense_layer')),
    ])

    model.compile(loss='binary_crossentropy', optimizer=AdamW(learning_rate=0.0001), metrics=['accuracy'])
    return model

from tensorflow.keras.layers import Layer, Input, GlobalAveragePooling1D

class Attention(Layer):
    def __init__(self, **kwargs):
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        self.W = self.add_weight(name="attention_weight", shape=(input_shape[-1], 1), initializer="random_normal", trainable=True)
        self.b = self.add_weight(name="attention_bias", shape=(1,), initializer="zeros", trainable=True)
        super(Attention, self).build(input_shape)

    def call(self, inputs, **kwargs):
        scores = tf.nn.tanh(tf.matmul(inputs, self.W) + self.b)
        weights = tf.nn.softmax(scores, axis=1)
        output = tf.reduce_sum(inputs * weights, axis=1)
        return output

def getBaselineModel_mod(vocab_size, embedding_dimension, embedding_matrix):
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=embedding_dimension, weights=embedding_matrix, mask_zero=True, name='encoder_embedding_baseline'),
        Bidirectional(LSTM(128, return_sequences=False)),
#        Attention(),
        Dropout(0.2),
        Dense(1, activation='sigmoid', kernel_regularizer=l2(0.0001)),
    ])

    model.compile(loss='binary_crossentropy', optimizer=AdamW(learning_rate=0.0005), metrics=['accuracy'])
    return model

def getModel1(vocab_size, embedding_dimension, embedding_matrix, n_units_1 = 128, n_units_2 = 64 ):
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=embedding_dimension, weights=embedding_matrix, mask_zero=True, name='encoder_embedding_model1'),
        Bidirectional(LSTM(n_units_1, return_sequences=True)),
        Bidirectional(LSTM(n_units_2, return_sequences=False)),
        Dense(1, activation='sigmoid', kernel_regularizer=l2(0.05)),
    ])
    model.compile(loss='binary_crossentropy', optimizer=AdamW(learning_rate=0.0001), metrics=['accuracy'])
    return model

### Token to embedding mapping

You can follow two approaches for encoding tokens in your classifier.

### Work directly with embeddings

- Compute the embedding of each input token
- Feed the mini-batches of shape (batch_size, # tokens, embedding_dim) to your model

### Work with Embedding layer

- Encode input tokens to token ids
- Define a Embedding layer as the first layer of your model
- Compute the embedding matrix of all known tokens (i.e., tokens in your vocabulary)
- Initialize the Embedding layer with the computed embedding matrix
- You are **free** to set the Embedding layer trainable or not

In [26]:
def build_embedding_matrix(embedding_model: gensim.models.keyedvectors.KeyedVectors,
                           embedding_dimension: int,
                           word_to_idx: Dict[str, int],
                           vocab_size: int,
                           oov_terms: List[str]) -> np.ndarray:
    """
    Builds the embedding matrix of a specific dataset given a pre-trained word embedding model

    :param embedding_model: pre-trained word embedding model (gensim wrapper)
    :param word_to_idx: vocabulary map (word -> index) (dict)
    :param vocab_size: size of the vocabulary
    :param oov_terms: list of OOV terms (list)

    :return
        - embedding matrix that assigns a high dimensional vector to each word in the dataset specific vocabulary (shape |V| x d)
    """
    embedding_matrix = np.zeros((vocab_size, embedding_dimension), dtype=np.float32)
    embedding_unk = np.random.uniform(low=-0.05, high=0.05, size=embedding_dimension)
    for word, idx in tqdm(word_to_idx.items()):
        if word == '[UNK]':
          embedding_matrix[idx] = np.ones(embedding_dimension, dtype = np.float32)
        elif word == '[PAD]':
          embedding_matrix[idx] = np.zeros(embedding_dimension, dtype = np.float32)
        else:
          try:
              embedding_vector = embedding_model[word]
          except (KeyError, TypeError):
              embedding_vector = np.random.uniform(low=-0.05, high=0.05, size=embedding_dimension)

          embedding_matrix[idx] = embedding_vector

    return embedding_matrix

In [27]:
import pandas as pd

# Creating a dictionary to store data
#data = {'Name':['Tony ciao sdskf', 'Steve ciao sdskf', 'Bruce ciao sdskf', 'Peter ciao sdskf' ],
#        'Age': [35, 70, 45, 20] }

# Creating DataFrame
#df = pd.DataFrame(data)

# Function to transform Name column into a list of words
def transform_name_to_words(df, column_name):
    """
    Transforms the strings in the specified column of a DataFrame into lists of words.

    Args:
        df (pd.DataFrame): Input DataFrame.
        column_name (str): Name of the column to transform.

    Returns:
        pd.DataFrame: DataFrame with the transformed column.
    """
    df["split_col"] = df[column_name].apply(lambda x: x.split())
    return df

# Transform the Name column
#df = transform_name_to_words(df, 'Name')

# Print the modified dataframe
#print(len(max(df[].tolist(), key = len)))


In [28]:
embedding_dimension = 50
new_word_to_idx = {'[PAD]': 0}

# Shift all other indices by 1
new_word_to_idx.update({word: idx + 1 for word, idx in word_to_idx_train.items()})

#Embedding matrix
embedding_matrix = build_embedding_matrix(extended_model, embedding_dimension, new_word_to_idx, len(new_word_to_idx), oov_terms_train)

100%|██████████| 9389/9389 [00:00<00:00, 446934.81it/s]


In [29]:
# Testing
#embedding_matrix = build_embedding_matrix(extended_model, embedding_dimension, word_to_idx_train, len(word_to_idx_train), oov_terms_train)
#print(f"Embedding matrix shape: {embedding_matrix.shape}")

In [30]:
embedding = tf.keras.layers.Embedding(input_dim=len(new_word_to_idx),
                                      output_dim=50,                    #embedding dimension
                                      weights=[embedding_matrix],
                                      mask_zero=True,                   # automatically masks padding tokens
                                      name='encoder_embedding',
                                      trainable = True)

### Padding

Pay attention to padding tokens!

Your model **should not** be penalized on those tokens.

#### How to?

There are two main ways.

However, their implementation depends on the neural library you are using.

- Embedding layer
- Custom loss to compute average cross-entropy on non-padding tokens only

**Note**: This is a **recommendation**, but we **do not penalize** for missing workarounds.

In [31]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

def convert_tokens_to_indices(tokenized_sentences, word_to_idx, unk_token='[UNK]'):
    unk_index = word_to_idx.get(unk_token, 9387)  # Default to last token if UNK is not in dictionary
    sequences = [
        [word_to_idx.get(token, unk_index) for token in sentence]
        for sentence in tokenized_sentences
    ]
    return sequences

tokenized_sentences = df_train['cleaned_tweet'].tolist()  # Replace with your dataframe column
sequences = convert_tokens_to_indices(tokenized_sentences, word_to_idx_train)

df_len = transform_name_to_words(df_train, 'cleaned_tweet')
max_sequence_length = len(max(df_len["split_col"].tolist(), key = len))

#max_sequence_length = len(max(df_train["cleaned_tweet"].tolist(), key = len))  # Adjust based on your dataset or experiment with different lengths
padded_sequences = pad_sequences(sequences, maxlen=max_sequence_length, padding='post', truncating='post')

print("Shape of padded sequences:", np.shape(padded_sequences))

tokenized_sentences_val = df_val['cleaned_tweet'].tolist()  # Replace with your dataframe column
sequences_val = convert_tokens_to_indices(tokenized_sentences_val, word_to_idx_train)

# 2. Pad Sequences
#max_sequence_length = len(max(df_train["cleaned_tweet"].tolist(), key = len))  # Adjust based on your dataset or experiment with different lengths
padded_sequences_val = pad_sequences(sequences_val, maxlen=max_sequence_length, padding='post', truncating='post')

# Now padded_sequences is ready to be used as input for training
print("Shape of padded sequences:", np.shape(padded_sequences_val))

Shape of padded sequences: (2870, 64)
Shape of padded sequences: (158, 64)


# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline and Model 1.



### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute metrics on the validation set.
* Pick **at least** three seeds for robust estimation.
* Pick the **best** performing model according to the observed validation set performance.
* Evaluate your models using macro F1-score.

In [32]:
from tensorflow.keras.callbacks import EarlyStopping, Callback
from keras.callbacks import ReduceLROnPlateau
from keras.callbacks import ModelCheckpoint
from sklearn.metrics import f1_score

# Assuming X_train, X_val, y_train, y_val are already prepared
X_train = padded_sequences # Sequences of token indices for training
y_train = df_train['hard_label_task1']  # Binary labels for training
X_val = padded_sequences_val  # Sequences of token indices for validation
y_val = df_val['hard_label_task1']    # Binary labels for validation

# Custom callback to compute F1 score
class F1ScoreCallback(Callback):
    def __init__(self, validation_data):
        self.validation_data = validation_data

    def on_epoch_end(self, epoch, logs=None):
        val_data, val_labels = self.validation_data
        val_predictions = (self.model.predict(val_data) > 0.5).astype(int)  # Binarize predictions (for binary classification)
        f1 = f1_score(val_labels, val_predictions, average='binary')  # Change 'binary' to 'macro' for multi-class
        print(f" — val_f1: {f1:.4f}")  # Print F1 score

# Define hyperparameters
batch_size = 32
epochs = 100
early_stopping = EarlyStopping(monitor='val_loss', patience=7, restore_best_weights=True)
#lr_scheduler = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=0)
#checkpoint = ModelCheckpoint('best_model.keras', monitor='val_loss', save_best_only=True, verbose=0)
f1_callback = F1ScoreCallback(validation_data=(X_val, y_val))


# Get BaselineModel
vocab_size = len(new_word_to_idx)
baseline_model = getBaselineModel(vocab_size, embedding_dimension, [embedding_matrix])

# Train BaselineModel
print("Baseline model: ")
history_baseline = baseline_model.fit(
    X_train,
    y_train,
    validation_data=(X_val, y_val),
    batch_size=batch_size,
    epochs=epochs,
    callbacks=[early_stopping, f1_callback]
)

# Get Model1
model1 = getModel1(vocab_size, embedding_dimension, [embedding_matrix])

# Train Model1
print("\nModel 1: ")
history_model1 = model1.fit(
    X_train,
    y_train,
    validation_data=(X_val, y_val),
    batch_size=32,
    epochs=50,
    callbacks=[early_stopping, f1_callback]
)

Baseline model: 
Epoch 1/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step
 — val_f1: 0.0000
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 21ms/step - accuracy: 0.5801 - loss: 0.7764 - val_accuracy: 0.5696 - val_loss: 0.7885
Epoch 2/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step 
 — val_f1: 0.0000
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.5931 - loss: 0.7587 - val_accuracy: 0.5633 - val_loss: 0.7885
Epoch 3/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
 — val_f1: 0.1538
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.6075 - loss: 0.7402 - val_accuracy: 0.5127 - val_loss: 0.7726
Epoch 4/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
 — val_f1: 0.1067
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.6172 - loss: 0.7271 - val_accuracy: 0

In [33]:
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score

# Get predictions on the validation set
y_val_pred = baseline_model.predict(X_val)
y_val_pred = (y_val_pred > 0.5).astype(int)  # Convert probabilities to binary labels


# Classification report
print("\nClassification Report:\n")
print(classification_report(y_val, y_val_pred))

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 

Classification Report:

              precision    recall  f1-score   support

           0       0.65      0.83      0.73        90
           1       0.64      0.40      0.49        68

    accuracy                           0.65       158
   macro avg       0.64      0.62      0.61       158
weighted avg       0.64      0.65      0.63       158



# [Task 6 - 1.0 points] Transformers

In this section, you will use a transformer model specifically trained for hate speech detection, namely [Twitter-roBERTa-base for Hate Speech Detection](https://huggingface.co/cardiffnlp/twitter-roberta-base-hate).




### Relevant Material
- Tutorial 3

### Instructions
1. **Load the Tokenizer and Model**

2. **Preprocess the Dataset**:
   You will need to preprocess your dataset to prepare it for input into the model. Tokenize your text data using the appropriate tokenizer and ensure it is formatted correctly.

   **Note**: You have to use the plain text of the dataset and not the version that you tokenized before, as you need to tokenize the cleaned text obtained after the initial cleaning process.

3. **Train the Model**:
   Use the `Trainer` to train the model on your training data.

4. **Evaluate the Model on the Test Set** using F1-macro.

In [34]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `token_NLP` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `token_NL

In [35]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-hate")
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-hate")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [36]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [37]:
from datasets import Dataset

train_data = Dataset.from_pandas(df_train)
test_data = Dataset.from_pandas(df_test)
val_data = Dataset.from_pandas(df_val)

In [40]:
def preprocess_text(texts):
    return tokenizer(texts['tweet'], truncation=True)

train_data = train_data.map(preprocess_text, batched=True)
test_data = test_data.map(preprocess_text, batched=True)
val_data = val_data.map(preprocess_text, batched=True)

Map:   0%|          | 0/2870 [00:00<?, ? examples/s]

Map:   0%|          | 0/286 [00:00<?, ? examples/s]

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

In [42]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [44]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('cardiffnlp/twitter-roberta-base-hate',
                                                           num_labels=2,
                                                           id2label={0: 'NEG', 1: 'POS'},
                                                           label2id={'NEG': 0, 'POS': 1})

In [69]:
from sklearn.metrics import f1_score, accuracy_score

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)

    f1 = f1_score(y_pred=predictions, y_true=labels, average='macro')
    acc = accuracy_score(y_pred=predictions, y_true=labels)
    return {'f1': f1, 'acc': acc}

In [47]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [61]:
import evaluate

acc_metric = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='macro')
    acc = acc_metric.compute(predictions=predictions, references=labels)
    return {**f1, **acc}


In [62]:
print(train_data)

Dataset({
    features: ['id_EXIST', 'lang', 'tweet', 'label', 'cleaned_tweet', 'split_col', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 2870
})


In [67]:
train_data = train_data.rename_column('hard_label_task1', 'label')
test_data = test_data.rename_column('hard_label_task1', 'label')

In [64]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="test_dir",                 # where to save model
    learning_rate=2e-5,
    per_device_train_batch_size=8,         # accelerate defines distributed training
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",           # when to report evaluation metrics/losses
    save_strategy="epoch",                 # when to save checkpoint
    load_best_model_at_end=True,
    report_to='none'                       # disabling wandb (default)
)



In [71]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [72]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,No log


KeyError: "The `metric_for_best_model` training argument is set to 'eval_loss', which is not found in the evaluation metrics. The available evaluation metrics are: []. Consider changing the `metric_for_best_model` via the TrainingArguments."

# [Task 7 - 0.5 points] Error Analysis

### Instructions

After evaluating the model, perform a brief error analysis:

 - Review the results and identify common errors.

 - Summarize your findings regarding the errors and their impact on performance (e.g. but not limited to Out-of-Vocabulary (OOV) words, data imbalance, and performance differences between the custom model and the transformer...)
 - Suggest possible solutions to address the identified errors.



# [Task 8 - 0.5 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.


# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Execution Order

You are **free** to address tasks in any order (if multiple orderings are available).

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).
However, you are **free** to play with their hyper-parameters.


### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer

If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Robust Evaluation

Each model is trained with at least 3 random seeds.

Task 4 requires you to compute the average performance over the 3 seeds and its corresponding standard deviation.

### Model Selection for Analysis

To carry out the error analysis you are **free** to either

* Pick examples or perform comparisons with an individual seed run model (e.g., Baseline seed 1337)
* Perform ensembling via, for instance, majority voting to obtain a single model.

### Error Analysis

Some topics for discussion include:
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

### Bonus Points
Bonus points are arbitrarily assigned based on significant contributions such as:
- Outstanding error analysis
- Masterclass code organization
- Suitable extensions
Note that bonus points are only assigned if all task points are attributed (i.e., 6/6).

**Possible Extensions/Explorations for Bonus Points:**
- **Try other preprocessing strategies**: e.g., but not limited to, explore techniques tailored specifically for tweets or  methods that are common in social media text.
- **Experiment with other custom architectures or models from HuggingFace**
- **Explore Spanish tweets**: e.g., but not limited to, leverage multilingual models to process Spanish tweets and assess their performance compared to monolingual models.







# The End