# Assignment 2

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Sexism Detection, Multi-class Classification, LLMs, Prompting


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Relevant Material

- Tutorial 3
- Huggingface documentation
- Huggingface hub

# Introduction

You are tasked to address the [EDOS Task A](https://github.com/rewire-online/edos) on sexism detection.

## Problem definition

Given an input text sentence, the task is to label the sentence as sexist or not sexist (binary classification).

### Examples:

**Text**: *``Schedule a date with her, then don't show up. Then text her "GOTCHA B___H".''*

**Label**: Sexist

**Text**: *``That’s completely ridiculous a woman flashing her boobs is not sexual assault in the slightest.''*

**Label**: Not sexist



## Approach

We will tackle the binary classification task with LLMs.

In particular, we'll consider zero-/few-shot prompting approaches to assess the capability of some popular open-source LLMs on this task.

## Preliminaries

We are going to download LLMs from [Huggingface](https://huggingface.co/).

Many of these open-source LLMs require you to accept their "Community License Agreement" to download them.

In summary:

- If not already, create an account of Huggingface (~2 mins)
- Check a LLM model card page (e.g., [Mistral v3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)) and accept its "Community License Agreement".
- Go to your account -> Settings -> Access Tokens -> Create new token -> "Repositories permissions" -> add the LLM model card you want to use.
- Save the token (we'll need it later)

### Huggingface Login

Once we have created an account and an access token, we need to login to Huggingface via code.

- Type your token and press Enter
- You can say No to Github linking

In [1]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `mistral v0.2+v0.3` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `

After login, you can download all models associated with your access token in addition to those that are not protected by an access token.

### Data Loading

Since we are only interested in prompting, we do not require a train dataset.

We have preparared a small test set version of EDOS in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material).

Check the ``Assignment 2/data`` folder.
It contains:

- ``a2_test.csv`` → a small test set of 300 samples.
- ``demonstrations.csv`` -> a batch of 1000 samples for few-shot prompting.

Both datasets contain a balanced number of sexist and not sexist samples.


### Instructions

We require you to:

* **Download** the ``A2/data`` folder.
* **Encode** ``a2_test.csv`` into a ``pandas.DataFrame`` object.

In [11]:
import requests
import os
import pandas as pd

# Create directory structure if it doesn't exist
base_dir = 'Assignment 2/data'
if not os.path.exists(base_dir):
    os.makedirs(base_dir)

# Updated URLs for the raw files on GitHub
urls = {
    'test': 'https://raw.githubusercontent.com/lt-nlp-lab-unibo/nlp-course-material/main/2024-2025/Assignment%202/data/a2_test.csv',
    'demos': 'https://raw.githubusercontent.com/lt-nlp-lab-unibo/nlp-course-material/main/2024-2025/Assignment%202/data/demonstrations.csv'
}

# Download files
for name, url in urls.items():
    response = requests.get(url)
    if response.status_code == 200:
        filepath = os.path.join(base_dir, f'{name}.csv')
        with open(filepath, 'wb') as f:
            f.write(response.content)
        print(f"Successfully downloaded {filepath}")
    else:
        print(f"Failed to download {name} dataset. Status code: {response.status_code}")
        print(f"URL attempted: {url}")

# Verify files exist before loading
if os.path.exists(os.path.join(base_dir, 'test.csv')) and os.path.exists(os.path.join(base_dir, 'demos.csv')):
    # Load the datasets using pandas

    test_df = pd.read_csv(os.path.join(base_dir, 'test.csv'))
    demos_df = pd.read_csv(os.path.join(base_dir, 'demos.csv'))

    # Display first few rows of test dataset
    print("\nFirst few rows of test dataset:")
    print(test_df.head())
    # Display basic information about both datasets
    print("\nTest dataset info:")
    print(test_df.info())
    print("\nDemonstrations dataset info:")
    print(demos_df.info())
else:
    print("\nFiles were not downloaded successfully. Please check the repository URLs.")

Successfully downloaded Assignment 2/data/test.csv
Successfully downloaded Assignment 2/data/demos.csv

First few rows of test dataset:
                  rewire_id  \
0  sexism2022_english-17133   
1  sexism2022_english-14197   
2   sexism2022_english-3018   
3   sexism2022_english-5301   
4  sexism2022_english-17796   

                                                text label_sexist  
0  The boys must be gaming because there goes the...   not sexist  
1  Look at those eyes. Either someone unexpectedl...       sexist  
2                  Old man mogs everyone in this sub   not sexist  
3  Excellent, I was just looking at another post ...   not sexist  
4  So you run back to daddy whenever you need hel...       sexist  

Test dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   rewire_id     300 non-null    object
 1   text          

# [Task 1 - 0.5 points] Model setup

Once the test data has been loaded, we have to setup the model pipeline for inference.

In particular, we have to:
- Load the model weights from Huggingface
- Quantize the model to fit into a single-GPU limited hardware

## Which LLMs?

The pool of LLMs is ever increasing and it's impossible to keep track of all new entries.

We focus on popular open-source models.

- [Mistral v2](mistralai/Mistral-7B-Instruct-v0.2)
- [Mistral v3](mistralai/Mistral-7B-Instruct-v0.3)
- [Llama v3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- [Phi3-mini](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

Other open-source models are more than welcome!

### Instructions

In order to get Task 1 points, we require you to:

* Pick 2 model cards from the provided list.
* For each model:
  - Define a separate section of your notebook for the model.
  - Setup a quantization configuration for the model.
  - Load the model via HuggingFace APIs.


### Notes

1. There's a popular library integrated with Huggingface's ``transformers`` to perform quantization.

2. Define two separate sections of your notebook to show that you have implemented the prompting pipeline for each selected model card.

In [2]:
# system packages
from pathlib import Path
import shutil
import urllib
import tarfile
import sys

# data and numerical management packages
import pandas as pd
import numpy as np

# useful during debugging (progress bars)
from tqdm import tqdm

In [3]:
!pip install transformers
!pip install datasets
!pip install accelerate -U
!pip install evaluate
!pip install bitsandbytes

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [4]:
import torch
torch.cuda.is_available()

True

In [5]:
!nvidia-smi

Tue Nov 26 22:54:44 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8              11W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [6]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 2560,
        'height': 1440,
        'scroll': True,
})

{'width': 2560, 'height': 1440, 'scroll': True}

In [9]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

model_card = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_card)
tokenizer.pad_token = tokenizer.eos_token

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [10]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_card,
    return_dict=True,
    quantization_config=bnb_config,
    device_map='auto'
)

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

# [Task 2 - 1.0 points] Prompt setup

Prompting requires an input pre-processing phase where we convert each input example into a specific instruction prompt.


## Prompt Template

Use the following prompt template to process input texts.

In [12]:
prompt = [
    {
        'role': 'system',
        'content': 'You are an annotator for sexism detection.'
    },
    {
        'role': 'user',
        'content': """Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        TEXT:
        {text}

        ANSWER:
        """
    }
]

### Instructions

In order to get Task 2 points, we require you to:

* Write a ``prepare_prompts`` function as the one reported below.

In [21]:
def prepare_prompts(texts, prompt_template, tokenizer):
    """
    This function formats input text samples into instruction prompts.

    Inputs:
      texts: input texts to classify via prompting
      prompt_template: the prompt template provided in this assignment
      tokenizer: the transformers Tokenizer object instance associated with the chosen model card

    Outputs:
      input texts to classify in the form of instruction prompts
    """
    formatted_prompts = []

    for text in texts:
        # Create a copy of the template to avoid modifying the original
        current_prompt = prompt_template.copy()

        # Format the user content by replacing the {text} placeholder
        current_prompt[1]['content'] = current_prompt[1]['content'].format(text=text)

        # Convert the prompt list into a chat format string
        chat_text = f"{current_prompt[0]['content']}\n\n{current_prompt[1]['content']}"

        # Tokenize the formatted prompt
        encoded_prompt = tokenizer(
            chat_text,
            padding=True,
            truncation=True,
            return_tensors="pt"
        )

        formatted_prompts.append(encoded_prompt)

    # Return the formatted prompts
    return formatted_prompts


### Notes

1. You are free to modify the prompt format (**not its content**) as you like depending on your code implementation.

2. Note that the provided prompt has placeholders. You need to format the string to replace placeholders. Huggingface might have dedicated APIs for this.

# [Task 3 - 1.0 points] Inference

We are now ready to define the inference loop where we prompt the model with each pre-processed sample.

### Instructions

In order to get Task 3 points, we require you to:

* Write a ``generate_responses`` function as the one reported below.
* Write a ``process_response`` function as the one reported below.

In [15]:
def generate_responses(model, prompt_examples):
    """
    This function implements the inference loop for a LLM model.
    Given a set of examples, the model is tasked to generate a response.

    Inputs:
      model: LLM model instance for prompting
      prompt_examples: pre-processed text samples

    Outputs:
      generated responses
    """
    responses = []
    for encoded_prompt in prompt_examples:
        outputs = model.generate(
            input_ids=encoded_prompt['input_ids'].to(model.device),
            attention_mask=encoded_prompt['attention_mask'].to(model.device),
            max_new_tokens=10,
            do_sample=False,
            num_beams=1
        )
        response = tokenizer.decode(outputs[0], skip_special_tokens=True).split("ANSWER:")[-1].strip()
        responses.append(response)
    return responses

In [16]:
def process_response(response):
    """
    This function takes a textual response generated by the LLM
    and processes it to map the response to a binary label.

    Inputs:
      response: generated response from LLM

    Outputs:
      parsed binary response: return 1 if YES and 0 if NO
    """
    # Convert response to uppercase and strip whitespace
    processed_response = response.strip().upper()

    # Check if response contains 1 or 0
    if "YES" in processed_response:
        return "sexist"
    elif "NO" in processed_response:
        return "not sexist"
    else:
        return 0

Test:

In [23]:
import time
from datetime import timedelta
import random

def test_random_samples_with_timing(model, tokenizer, test_df, prompt_template, num_samples=300):
    """
    Test the model with multiple random samples from the dataset.

    Args:
        model: The loaded model
        tokenizer: The tokenizer
        test_df: Test dataset DataFrame
        prompt_template: Template for prompting
        num_samples: Number of random samples to test

    Returns:
        list: List of dictionaries containing test results and metrics for each sample
    """
    # Start timing
    start_time = time.time()

    # Results list
    results = []

    # Set padding token for tokenizer
    tokenizer.pad_token = tokenizer.eos_token

    # Loop over the specified number of samples
    for i in range(num_samples):
        # Get a random index
        random_idx = random.randint(0, len(test_df) - 1)

        # Get the sample text and true label
        sample_text = test_df['text'].iloc[random_idx:random_idx+1]  # Keep as series for compatibility
        true_label = test_df['label_sexist'].iloc[random_idx]

        print(f"Testing sample {i+1}/{num_samples} (Index: {random_idx})")
        print(f"Text: {sample_text.iloc[0]}")
        print(f"True label: {true_label}")

        # Format the prompt with the Mistral chat format
        formatted_prompts = prepare_prompts(sample_text, prompt_template, tokenizer)

        # Generate response
        responses = generate_responses(model, formatted_prompts)

        # Process response
        binary_label = process_response(responses[0])
        print(f"Predicted label: {binary_label}")
        # Save result for this sample
        result = {
            'sample_idx': random_idx,
            'text': sample_text.iloc[0],
            'true_label': true_label,
            'predicted_label': binary_label,
            'raw_response': responses[0],
            'processing_time': None,  # Will calculate at the end
            'correct': binary_label == true_label
        }
        results.append(result)

        # Print elapsed time after this sample
        elapsed_time = time.time() - start_time
        print(f"Elapsed time: {str(timedelta(seconds=elapsed_time))}\n")

    # End timing
    total_time = time.time() - start_time

    # Add processing time for all samples
    avg_time_per_sample = total_time / num_samples
    for result in results:
        result['processing_time'] = avg_time_per_sample

    print(f"\nTotal processing time for {num_samples} samples: {str(timedelta(seconds=total_time))}")
    print(f"Average time per sample: {avg_time_per_sample:.2f} seconds")

    # Return results
    return results

# Run test on 300 samples
test_results = test_random_samples_with_timing(model, tokenizer, test_df, prompt, num_samples=50)

# Print summary of results
correct_predictions = sum(result['correct'] for result in test_results)
accuracy = correct_predictions / len(test_results)

print("\nTest Summary:")
print(f"Total samples tested: {len(test_results)}")
print(f"Accuracy: {accuracy:.2%}")


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Testing sample 1/50 (Index: 108)
Text: "THIS is the beauty of concealed carry #GunsSaveLives" by [USER] [URL] , i live in canaduh wish we had women like that here .... sigh
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:01.742244

Testing sample 2/50 (Index: 40)
Text: How do you make Lady Gaga scream? Bite her clit off.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:03.430284

Testing sample 3/50 (Index: 239)
Text: Citizen Donald Trump was simply pointing out that hollywood women get so wet and excited around millionaire movies stars, they could literally be grabbed by their pussies. (He was right again!)
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:04.894014

Testing sample 4/50 (Index: 84)
Text: I hope the DOJ goes after EVERY DEMOCRAT for ANY tax, bank, campaign fraud they can find. ANY of it! EVEN IF from 10 years ago. These political prosecutions will only end when the dems cry "Uncle!"
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:06.304443

Testing sample 5/50 (Index: 139)
Text: Providing Emotional Support is gay. The best way to pull anyone women included out of depression is squats and nutritious food. You should be leading your girlfriend not being her vagina boy. She already has a vagina she doesn't need a second.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:07.653076

Testing sample 6/50 (Index: 269)
Text: The lawyer chick will shoot down the McDonald's guy, every time, no matter how he looks. That's just the world that we live in. Agree to disagree, I suppose.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:08.983603

Testing sample 7/50 (Index: 111)
Text: It's a sickening thought to realize roughly half our nation's population think we're their enemy because we genuinely support our Constitution. They support Social Justice and Gender Neutrality.... #WeCanNoLongerCoexistInPeace
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:10.309322

Testing sample 8/50 (Index: 281)
Text: GL convincing the idiot masses ......Black pastors see Trump bringing 'new hope' -- but still need to convince their flocks .....[URL]
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:11.640933

Testing sample 9/50 (Index: 189)
Text: Of course I’m judgemental. If you can talk about a someone that you haven’t even meet to its fair that I get to judge you isn’t it. Or do you want to be exempt from those standards. Like a women wants.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:12.981976

Testing sample 10/50 (Index: 219)
Text: Folks [USER] is what having a small cock looks like Poor little [USER]. Nobody likes him, he has a small pencil dick, he has problems meeting women, and he can't get his testosterone patch dosage correct!
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:14.418972

Testing sample 11/50 (Index: 260)
Text: Who the fuck would believe that fat landwhale that someone would want to rape her lel.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:15.904118

Testing sample 12/50 (Index: 151)
Text: Shut it women
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:17.303270

Testing sample 13/50 (Index: 6)
Text: ‪Import the third world and become the third world. You won’t see feminists protesting about this. ‬ ‪[URL]
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:18.634672

Testing sample 14/50 (Index: 182)
Text: Out of boredom I decided to read quite a few of the posts from cheeky monkey. My wife and I can't decide whether it's 1. A teenage girl or 2. Just a cunt I know it's possible to be both but I'm guessing it's 1 and my wife says 2.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:19.988767

Testing sample 15/50 (Index: 127)
Text: I have an alt account pretending to be a woman. My mission is making FDS females as toxic as possible I've gotten thousands of up votes while posting grade A bitch shit
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:21.380183

Testing sample 16/50 (Index: 54)
Text: The "migrant" first eats her snatch, then the disease eats it.
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:22.750513

Testing sample 17/50 (Index: 90)
Text: Guys a retard , who wants attention
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:24.185849

Testing sample 18/50 (Index: 151)
Text: Shut it women
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:25.736032

Testing sample 19/50 (Index: 181)
Text: [URL] [URL] I actually like the style of the bra popping out of the top of the dress. It looks like it was constructed that way and accentuates her curves. This dress is so vibrant and sexy--great for dancing!✨✨✨ #Beauty #GabMusic
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:27.306904

Testing sample 20/50 (Index: 150)
Text: Typical praising of women and bashing of men. They're so predictable. And dumb.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:28.725677

Testing sample 21/50 (Index: 159)
Text: I see. I hope that your balls are metaphorical, little lady.
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:30.261237

Testing sample 22/50 (Index: 113)
Text: No, it's called 'guys like girls that look like bimbos' and it's a trend that needs to die. Yes, some of it is uncanny territory, but the botox lips and hooker paint job is a design choice.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:31.620514

Testing sample 23/50 (Index: 227)
Text: [USER] · 8h8 hours ago Shouldn’t the most infamous loser in history shut her pie hole at this point? It’s becoming an authentic tragedy. #TumblingHag [URL]
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:32.988876

Testing sample 24/50 (Index: 96)
Text: Too many damn emojis.
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:34.375419

Testing sample 25/50 (Index: 196)
Text: you're may be right. i'm assuming for a hookup, face matters little, because of the positions they'd be in.
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:35.758559

Testing sample 26/50 (Index: 219)
Text: Folks [USER] is what having a small cock looks like Poor little [USER]. Nobody likes him, he has a small pencil dick, he has problems meeting women, and he can't get his testosterone patch dosage correct!
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:37.123353

Testing sample 27/50 (Index: 224)
Text: Only people in my office in a shitty mood today: Our token "Wise latina" and the silly single mom beaner burner. Engineering group is all grins (also, all white and all male), even their intern is wearing a red MAGA hate. LOL.
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:38.484203

Testing sample 28/50 (Index: 220)
Text: what have you tried already? I like to point out that nobody blinks when a smaller man starts shit with a larger man and gets his ass beat, men and women are equal right?
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:39.839617

Testing sample 29/50 (Index: 83)
Text: Go for more traditional women. They cheat at a far lower rate than city girls.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:41.302518

Testing sample 30/50 (Index: 22)
Text: *women spend all free time trying to block male only activities*
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:42.812709

Testing sample 31/50 (Index: 24)
Text: Human sacrifice in the Bible psalm-1379-christian-bible-girl-baby-religion
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:44.327588

Testing sample 32/50 (Index: 177)
Text: I'm thinking of starting a kickstarter or something to deprogram her with a scientific Psyops campaign so we don't have to endure her debased shenanigans anymore. I'll take a $million bounty if we turn her into a legit evangelical.
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:45.693709

Testing sample 33/50 (Index: 245)
Text: To female Bishops stop making fun of religion.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:47.042214

Testing sample 34/50 (Index: 10)
Text: The black African will soon be labeling her a racist. Race card coming soon
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:48.374344

Testing sample 35/50 (Index: 288)
Text: A kilt is generally fairly heavy-duty and made of wool. A skirt is generally lighter weight.
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:49.720340

Testing sample 36/50 (Index: 22)
Text: *women spend all free time trying to block male only activities*
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:51.051707

Testing sample 37/50 (Index: 147)
Text: 🤣🤣🤣🤣 My hubby eats healthy, as do I. Thats why. Also, he enjoys cooking as a hobby. No, he isn't a cuck, like you are. 👌👌👌👌
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:52.392512

Testing sample 38/50 (Index: 27)
Text: #Italy: #Muslim #migrant gropes American student’s girlfriend, then #stabs the student twice: [URL]
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:53.722675

Testing sample 39/50 (Index: 71)
Text: [URL] Obviously written by an incel
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:55.143925

Testing sample 40/50 (Index: 135)
Text: "Where on Earth do you get all your ideas from for your sick jokes?" asked my wife. Then she fell out of her wheelchair.
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:56.613399

Testing sample 41/50 (Index: 213)
Text: Kick the trans in her man eat
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:57.980257

Testing sample 42/50 (Index: 137)
Text: Feminist ignore RACE when it comes to rape statistics - are cops falsely arresting nignogs darkies for rape? Or are nignog darkies raping whityle women as if nignogs are too retarded for civil society?
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:00:59.319265

Testing sample 43/50 (Index: 295)
Text: It stops white women from being hit (or worse) by niggers and thats as right as it gets.
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:01:00.642319

Testing sample 44/50 (Index: 119)
Text: You’re overthinking it completely. Whether you sit next to her or not next time you see her talk to her and ask her out...
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:01:01.982332

Testing sample 45/50 (Index: 31)
Text: I don’t date women with body imperfections. This is why if she has asymmetric breasts, stretch marks, cellulite, a weird vagina etc I ghost after the creampie
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:01:03.295705

Testing sample 46/50 (Index: 249)
Text: I know mine is higher with smart women than dumb women. Or rather, it's lower with dumb women because my lack of knowledge or interest in what (Insert celebrity) did comes across as patronizing.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:01:04.616728

Testing sample 47/50 (Index: 76)
Text: Among letters attesting to her ‘good character’ was one written by her carpenter husband of 13 years, Paul, 34. He helped her cover her face as she arrived at court for yesterday’s hearing. Double pussy pass! Idiot husband deserves what he gets.
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:01:05.933482

Testing sample 48/50 (Index: 4)
Text: So you run back to daddy whenever you need help. I thought men were sooo bad? Lol. Like I said at the end you need the help of men and can't do shit on your own. Your dad must be a major cuck to raise such a stupid daughter, what went wrong
True label: sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:01:07.270363

Testing sample 49/50 (Index: 106)
Text: Even us girls did this with or without our brothers. I miss those days when we loved earning money. The pride we felt was an amazing feeling. My kids did the same thing. Raised them right and now the grand kids are learning the same lesson. ;-)
True label: not sexist


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Predicted label: sexist
Elapsed time: 0:01:08.658447

Testing sample 50/50 (Index: 253)
Text: Banging a white girl like her [she's a felon.. ]([URL] I always love rogue white chicks. Her smile is hot.
True label: not sexist
Predicted label: sexist
Elapsed time: 0:01:10.100086


Total processing time for 50 samples: 0:01:10.100545
Average time per sample: 1.40 seconds

Test Summary:
Total samples tested: 50
Accuracy: 46.00%


## Notes

1. According to our tests, it should take you ~10 mins to perform full inference on 300 samples.

# [Task 4 - 0.5 points] Metrics

In order to evaluate selected LLMs, we need to compute performance metrics.

In particular, we are interested in computing **accuracy** since the provided data is balanced with respect to classification classes.

Moreover, we want to compute the ratio of failed responses generated by models.

That is, how frequent the LLM fails to follow instructions and provides incorrect responses that do not address the classification task.

We denote this metric as **fail-ratio**.

In summary, we parse generated responses as follows:
- 1 if the model says YES
- 0 if the model says NO
- 0 if the model does not answer in either way

### Instructions

In order to get Task 4 points, we require you to:

* Write a ``compute_metrics`` function as the one reported below.
* Compute metrics for the two selected LLMs.

In [None]:
def compute_metrics(responses, y_true):
  """
    This function takes predicted and ground-truth labels and compute metrics.
    In particular, this function compute accuracy and fail-ratio metrics.
    This function internally invokes `process_response` to compute metrics.

    Inputs:
      responses: generated LLM responses
      y_true: ground-truth binary labels

    Outputs:
      dictionary containing desired metrics
  """
  pass

# [Task 5 - 1.0 points] Few-shot Inference

So far, we have tested models in a zero-shot fashion: we provide the input text to classify and instruct the model to generate a response.

We are now interested in performing few-shot prompting to see the impact of providing demonstration examples.

To do so, we slightly change the prompt template as follows.

In [None]:
prompt = [
    {
        'role': 'system',
        'content': 'You are an annotator for sexism detection.'
    },
    {
        'role': 'user',
        'content': """Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        EXAMPLES:
        {examples}

        TEXT:
        {text}

        ANSWER:
        """
    }
]

The new prompt template reports some demonstration examples to instruct the model.

Generally, we provide an equal number of demonstrations per class as shown in the example below.

In [None]:
prompt = [
    {
        'role': 'system',
        'content': 'You are an annotator for sexism detection.'
    },
    {
        'role': 'user',
        'content': """Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        EXAMPLES:
        TEXT: **example 1**
        ANSWER: YES
        TEXT: **example 2**
        ANSWER: NO

        TEXT:
        {text}

        ANSWER:
        """
    }
]

## Instructions

In order to get Task 5 points, we require you to:

- Load ``demonstrations.csv`` and encode it into a ``pandas.DataFrame`` object.
- Define a ``build_few_shot_demonstrations`` function as the one reported below.
- Perform few-shot inference as in Task 3.
- Compute metrics as in Task 4.

In [None]:
def build_few_shot_demonstrations(demonstrations, num_per_class=2):
  """
    Inputs:
      demonstrations: the pandas.DataFrame object wrapping demonstrations.csv
      num_per_class: number of demonstrations per class

    Outputs:
      a list of textual demonstrations to inject into the prompt template.
  """
  pass

## Notes

1. You are free to pick any value for ``num_per_class``.

2. According to our tests, few-shot prompting increases inference time by some minutes (we experimented with ``num_per_class`` $\in [2, 4]$).

# [Task 6 - 1.0 points] Error Analysis

We are now interested in evaluating model responses and comparing their performance.

This analysis helps us in understanding

- Classification task performance gap: are the models good at this task?
- Generation quality: which kind of responses do models generate?
- Errors: which kind of mistakes do models do?

### Instructions

In order to get Task 6 points, we require you to:

* Compare classification performance of selected LLMs in a Table.
* Compute confusion matrices for selected LLMs.
* Briefly summarize your observations on generated responses.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...

# FAQ

Please check this frequently asked questions before contacting us.

### Model cards

You can pick any open-source model card you like.

We recommend starting from those reported in this assignment.

### Implementation

Everything can be done via ``transformers`` APIs.

However, you are free to test frameworks, such as [LangChain](https://www.langchain.com/), [LlamaIndex](https://www.llamaindex.ai/) [LitParrot](https://github.com/awesome-software/lit-parrot), provided that you correctly address task instructions.

### Bonus Points

0.5 bonus points are arbitrarily assigned based on significant contributions such as:

- Outstanding error analysis
- Masterclass code organization
- Suitable extensions
- Evaluate A1 dataset and perform comparison

Note that bonus points are only assigned if all task points are attributed (i.e., 6/6).

### Prompt Template

Do not change the provided prompt template.

You are only allowed to change it in case of a possible extension.

### Optimizations

Any kind of code optimization (e.g., speedup model inference or reduce computational cost) is more than welcome!

# The End