<a href="https://colab.research.google.com/github/leonardoman9/NLPProject/blob/main/Assignment%202/%5BNLP_2425%5D_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Sexism Detection, Multi-class Classification, LLMs, Prompting


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Relevant Material

- Tutorial 3
- Huggingface documentation
- Huggingface hub

# Introduction

You are tasked to address the [EDOS Task A](https://github.com/rewire-online/edos) on sexism detection.

## Problem definition

Given an input text sentence, the task is to label the sentence as sexist or not sexist (binary classification).

### Examples:

**Text**: *``Schedule a date with her, then don't show up. Then text her "GOTCHA B___H".''*

**Label**: Sexist

**Text**: *``That’s completely ridiculous a woman flashing her boobs is not sexual assault in the slightest.''*

**Label**: Not sexist



## Approach

We will tackle the binary classification task with LLMs.

In particular, we'll consider zero-/few-shot prompting approaches to assess the capability of some popular open-source LLMs on this task.

## Preliminaries

We are going to download LLMs from [Huggingface](https://huggingface.co/).

Many of these open-source LLMs require you to accept their "Community License Agreement" to download them.

In summary:

- If not already, create an account of Huggingface (~2 mins)
- Check a LLM model card page (e.g., [Mistral v3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)) and accept its "Community License Agreement".
- Go to your account -> Settings -> Access Tokens -> Create new token -> "Repositories permissions" -> add the LLM model card you want to use.
- Save the token (we'll need it later)

### Huggingface Login

Once we have created an account and an access token, we need to login to Huggingface via code.

- Type your token and press Enter
- You can say No to Github linking

In [1]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `mistral v0.2+v0.3` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `

After login, you can download all models associated with your access token in addition to those that are not protected by an access token.

### Data Loading

Since we are only interested in prompting, we do not require a train dataset.

We have preparared a small test set version of EDOS in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material).

Check the ``Assignment 2/data`` folder.
It contains:

- ``a2_test.csv`` → a small test set of 300 samples.
- ``demonstrations.csv`` -> a batch of 1000 samples for few-shot prompting.

Both datasets contain a balanced number of sexist and not sexist samples.


### Instructions

We require you to:

* **Download** the ``A2/data`` folder.
* **Encode** ``a2_test.csv`` into a ``pandas.DataFrame`` object.

In [2]:
import requests
import os
import pandas as pd

# Create directory structure if it doesn't exist
base_dir = 'Assignment 2/data'
if not os.path.exists(base_dir):
    os.makedirs(base_dir)

# Updated URLs for the raw files on GitHub
urls = {
    'test': 'https://raw.githubusercontent.com/lt-nlp-lab-unibo/nlp-course-material/main/2024-2025/Assignment%202/data/a2_test.csv',
    'demos': 'https://raw.githubusercontent.com/lt-nlp-lab-unibo/nlp-course-material/main/2024-2025/Assignment%202/data/demonstrations.csv'
}

# Download files
for name, url in urls.items():
    response = requests.get(url)
    if response.status_code == 200:
        filepath = os.path.join(base_dir, f'{name}.csv')
        with open(filepath, 'wb') as f:
            f.write(response.content)
        print(f"Successfully downloaded {filepath}")
    else:
        print(f"Failed to download {name} dataset. Status code: {response.status_code}")
        print(f"URL attempted: {url}")

# Verify files exist before loading
if os.path.exists(os.path.join(base_dir, 'test.csv')) and os.path.exists(os.path.join(base_dir, 'demos.csv')):
    # Load the datasets using pandas

    test_df = pd.read_csv(os.path.join(base_dir, 'test.csv'))
    demos_df = pd.read_csv(os.path.join(base_dir, 'demos.csv'))

    # Display first few rows of test dataset
    print("\nFirst few rows of test dataset:")
    print(test_df.head())
    # Display basic information about both datasets
    print("\nTest dataset info:")
    print(test_df.info())
    print("\nDemonstrations dataset info:")
    print(demos_df.info())
else:
    print("\nFiles were not downloaded successfully. Please check the repository URLs.")

Successfully downloaded Assignment 2/data/test.csv
Successfully downloaded Assignment 2/data/demos.csv

First few rows of test dataset:
                  rewire_id  \
0  sexism2022_english-17133   
1  sexism2022_english-14197   
2   sexism2022_english-3018   
3   sexism2022_english-5301   
4  sexism2022_english-17796   

                                                text label_sexist  
0  The boys must be gaming because there goes the...   not sexist  
1  Look at those eyes. Either someone unexpectedl...       sexist  
2                  Old man mogs everyone in this sub   not sexist  
3  Excellent, I was just looking at another post ...   not sexist  
4  So you run back to daddy whenever you need hel...       sexist  

Test dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   rewire_id     300 non-null    object
 1   text          

# [Task 1 - 0.5 points] Model setup

Once the test data has been loaded, we have to setup the model pipeline for inference.

In particular, we have to:
- Load the model weights from Huggingface
- Quantize the model to fit into a single-GPU limited hardware

## Which LLMs?

The pool of LLMs is ever increasing and it's impossible to keep track of all new entries.

We focus on popular open-source models.

- [Mistral v2](mistralai/Mistral-7B-Instruct-v0.2)
- [Mistral v3](mistralai/Mistral-7B-Instruct-v0.3)
- [Llama v3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- [Phi3-mini](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

Other open-source models are more than welcome!

### Instructions

In order to get Task 1 points, we require you to:

* Pick 2 model cards from the provided list.
* For each model:
  - Define a separate section of your notebook for the model.
  - Setup a quantization configuration for the model.
  - Load the model via HuggingFace APIs.


### Notes

1. There's a popular library integrated with Huggingface's ``transformers`` to perform quantization.

2. Define two separate sections of your notebook to show that you have implemented the prompting pipeline for each selected model card.

In [3]:
# system packages
from pathlib import Path
import shutil
import urllib
import tarfile
import sys

# data and numerical management packages
import pandas as pd
import numpy as np

# useful during debugging (progress bars)
from tqdm import tqdm

In [4]:
!pip install transformers
!pip install datasets
!pip install accelerate -U
!pip install evaluate
!pip install bitsandbytes

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [5]:
import torch
torch.cuda.is_available()

True

In [6]:
!nvidia-smi

Mon Jan  6 15:30:43 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [7]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 2560,
        'height': 1440,
        'scroll': True,
})

{'width': 2560, 'height': 1440, 'scroll': True}

In [8]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

model_card = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_card)
tokenizer.pad_token = tokenizer.eos_token

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [9]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_card,
    return_dict=True,
    quantization_config=bnb_config,
    device_map='auto'
)

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

# [Task 2 - 1.0 points] Prompt setup

Prompting requires an input pre-processing phase where we convert each input example into a specific instruction prompt.


## Prompt Template

Use the following prompt template to process input texts.

In [249]:
prompt = [
    {
        'role': 'system',
        'content': 'You are an expert annotator for sexism detection. Respond only with YES for sexist text or NO for non-sexist text.'
    },
    {
        'role': 'user',
        'content': """Is this text sexist? Answer only YES or NO.

        TEXT:
        {text}

        ANSWER:"""
    }
]

### Instructions

In order to get Task 2 points, we require you to:

* Write a ``prepare_prompts`` function as the one reported below.

In [256]:
def prepare_prompts(texts, prompt_template, tokenizer):
    """
    This function formats input text samples into instruction prompts.
    """
    formatted_prompts = []

    # Handle both single text and list of texts
    if isinstance(texts, str):
        texts = [texts]
    elif isinstance(texts, pd.Series):
        texts = texts.tolist()

    for text in texts:
        # Create a fresh copy of the template
        current_prompt = [
            prompt_template[0].copy(),
            prompt_template[1].copy()
        ]

        # Format the user content by replacing the {text} placeholder
        current_prompt[1]['content'] = current_prompt[1]['content'].format(text=text)

        # Combine system and user messages
        full_prompt = f"{current_prompt[0]['content']}\n\n{current_prompt[1]['content']}"


        # Tokenize the formatted prompt
        encoded_prompt = tokenizer(
            full_prompt,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        )

        formatted_prompts.append(encoded_prompt)

    return formatted_prompts


### Notes

1. You are free to modify the prompt format (**not its content**) as you like depending on your code implementation.

2. Note that the provided prompt has placeholders. You need to format the string to replace placeholders. Huggingface might have dedicated APIs for this.

# [Task 3 - 1.0 points] Inference

We are now ready to define the inference loop where we prompt the model with each pre-processed sample.

### Instructions

In order to get Task 3 points, we require you to:

* Write a ``generate_responses`` function as the one reported below.
* Write a ``process_response`` function as the one reported below.

In [275]:
def generate_responses(model, prompt_examples):
    """
    This function implements the inference loop for a LLM model.
    """
    responses = []
    device = next(model.parameters()).device

    with torch.no_grad():
        for encoded_prompt in prompt_examples:
            # Move tensors to device
            input_ids = encoded_prompt['input_ids'].to(device)
            attention_mask = encoded_prompt['attention_mask'].to(device)

            # Generate response
            outputs = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_new_tokens=5,  # Reduced for YES/NO response
                do_sample=True,     # Enable sampling
                temperature=0.1,    # low temperature
                num_beams=3,
                pad_token_id=model.config.eos_token_id,
                eos_token_id=tokenizer.eos_token_id
            )

            # Get only the newly generated tokens
            new_tokens = outputs[0][input_ids.shape[1]:]
            decoded_output = tokenizer.decode(new_tokens, skip_special_tokens=True)
            responses.append(decoded_output.strip())

            # Clear cache
            torch.cuda.empty_cache()

    return responses


In [276]:
# Test texts
test_texts = [
    "Women belong in the kitchen, not in the workplace.",  # Clearly sexist
    "Everyone deserves equal opportunities regardless of gender.",  # Not sexist
    "I like apples.",  # Neutral
    "She's pretty smart... for a girl.",  # Sexist
    "The weather is nice today."  # Neutral
]

# Test each text individually
for test in test_texts:
    print(f"\nTesting text: {test}")
    formatted_prompts = prepare_prompts([test], prompt, tokenizer)
    response = generate_responses(model, formatted_prompts)
    print(f"Response: {response[0]}")


Testing text: Women belong in the kitchen, not in the workplace.
Response: YES

Testing text: Everyone deserves equal opportunities regardless of gender.
Response: NO

Testing text: I like apples.
Response: NO

Testing text: She's pretty smart... for a girl.
Response: YES

Testing text: The weather is nice today.
Response: NO


In [277]:

def process_response(response):
    """
    This function processes model responses into binary labels.

    Inputs:
      response: the raw text response from the model

    Outputs:
      binary label (sexist/not sexist)
    """
    # Convert response to lowercase for case-insensitive matching
    response = response.lower().strip()
    print(f"response: {response}")
    # Check for "yes" indicating sexist content
    if "yes" in response:
        return 1
    # Check for "no" indicating not sexist content
    elif "no" in response:
        return 0
    else:
        # invalid repsonse
        return None

Test:

In [281]:
def test_random_samples_with_timing(model, tokenizer, test_df, prompt_template, num_samples=10):
    results = []
    device = next(model.parameters()).device
    tokenizer.pad_token = tokenizer.eos_token

    # Get equal number of samples from both classes
    samples_per_class = num_samples // 2

    # Get indices for both classes
    sexist_indices = test_df[test_df['label_sexist'] == 'sexist'].index
    not_sexist_indices = test_df[test_df['label_sexist'] == 'not sexist'].index

    # Randomly sample indices from both classes
    selected_sexist = random.sample(list(sexist_indices), min(samples_per_class, len(sexist_indices)))
    selected_not_sexist = random.sample(list(not_sexist_indices), min(samples_per_class, len(not_sexist_indices)))

    # Combine and shuffle the selected indices
    selected_indices = selected_sexist + selected_not_sexist
    random.shuffle(selected_indices)

    # Loop over the selected samples
    for i, idx in enumerate(selected_indices):
        # Get the sample text and true label
        sample_text = test_df.iloc[idx]['text']
        true_label = test_df.iloc[idx]['label_sexist']

        print(f"\nSample {i+1}/{len(selected_indices)}:")
        print(f"Selected text: {sample_text}")
        print(f"True label: {true_label}")

        # Format the prompt and generate response
        formatted_prompts = prepare_prompts([sample_text], prompt_template, tokenizer)
        for j in range(len(formatted_prompts)):
            formatted_prompts[j] = {k: v.to(device) for k, v in formatted_prompts[j].items()}

        responses = generate_responses(model, formatted_prompts)
        binary_label = process_response(responses[0])
        predicted_label = "sexist" if binary_label == 1 else "not sexist"

        print(f"Raw response: {responses[0]}")
        print(f"Predicted label: {predicted_label}")
        print("-" * 80)

        # Save result
        result = {
            'sample_idx': idx,
            'text': sample_text,
            'true_label': true_label,
            'predicted_label': predicted_label,
            'raw_response': responses[0],
            'correct': predicted_label == true_label
        }
        results.append(result)

    # Print summary statistics
    correct_predictions = sum(result['correct'] for result in results)
    accuracy = correct_predictions / len(results)

    sexist_correct = sum(1 for r in results
                        if r['true_label'] == 'sexist' and r['correct'])
    not_sexist_correct = sum(1 for r in results
                            if r['true_label'] == 'not sexist' and r['correct'])

    sexist_total = sum(1 for r in results if r['true_label'] == 'sexist')
    not_sexist_total = sum(1 for r in results if r['true_label'] == 'not sexist')

    print("\nResults Summary:")
    print(f"Overall Accuracy: {accuracy:.2%}")
    print(f"Sexist Accuracy: {sexist_correct/sexist_total:.2%} ({sexist_correct}/{sexist_total})")
    print(f"Not Sexist Accuracy: {not_sexist_correct/not_sexist_total:.2%} ({not_sexist_correct}/{not_sexist_total})")

    return results

In [None]:
# Run test on n samples
start_time = time.time()
test_results = test_random_samples_with_timing(model, tokenizer, test_df, prompt, num_samples=300)
total_time = time.time() - start_time
print(f"\nTotal execution time: {str(timedelta(seconds=total_time))}")
print(f"Average time per sample: {str(timedelta(seconds=total_time/100))}")
# Print summary of results
correct_predictions = sum(result['correct'] for result in test_results)
accuracy = correct_predictions / len(test_results)

print("\nTest Summary:")
print(f"Total samples tested: {len(test_results)}")
print(f"Accuracy: {accuracy:.2%}")


Sample 1/300:
Selected text: and whats new about this. a woman can be total garbage, but as long as she is fuckable she gets something.
True label: sexist
response: yes
Raw response: YES
Predicted label: sexist
--------------------------------------------------------------------------------

Sample 2/300:
Selected text: Are your chromosomes XX? If so, you're 100% female.
True label: sexist
response: yes
Raw response: YES
Predicted label: sexist
--------------------------------------------------------------------------------

Sample 3/300:
Selected text: I know mine is higher with smart women than dumb women. Or rather, it's lower with dumb women because my lack of knowledge or interest in what (Insert celebrity) did comes across as patronizing.
True label: sexist
response: yes
Raw response: YES
Predicted label: sexist
--------------------------------------------------------------------------------

Sample 4/300:
Selected text: I'm sure BB will cover the story when the coalburner and h

## Notes

1. According to our tests, it should take you ~10 mins to perform full inference on 300 samples.

# [Task 4 - 0.5 points] Metrics

In order to evaluate selected LLMs, we need to compute performance metrics.

In particular, we are interested in computing **accuracy** since the provided data is balanced with respect to classification classes.

Moreover, we want to compute the ratio of failed responses generated by models.

That is, how frequent the LLM fails to follow instructions and provides incorrect responses that do not address the classification task.

We denote this metric as **fail-ratio**.

In summary, we parse generated responses as follows:
- 1 if the model says YES
- 0 if the model says NO
- 0 if the model does not answer in either way

### Instructions

In order to get Task 4 points, we require you to:

* Write a ``compute_metrics`` function as the one reported below.
* Compute metrics for the two selected LLMs.

In [286]:


def compute_metrics(responses, y_true):
    """
    This function takes predicted and ground-truth labels and compute metrics.
    In particular, this function compute accuracy and fail-ratio metrics.

    Inputs:
      responses: generated LLM responses
      y_true: ground-truth binary labels (1 for 'sexist', 0 for 'not sexist')

    Outputs:
      dictionary containing desired metrics
    """
    # Convert string labels to binary if needed
    y_true_binary = []
    for label in y_true:
        if isinstance(label, str):
            y_true_binary.append(1 if label == 'sexist' else 0)
        else:
            y_true_binary.append(label)

    # Process responses
    y_pred = []
    failed_responses = 0
    total_responses = len(responses)

    for response in responses:
        prediction = process_response(response)
        if prediction is None:
            failed_responses += 1
            y_pred.append(0)  # Count failed responses as "not sexist"
        else:
            y_pred.append(prediction)

    # Calculate metrics
    correct_predictions = sum(1 for true, pred in zip(y_true_binary, y_pred) if true == pred)
    accuracy = correct_predictions / total_responses
    fail_ratio = failed_responses / total_responses

    # Calculate per-class metrics
    true_positives = sum(1 for true, pred in zip(y_true_binary, y_pred) if true == 1 and pred == 1)
    true_negatives = sum(1 for true, pred in zip(y_true_binary, y_pred) if true == 0 and pred == 0)
    false_positives = sum(1 for true, pred in zip(y_true_binary, y_pred) if true == 0 and pred == 1)
    false_negatives = sum(1 for true, pred in zip(y_true_binary, y_pred) if true == 1 and pred == 0)

    # Calculate precision, recall, and F1 score
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    return {
        'accuracy': accuracy,
        'fail_ratio': fail_ratio,
        'precision': precision,
        'recall': recall,
        'f1_score': f1_score,
        'confusion_matrix': {
            'true_positives': true_positives,
            'true_negatives': true_negatives,
            'false_positives': false_positives,
            'false_negatives': false_negatives
        }
    }

In [287]:
def test_model_with_metrics(model, tokenizer, test_df, prompt_template, num_samples=10):
    """
    Test the model and compute metrics on the results
    """
    results = []
    device = next(model.parameters()).device
    tokenizer.pad_token = tokenizer.eos_token

    # Get equal number of samples from both classes
    samples_per_class = num_samples // 2

    # Get indices for both classes
    sexist_indices = test_df[test_df['label_sexist'] == 'sexist'].index
    not_sexist_indices = test_df[test_df['label_sexist'] == 'not sexist'].index

    # Randomly sample indices from both classes
    selected_sexist = random.sample(list(sexist_indices), min(samples_per_class, len(sexist_indices)))
    selected_not_sexist = random.sample(list(not_sexist_indices), min(samples_per_class, len(not_sexist_indices)))

    # Combine and shuffle
    selected_indices = selected_sexist + selected_not_sexist
    random.shuffle(selected_indices)

    all_responses = []
    all_true_labels = []

    # Test each sample
    for i, idx in enumerate(selected_indices):
        sample_text = test_df.iloc[idx]['text']
        true_label = test_df.iloc[idx]['label_sexist']

        print(f"\nSample {i+1}/{len(selected_indices)}:")
        print(f"Text: {sample_text}")
        print(f"True label: {true_label}")

        # Generate response
        formatted_prompts = prepare_prompts([sample_text], prompt_template, tokenizer)
        for j in range(len(formatted_prompts)):
            formatted_prompts[j] = {k: v.to(device) for k, v in formatted_prompts[j].items()}

        responses = generate_responses(model, formatted_prompts)
        print(f"Model response: {responses[0]}")

        all_responses.append(responses[0])
        all_true_labels.append(true_label)

        print("-" * 80)

    # Compute metrics
    metrics = compute_metrics(all_responses, all_true_labels)

    print("\nFinal Metrics:")
    print(f"Accuracy: {metrics['accuracy']:.2%}")
    print(f"Fail Ratio: {metrics['fail_ratio']:.2%}")
    print(f"Precision: {metrics['precision']:.2%}")
    print(f"Recall: {metrics['recall']:.2%}")
    print(f"F1 Score: {metrics['f1_score']:.2%}")
    print("\nConfusion Matrix:")
    print(f"True Positives: {metrics['confusion_matrix']['true_positives']}")
    print(f"True Negatives: {metrics['confusion_matrix']['true_negatives']}")
    print(f"False Positives: {metrics['confusion_matrix']['false_positives']}")
    print(f"False Negatives: {metrics['confusion_matrix']['false_negatives']}")

    return metrics, all_responses, all_true_labels

# Run the test with metrics
metrics, responses, true_labels = test_model_with_metrics(model, tokenizer, test_df, prompt, num_samples=10)


Sample 1/10:
Text: Chelsea Clinton leaned over to her husband and said something about Melania. Not sure what she said, but I guarantee it was classless. #PresidentialDebates
True label: not sexist
Model response: YES
--------------------------------------------------------------------------------

Sample 2/10:
Text: Feminist ignore RACE when it comes to rape statistics - are cops falsely arresting nignogs darkies for rape? Or are nignog darkies raping whityle women as if nignogs are too retarded for civil society?
True label: not sexist
Model response: YES
--------------------------------------------------------------------------------

Sample 3/10:
Text: Hillary should also have her clearance revoked.
True label: not sexist
Model response: YES
--------------------------------------------------------------------------------

Sample 4/10:
Text: merkel is stazi she has to go
True label: not sexist
Model response: YES
---------------------------------------------------------------------

# [Task 5 - 1.0 points] Few-shot Inference

So far, we have tested models in a zero-shot fashion: we provide the input text to classify and instruct the model to generate a response.

We are now interested in performing few-shot prompting to see the impact of providing demonstration examples.

To do so, we slightly change the prompt template as follows.

In [None]:
prompt = [
    {
        'role': 'system',
        'content': 'You are an annotator for sexism detection.'
    },
    {
        'role': 'user',
        'content': """Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        EXAMPLES:
        {examples}

        TEXT:
        {text}

        ANSWER:
        """
    }
]

The new prompt template reports some demonstration examples to instruct the model.

Generally, we provide an equal number of demonstrations per class as shown in the example below.

In [None]:
prompt = [
    {
        'role': 'system',
        'content': 'You are an annotator for sexism detection.'
    },
    {
        'role': 'user',
        'content': """Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        EXAMPLES:
        TEXT: **example 1**
        ANSWER: YES
        TEXT: **example 2**
        ANSWER: NO

        TEXT:
        {text}

        ANSWER:
        """
    }
]

## Instructions

In order to get Task 5 points, we require you to:

- Load ``demonstrations.csv`` and encode it into a ``pandas.DataFrame`` object.
- Define a ``build_few_shot_demonstrations`` function as the one reported below.
- Perform few-shot inference as in Task 3.
- Compute metrics as in Task 4.

In [None]:
def build_few_shot_demonstrations(demonstrations, num_per_class=2):
  """
    Inputs:
      demonstrations: the pandas.DataFrame object wrapping demonstrations.csv
      num_per_class: number of demonstrations per class

    Outputs:
      a list of textual demonstrations to inject into the prompt template.
  """
  pass

## Notes

1. You are free to pick any value for ``num_per_class``.

2. According to our tests, few-shot prompting increases inference time by some minutes (we experimented with ``num_per_class`` $\in [2, 4]$).

# [Task 6 - 1.0 points] Error Analysis

We are now interested in evaluating model responses and comparing their performance.

This analysis helps us in understanding

- Classification task performance gap: are the models good at this task?
- Generation quality: which kind of responses do models generate?
- Errors: which kind of mistakes do models do?

### Instructions

In order to get Task 6 points, we require you to:

* Compare classification performance of selected LLMs in a Table.
* Compute confusion matrices for selected LLMs.
* Briefly summarize your observations on generated responses.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...

# FAQ

Please check this frequently asked questions before contacting us.

### Model cards

You can pick any open-source model card you like.

We recommend starting from those reported in this assignment.

### Implementation

Everything can be done via ``transformers`` APIs.

However, you are free to test frameworks, such as [LangChain](https://www.langchain.com/), [LlamaIndex](https://www.llamaindex.ai/) [LitParrot](https://github.com/awesome-software/lit-parrot), provided that you correctly address task instructions.

### Bonus Points

0.5 bonus points are arbitrarily assigned based on significant contributions such as:

- Outstanding error analysis
- Masterclass code organization
- Suitable extensions
- Evaluate A1 dataset and perform comparison

Note that bonus points are only assigned if all task points are attributed (i.e., 6/6).

### Prompt Template

Do not change the provided prompt template.

You are only allowed to change it in case of a possible extension.

### Optimizations

Any kind of code optimization (e.g., speedup model inference or reduce computational cost) is more than welcome!

# The End