# Assignment 2

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Sexism Detection, Multi-class Classification, LLMs, Prompting


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Relevant Material

- Tutorial 3
- Huggingface documentation
- Huggingface hub

# Introduction

You are tasked to address the [EDOS Task B](https://github.com/rewire-online/edos) on sexism detection.

## Problem definition

Given an input text sentence, the task is to label the sentence as not-sexist or one of these four sexist categories: (1) threats, (2)  derogation, (3) animosity, (4) prejudiced discussion.

### Examples:

**Text**: *``Schedule a date with her, then don't show up. Then text her "GOTCHA B___H".''*

**Label**: Threats

**Text**: *``That’s completely ridiculous a woman flashing her boobs is not sexual assault in the slightest.''*

**Label**: Not sexist



## Approach

We will tackle the five-class classification task with LLMs.

In particular, we'll consider zero-/few-shot prompting approaches to assess the capability of some popular open-source LLMs on this task.

## Preliminaries

We are going to download LLMs from [Huggingface](https://huggingface.co/).

Many of these open-source LLMs require you to accept their "Community License Agreement" to download them.

In summary:

- If not already, create an account of Huggingface (~2 mins)
- Check a LLM model card page (e.g., [Mistral v3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)) and accept its "Community License Agreement".
- Go to your account -> Settings -> Access Tokens -> Create new token -> "Repositories permissions" -> add the LLM model card you want to use.
- Save the token (we'll need it later)

## Environment setup

Before importing the necessary modules for this notebook let's download the `pyproject.toml` file specifying the project dependencies

In [1]:
import os
import urllib.request
from pathlib import Path


def get_project_root() -> Path:
    """Return the root directory of the project."""
    start_dir = Path.cwd()

    markers = ["assignment2.ipynb"]

    for path in [start_dir, *list(start_dir.parents)]:
        for marker in markers:
            if (path / marker).exists():
                return path

    return start_dir


project_root: Path = get_project_root()

In [2]:
project_repo: str = "mpreda01/nlp-assignments"
project_branch: str = "feat/task2"
#TODO CHANGEME: main branch
pyproject_url = f"https://raw.githubusercontent.com/{project_repo}/{project_branch}/pyproject.toml"
lockfile_url = f"https://raw.githubusercontent.com/{project_repo}/{project_branch}/uv.lock"
urllib.request.urlretrieve(pyproject_url, project_root / "pyproject.toml")  # noqa: S310
urllib.request.urlretrieve(lockfile_url, project_root / "uv.lock");  # noqa: S310

If using [uv](https://github.com/astral-sh/uv) (recommended) you can now install the dependencies to a local virtual environment at `.venv` simply via
```sh
uv sync
```

If not, the same can be achieved with the usual python [venv](https://docs.python.org/3/library/venv.html):
```sh
python3 -m venv .venv
source .venv/bin/activate
(.venv) pip install .
```

Make sure to do the above and restart the kernel if necessary before proceeding.

In [3]:
import pandas as pd
import torch
from huggingface_hub import login, whoami
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

### Huggingface Login

Once we have created an account and an access token, we need to login to Huggingface via code.

- Type your token and press Enter
- You can say No to Github linking

In [None]:
hf_token = os.environ.get("HF_TOKEN")

if hf_token:
    try:
        login(token=hf_token, add_to_git_credential=True)
        user = whoami(token=hf_token)
        username = user["name"]
        print(f"Logged in as @{username}")
        print(f"Profile page: https://huggingface.co/{username}")
    except ValueError:
        print("Invalid token.")
    except ImportError:
        print("ipywidgets not installed.")
else:
    print("Warning: 'HF_TOKEN' environment variable not set.")

Logged in as @MZR1
Profile page: https://huggingface.co/MZR1


After login, you can download all models associated with your access token in addition to those that are not protected by an access token.

### Data Loading

Since we are only interested in prompting, we do not require a train dataset.

We have preparared a small test set version of EDOS in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material).

Check the ``Assignment 2/data`` folder.
It contains:

- ``a2_test.csv`` → a small test set of 300 samples.
- ``demonstrations.csv`` -> a batch of 1000 samples for few-shot prompting.

Both datasets contain a balanced number of sexist and not sexist samples.


### Instructions

We require you to:

* **Download** the ``A2/data`` folder.
* **Encode** ``a2_test.csv`` into a ``pandas.DataFrame`` object.

In [5]:
repo = "nlp-unibo/nlp-course-material"
branch = "main"
folder_path = "2025-2026/Assignment%202/data"
local_dir = "data"
files = ["a2_test.csv", "demonstrations.csv"]

base_url = f"https://raw.githubusercontent.com/{repo}/{branch}/{folder_path}"
os.makedirs(local_dir, exist_ok=True)

for filename in files:
    url = f"{base_url}/{filename}"
    destination = os.path.join(local_dir, filename)

    urllib.request.urlretrieve(url, destination)
    print(f"Saved to: {destination}")

Saved to: data/a2_test.csv
Saved to: data/demonstrations.csv


# [Task 1 - 0.5 points] Model setup

Once the test data has been loaded, we have to setup the model pipeline for inference.

In particular, we have to:
- Load the model weights from Huggingface
- Quantize the model to fit into a single-GPU limited hardware

## Which LLMs?

The pool of LLMs is ever increasing and it's impossible to keep track of all new entries.

We focus on popular open-source models.

- [Mistral v2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
- [Mistral v3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)
- [Llama v3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- [Phi3-mini](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
- [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
- [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
- [Qwen3](https://huggingface.co/Qwen/Qwen3-1.7B)

Other open-source models are more than welcome!

### Instructions

In order to get Task 1 points, we require you to:

* Pick 2 model cards from the provided list.
* For each model:
  - Setup a quantization configuration for the model.
  - Load the model via HuggingFace APIs.


### Note

There's a popular library integrated with Huggingface's ``transformers`` to perform quantization.


In [6]:
#%pip install .

Processing /content
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: nlp-project
  Building wheel for nlp-project (pyproject.toml) ... [?25l[?25hdone
  Created wheel for nlp-project: filename=nlp_project-0.1.0-py3-none-any.whl size=1174 sha256=da051a3efd453cb6e37c6d66d5a7985791190a79505cb5ffeaa818d7a3a89477
  Stored in directory: /tmp/pip-ephem-wheel-cache-xcozhg77/wheels/bd/e2/ad/6557ae2989fbf3d2351bffa42147f9477243538a6ea9803db9
Successfully built nlp-project
Installing collected packages: nlp-project
  Attempting uninstall: nlp-project
    Found existing installation: nlp-project 0.1.0
    Uninstalling nlp-project-0.1.0:
      Successfully uninstalled nlp-project-0.1.0
Successfully installed nlp-project-0.1.0


In [7]:
%pip list | grep -E "accelerate|bitsandbytes|huggingface-hub|pandas|transformers"

accelerate                               1.12.0
bitsandbytes                             0.49.1
geopandas                                1.1.1
huggingface-hub                          0.36.0
pandas                                   2.3.3
pandas-datareader                        0.10.0
pandas-gbq                               0.30.0
pandas-stubs                             2.2.2.240909
sentence-transformers                    5.2.0
sklearn-pandas                           2.2.0
transformers                             4.57.3


In [11]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# --- Model 1: DeepSeek ---
MODEL_1_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

if "model_DeepSeek" not in globals() or model_DeepSeek is None:
    print(f"Loading Model 1: {MODEL_1_ID}")

    tokenizer_DeepSeek = AutoTokenizer.from_pretrained(MODEL_1_ID, force_download=True)
    model_DeepSeek = AutoModelForCausalLM.from_pretrained(
        MODEL_1_ID,
        return_dict=True,
        quantization_config=quantization_config,
        device_map="auto",
        force_download=True
    )
    # Attach tokenizer to model object
    model_DeepSeek.tokenizer = tokenizer_DeepSeek

    if tokenizer_DeepSeek.pad_token is None:
        tokenizer_DeepSeek.pad_token = tokenizer_DeepSeek.eos_token

    print("Model 1 loaded successfully!")
else:
    print(f"Model 1 ({MODEL_1_ID}) already loaded, skipping...")


# --- Model 2: Mistral V3---
MODEL_2_ID = "mistralai/Mistral-7B-Instruct-v0.3"

if "model_mistral" not in globals() or model_mistral is None:
    print(f"Loading Model 2: {MODEL_2_ID}")

    tokenizer_mistral = AutoTokenizer.from_pretrained(MODEL_2_ID, force_download=True)
    model_mistral = AutoModelForCausalLM.from_pretrained(
        MODEL_2_ID,
        quantization_config=quantization_config,
        device_map="auto",
        return_dict=True,
        force_download=True
    )
    # Attach tokenizer to model object
    model_mistral.tokenizer = tokenizer_mistral

    if tokenizer_mistral.pad_token is None:
        tokenizer_mistral.pad_token = tokenizer_mistral.eos_token

    print("Model 2 loaded successfully!")
else:
    print(f"Model 2 ({MODEL_2_ID}) already loaded, skipping...")

Loading Model 1: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/680 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/680 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-000002.safetensors:   0%|          | 0.00/8.61G [00:00<?, ?B/s]

model-00002-of-000002.safetensors:   0%|          | 0.00/6.62G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Model 1 loaded successfully!
Loading Model 2: mistralai/Mistral-7B-Instruct-v0.3


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Model 2 loaded successfully!


# [Task 2 - 1.0 points] Prompt setup

Prompting requires an input pre-processing phase where we convert each input example into a specific instruction prompt.


## Prompt Template

Use the following prompt template to process input texts.

In [12]:
prompt = [
    {"role": "system", "content": "You are an annotator for sexism detection."},
    {
        "role": "user",
        "content": """Your task is to classify input text as not-sexist
         or sexist. If sexist, classify input text according to one
         of the following four categories: threats, derogation,
         animosity, prejudiced discussion.

         Below you find sexist categories definitions:
         Threats: the text expresses intent or desire to harm a woman.
         Derogation: the text describes a woman in a derogative manner.
         Animosity: the text contains slurs or insults towards a woman.
         Prejudiced discussion: the text expresses supports for
         mistreatment of women as individuals.

         Respond only by writing one of the following categories:
         not-sexist, threats, derogation, animosity, prejudiced.

        TEXT: {text}

        ANSWER:
        """,
    },
]

### Instructions

In order to get Task 2 points, we require you to:

* Write a ``prepare_prompts`` function as the one reported below.

In [14]:
import copy

In [15]:
def prepare_prompts(texts, prompt_template, tokenizer):
    """This function format input text samples into instructions prompts.

    Inputs:
      texts: input texts to classify via prompting
      prompt_template: the prompt template provided in this assignment
      tokenizer: the transformers Tokenizer object instance associated
      with the chosen model card

    Outputs:
      input texts to classify in the form of instruction prompts
    """
    prepared_prompts = []

    for text in texts:

        prompt_messages = copy.deepcopy(prompt_template)


        prompt_messages[1]["content"] = prompt_messages[1]["content"].format(text=text)

        # Use the tokenizer's chat template to format the messages
        # This converts the list of message dicts into the model's expected format
        formatted_prompt = tokenizer.apply_chat_template(
            prompt_messages,
            tokenize=False,           # Return string, not token IDs
            add_generation_prompt=True # Add the prompt for the model to generate
        )

        prepared_prompts.append(formatted_prompt)

    return prepared_prompts


'''prepared_promt = prepare_prompts(["ciao", "coglione"], prompt, tokenizer_DeepSeek)
print(prepared_promt)'''

['<｜begin▁of▁sentence｜>You are an annotator for sexism detection.<｜User｜>Your task is to classify input text as not-sexist \n         or sexist. If sexist, classify input text according to one\n         of the following four categories: threats, derogation,\n         animosity, prejudiced discussion.\n         \n         Below you find sexist categories definitions:\n         Threats: the text expresses intent or desire to harm a woman.\n         Derogation: the text describes a woman in a derogative manner.\n         Animosity: the text contains slurs or insults towards a woman.\n         Prejudiced discussion: the text expresses supports for\n         mistreatment of women as individuals.\n        \n         Respond only by writing one of the following categories:\n         not-sexist, threats, derogation, animosity, prejudiced.\n\n        TEXT: ciao\n\n        ANSWER:\n        <｜Assistant｜><think>\n', '<｜begin▁of▁sentence｜>You are an annotator for sexism detection.<｜User｜>Your task 

### Notes

1. You are free to modify the prompt format (**not its content**) as you like depending on your code implementation.

2. Note that the provided prompt has placeholders. You need to format the string to replace placeholders. Huggingface might have dedicated APIs for this.

# [Task 3 - 1.0 points] Inference

We are now ready to define the inference loop where we prompt the model with each pre-processed sample.

### Instructions

In order to get Task 3 points, we require you to:

* Write a ``generate_responses`` function as the one reported below.
* Write a ``process_response`` function as the one reported below.

In [34]:
from tqdm import tqdm
# Attach tokenizers to models
model_DeepSeek.tokenizer = tokenizer_DeepSeek
model_mistral.tokenizer = tokenizer_mistral

In [35]:
def generate_responses(model, prompt_examples):
    """This function implements the inference loop for a LLM model.
    Given a set of examples, the model is tasked to generate
    a response.

    Inputs:
      model: LLM model instance for prompting
      prompt_examples: pre-processed text samples

    Outputs:
      generated responses
    """

    # Access tokenizer from the model object
    tokenizer = model.tokenizer

    responses = []
    model.eval()

    for prompt in tqdm(prompt_examples, desc="Generating responses"):
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        with torch.no_grad():
            # Define max_new_tokens for generation (e.g., a small number for classification answers)
            # Ensure 'max_new_tokens' is defined, otherwise model generation might hang or produce too much text.
            # For a classification task like this, a small number like 5-20 should be sufficient.
            max_new_tokens = 20 # You can adjust this value based on expected response length

            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id,
            )

        # Decode only the newly generated tokens
        generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
        response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

        responses.append(response.strip())

    return responses

In [42]:
def process_response(response):
    """This function takes a textual response generated by the LLM
    and processes it to map the response to a binary label.

    Inputs:
      response: generated response from LLM

    Outputs:
      parsed classification response.
      Use the following mapping:
      {
        'not-sexist': 0,
        'threats': 1,
        'derogation': 2,
        'animosity': 3,
        'prejudiced': 4
      }
    """
    label_map = {
        "not-sexist": 0,
        "threats": 1,
        "derogation": 2,
        "animosity": 3,
        "prejudiced": 4,
    }

    response_lower = response.lower().strip()

    # Check for each label keyword in the response
    for label, idx in label_map.items():
        if label in response_lower:
            return idx

    # Default to 0 (not-sexist) if model fails to follow instructions
    return 0

In [43]:
# Prepare prompts
prompts = prepare_prompts(["text", "provaprova"], prompt, tokenizer_DeepSeek)

# Generate responses
responses = generate_responses(model_DeepSeek, prompts)

# Parse responses to labels
predictions = [process_response(r) for r in responses]

Generating responses: 100%|██████████| 2/2 [00:05<00:00,  2.89s/it]


## Notes

1. According to our tests, it should take you ~10 mins to perform full inference on 300 samples on Colab.

# [Task 4 - 0.5 points] Metrics

In order to evaluate selected LLMs, we need to compute performance metrics.

We compute **macro F1-score** and the ratio of failed responses generated by models (**fail-ratio**).

That is, how frequent the LLM fails to follow instructions and provides incorrect responses that do not address the classification task.

In summary, we parse generated responses as follows:
- **0** if 'not-sexist'
- **1** if 'threats'
- **2** if 'derogation'
- **3** if 'animosity'
- **4** if 'prejudiced'
- **0** if the model does not answer in either way

### Instructions

In order to get Task 4 points, we require you to:

* Write a ``compute_metrics`` function as the one reported below.
* Compute metrics for the two selected LLMs.

In [None]:
def compute_metrics(y_pred, y_true):
    """This function takes predicted and ground-truth labels and compute
    metrics. In particular, this function compute accuracy and
    fail-ratio metrics. This function internally invokes
    `process_response` to compute metrics.

    Inputs:
      y_pred: parsed LLM responses
      y_true: ground-truth binary labels

    Outputs:
      dictionary containing desired metrics
    """

# [Task 5 - 1.0 points] Few-shot Inference

So far, we have tested models in a zero-shot fashion: we provide the input text to classify and instruct the model to generate a response.

We are now interested in performing few-shot prompting to see the impact of providing demonstration examples.

To do so, we slightly change the prompt template as follows.

In [None]:
prompt = [
    {"role": "system", "content": "You are an annotator for sexism detection."},
    {
        "role": "user",
        "content": """Your task is to classify input text as not-sexist
         or sexist. If sexist, classify input text according to one
         of the following four categories: threats, derogation,
         animosity, prejudiced discussion.

         Below you find sexist categories definitions:
         Threats: the text expresses intent or desire to harm a woman.
         Derogation: the text describes a woman in a derogative manner.
         Animosity: the text contains slurs or insults towards a woman.
         Prejudiced discussion: the text expresses supports for
         mistreatment of women as individuals.

         Respond only by writing one of the following categories:
         not-sexist, threats, derogation, animosity, prejudiced.

        EXAMPLES: {examples}

        TEXT: {text}

        ANSWER:
        """,
    },
]

The new prompt template reports some demonstration examples to instruct the model.

Generally, we provide an equal number of demonstrations per class as shown in the example below.

In [None]:
prompt = [
    {"role": "system", "content": "You are an annotator for sexism detection."},
    {
        "role": "user",
        "content": """Your task is to classify input text as not-sexist
         or sexist. If sexist, classify input text according to one
         of the following four categories: threats, derogation,
         animosity, prejudiced discussion.

         Below you find sexist categories definitions:
         Threats: the text expresses intent or desire to harm a woman.
         Derogation: the text describes a woman in a derogative manner.
         Animosity: the text contains slurs or insults towards a woman.
         Prejudiced discussion: the text expresses supports for
         mistreatment of women as individuals.

         Respond only by writing one of the following categories:
         not-sexist, threats, derogation, animosity, prejudiced.

         EXAMPLES:
         TEXT: **example 1**
         ANSWER: threats
         TEXT: **example 2**
         ANSWER: not-sexist

         TEXT: {text}

        ANSWER:
        """,
    },
]

## Instructions

In order to get Task 5 points, we require you to:

- Load ``demonstrations.csv`` and encode it into a ``pandas.DataFrame`` object.
- Define a ``build_few_shot_demonstrations`` function as the one reported below.
- Modify ``prepare_prompts`` to support demonstrations.
- Perform few-shot inference as in Task 3.
- Compute metrics as in Task 4.

In [None]:
def build_few_shot_demonstrations(demonstrations, num_per_class=2):
    """Inputs:
      demonstrations: DataFrame wrapping demonstrations.csv
      num_per_class: number of demonstrations per class

    Outputs:
      list of demonstrations to inject into the prompt template.
    """

## Notes

1. You are free to pick any value for ``num_per_class``.

2. According to our tests, few-shot prompting increases inference time by some minutes (we experimented with ``num_per_class`` $\in [2, 4]$).

# [Task 6 - 1.0 points] Error Analysis

We are now interested in evaluating model responses and comparing their performance.

This analysis helps us in understanding

- Classification task performance gap: are the models good at this task?
- Generation quality: which kind of responses do models generate?
- Errors: which kind of mistakes do models do?

### Instructions

In order to get Task 6 points, we require you to:

* Compare classification performance of selected LLMs in a Table.
* Compute confusion matrices for selected LLMs.
* Briefly summarize your observations on generated responses.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...

# FAQ

Please check this frequently asked questions before contacting us.

### Model cards

You can pick any open-source model card you like.

We recommend starting from those reported in this assignment.

### Implementation

Everything can be done via ``transformers`` APIs.

However, you are free to test frameworks, such as [LangChain](https://www.langchain.com/), [LlamaIndex](https://www.llamaindex.ai/) [LitParrot](https://github.com/awesome-software/lit-parrot), provided that you correctly address task instructions.

### Task Performance

The task is challenging and zero-shot prompting may show relatively low performance depending on the chosen model.

### Prompt Template

Do not change the provided prompt template.

You are only allowed to change it in case of a possible extension.

### Optimizations

Any kind of code optimization (e.g., speedup model inference or reduce computational cost) is more than welcome!

### Bonus Points

0.5 bonus points are arbitrarily assigned based on significant contributions such as:

- Outstanding error analysis
- Masterclass code organization
- Evaluate A1 dataset and perform comparison
- Perform prompt tuning

Note that bonus points are only assigned if all task points are attributed (i.e., 6/6).

# The End