# Tutorial 3

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Transformers, Huggingface, Prompting

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

## PART 0 ($\sim$5 mins)
*   Downloading a **dataset**.
*   Encoding a a **dataset**.

## PART I ($\sim$30 mins)

*   Text encoding with transformers.
*   Model definition.
*   Model training and evaluation with huggingface APIs.

## PART II ($\sim$30 mins)

*   Prompting 101
*   Sentiment analysis with prompting
*   Advanced prompting

## Preliminaries

First of all, we need to import some useful packages that we will use during this hands-on session.

In [None]:
# system packages
from pathlib import Path
import shutil
import urllib
import tarfile
import sys

# data and numerical management packages
import pandas as pd
import numpy as np

# useful during debugging (progress bars)
from tqdm import tqdm

In [None]:
!pip install transformers
!pip install datasets
!pip install accelerate -U
!pip install evaluate
!pip install bitsandbytes

In [None]:
import torch
torch.cuda.is_available()

In [None]:
!nvidia-smi

In [None]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 2560,
        'height': 1440,
        'scroll': True,
})

# Data

We will use the IMDB dataset first introduced in tutorial 1.

* [**Stats**] A dataset of 50k sentences used for sentiment analysis: 25k with positive sentiment, 25k with negative one.
* [**Sentiment**] We consider sentiment labels for classification.

We start by **downloading** the dataset and **extract** it to a folder.

In [None]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(download_path: Path, url: str):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=download_path, reporthook=t.update_to)

        
def download_dataset(download_path: Path, url: str):
    print("Downloading dataset...")
    download_url(url=url, download_path=download_path)
    print("Download complete!")

def extract_dataset(download_path: Path, extract_path: Path):
    print("Extracting dataset... (it may take a while...)")
    with tarfile.open(download_path) as loaded_tar:
        loaded_tar.extractall(extract_path)
    print("Extraction completed!")

In [None]:
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset_name = "aclImdb"

print(f"Current work directory: {Path.cwd()}")
dataset_folder = Path.cwd().joinpath("Datasets")

if not dataset_folder.exists():
    dataset_folder.mkdir(parents=True)

dataset_tar_path = dataset_folder.joinpath("Movies.tar.gz")
dataset_path = dataset_folder.joinpath(dataset_name)

if not dataset_tar_path.exists():
    download_dataset(dataset_tar_path, url)

if not dataset_path.exists():
    extract_dataset(dataset_tar_path, dataset_folder)

#### Data Format

Just like in the first assignment, we need a **high level view** of the dataset that is helpful to our needs. 

We encode the dataset into a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

In [None]:
dataframe_rows = []

for split in ['train', 'test']:
    for sentiment in ['pos', 'neg']:
        folder = dataset_folder.joinpath(dataset_name, split, sentiment)
        for file_path in folder.glob('*.txt'):            
            with file_path.open(mode='r', encoding='utf-8') as text_file:
                text = text_file.read()
                score = file_path.stem.split("_")[1]
                score = int(score)
                file_id = file_path.stem.split("_")[0]

                num_sentiment = 1 if sentiment == 'pos' else 0

                dataframe_row = {
                    "file_id": file_id,
                    "score": score,
                    "sentiment": num_sentiment,
                    "split": split,
                    "text": text
                }

                dataframe_rows.append(dataframe_row)

In [None]:
folder = Path.cwd().joinpath("Datasets", "Dataframes", dataset_name)
if not folder.exists():
    folder.mkdir(parents=True)

# transform the list of rows in a proper dataframe
df = pd.DataFrame(dataframe_rows)
df = df[["file_id", 
         "score",
         "sentiment",
         "split",
         "text"]
       ]
df_path = folder.with_name(dataset_name + ".pkl")
df.to_pickle(df_path)

# PART I

*   Text encoding with Transformers.
*   Model definition.
*   Model training and evaluation with huggingface APIs.

## 1. Text encoding with Transformers.

In tutorial 1, we have seen how to define standard machine learning models to address sentiment classification.

However, we know that Transformer-based models are one of the strongest baselines when assessing a task or benchmarking on a novel corpus.

Before defining our transformer-based classifier, we need to encode text inputs into numerical format.

As in Tutorial 1, we are going to **tokenize** input texts to perform token indexing.

### 1.1 Encoding the dataset

First, we are going to use ``datasets`` library to encode our dataset into a handy wrapper for computational speedup.

In [None]:
from datasets import Dataset

# Slicing for showcasing purposes only!
train_df = df.loc[df['split'] == "train"].sample(frac=1.0)[:5000]
test_df = df.loc[df['split'] == "test"].sample(frac=1.0)[:1000]

train_data = Dataset.from_pandas(train_df)
test_data = Dataset.from_pandas(test_df)

Let's inspect the newly defined `Dataset` instances

In [None]:
print(train_data)
print(test_data)

### 1.2 Tokenization

Transformers typically use [SentencePiece tokenizer](https://github.com/google/sentencepiece) to perform sub-word level tokenization.

In particular, the `transformers` library offers the `AutoTokenizer` class to quickly retrieve our chosen transformer's ad-hoc tokenizer.

In [None]:
from transformers import AutoTokenizer

model_card = 'distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_card)

The `model_card` variable defines the *path* where to look for our pre-trained model.

You can check [huggingface's hub](https://huggingface.co/models) model hub to pick the model card according to your preference.

We proceed on tokenizing movie reviews text with our tokenizer.

In [None]:
def preprocess_text(texts):
    return tokenizer(texts['text'], truncation=True)

train_data = train_data.map(preprocess_text, batched=True)
test_data = test_data.map(preprocess_text, batched=True)

Let's inspect the preprocess `Dataset` instances

In [None]:
print(train_data)
print(test_data)

In [None]:
print(train_data['input_ids'][50])

In [None]:
print(train_data['attention_mask'][50])

We can perform some quick *sanity check* to evaluate the tokenization process

In [None]:
original_text = train_data['text'][50]
decoded_text = tokenizer.decode(train_data['input_ids'][50])

print(original_text)
print()
print()
print(decoded_text)

### 1.3 Vocabulary

We **do not** necessarily need to build a vocabulary since transformers already come with their own! 

**However**, it is still possible to add new tokens to the vocabulary to adapt the model to the given use case.

```
tokenizer.add_tokens(new_tokens=new_tokens)
```

The transformer vocabulary will update its **unusued** vocabulary indexes with newly provided tokens.

### 1.4 Special tokens

**Pay attention** to used special tokens and their corresponding token ids.

Each transformer models has its own special tokens ([CLS], [SEP], [PAD], [EOS], etc...).

Thus, the same special token may be mapped to different token ids in distinct transformer models.

### 1.5 Text cleaning

We didn't perform any kind of text cleaning before performing text encoding.

This is usually because transformer tokenizers **have their own text cleaning process** to perform tokenization.

Thus, models **may be sensitive** to custom operations!

In [None]:
example_text = "couldn't"
encoded_example = tokenizer.encode_plus(example_text, add_special_tokens=False)
print(encoded_example.tokens())

In [None]:
example_text = "At one point,some kids are wandering through the deeper levels, exploring."
encoded_example = tokenizer.encode_plus(example_text, add_special_tokens=False)
print(encoded_example.tokens())

#### Example

`bert-base-uncased` is trained with text in lower format.

**Check model cards** on huggingface to know more about the models you use and inspect their text encoding pipeline to understand how they behave.

#### Homework 📖

Experiment with different model cards.

Experiment with text cleaning and evaluate its impact on classification.

## 2. Model definition

We are now ready to define our transformer-based classifier.

### 2.1 Data Formatting

We first need to format input data to be fed as mini-batches in a training/evaluation procedure.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The ``DataCollatorWithPadding`` receives a batch of

```
(input_ids, attention_mask, token_type_ids, label)
```

tuples and **dynamically pads** ``input_ids``, ``attention_mask`` and ``token_type_ids`` to maximum sequence in the batch. 

Intuitively, this operation saves a lot of memory compared to padding to global maximum sequence, while it introduces a reasonable computational overhead.

### Note

The above example is just one way out of many to perform dynamic batch padding: it really depends on which data structures you are using.

### 2.2 Model definition

Defining a transformer-based model with huggingface is pretty straightforward!

Since we are dealing with text classification, we can use off-the-shelf `AutoModelForSequenceClassification`.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_card,
                                                           num_labels=2,
                                                           id2label={0: 'NEG', 1: 'POS'},
                                                           label2id={'NEG': 0, 'POS': 1})

Let's first check the loaded model architecture.

In [None]:
print(model)

**That's it!**

That's the simplicity of huggingface's APIs.

The model is ready to use for classification.

### 2.3 Custom architectures

There are plenty of pre-defined model architectures $\rightarrow$ [auto classes](https://huggingface.co/docs/transformers/model_doc/auto)

In more complex scenarios, we may want to define a custom architecture where the pre-trained model is part of it.

In these cases, the way you do it strongly depends on the underlying neural library.

However, there exist several high-level APIs depending on your needs.

## 3. Model training and evaluation

We are now ready to define the training and evaluation procedures to test our model on the IMDB dataset.

In particular, we are going to use ``Trainer`` APIs to efficiently perform training.

### 3.1 Metrics

First, we define classification metrics for evaluation.

In [None]:
from sklearn.metrics import f1_score, accuracy_score

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)
    
    f1 = f1_score(y_pred=predictions, y_true=labels, average='macro')
    acc = accuracy_score(y_pred=predictions, y_true=labels)
    return {'f1': f1, 'acc': acc}

### Hugginface's metrics

Huggingface's offers the **evaluate** package that contains several evaluation metrics (e.g., accuracy, f1, squad-f1, etc...)

In [None]:
import evaluate

acc_metric = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)
    
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='macro')
    acc = acc_metric.compute(predictions=predictions, references=labels)
    return {**f1, **acc}
    

### 3.2 Training Arguments

The ``Trainer`` object can be extensively customized.

Feel free to check the [documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) on training arguments.

We first rename the `sentiment` column to `label` as the default input to `AutoModelForSequenceClassification`.

In [None]:
train_data = train_data.rename_column('sentiment', 'label')
test_data = test_data.rename_column('sentiment', 'label')

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="test_dir",                 # where to save model
    learning_rate=2e-5,                   
    per_device_train_batch_size=8,         # accelerate defines distributed training
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",           # when to report evaluation metrics/losses
    save_strategy="epoch",                 # when to save checkpoint
    load_best_model_at_end=True,
    report_to='none'                       # disabling wandb (default)
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

### Training schema with collator

<center>
    <img src="images/collator.png" alt="collator" />
</center>

In [None]:
trainer.train()

### 3.3 Evaluation

We now evaluate the trained model on the test set.

In [None]:
test_prediction_info = trainer.predict(test_data)
test_predictions, test_labels = test_prediction_info.predictions, test_prediction_info.label_ids

print(test_predictions.shape)
print(test_labels.shape)

In [None]:
test_metrics = compute_metrics([test_predictions, test_labels])
print(test_metrics)

### Some cleaning before PART II

Let's clean the memory and GPU before switching to instruction-tuned models.

In [None]:
import gc

model = None
del model
trainer = None
del trainer

with torch.no_grad():
    torch.cuda.empty_cache()

gc.collect()

# PART II

*   Prompting 101
*   Sentiment analysis with prompting
*   Advanced prompting

## 1. Prompting 101

Prompting is a technique used to adapt a model to a variety of tasks without requiring fine-tuning.

```
Classify the text into neutral, negative or positive.
Text: {text}
Sentiment:
```

The model receives the above input prompt and performs text classification via completion.

```
Classify the text into neutral, negative or positive.
Text: {text}
Sentiment: {label}
```

In natural language, prompting is a very delicate process since natural language is **expressive**, **flexible**, and, **ambiguous**.

A certain concept can be expressed in several ways:

* These ways are semantically **equivalent**
* May lead to **significant** model performance **drifts**

### 1.1 Sensitivity Factors

There are two main factors to consider when performing prompt-based learning.

#### [Prompt Engineering](https://www.promptingguide.ai/)

Eventually we have to iteratively find the best performing prompt.

This can either done

* Manually
* Automatically (via an ad-hoc model).


#### [Generation hyper-parameters](https://huggingface.co/docs/transformers/main/generation_strategies#text-generation-strategies)

Finding the optimal text generation strategy is a **critical point** for achieving satisfying performance.

These strategies affects how the model iteratively selects tokens during generation to avoid phenomena like repetitions, rare words, coherence with input text, and style.

* [Deterministic] Greedy $\rightarrow$ the most preferred (i.e., highest likelihood) token wins
* [Deterministic] Beam search
* [Stochastic] Top-k sampling
* [Stochastic] Nucleus sampling (or top-p sampling)
* [Contrastive search](https://huggingface.co/blog/introducing-csearch)  $\leftarrow$ **recommended**

### 1.2 Model types

There are a lot of different large language models and it is quite easy to be confused.

Essentially, we have:

* **Base models** (either encoders or encode-decoders): very good at text completion.
* **Instruct-based models**: base models specifically fine-tuned to address instructions.
* **Chat-based models**: models specifically fine-tuned to chat.

#### Example

In Huggingface, the distinct is easily formatted as:

* `llama2-7b`            $\rightarrow$ base model
* `llama2-7b-instruct`   $\rightarrow$ instruct-based model
* `llama-7b-chat`        $\rightarrow$ chat-based model

### 1.3 Preliminaries

We are going to download LLMs from [Huggingface](https://huggingface.co/).

Many of these open-source LLMs require you to accept their "Community License Agreement" to download them.

In summary:

- If not already, create an account of Huggingface (~2 mins)
- Check a LLM model card page (e.g., [Mistral v3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)) and accept its "Community License Agreement".
- Go to your account -> Settings -> Access Tokens -> Create new token -> "Repositories permissions" -> add the LLM model card you want to use.
- Save the token (we'll need it later)

Once we have created an account and an access token, we need to login to Huggingface via code.

- Type your token and press Enter
- You can say No to Github linking

In [None]:
!huggingface-cli login

After login, you can download all models associated with your access token in addition to those that are not protected by an access token.

## 2. Sentiment analysis with prompting

Let's consider our task once again to evaluate prompt-based models.

### 2.1 Model pipeline

First, we have to define the model pipeline to digest input prompts.

In [None]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

model_card = "microsoft/Phi-3.5-mini-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_card)
tokenizer.pad_token = tokenizer.eos_token

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

In [None]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_card,
    return_dict=True,
    quantization_config=bnb_config,
    device_map='auto'
)

In [None]:
generation_config = model.generation_config
generation_config.max_new_tokens = 100
generation_config.eos_token_id = tokenizer.eos_token_id
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.temperature = None
generation_config.num_return_sequences = 1

#### Homework 📖

Experiment with different model cards (either base or chat-base models)

### 2.2 Prompt Template

We first define the prompt template to format data samples.

In [None]:
prompt = [
    {
        'role': 'system',
        'content': 'You are an annotator for sentiment analysis.'
    },
    {
        'role': 'user',
        'content': """Classify the text into negative or positive.
        Respond only POSITIVE or NEGATIVE.

        TEXT: 
        {text}

        SENTIMENT:
        """
    }
]
prompt = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

Let's inspect the formatted prompt.

In [None]:
print(prompt)

### 2.2 Inference

We are now ready to feed prompts to our model and evaluate its performance.

Let's start with an example.

In [None]:
example_text = "This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen."
formatted_example = prompt.format(text=example_text)
parsed_example = tokenizer(formatted_example, return_tensors='pt').to('cuda')
generated = model.generate(input_ids=parsed_example['input_ids'],
                           attention_mask=parsed_example['attention_mask'],
                           generation_config=generation_config,
                           do_sample=False)
output = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
print(output)

Now we try with the whole test set.

In [None]:
test_df = df.loc[df['split'] == "test"].sample(frac=1.0)[:100]
test_data = Dataset.from_pandas(test_df)

In [None]:
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader

def prepend_prompt(example):
    example['text'] = prompt.format(text=example['text'])
    return example

def collate_fn(batch):
    texts = tokenizer.batch_encode_plus([it['text'] for it in batch], return_tensors='pt', padding=True, truncation=True)
    sentiment = th.tensor([it['sentiment'] for it in batch])
    return texts, sentiment

test_data = test_data.map(prepend_prompt)
test_data = test_data.select_columns(['text', 'sentiment'])
data_loader = DataLoader(test_data,
                         batch_size=1,
                         shuffle=False,
                         collate_fn=collate_fn)

Before running the inference loop, we define a function to parse generated outputs into classification labels.

In [None]:
import re

def extract_response(response):
    match = [m for m in re.finditer('SENTIMENT:', response)][-1]
    parsed = response[match.end():].strip()
    return parsed

def convert_response(response):
    return [0, 1] if 'positive' in response.casefold() else [1, 0]

In [None]:
raw_responses = []
predictions = []
with th.inference_mode():
    for batch_x, batch_y in tqdm(data_loader, desc="Generating responses"):
        response = model.generate(
            input_ids=batch_x['input_ids'].to(model.device),
            attention_mask=batch_x['attention_mask'].to(model.device),
            generation_config=generation_config,
            do_sample=False,
            use_cache=True
        )
        raw_response = tokenizer.batch_decode(response, skip_special_tokens=True)
        raw_response = [extract_response(item) for item in raw_response]
        raw_responses.extend(raw_response)
        batch_predictions = [convert_response(item) for item in raw_response]
        predictions.extend(batch_predictions)


We now compute classification metrics.

In [None]:
predictions = np.array(predictions)
ground_truth = np.array(test_data['sentiment'])
metrics = compute_metrics([predictions, ground_truth])
print(metrics)

## 3. [Advanced Prompting](https://huggingface.co/docs/transformers/main/tasks/prompting#chain-of-thought)

There is no rule of thumb to perform well on prompting.

Some may argue it is *art*, some others might say it is just *engineering*.

However, here are some **general recommendations**:

* Check **how** the pre-trained model you are using was trained!

* Start **simple** and then refine.

* Instructions at the **start/end** of the prompt $\rightarrow$ based on how most attention layers work.

* **Separate** input text from instructions

* Provide **clear description** of the task: no ambiguity, text format, style, language, etc...

* **Evaluate** the prompt on several models

* Use advanced techniques: **few-shot prompting**, **Chain-of-thought (CoT)**, Least-to-Most (LtM)

### 3.1 From Zero- to Few-shot Prompting

In many situations, a prompt containing instructions is not sufficient for a model to behave properly.

We can improve the prompt by providing **a few** ground-truth examples showing how the model should behave.

```
Classify the text into negative or positive. 
Text: {example1}
Sentiment: {label1}
Text: {example2}
Sentiment: {label2}
Text: {example3}
Sentiment: {label3}
Text: {text}
Sentiment:
```

#### Examples may be insufficient

Depending on the task at hand, providing examples may be not sufficient for the model to *understand* the instructions.

Also, the model might ignore provided examples or it might still perform correctly despite using **intentionally wrong** examples!!

#### Lengthy prompts

Adding examples increases the level of detail of prompt, while it may considerably increases its length.

Pay attention to what ``model_card`` you choose since your model may **truncate** input prompts!

Additionally, a lengthy prompt **increases computation**!

#### Examples quality

Choosing the right set of examples has an impact on model performance.

Intuitively, we select examples to maximize (textual) diversity and cover the whole label distribution.

In practice, this may be harder than expected: models are sensitive to prompt formatting.

Let's try sentiment analysis again with Few-shot prompting!

In [None]:
prompt = [
    {
        'role': 'system',
        'content': 'You are an annotator for sentiment analysis.'
    },
    {
        'role': 'user',
        'content': """Classify the text into negative or positive.
        Respond only POSITIVE or NEGATIVE.
        
        Here are some examples you can look at.
        EXAMPLES:
        {examples}

        Here is the text to classify.

        TEXT: 
        {text}

        SENTIMENT:
        """
    }
]
prompt = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

In [None]:
demonstrations = [
    ("Everything is so well done: acting, directing, visuals, settings, photography, casting. If you can enjoy a story of real people and real love - this is a winner.", "POSITIVE"),
    ("This is one of the dumbest films, I've ever seen. It rips off nearly ever type of thriller and manages to make a mess of them all.", "NEGATIVE")
]
demonstrations = '\n'.join([f'TEXT: {text}\nSENTIMENT: {sentiment}' for (text, sentiment) in demonstrations])

In [None]:
example_text = "This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen."
formatted_example = prompt.format(text=example_text, examples=demonstrations)
parsed_example = tokenizer(formatted_example, return_tensors='pt').to('cuda')
generated = model.generate(input_ids=parsed_example['input_ids'],
                           attention_mask=parsed_example['attention_mask'],
                           generation_config=generation_config,
                           do_sample=False)
output = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
print(output)

In [None]:
def prepend_prompt(example):
    example['text'] = prompt.format(text=example['text'], examples=demonstrations)
    return example

test_data = Dataset.from_pandas(test_df)
test_data = test_data.map(prepend_prompt)
test_data = test_data.select_columns(['text', 'sentiment'])
data_loader = DataLoader(test_data,
                         batch_size=1,
                         shuffle=False,
                         collate_fn=collate_fn)

In [None]:
raw_responses = []
predictions = []
with th.inference_mode():
    for batch_x, batch_y in tqdm(data_loader, desc="Generating responses"):
        response = model.generate(
            input_ids=batch_x['input_ids'].to(model.device),
            attention_mask=batch_x['attention_mask'].to(model.device),
            generation_config=generation_config,
            do_sample=False,
            use_cache=True
        )
        raw_response = tokenizer.batch_decode(response, skip_special_tokens=True)
        raw_response = [extract_response(item) for item in raw_response]
        raw_responses.extend(raw_response)
        batch_predictions = [convert_response(item) for item in raw_response]
        predictions.extend(batch_predictions)

In [None]:
predictions = np.array(predictions)
ground_truth = np.array(test_data['sentiment'])
metrics = compute_metrics([predictions, ground_truth])
print(metrics)

#### Homework 📖

Experiment with different few-shot examples and evaluate corresponding model performance.

### 3.2 Chain-of-thought (CoT) Prompting

Providing examples to improve task performance may fail in complex scenarios like reasoning tasks.

CoT prompting forces the model to generate intermediate reasoning steps before providing the final output.

CoT can either be achieved via

* Few-shot examples on how to perform *reasoning*
* Defining the prompt to force *reasoning* (e.g., *let's think step by step*)

Let's try our sentiment analysis task with CoT prompting

In [None]:
prompt = [
    {
        'role': 'system',
        'content': 'You are an annotator for sentiment analysis.'
    },
    {
        'role': 'user',
        'content': """Classify the text into negative or positive.
        Respond with POSITIVE or NEGATIVE.
        Think step by step.
        
        Here are some examples you can look at.
        EXAMPLES:
        {examples}

        Here is the text to classify.

        TEXT: 
        {text}

        SENTIMENT:
        """
    }
]
prompt = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

In [None]:
example_text = "This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen."
formatted_example = prompt.format(text=example_text, examples=demonstrations)
parsed_example = tokenizer(formatted_example, return_tensors='pt').to('cuda')
generated = model.generate(input_ids=parsed_example['input_ids'],
                           attention_mask=parsed_example['attention_mask'],
                           generation_config=generation_config,
                           do_sample=False)
output = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
print(output)

In [None]:
test_data = Dataset.from_pandas(test_df)
test_data = test_data.map(prepend_prompt)
test_data = test_data.select_columns(['text', 'sentiment'])
data_loader = DataLoader(test_data,
                         batch_size=1,
                         shuffle=False,
                         collate_fn=collate_fn)

In [None]:
raw_responses = []
predictions = []
with th.inference_mode():
    for batch_x, batch_y in tqdm(data_loader, desc="Generating responses"):
        response = model.generate(
            input_ids=batch_x['input_ids'].to(model.device),
            attention_mask=batch_x['attention_mask'].to(model.device),
            generation_config=generation_config,
            do_sample=False,
            use_cache=True
        )
        raw_response = tokenizer.batch_decode(response, skip_special_tokens=True)
        raw_response = [extract_response(item) for item in raw_response]
        raw_responses.extend(raw_response)
        batch_predictions = [convert_response(item) for item in raw_response]
        predictions.extend(batch_predictions)

In [None]:
predictions = np.array(predictions)
ground_truth = np.array(test_data['sentiment'])
metrics = compute_metrics([predictions, ground_truth])
print(metrics)

#### Homework 📖

Experiment with different CoT prompts to enforce intermediate reasoning steps.

For more details check this [page](https://www.promptingguide.ai/techniques/cot).

### 3.3 Prompting vs Fine-tuning

At last, we may wondering on which technique to use.

In short, prompting comes at hand when transferring a pre-trained model on a domain that has some affinities with those seen during training.

In other cases like:

* Different domain
* Sensitive data
* Low-resource language
* Domain-specific model constraints

Fine-tuning is the preferred choice (to maximize improvements)

# The End!