<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/exercises/ex5/ex5_part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### LLM Prompting and Prompt Engineering Part 2

In part 2, we experiment with prompting instruction-tuned Large Language Models (LLMs), and evaluate their performance on a linguistic annotation task involving structured outputs.

The goal of this assignment is to gain some experience working with instruction-tuned LLMs. To this end, you will learn how to

- query an instruction-tuned LLM with default chat templates (`Llama-3.2-3B-Instruct`)
- parse LLM outputs for structured responses using JSON and `Pydantic`
- implement error handling for edge cases where the model fails to output the expected data format.

The task we use for this purpose is a simple Tokenization and Part-of-Speech tagging task using data taken from Universal Dependencies.

To facilitate working with LLMs, we will again use the Unsloth library. Note that Unsloth provides both freeware and closed-source proprietary software. For our purposes, the freeware is sufficient! For more information on Unsloth, see the docs here.

This notebook is adapted from [this example](https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing) by Unsloth.


### NOTE: Expected execution times
We have provided expected execution times throughout the notebook as a guide. These are intended to be approximate, but should give you some idea for what to expect. If your runtimes far exceed these expected execution times, you may want to consider modifying your approach. These are denoted with ‚åõ .

### NOTE: GPU Usage
It is expected that you load the model onto a GPU for inference. For other parts of the code, such as data preparation, a GPU is not necessary. To avoid waiting for resources unnecessarily, we recommend doing as much as you can on a CPU instance and change the runtime type as necessary. We've highlight the cells that need a GPU with ‚ö°

## 1) Installing dependencies

In [1]:
%%capture
!pip install levenshtein
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [2]:
# check unsloth version
expected_version = '2024.10.2'
unsloth_version = !pip list | grep -P 'unsloth\s+' | grep -Po '\S+$'
if unsloth_version[0] != expected_version:
    print(f"Warning! Found Unsloth version {unsloth_version[0]} but expected {expected_version}.")

# check python version
import sys
print(sys.version)

# check gpu info
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

# check RAM info
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]
Fri Nov 29 23:46:44 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                 

## 2) Model Loading

In [3]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# Note, here we specify the instruction-tuned version of Llama-3.2-3B
model_name = "unsloth/Llama-3.2-3B-Instruct"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
    )


FastLanguageModel.for_inference(model) # Enable native 2x faster inference

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Llam

## 3) Data Loading and Preparation

In [1]:
# load data
import random
import pandas as pd

seed = 42

random.seed(seed)

dataset_url = "https://raw.githubusercontent.com/tannonk/prompting_exercise/refs/heads/main/data/en_ewt-ud-dev-pos.json"
df = pd.read_json(dataset_url, lines=True)

# For each input sentence, we'll build the target as a list of dictionaries containing keys for the token and pos tag. This is what we want our LLM annotator to predict.
df['target'] = df.apply(lambda x: [{'token': token, 'pos': pos} for token, pos in zip(x['tokens'].split(), x['upos'].split())], axis=1)

# We'll sample 100 items for testing purposes
test_data = df.sample(n=100, random_state=seed)
train_data = df.drop(test_data.index)

print(f"Train data: {len(train_data)}")
print(f"Test data: {len(test_data)}")

test_data.head()


Train data: 1408
Test data: 100


Unnamed: 0,sentence,tokens,upos,xpos,target
578,"""...there is no companion quite so devoted, so...",""" ... there is no companion quite so devoted ,...",PUNCT PUNCT PRON VERB DET NOUN ADV ADV ADJ PUN...,"`` , EX VBZ DT NN RB RB JJ , RB JJ , RB JJ CC ...","[{'token': '""', 'pos': 'PUNCT'}, {'token': '....."
1146,"Great computer repair store, highly recommended.","Great computer repair store , highly recommend...",ADJ NOUN NOUN NOUN PUNCT ADV VERB PUNCT,"JJ NN NN NN , RB VBN .","[{'token': 'Great', 'pos': 'ADJ'}, {'token': '..."
382,You wear your heart on your sleeve ... and sin...,You wear your heart on your sleeve ... and sin...,PRON VERB PRON NOUN ADP PRON NOUN PUNCT CCONJ ...,"PRP VBP PRP$ NN IN PRP$ NN , CC IN PRP VBP DT ...","[{'token': 'You', 'pos': 'PRON'}, {'token': 'w..."
583,for Books that Speak for Themselves....,for Books that Speak for Themselves ....,ADP NOUN PRON VERB ADP PRON PUNCT,"IN NNS WDT VBP IN PRP ,","[{'token': 'for', 'pos': 'ADP'}, {'token': 'Bo..."
966,yuck !!,yuck !!,INTJ PUNCT,UH .,"[{'token': 'yuck', 'pos': 'INTJ'}, {'token': '..."


### TODO: Inspect and describe the data

üìù‚ùì What are the fields and their corresponding values in the dataframe?

üìù‚ùì What is the difference between `upos` and `xpos`?

üìù‚ùì What is the distribution of `upos` labels in the `test_data`?

In [18]:
from collections import Counter

pos = [tok["pos"] for sent in test_data.target.values.tolist() for tok in sent]
dist = Counter(pos)
print(f"Total number of tokens in the test set: {len(pos)}")
dist

Total number of tokens in the test set: 1200


Counter({'NOUN': 219,
         'PUNCT': 146,
         'VERB': 124,
         'ADJ': 111,
         'PRON': 106,
         'ADP': 99,
         'DET': 88,
         'AUX': 72,
         'ADV': 69,
         'PROPN': 69,
         'CCONJ': 41,
         'PART': 17,
         'NUM': 16,
         'SCONJ': 13,
         'INTJ': 6,
         'SYM': 3,
         'X': 1})

### TODO: Define the basic `PromptTemplate`

Note, you can reuse the solution from part 1 of this exercise here.

In [28]:
class PromptTemplate:
    def __init__(self, task_description, system = None):
        self.task_description = task_description
        self.system = system

    def zero_shot_prompt(self, input_sentence):

        if self.system: # for system prompt 
          chat = [{'role': 'system', 'content': self.system},
                  {'role': 'user', 'content': f'{self.task_description}\n\n{input_sentence}'}]
        else:
          chat = [{'role': 'user', 'content': f'{self.task_description}\n\n{input_sentence}'}]

        return chat

## 3.2 ChatTemplates

Instruction-tuned models are typically finetuned using a predefined `ChatTemplate`.
This means that when using them for inference, it is important that we use the correct `ChatTemplate` in order to avoid "confusing" the model.

You can find more information about model `ChatTemplates` for Huggingface models [here](https://huggingface.co/docs/transformers/en/chat_templating).


In [11]:
from unsloth.chat_templates import get_chat_template

# load the chat_template from unsloth (note, the logic is similar when using native Huggingface, but here we're using Unsloth!)
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1", # for Llama-3.1 and Llama-3.2 models
)

# Inspect the template (note, it looks more complicated than it is!)
print(tokenizer.chat_template)


{{- bos_token }}
{%- if custom_tools is defined %}
    {%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
    {%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
    {%- set date_string = "26 July 2024" %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
    {%- set system_message = messages[0]['content'] %}
    {%- set messages = messages[1:] %}
{%- else %}
    {%- set system_message = "" %}
{%- endif %}

{#- System message + builtin tools #}
{{- "<|start_header_id|>system<|end_header_id|>

" }}
{%- if builtin_tools is defined or tools is not none %}
    {{- "Environment: ipython
" }}
{%- endif %}
{%- if builtin_tools is defined %}
    {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "

"}}
{%- endif %}
{{- "

### TODO: Prepare your inputs using the `ChatTemplate` for the model.

Note, you should be able to drop your custom `PromptTemplate` string into the model's default `ChatTemplate`.


In [14]:
prompts = []
prompt_template = PromptTemplate("Please tokenize and then Part-of-Speech tag the following sentence with the Universal POS tags.\nTo give an example, the sentence 'I hate prompts!' should be tokenized and tagged as:\n1. I - PRON\n2. hate - VERB\n3. prompts - NOUN\n4. ! - PUNCT\n\nYour turn:")
for idx, row in test_data.iterrows():
    input_text = row["sentence"]
    prompt = prompt_template.zero_shot_prompt(input_text)
    prompts.append(tokenizer.apply_chat_template(prompt, tokenize = False))

## 4) Inference Pipeline

### TODO: Define a function to run inference efficiently with an LLM

Note, you can use the same inference function from part 1 of this exercise here!

In [15]:
# Set up our inference pipeline for generation

# We'll set some default generation args that we'll pass to our inference function
# Following best practices, we'll use Pydantic class which helps with validation.
from pydantic import BaseModel

class Generation_Args(BaseModel):
    max_new_tokens: int
    temperature: float
    top_k: int
    top_p: float
    repetition_penalty: float
    do_sample: bool
    use_cache: bool
    min_p: float
    num_return_sequences: int

# Here are some default generation args
generation_args = Generation_Args(
    max_new_tokens = 1024, # note, for this task, we're setting the max_new_tokens to be more appropriate
    temperature = 1.0,
    top_k = 0,
    top_p = 1.0,
    repetition_penalty = 1.0,
    do_sample = True,
    use_cache = True,
    min_p = 0.1,
    num_return_sequences = 1
)


def run_batched_inference(prompts, model, tokenizer, batch_size=10, generation_args=generation_args):
    """
    Runs batched inference on a list of prompts using a given model and tokenizer.

    Set the batch_size to control the number of prompts processed in each batch.
    Depending on the length of your prompts and model size the batch size may need to be adjusted.

    Args:
        prompts (list[str]): List of prompts that are passed to the model
        model (): The model used for generation
        tokenizer (): The tokenizer used for encoding and decoding the prompts
        batch_size (int): Number of prompts to run batched inference on.

    Returns:
        List[str] containing generated outputs.
    """

    # TODO: implement the logic for efficient inference with LLM
    outputs = []
    for batch_idx in range(0, len(prompts), batch_size):
      batch = prompts[batch_idx: batch_idx+batch_size]

      tokenizer.padding_side='left' #pad to the left
      tokenizer.pad_token = tokenizer.eos_token #pad the input with the eos token
      inputs = tokenizer(batch, return_tensors = "pt", padding=True).to(model.device)
      out = model.generate(**inputs,
                           max_new_tokens = generation_args.max_new_tokens,
                           temperature = generation_args.temperature,
                           top_k = generation_args.top_k,
                           top_p = generation_args.top_p,
                           repetition_penalty = generation_args.repetition_penalty,
                           do_sample = generation_args.do_sample,
                           use_cache = generation_args.use_cache,
                           min_p = generation_args.min_p,
                           num_return_sequences = generation_args.num_return_sequences
                           )


      decoded = tokenizer.batch_decode(out)
      decoded = [output.replace(tokenizer.pad_token, "") for output in decoded]

      outputs.extend(decoded)

    return outputs

In [16]:
prompts[0]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease tokenize and then Part-of-Speech tag the following sentence with the Universal POS tags.\nTo give an example, the sentence \'I hate prompts!\' should be tokenized and tagged as:\n1. I - PRON\n2. hate - VERB\n3. prompts - NOUN\n4. ! - PUNCT\n\nYour turn:\n\n"...there is no companion quite so devoted, so communicative, so loving and so mesmerizing as a rat."<|eot_id|>'

## 4.1) Run Inference

### TODO: run inference!

‚åõ 10-20 mins

‚ö° GPU

In [17]:
# TODO: inference
pred = run_batched_inference(prompts, model, tokenizer)

## 5) Structured output validation

LLMs output text. But in practice, we often want structured data that we can process further with other automatic processes.

For this purpose, JSON is a good target data structure.


### TODO: Define a processing pipeline that extracts and validates the JSON response from the LLM.

Hint: For this you should use a combination of [`Regex`](https://www.w3schools.com/python/python_regex.asp) and [`Pydantic`](https://docs.pydantic.dev/latest/).

The output should be a valid json object with the following structure:

```json
[
    {"token": "there", "pos": "DET"}, # each dict contains a token and its corresponding POS-Tag.
    {"token": "is", "pos": "VERB"},
    {"token": "no", "pos": "ADJ"},
    {"token": "companion", "pos": "NOUN"},
    {"token": "quite", "pos": "ADV"},
    {"token": "so", "pos": "ADV"},
    {"token": "devoted", "pos": "ADV"},
    {"token": "so", "pos": "ADV"},
    ...
]
```

In [20]:
from pydantic import BaseModel, field_validator
import re

In [24]:
# TODO
class Tok(BaseModel):
    token: str
    pos: str

    @field_validator("pos")
    def validate_pos(cls, tag: str) -> str: # validate the pos labels 
        valid_pos_tags = ["ADJ", "ADV", "INTJ", "NOUN", "PROPN", "VERB",
                          "ADP", "AUX", "CCONJ", "DET", "NUM", "PART", "PRON", "SCONJ",
                          "PUNCT", "SYM", "X"]
        if tag in valid_pos_tags:
            return tag
        else:
            raise ValueError("Invalid pos tag!")

def parse_and_validate(output):
    token_and_pos = re.findall(r"\n\d{,2}\.\s?(.+) - (.+)", output)
    valid = []
    for token, pos in token_and_pos:
        try:
            tok = Tok(token = token, pos = pos)
            valid.append(tok)
        except(ValueError):
            continue
    return [{"token": tok.token, "pos": tok.pos} for tok in valid[4:]]

processed_outputs = list(map(lambda output: parse_and_validate(output), pred))

## 6) *Evaluation*

In [22]:
# Below is some boilerplate evaluation code. You should not need to make any changes here.

import Levenshtein
import numpy as np
from typing import List, Dict

def evaluate_instance(target: List[Dict], prediction: List[Dict]):
    """
    Evaluates the accuracy of tokenization and part-of-speech (POS) tagging between a target and a predicted sequence.

    Args:
        target (List[Dict]): A list of dictionaries representing the target tokens and POS tags.
        prediction (List[Dict]): A list of dictionaries representing the predicted tokens and POS tags.

    Returns:
        dict: A dictionary containing the token-level accuracy ('Token Acc') and POS accuracy ('POS Acc').
    """

    # If there is no prediction, return zero accuracies
    if prediction is None:
        return {'Token Acc': 0, 'POS Acc': 0}

    # Extract tokens and POS tags from the target and prediction lists
    target_tokens = [item['token'] for item in target]
    target_pos = [item['pos'] for item in target]
    pred_tokens = [item['token'] for item in prediction]
    pred_pos = [item['pos'] for item in prediction]

    # Get alignment operations between the target and predicted tokens using Levenshtein.opcodes()
    opcodes = Levenshtein.opcodes(target_tokens, pred_tokens)

    # Initialize aligned lists to store tokens and POS tags after alignment
    aligned_target_tokens = []
    aligned_target_pos = []
    aligned_pred_tokens = []
    aligned_pred_pos = []

    # Iterate over each operation in the alignment
    for tag, i1, i2, j1, j2 in opcodes:
        # "equal" means the tokens in this range are identical in both sequences
        if tag == 'equal':
            aligned_target_tokens.extend(target_tokens[i1:i2])
            aligned_target_pos.extend(target_pos[i1:i2])
            aligned_pred_tokens.extend(pred_tokens[j1:j2])
            aligned_pred_pos.extend(pred_pos[j1:j2])
        # "replace" means tokens in this range are different between the target and prediction
        elif tag == 'replace':
            aligned_target_tokens.extend(target_tokens[i1:i2])
            aligned_target_pos.extend(target_pos[i1:i2])
            aligned_pred_tokens.extend(pred_tokens[j1:j2])
            aligned_pred_pos.extend(pred_pos[j1:j2])
        # "insert" means tokens were added in the prediction that are not in the target
        elif tag == 'insert':
            aligned_target_tokens.extend(['<MISSING>'] * (j2 - j1))  # Add placeholders for missing target tokens
            aligned_target_pos.extend(['<MISSING>'] * (j2 - j1))      # Add placeholders for missing target POS tags
            aligned_pred_tokens.extend(pred_tokens[j1:j2])
            aligned_pred_pos.extend(pred_pos[j1:j2])
        # "delete" means tokens are present in the target but missing in the prediction
        elif tag == 'delete':
            aligned_target_tokens.extend(target_tokens[i1:i2])
            aligned_target_pos.extend(target_pos[i1:i2])
            aligned_pred_tokens.extend(['<MISSING>'] * (i2 - i1))    # Add placeholders for missing predicted tokens
            aligned_pred_pos.extend(['<MISSING>'] * (i2 - i1))       # Add placeholders for missing predicted POS tags

    # Calculate token-level accuracy
    # We only consider positions where both target and prediction have valid tokens (i.e., not '<MISSING>')
    correct_tokens = [
        1 if tgt == pred else 0
        for tgt, pred in zip(aligned_target_tokens, aligned_pred_tokens)
        if tgt != '<MISSING>' and pred != '<MISSING>'
    ]
    token_accuracy = np.mean(correct_tokens) if correct_tokens else 0

    # Calculate POS accuracy
    # Only consider positions where tokens match and are not '<MISSING>'
    correct_pos = [
        1 if tgt_pos == pred_pos else 0
        for tgt_tok, pred_tok, tgt_pos, pred_pos in zip(aligned_target_tokens, aligned_pred_tokens, aligned_target_pos, aligned_pred_pos)
        if tgt_tok == pred_tok and tgt_tok != '<MISSING>'
    ]
    pos_accuracy = np.mean(correct_pos) if correct_pos else 0

    return {'Token Acc': token_accuracy, 'POS Acc': pos_accuracy}

def get_results(test_data: pd.DataFrame, processed_outputs: List[List[Dict]]):
    """
    Returns a summary dataframe by taking the average of the all results for tokenization and pos-tagging.
    """
    results = []
    for i in range(len(processed_outputs)):
        results.append(evaluate_instance(test_data.iloc[i]['target'], processed_outputs[i]))

    results = pd.DataFrame(results).mean()
    return results

In [25]:
# To get the results, you should be able to pass your test_data DataFrame and the processed_outputs from above...
get_results(test_data, processed_outputs)

Unnamed: 0,0
Token Acc,0.875727
POS Acc,0.74387


## 7) Manipulating the system prompt

The system prompt is part of the `ChatTemplate` that can help to steer the model.


### TODO: Customise the system prompt for the intended task and re-run inference

Note, this is an experiment. You should try a few different system prompts and report the resulting performance in your report.


üìù‚ùì What was the best system prompt you considered?

üìù‚ùì Were you able to improve the performance by manipulating the system prompt? Please discuss.

‚åõ 10-20 mins (per experiment run)

‚ö° GPU

In [29]:
# use the task description as the system prompt
prompts = []
system = """
            Your task is to tokenize and Part-of-Speech tag the given sentence.

            Please only use the following universal POS tags:

            "ADJ", "ADV", "INTJ", "NOUN", "PROPN", "VERB",
            "ADP", "AUX", "CCONJ", "DET", "NUM", "PART", "PRON", "SCONJ",
            "PUNCT", "SYM", "X"

            Please format your answer as:

            1. token - tag
            2. token - tag
            ...

            Please include all the punctuations as well.
         """
prompt_template = PromptTemplate("Please tokenize and then Part-of-Speech tag the following sentence with the Universal POS tags.\nTo give an example, the sentence 'I hate prompts!' should be tokenized and tagged as:\n1. I - PRON\n2. hate - VERB\n3. prompts - NOUN\n4. ! - PUNCT\n\nYour turn:",
                                 system = system)
for idx, row in test_data.iterrows():
    input_text = row["sentence"]
    prompt = prompt_template.zero_shot_prompt(input_text)
    prompts.append(tokenizer.apply_chat_template(prompt, tokenize = False))


In [35]:
prompts[1]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n\n            Your task is to tokenize and Part-of-Speech tag the given sentence.\n\n            Please only use the following universal POS tags: \n            \n            "ADJ", "ADV", "INTJ", "NOUN", "PROPN", "VERB",\n            "ADP", "AUX", "CCONJ", "DET", "NUM", "PART", "PRON", "SCONJ",\n            "PUNCT", "SYM", "X"\n\n            Please format your answer as:\n\n            1. token - tag\n            2. token - tag\n            ...\n\n            Please include all the punctuations as well.\n         <|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease tokenize and then Part-of-Speech tag the following sentence with the Universal POS tags.\nTo give an example, the sentence \'I hate prompts!\' should be tokenized and tagged as:\n1. I - PRON\n2. hate - VERB\n3. prompts - NOUN\n4. ! - PUNCT\n\nYour turn:\n\nGreat computer repair store, hi

In [31]:
pred = run_batched_inference(prompts, model, tokenizer)

In [34]:
pred[1]

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n\n            Your task is to tokenize and Part-of-Speech tag the given sentence.\n\n            Please only use the following universal POS tags: \n            \n            "ADJ", "ADV", "INTJ", "NOUN", "PROPN", "VERB",\n            "ADP", "AUX", "CCONJ", "DET", "NUM", "PART", "PRON", "SCONJ",\n            "PUNCT", "SYM", "X"\n\n            Please format your answer as:\n\n            1. token - tag\n            2. token - tag\n           ...\n\n            Please include all the punctuations as well.\n         <|start_header_id|>user<|end_header_id|>\n\nPlease tokenize and then Part-of-Speech tag the following sentence with the Universal POS tags.\nTo give an example, the sentence \'I hate prompts!\' should be tokenized and tagged as:\n1. I - PRON\n2. hate - VERB\n3. prompts - NOUN\n4.! - PUNCT\n\nYour turn:\n\nGreat computer repair stor

In [33]:
def parse_and_validate(output):
    token_and_pos = re.findall(r"\n\d{,2}\.\s?(.+) - (.+)", output)
    valid = []
    for token, pos in token_and_pos:
        try:
            tok = Tok(token = token, pos = pos)
            valid.append(tok)
        except(ValueError):
            continue
    return [{"token": tok.token, "pos": tok.pos} for tok in valid]

processed_outputs = list(map(lambda output: parse_and_validate(output), pred))
get_results(test_data, processed_outputs)


Unnamed: 0,0
Token Acc,0.918594
POS Acc,0.73925


In [36]:
# use the task description as the system prompt, plus an example
prompts = []
system = """
            Your task is to tokenize and Part-of-Speech tag the given sentence.

            Please only use the following universal POS tags:

            "ADJ", "ADV", "INTJ", "NOUN", "PROPN", "VERB",
            "ADP", "AUX", "CCONJ", "DET", "NUM", "PART", "PRON", "SCONJ",
            "PUNCT", "SYM", "X"

            Please format your answer as:

            1. token - tag
            2. token - tag
            ...

            Please include all the punctuations as well, but do not include the type of punctuations in the tag.

            To give an example:

            Asimov likes robots. ->

            1. Asimov - PROPN
            2. likes - VERB
            3. robots - NOUN
            4. . - PUNCT

            Good luck!
         """
prompt_template = PromptTemplate("Please tokenize and then Part-of-Speech tag the following sentence with the Universal POS tags.\nTo give an example, the sentence 'I hate prompts!' should be tokenized and tagged as:\n1. I - PRON\n2. hate - VERB\n3. prompts - NOUN\n4. ! - PUNCT\n\nYour turn:",
                                 system = system)
for idx, row in test_data.iterrows():
    input_text = row["sentence"]
    prompt = prompt_template.zero_shot_prompt(input_text)
    prompts.append(tokenizer.apply_chat_template(prompt, tokenize = False))

In [37]:
pred = run_batched_inference(prompts, model, tokenizer)

In [39]:
pred[1]

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n\n            Your task is to tokenize and Part-of-Speech tag the given sentence.\n\n            Please only use the following universal POS tags: \n            \n            "ADJ", "ADV", "INTJ", "NOUN", "PROPN", "VERB",\n            "ADP", "AUX", "CCONJ", "DET", "NUM", "PART", "PRON", "SCONJ",\n            "PUNCT", "SYM", "X"\n\n            Please format your answer as:\n\n            1. token - tag\n            2. token - tag\n           ...\n\n            Please include all the punctuations as well, but do not include the type of punctuations in the tag.\n\n            To give an example:\n\n            Asimov likes robots. ->\n\n            1. Asimov - PROPN\n            2. likes - VERB\n            3. robots - NOUN\n            4.. - PUNCT\n\n            Good luck!\n         <|start_header_id|>user<|end_header_id|>\n\nPlease tokenize 

In [40]:
processed_outputs = list(map(lambda output: parse_and_validate(output), pred))
get_results(test_data, processed_outputs)

Unnamed: 0,0
Token Acc,0.921324
POS Acc,0.766147


---

## 8) Lab report

üìù‚ùì Write your lab report here addressing all questions in the notebook

### Lab Report 

*(Answers to questions are marked with ‚ùì)*

#### **1. Introduction**

We experimented with prompting the instruction-tuned version of Llama-3.2-3B model on a tokenization and Part-of-Speech (PoS) tagging task. We also experimented with different system prompts. The best setting yielded a tokenization accuracy of 0.92 and a PoS accuracy of 0.77. 

#### **2. Dataset**

(‚ùì inspect the data) Our dataset contains five fields: the sentence in string (`sentence`), the pre-tokenized version of the sentence (`tokens`) where tokens were separated by whitespace, their universal pos tags (`upos`), a more fine-grained version of their pos tags (`xpos`) and a `target` field containing the desired results in the format`{'token': TOKEN, 'pos': POS}` for each word in the sentence, where the universal pos tags are used. The universal pos tags (`upos`) cover the core part-of-speech categories such as `NOUN` and `VERB`, whereas `xpos` also provided a more detailed indication of the word form, for example `NNS` for plural nouns. Among the test data of 1200 tokens, `NOUN` labels occur most frequently (219 occurrences), followed by `PUNC` (punctuations, 146 occurrences) and `VERB` (124, occurrences). Some labels are relatively rare in the data, for example, `INTJ` (interjections) only appears 6 times.

#### **3. Model and prompting**

We prompted the instruction-tuned version of Llama-3.2-3B model under three different settings: 1) No system prompt, 2) Task description as system prompt and 3) Task description and an example as system prompt. 

#### **4. Results and Discussions**

The results are summarized in `Table 1`.

|System prompt| Token acc | POS acc|
|:-|-:|-:|
|No|0.88|0.74|
|Task description|0.92|0.74|
|Task description + example|0.92|0.77|

**Table 1** Test acc under different settings 

(‚ùì best system prompt) Using both the task description and an example as the system prompt gave us the best token and POS accuracy (0.92 and 0.77 respectively). (‚ùì improvement) Using this system prompt improved the token acc by 4% and 3% for POS acc. This small boost might be explicated by the capability of the system prompts to guide the model. In our case, we helped the model to focus on the task at hand, and by giving an example, implicitly instructed (apart from the overt instructions in the task description) it to use the Universal POS tags, which might explain the improvement in POS acc. The example also helped to limit the format of the model output, making it easier for the regex function to capture the output, hence potentially reducing false negative rate. 


