## Week 9 Homework

For this week's homework, you will get more experience with prompt tuning. Like we did in class, you will choose a training dataset and a general language benchmark to assess catastrophic forgetting (I recommend using datasets you will use for your final project to practice implementing them). You will use `lm_eval` to assess your model's performance before and after prompt tuning. Like in class, it is okay if prompt tuning does not work well for your task- the goal of this assignment is to get more experience with implementation and further understand what tasks prompt tuning works well for and poorly for. This homework will guide you through the steps to do this. 

## Step 1: Download a Model and Assess Its Pretraining Performance (20 points)
As we did in class, start by downloading a model and assessing its pretraining performance on your training task and a general language benchmark using the `lm_eval` package (if your dataset is not a `lm_eval` task, you will need to write your own code to generate responses to a hold-out test split of the dataset and assess whether they are correct or not). Here are step-by-step directions with point values:
- Import necessary dependencies (5 points)
- Load the model and tokenizer (5 points) 
- Set up a `lm_eval` `task_manager` and implement your training task and general language benchmark; make sure to log the samples and you can set a limit (n=50) if you'd like to reduce runtime. If you are implementing a training task not in `lm_eval`, load the necessary data, conduct a training and test split (use a `seed` so you can reproduce the split in future homeworks and your final assignment), and evaluate your model on the test dataset, logging responses and using a limit if you'd like (5 points).
- Print the results and 2 model responses for each task to get a sense of what the model outputs look like before training (5 points)

In [1]:
import os 
os.environ["HF_HOME"] = "/scratch/ezq9qu/models/cache"
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset
from tqdm import tqdm
from lm_eval import evaluator
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B", device_map = 'auto', dtype = torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
results = evaluator.simple_evaluate(
    model = "hf", #Specify huggingface model
    model_args = {"pretrained": model, "dtype": "bfloat16", "tokenizer": tokenizer}, #Define model arguments
    tasks = ["hellaswag", "race"], 
    log_samples = True, 
    batch_size = "1",
    limit = 50,
    random_seed = 42,
)

`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████| 50/50 [00:00<00:00, 4678.95it/s]
100%|██████████| 50/50 [00:00<00:00, 3503.66it/s]
Running loglikelihood requests: 100%|██████████| 400/400 [00:17<00:00, 22.98it/s]


In [3]:
results["results"]

{'hellaswag': {'alias': 'hellaswag',
  'acc,none': 0.4,
  'acc_stderr,none': 0.06998542122237653,
  'acc_norm,none': 0.46,
  'acc_norm_stderr,none': 0.07119963311072637},
 'race': {'alias': 'race',
  'acc,none': 0.36,
  'acc_stderr,none': 0.06857142857142856}}

In [55]:
def get_hella_examples(index):
    query = results["samples"]["hellaswag"][index]["doc"]['query']
    choices = results["samples"]["hellaswag"][index]["doc"]["choices"]
    target = results["samples"]["hellaswag"][index]['target']
    resps = max(enumerate(results["samples"]["hellaswag"][index]["resps"]), key=lambda x: x[1][0])[0]

    print(f"Query: {query}: \n")
    for choice in choices:
        print(f"{choice}")
    print("\n")
    print(f"Target: {target}")
    print(f"Response: {resps}")


In [56]:
get_hella_examples(0)

Query: Roof shingle removal: A man is sitting on a roof. He: 

is using wrap to wrap a pair of skis.
is ripping level tiles off.
is holding a rubik's cube.
starts pulling up roofing on a roof.


Target: 3
Response: 2


In [57]:
get_hella_examples(1)

Query: Clean and jerk: A lady walks to a barbell. She bends down and grabs the pole. The lady: 

swings and lands in her arms.
pulls the barbell forward.
pulls a rope attached to the barbell.
stands and lifts the weight over her head.


Target: 3
Response: 1


In [79]:
import textwrap
import ast

In [87]:
results["samples"]["race"][0]['doc']['problems']

"[{'question': 'What did Nancy try to do before she fell over?', 'answer': 'C', 'options': ['Measure the depth of the river', 'Look for a fallen tree trunk', 'Protect her cows from being drowned', 'Run away from the flooded farm']}, {'question': 'The following are true according to the passage except  _  .', 'answer': 'D', 'options': ['It took Lizzie and Nancy about 20 minutes to get to safety.', 'It was raining harder when Nancy managed to get up.', 'The bad weather made it difficult for rescuers to find Nancy.', 'Nancy took hold of the rope and climbed into the helicopter.']}, {'question': 'What did the local people do to help those in the flooded area according to the passage?', 'answer': 'A', 'options': ['They put up shelter for them in a school.', 'They used helicopters to help carry cows.', 'They helped farmers gather their cows.', 'They set up an organization called Red Cross.']}]"

In [90]:
def get_race_examples(index):
    article = results["samples"]["race"][index]['doc']['article']
    article_wrapped = textwrap.fill(article, width=100)
    problems = results["samples"]["race"][index]['doc']['problems']

    data_list = ast.literal_eval(problems)
    questions = [item['question'] for item in data_list]

    target = results["samples"]["race"][index]["target"]
    resps = max(enumerate(results["samples"]["race"][index]["resps"]), key=lambda x: x[1][0])[0]

    print(f"Article: {article_wrapped} \n\n ")
    print("There are a series of questions...")
    for question in questions:
        print(question)
    print(f"Target: {target}")
    print(f"Resps: {resps}")




In [91]:
get_race_examples(0)

Article: The rain had continued for a week and the flood had created a big river which were running by Nancy
Brown's farm. As she tried to gather her cows to a higher ground, she slipped and hit her head on a
fallen tree trunk. The fall made her unconscious for a moment or two. When she came to, Lizzie, one
of her oldest and favorite cows, was licking her face.  At that time, the water level on the farm
was still rising. Nancy gathered all her strength to get up and began walking slowly with Lizzie.
The rain had become much heavier, and the water in the field was now waist high. Nancy's pace got
slower and slower because she felt a great pain in her head. Finally, all she could do was to throw
her arm around Lizzie's neck and try to hang on. About 20 minutes later, Lizzie managed to pull
herself and Nancy out of the rising water and onto a bit of high land, which seemed like a small
island in the middle of a lake of white water.  Even though it was about noon, the sky was so dark
and t

In [92]:
get_race_examples(1)

Article: There is probably no field of human activity in which our values and lifestyles are shown more
clearly and strongly than they are in the clothes that we choose to wear.The dress of an individual
is a kind of "sign language" that communicates a set of information and is usually the basis on
which immediate impressions are formed.Traditionally,a concern for clothes was considered to be an
affair of females,while men took pride in the fact that they were completely lacking in clothes
consciousness . This type of American culture is by degrees changing as man dress takes on greater
variety and color.Even as early as 1955,a researcher in Michigan said that _ White collar workers in
particular viewed dress as a symbol of ability,which could be used to impress or influence
others,especially in the work situation.The white collar worker was described as extremely concerned
about the impression his clothing made on his superiors .Although blue collar workers were less
aware that they m

## Step 2: Prepare Data and Model for Prompt Tuning (20 points)
Step-by-step directions:
- Set the number of virtual tokens and training epochs (1-2 training epochs is fine) (2 points)
- Make sure the tokenizer and model have a `pad` token set (2 points)
- Load your training dataset and do a training/validation split (use a `seed` so you can reproduce the split in future homeworks and your final assignment), ensuring that your data are formatted in an instruction/response format (i.e., make sure each row denotes the instruction/question and response separately; for example, `"Instruct: {instruction}\n\nResponse: {response}"`). If you are using a `lm_eval` task for your training task, make sure your training/validation data are from the training split of the benchmark so that you are not directly training on test benchmark data. If you are not using a `lm_eval` task for your training task, make sure your training dataset does not contain examples from the test split you used above and that your evaluation dataset comes from a split of your training dataset so that you are not tuning performance on the test dataset directly. (10 points)
- Tokenize your training and validation data using the map function (2 points)
- Print 1 training and 1 validation data sample to make sure the mapping worked and the data are formatted properly (2 points)
- Set up a `PromptTuningConfig` using your desired hyperparameters and print the number of trainable parameters (2 points)

In [5]:
num_virtual_tokens = 10
num_epochs = 2
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_type_id

In [6]:
data = load_dataset("hellaswag", split="train")

In [7]:
train = data[:30000]
val = data[30000:]

In [8]:
train_dataset = Dataset.from_dict(train)
val_dataset = Dataset.from_dict(val)

Unlike the instruction response used in other models HellaSwag uses a different format I have to write a function to format the instruction in this specific format

In [9]:
def format_data(example):
    instruction = example['ctx']
    correct_ending_index = int(example['label'])
    response = example['endings'][correct_ending_index]

    text = f"Instruct: Complete the sentence: {instruction}\n\nResponse: {response}"
    return {"text": text}

In [10]:
def tokenize(examples):
    return tokenizer(examples["text"])

In [11]:
formatted_train_data = train_dataset.map(format_data, remove_columns=list(train_dataset.features))
formatted_val_data = val_dataset.map(format_data, remove_columns=list(val_dataset.features))

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9905 [00:00<?, ? examples/s]

In [12]:
tokenized_train = formatted_train_data.map(tokenize, batched=True, remove_columns=['text'])
tokenized_val = formatted_val_data.map(tokenize, batched=True, remove_columns=['text'])

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9905 [00:00<?, ? examples/s]

In [13]:
tokenized_train[0]

{'input_ids': [641,
  1235,
  25,
  18608,
  279,
  11652,
  25,
  5005,
  11,
  279,
  883,
  13914,
  916,
  279,
  11794,
  18202,
  279,
  3241,
  315,
  264,
  1803,
  11,
  323,
  264,
  5220,
  12233,
  12406,
  15097,
  42532,
  13,
  1221,
  271,
  2582,
  25,
  1154,
  279,
  883,
  9539,
  17592,
  279,
  11794,
  389,
  806,
  1803,
  13],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

In [14]:
tokenized_val[0]

{'input_ids': [641,
  1235,
  25,
  18608,
  279,
  11652,
  25,
  508,
  2708,
  60,
  2585,
  311,
  990,
  697,
  2484,
  8286,
  508,
  2102,
  60,
  95210,
  279,
  27505,
  389,
  279,
  2115,
  3108,
  315,
  279,
  32177,
  3250,
  13,
  508,
  9520,
  60,
  576,
  2484,
  8286,
  374,
  264,
  1293,
  27505,
  11,
  5990,
  3691,
  476,
  17545,
  304,
  1894,
  13,
  3197,
  7726,
  705,
  476,
  1495,
  11,
  419,
  27505,
  686,
  5240,
  264,
  3100,
  389,
  2987,
  279,
  2115,
  476,
  1290,
  3108,
  315,
  697,
  1803,
  311,
  8217,
  382,
  2582,
  25,
  508,
  1966,
  24080,
  60,
  576,
  2484,
  8286,
  686,
  537,
  1281,
  264,
  5112,
  476,
  3100,
  279,
  8286,
  3100,
  389,
  697,
  1803,
  7241,
  279,
  1803,
  374,
  4303,
  13,
  508,
  2102,
  60,
  5443,
  279,
  2484,
  8286,
  311,
  13216,
  264,
  2484,
  311,
  279,
  2115,
  13],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  

In [15]:
from peft import get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit

In [16]:
generation_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init = PromptTuningInit.RANDOM,
    num_virtual_tokens=num_virtual_tokens,
    tokenizer_name_or_path = 'Qwen/Qwen3-1.7B')

model = get_peft_model(model, generation_config)
model.print_trainable_parameters()

trainable params: 20,480 || all params: 1,720,595,456 || trainable%: 0.0012


## Step 3: Prepare Trainer and Train Model (20 points)
Step-by-step directions:
- As we did in class, create `TrainingArguments` that define an output directory (make sure the directory exists like we did in class); batch size; learning rate; number of training epochs; `logging_steps`, `eval_strategy`, and `eval_steps` for evaluating on the training and validation dataset; set the number of `save_steps`; and set `load_best_model_at_end` to `True` (5 points)
- Create a trainer using your training arguments and defining the model, `train_dataset`, `eval_dataset`, and `data_collator` (5 points)
- Run `trainer.train()` to train your model (5 points)
- Save your model after training in a folder in your output_directory called best_model since the trainer will automatically load your best model at the end of training (5 points)

In [17]:
from transformers import TrainingArguments

In [18]:
def create_training_arguments(path, learning_rate= 0.001, epochs=num_epochs, eval_steps= 500):
    training_args = TrainingArguments(
        output_dir = path, #specify path for trained model weights
        auto_find_batch_size = True, #automatically find batch size
        learning_rate = learning_rate,
        num_train_epochs = epochs, 
        logging_steps = eval_steps, #this is how often we log training results
        eval_strategy = "steps", #evaluate every 150 steps
        eval_steps = eval_steps,
        save_steps = eval_steps,
        load_best_model_at_end = True,
    )
    return training_args

In [19]:
working_dir = "/scratch/ezq9qu"
output_directory = os.path.join(working_dir, "prompt_tuned_trained_model")
if not os.path.exists(output_directory):
    os.mkdir(output_directory)

In [20]:
training_args = create_training_arguments(output_directory)

In [21]:
from transformers import Trainer, DataCollatorForLanguageModeling
def create_trainer(model, training_args, train_dataset, eval_dataset):
    trainer = Trainer(
        model = model,
        args = training_args,
        train_dataset = train_dataset,
        eval_dataset = eval_dataset,
        data_collator = DataCollatorForLanguageModeling(tokenizer,
                                                        mlm= False),
    )
    return trainer


In [22]:
trainer = create_trainer(model, training_args, tokenized_train, tokenized_val)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [23]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mpfoster317[0m ([33mpfoster317-university-of-virginia[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss
500,2.7311,2.228446
1000,2.2974,2.069973
1500,2.2052,2.040892
2000,2.1778,2.016444
2500,2.1623,2.003463
3000,2.1495,1.992006
3500,2.1398,1.987259
4000,2.125,1.978879
4500,2.1266,1.97873
5000,2.115,1.978866


TrainOutput(global_step=7500, training_loss=2.1856515787760418, metrics={'train_runtime': 2611.9547, 'train_samples_per_second': 22.971, 'train_steps_per_second': 2.871, 'total_flos': 6.482465380388045e+16, 'train_loss': 2.1856515787760418, 'epoch': 2.0})

In [24]:
trainer.model.save_pretrained(f"{output_directory}/best_model")

## Step 4: Assess Post-training Performance (20 points)
Step-by-step directions:
- Assess post training performance: Set up an `lm_eval` `task_manager` and implement your training task's test split and general language benchmark; make sure to log the samples and you can set a limit (n=50) if you'd like. If you are implementing a training task not in `lm_eval`, remember to use your held-out test data, logging responses and using a limit if you'd like. (10 points)
- Print the results and 2 model responses for each task to get a sense of what the model outputs look like after training (10 points)

In [25]:
results = evaluator.simple_evaluate(
    model = "hf", #Specify huggingface model
    model_args = {"pretrained": model, "dtype": "bfloat16", "tokenizer": tokenizer}, #Define model arguments
    tasks = ["hellaswag", "race"], 
    log_samples = True, 
    batch_size = "1",
    limit = 50,
    random_seed = 42,
)

`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████| 50/50 [00:00<00:00, 4612.57it/s]
100%|██████████| 50/50 [00:00<00:00, 3536.63it/s]
Running loglikelihood requests: 100%|██████████| 400/400 [00:17<00:00, 23.20it/s]


In [26]:
results["results"]

{'hellaswag': {'alias': 'hellaswag',
  'acc,none': 0.44,
  'acc_stderr,none': 0.07091242083423345,
  'acc_norm,none': 0.5,
  'acc_norm_stderr,none': 0.07142857142857142},
 'race': {'alias': 'race',
  'acc,none': 0.34,
  'acc_stderr,none': 0.0676726816132972}}

In [93]:
get_hella_examples(2)

Query: Canoeing: Two women in a child are shown in a canoe while a man pulls the canoe while standing in the water, with other individuals visible in the background. The child and a different man: 

are then shown paddling down a river in a boat while a woman talks.
are driving the canoe, they go down the river flowing side to side.
sit in a canoe while the man paddles.
walking go down the rapids, while the man in his helicopter almost falls and goes out of canoehood.


Target: 2
Response: 2


In [94]:
get_hella_examples(3)

Query: High jump: A boy is running down a track. The boy: 

runs into a car.
gets in a mat.
lifts his body above the height of a pole.
stands on his hands and springs.


Target: 2
Response: 1


In [95]:
get_race_examples(2)

Article: Little Tommy was doing very badly in math. His parents had tried everything--tutors, cards, special
learning centers--in short, everything they could think of. Finally they took Tommy to a catholic
school. After the first day, little Tommy came home with a very serious look on his face. He didn't
kiss his mother hello. Instead, he went straight to his room and started studying. Books and papers
were spread out all over the room and little Tommy was hard at work. His mother was surprised. She
called him down to dinner and as soon as he finished eating, he went back to his room, without a
word. In no time he was back hitting the books as hard as before. This went on for some time, day
after day while the mother tried to understand what was happening. Finally, little Tommy brought
home his report card. He quietly put it on the table and went up to his room and hit the books. His
mom looked at it and to her surprise, little Tommy got an A in math. She could no longer hold her
curi

In [96]:
get_race_examples(3)

Article: Give it five minutes I used to be a hothead. Whenever anyone said anything, I'd think of a way to
disagree. I'd push back hard if something didn't fit my world-view. It's like I had to be first with
an opinion -- as if being first meant something. But what it really meant was that I wasn't thinking
hard enough about the problem. The faster you react, the less you think. Not always, but often. This
came to a head back in 2007. I was speaking at the Business Innovation Factory conference in
Providence, RI. So was Richard Saul Wurman. After my talk Richard came up to introduce himself and
compliment my talk. That was very generous of him. He certainly didn't have to do that. And what did
I do? I pushed back at him about the talk he gave. While he was making his points on stage, I was
taking an inventory of the things I didn't agree with. And when presented with an opportunity to
speak with him, I quickly pushed back at some of his ideas. I must have seemed like such an asshole.
H

## Step 5: Interpretation (20 points)
In one-two paragraphs, interpret the results from prompt tuning. Discuss whether you think prompt tuning worked for your task, citing quantitative and qualitative evidence from Steps 1 and 4 (10 points). Also discuss why you think prompt tuning performed the way it did for your task (i.e., why did it succeed or fail?), citing any resources you used to formulate your argument (10 points). 

The model has become slightly better at the specific task, hellaswag, after prompt tuning. We can see evidence of this from the fact that the model performed better on it metrics. The model improved from being correct 40 percent of the time to 44 percent. This gain in performance did come at a cost. We can see that the other metric, race, went down as a result. Race went from a peformance of 36 percent down to 34%. I think that ti worked specifically in this case because the dataset was particularly large and well formatted. Since I used a dataset that was available from the lm-eval harness it allowed for a dataset with almost 40,000 entries, this gave the model a lot of training data to learn from. Also, the data was of good quality, a lot of work has gone into maintaining good quality prompts in the hellaswag dataset. These two factors allowed for an improvement in performance across the metrics.