## Homework 7

For this week's homework, you will get more experience with LoRA. Like we did in class, you will choose a training dataset and a general language benchmark to assess catastrophic forgetting (you must use the same datasets you used last week for prompt tuning so that you can directly compare the results from both methods). You will use `lm_eval` to assess your model's performance before and after LoRA. The goal of this assignment is to get more experience implementing LoRA and compare its effectiveness to lighter approaches like prompt tuning. This homework will guide you through the steps to do this. 

## Step 1: Download a Model and Assess Its Pretraining Performance (20 points)
As we did in class, start by downloading a model and assessing its pretraining performance on your training task and a general language benchmark using the `lm_eval` package (if your dataset is not a `lm_eval` task, you will need to write your own code to generate responses to a hold-out test split of the dataset and assess whether they are correct or not). Here are step-by-step directions with point values:
- Import necessary dependencies (5 points)
- Load the model and tokenizer (5 points) 
- Set up a `lm_eval` `task_manager` and implement your training task and general language benchmark; make sure to log the samples and you can set a limit (n=50) if you'd like to reduce runtime. If you are implementing a training task not in `lm_eval`, load the necessary data, conduct a training and test split (using the same `seed` as HW 9), and evaluate your model on the test dataset, logging responses and using a limit if you'd like (5 points).
- Print the results and 2 model responses for each task to get a sense of what the model outputs look like before training (5 points)

In [1]:
import os 
os.environ["HF_HOME"] = "/scratch/ezq9qu/models/cache"
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset
from peft import get_peft_model, LoraConfig
from tqdm import tqdm
from lm_eval import evaluator
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B", device_map = 'auto', dtype = torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
import textwrap
import ast

In [3]:
results = evaluator.simple_evaluate(
    model = "hf", #Specify huggingface model
    model_args = {"pretrained": model, "dtype": "bfloat16", "tokenizer": tokenizer}, #Define model arguments
    tasks = ["hellaswag", "race"], 
    log_samples = True, 
    batch_size = "1",
    limit = 50,
    random_seed = 42,
)

`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████| 50/50 [00:00<00:00, 4452.84it/s]
100%|██████████| 50/50 [00:00<00:00, 3296.42it/s]
Running loglikelihood requests: 100%|██████████| 400/400 [00:17<00:00, 22.87it/s]


In [4]:
results['results']

{'hellaswag': {'alias': 'hellaswag',
  'acc,none': 0.4,
  'acc_stderr,none': 0.06998542122237653,
  'acc_norm,none': 0.46,
  'acc_norm_stderr,none': 0.07119963311072637},
 'race': {'alias': 'race',
  'acc,none': 0.36,
  'acc_stderr,none': 0.06857142857142856}}

In [5]:
def get_hella_examples(index):
    query = results["samples"]["hellaswag"][index]["doc"]['query']
    choices = results["samples"]["hellaswag"][index]["doc"]["choices"]
    target = results["samples"]["hellaswag"][index]['target']
    resps = max(enumerate(results["samples"]["hellaswag"][index]["resps"]), key=lambda x: x[1][0])[0]

    print(f"Query: {query}: \n")
    for choice in choices:
        print(f"{choice}")
    print("\n")
    print(f"Target: {target}")
    print(f"Response: {resps}")

In [6]:
def get_race_examples(index):
    article = results["samples"]["race"][index]['doc']['article']
    article_wrapped = textwrap.fill(article, width=100)
    problems = results["samples"]["race"][index]['doc']['problems']

    data_list = ast.literal_eval(problems)
    questions = [item['question'] for item in data_list]

    target = results["samples"]["race"][index]["target"]
    resps = max(enumerate(results["samples"]["race"][index]["resps"]), key=lambda x: x[1][0])[0]

    print(f"Article: {article_wrapped} \n\n ")
    print("There are a series of questions...")
    for question in questions:
        print(question)
    print(f"Target: {target}")
    print(f"Resps: {resps}")

In [7]:
get_hella_examples(0)

Query: Roof shingle removal: A man is sitting on a roof. He: 

is using wrap to wrap a pair of skis.
is ripping level tiles off.
is holding a rubik's cube.
starts pulling up roofing on a roof.


Target: 3
Response: 2


In [8]:
get_hella_examples(1)

Query: Clean and jerk: A lady walks to a barbell. She bends down and grabs the pole. The lady: 

swings and lands in her arms.
pulls the barbell forward.
pulls a rope attached to the barbell.
stands and lifts the weight over her head.


Target: 3
Response: 1


In [9]:
get_race_examples(0)

Article: The rain had continued for a week and the flood had created a big river which were running by Nancy
Brown's farm. As she tried to gather her cows to a higher ground, she slipped and hit her head on a
fallen tree trunk. The fall made her unconscious for a moment or two. When she came to, Lizzie, one
of her oldest and favorite cows, was licking her face.  At that time, the water level on the farm
was still rising. Nancy gathered all her strength to get up and began walking slowly with Lizzie.
The rain had become much heavier, and the water in the field was now waist high. Nancy's pace got
slower and slower because she felt a great pain in her head. Finally, all she could do was to throw
her arm around Lizzie's neck and try to hang on. About 20 minutes later, Lizzie managed to pull
herself and Nancy out of the rising water and onto a bit of high land, which seemed like a small
island in the middle of a lake of white water.  Even though it was about noon, the sky was so dark
and t

In [10]:
get_race_examples(1)

Article: There is probably no field of human activity in which our values and lifestyles are shown more
clearly and strongly than they are in the clothes that we choose to wear.The dress of an individual
is a kind of "sign language" that communicates a set of information and is usually the basis on
which immediate impressions are formed.Traditionally,a concern for clothes was considered to be an
affair of females,while men took pride in the fact that they were completely lacking in clothes
consciousness . This type of American culture is by degrees changing as man dress takes on greater
variety and color.Even as early as 1955,a researcher in Michigan said that _ White collar workers in
particular viewed dress as a symbol of ability,which could be used to impress or influence
others,especially in the work situation.The white collar worker was described as extremely concerned
about the impression his clothing made on his superiors .Although blue collar workers were less
aware that they m

## Step 2: Prepare Data and Model for LoRA (20 points)
Step-by-step directions:
- Make sure the tokenizer and model have a `pad` token set (2 points)
- Load your training dataset and do a training/validation split (using the same `seed` as HW 9), ensuring that your data are formatted in an instruction/response format (i.e., make sure each row denotes the instruction/question and response separately; for example, `"Instruct: {instruction}\n\nResponse: {response}"`). If you are using a `lm_eval` task for your training task, make sure your training/validation data are from the training split of the benchmark so that you are not directly training on test benchmark data. If you are not using a `lm_eval` task for your training task, make sure your training dataset does not contain examples from the test split you used above and that your evaluation dataset comes from a split of your training dataset so that you are not tuning performance on the test dataset directly. (10 points)
- Tokenize your training and validation data using the map function (2 points)
- Print 1 training and 1 validation data sample to make sure the mapping worked and the data are formatted properly (2 points)
- Set the number of training epochs (1-2 training epochs is fine) (2 points)
- Set up a `LoraConfig` using your desired hyperparameters (including `lora_alpha`, `lora_dropout`, `bias`, `task_type`, and `target_modules`) and print the number of trainable parameters (2 points)

In [11]:
num_virtual_tokens = 10
num_epochs = 2
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_type_id

In [12]:
data = load_dataset("hellaswag", split="train")

In [13]:
train = data[:2000]
val = data[2000:3000]

In [14]:
train_dataset = Dataset.from_dict(train)
val_dataset = Dataset.from_dict(val)

In [15]:
def format_data(example):
    instruction = example['ctx']
    correct_ending_index = int(example['label'])
    response = example['endings'][correct_ending_index]

    text = f"Instruct: Complete the sentence: {instruction}\n\nResponse: {response}"
    return {"text": text}

In [16]:
def tokenize(examples):
    return tokenizer(examples["text"])

In [17]:
formatted_train_data = train_dataset.map(format_data, remove_columns=list(train_dataset.features))
formatted_val_data = val_dataset.map(format_data, remove_columns=list(val_dataset.features))

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [18]:
tokenized_train = formatted_train_data.map(tokenize, batched=True, remove_columns=['text'])
tokenized_val = formatted_val_data.map(tokenize, batched=True, remove_columns=['text'])

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [19]:
tokenized_train[0]

{'input_ids': [641,
  1235,
  25,
  18608,
  279,
  11652,
  25,
  5005,
  11,
  279,
  883,
  13914,
  916,
  279,
  11794,
  18202,
  279,
  3241,
  315,
  264,
  1803,
  11,
  323,
  264,
  5220,
  12233,
  12406,
  15097,
  42532,
  13,
  1221,
  271,
  2582,
  25,
  1154,
  279,
  883,
  9539,
  17592,
  279,
  11794,
  389,
  806,
  1803,
  13],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

In [20]:
tokenized_val[0]

{'input_ids': [641,
  1235,
  25,
  18608,
  279,
  11652,
  25,
  362,
  5220,
  374,
  6839,
  48348,
  2348,
  264,
  7002,
  13,
  25694,
  4935,
  261,
  20114,
  525,
  6839,
  304,
  264,
  3054,
  13,
  2155,
  12538,
  271,
  2582,
  25,
  525,
  20459,
  553,
  1105,
  13],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

In [21]:
LORA_R = 64 
LORA_ALPHA = 64
LORA_DROPOUT = .05
lora_config = LoraConfig(
    r = LORA_R, #the lower dimension of the low-rank matrices
    lora_alpha = LORA_ALPHA, #scaling factor for the low-rank update
    lora_dropout = LORA_DROPOUT, #dropout factor to prevent overfitting
    bias = "none",
    task_type = "CAUSAL_LM", #set language modeling as task type
    target_modules = ["q_proj", "v_proj"], #add LoRA modules to every query and value matrix in the attention layers
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 12,845,056 || all params: 1,733,420,032 || trainable%: 0.7410


## Step 3: Prepare Trainer and Train Model (20 points)
Step-by-step directions:
- As we did in class, create `TrainingArguments` that define an output directory (make sure the directory exists like we did in class); batch size; learning rate; number of training epochs; `logging_steps`, `eval_strategy`, and `eval_steps` for evaluating on the training and validation dataset; set the number of `save_steps`; and set `load_best_model_at_end` to `True` (5 points)
- Create a trainer using your training arguments and defining the model, `train_dataset`, `eval_dataset`, and `data_collator` (5 points)
- Run `trainer.train()` to train your model (5 points)
- Save your model after training in a folder in your output_directory called best_model since the trainer will automatically load your best model at the end of training (5 points)

In [22]:
from transformers import TrainingArguments

def create_training_arguments(path, learning_rate = 0.00001, epochs = 1, eval_steps = 150):
    training_args = TrainingArguments(
        output_dir = path, #specify path for trained model weights
        auto_find_batch_size = True, #automatically find batch size
        learning_rate = learning_rate,
        num_train_epochs = epochs, 
        logging_steps = eval_steps, #this is how often we log training results
        eval_strategy = "steps", #evaluate every 150 steps
        eval_steps = eval_steps,
        save_steps = eval_steps,
        load_best_model_at_end = True,
    )
    return training_args

In [23]:
working_dir = "/scratch/ezq9qu"
output_directory = os.path.join(working_dir, "trained_lora_model")
if not os.path.exists(output_directory):
    os.mkdir(output_directory)

In [24]:
training_args = create_training_arguments(output_directory)

In [25]:
from transformers import Trainer, DataCollatorForLanguageModeling
def create_trainer(model, training_args, train_dataset, eval_dataset):
    trainer = Trainer(
        model = model,
        args = training_args,
        train_dataset = train_dataset,
        eval_dataset = eval_dataset,
        data_collator = DataCollatorForLanguageModeling(tokenizer,
                                                        mlm= False),
    )
    return trainer

In [26]:
trainer = create_trainer(model, training_args, tokenized_train, tokenized_val)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [27]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mpfoster317[0m ([33mpfoster317-university-of-virginia[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss
150,3.285,2.753965


TrainOutput(global_step=250, training_loss=3.05614794921875, metrics={'train_runtime': 54.5606, 'train_samples_per_second': 36.656, 'train_steps_per_second': 4.582, 'total_flos': 1184522318856192.0, 'train_loss': 3.05614794921875, 'epoch': 1.0})

In [28]:
trainer.model.save_pretrained(f"{output_directory}/best_model")

## Step 4: Assess Post-training Performance (20 points)
Step-by-step directions:
- Assess post training performance: Set up an `lm_eval` `task_manager` and implement your training task's test split and general language benchmark; make sure to log the samples and you can set a limit (n=50) if you'd like. If you are implementing a training task not in `lm_eval`, remember to use your held-out test data, logging responses and using a limit if you'd like. (10 points)
- Print the results and 2 model responses for each task to get a sense of what the model outputs look like after training (10 points)

In [29]:
results = evaluator.simple_evaluate(
    model = "hf", #Specify huggingface model
    model_args = {"pretrained": model, "dtype": "bfloat16", "tokenizer": tokenizer}, #Define model arguments
    tasks = ["hellaswag", "race"], 
    log_samples = True, 
    batch_size = "1",
    limit = 50,
    random_seed = 42,
)

`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████| 50/50 [00:00<00:00, 4476.79it/s]
100%|██████████| 50/50 [00:00<00:00, 3361.57it/s]
Running loglikelihood requests: 100%|██████████| 400/400 [00:20<00:00, 19.29it/s]


In [30]:
results['results']

{'hellaswag': {'alias': 'hellaswag',
  'acc,none': 0.46,
  'acc_stderr,none': 0.07119963311072637,
  'acc_norm,none': 0.5,
  'acc_norm_stderr,none': 0.07142857142857142},
 'race': {'alias': 'race',
  'acc,none': 0.34,
  'acc_stderr,none': 0.0676726816132972}}

In [31]:
get_hella_examples(0)

Query: Roof shingle removal: A man is sitting on a roof. He: 

is using wrap to wrap a pair of skis.
is ripping level tiles off.
is holding a rubik's cube.
starts pulling up roofing on a roof.


Target: 3
Response: 2


In [32]:
get_hella_examples(1)

Query: Clean and jerk: A lady walks to a barbell. She bends down and grabs the pole. The lady: 

swings and lands in her arms.
pulls the barbell forward.
pulls a rope attached to the barbell.
stands and lifts the weight over her head.


Target: 3
Response: 1


In [33]:
get_race_examples(0)

Article: The rain had continued for a week and the flood had created a big river which were running by Nancy
Brown's farm. As she tried to gather her cows to a higher ground, she slipped and hit her head on a
fallen tree trunk. The fall made her unconscious for a moment or two. When she came to, Lizzie, one
of her oldest and favorite cows, was licking her face.  At that time, the water level on the farm
was still rising. Nancy gathered all her strength to get up and began walking slowly with Lizzie.
The rain had become much heavier, and the water in the field was now waist high. Nancy's pace got
slower and slower because she felt a great pain in her head. Finally, all she could do was to throw
her arm around Lizzie's neck and try to hang on. About 20 minutes later, Lizzie managed to pull
herself and Nancy out of the rising water and onto a bit of high land, which seemed like a small
island in the middle of a lake of white water.  Even though it was about noon, the sky was so dark
and t

In [34]:
get_race_examples(1)

Article: There is probably no field of human activity in which our values and lifestyles are shown more
clearly and strongly than they are in the clothes that we choose to wear.The dress of an individual
is a kind of "sign language" that communicates a set of information and is usually the basis on
which immediate impressions are formed.Traditionally,a concern for clothes was considered to be an
affair of females,while men took pride in the fact that they were completely lacking in clothes
consciousness . This type of American culture is by degrees changing as man dress takes on greater
variety and color.Even as early as 1955,a researcher in Michigan said that _ White collar workers in
particular viewed dress as a symbol of ability,which could be used to impress or influence
others,especially in the work situation.The white collar worker was described as extremely concerned
about the impression his clothing made on his superiors .Although blue collar workers were less
aware that they m

## Step 5: Interpretation and Comparison to Prompt Tuning (20 points)
In one-two paragraphs, interpret the results from LoRA. Discuss whether you think LoRA worked well for your task, citing quantitative and qualitative evidence from Steps 1 and 4 (6 points). Also discuss why you think LoRA performed the way it did for your task (i.e., why did it succeed or fail?), citing any resources you used to formulate your argument (6 points). Lastly, compare LoRA's performance on your task to prompt tuning: which method worked better in your opinion, and why (8 points)? Some considerations to help you make your decision include accuracy on the test set, difference in outputs before and after training, and level of catastrophic forgetting. 

Based on the quantitative results, LoRA training worked fairly well, the acc,none for the task increased from .40 to .46, the other benchmark that was used race did not change much either from .36 to .34. There was not much catastrophic forgetting. The responses for the qualatative task did not give much information, the models made the same mistakes for both selected answers. I think the Lora succeeded as it allowed for the model to learn the specifics for the inputs needed for this spefic test. Interestingly this was achieved with less training data than the prompt tuning task. I think the LoRA model was more successful in this case than the prompt tuning. LoRA achieved a better score on the hellaswag, saw a similar amount of forgetting, and did it with less resources. This is much more efficient than the prompt tuning.