## Homework 4

Your goal for this homework is to learn more about parameter importance in LLMs and get more comfortable with editing LLM parameters and using the Eleuther LM Evaluation Harness like we did in the Week 4 live session. This homework will guide you through several different experiments to get further insight into the importance of various components of the LLM architecture. 

## Step 1: Choose a model and load dependencies (10 points)
In the code cell below:
- Load a LLM other than a Llama 3 model
- Import the necessary dependencies from transformers
- Import torch and lm_eval
- Print the model architecture so you can see what its components are named

In [1]:
import torch
import os
os.environ['HF_HOME'] = '/scratch/ezq9qu/models/cache'
from datasets import load_dataset
import pandas as pd
import numpy as np
import lm_eval
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline


In [2]:
def load_model(name = 'Qwen/Qwen3-4B-Instruct-2507'):
    tokenizer = AutoTokenizer.from_pretrained(name)
    model = AutoModelForCausalLM.from_pretrained(name, device_map="auto", dtype=torch.bfloat16)
    pipe = pipeline(
        "text-generation", 
        model=model, 
        dtype=torch.bfloat16, 
        device_map="auto", 
        tokenizer = tokenizer, 
        max_new_tokens = 250,
        do_sample = False
    )
    return model, pipe, tokenizer

model, pipe, tokenizer = load_model()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [3]:
model

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 2560)
    (layers): ModuleList(
      (0-35): 36 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=2560, out_features=4096, bias=False)
          (k_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=2560, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (up_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (down_proj): Linear(in_features=9728, out_features=2560, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06)
        (post_attention_layer

## Step 2: Test baseline performance (18 points)
Using the code from Week 4 as a template, choose a task from the Eleuther LM Evaluation Harness and test your model on it, logging the model samples like we did in class (9 points). This is a good opportunity to practice implementing one of the benchmarks you will use for your final project. Feel free to use the `limit` parameter to subset the task to a few questions so that the code runs more quickly. 

Like we did in class, print the model's accuracy on the task (3 points) and print two examples of questions and the model's associated responses (3 points each). 

In [4]:
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM

In [5]:
lm_eval_model = HFLM(pretrained=model,
tokenizer = tokenizer,
dtype = "bfloat16")

`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration


In [6]:
results = evaluator.simple_evaluate(
    model = lm_eval_model,
    tasks = "gsm8k",
    log_samples = True,
    batch_size = 1,
    limit=3, #set number of questions from benchmark
    random_seed=126,
    
)


100%|██████████| 3/3 [00:00<00:00, 305.90it/s]
Running generate_until requests: 100%|██████████| 3/3 [00:22<00:00,  7.44s/it]


We can see the results, by indexing to the results level

In [7]:
results['results']

{'gsm8k': {'alias': 'gsm8k',
  'exact_match,strict-match': np.float64(1.0),
  'exact_match_stderr,strict-match': 0.0,
  'exact_match,flexible-extract': np.float64(0.6666666666666666),
  'exact_match_stderr,flexible-extract': 0.33333333333333337}}

Here we see that the it scored a value of 1 with a standard error of 0 with the 3 question limit from my chosen test `gsm8k`. A perfect score! Now we can index and see a couple of question an response pairs.

In [8]:
results['samples']['gsm8k'][0]['doc']['question']

"Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"

In [9]:
results['samples']['gsm8k'][0]['resps']

[[" Janet’s ducks lay 16 eggs per day. She eats 3 for breakfast and bakes 4 for muffins, so she uses 3 + 4 = 7 eggs. The remainder of the eggs is 16 - 7 = 9 eggs. She sells each egg for $2, so she makes 9 * 2 = <<9*2=18>>18 dollars every day at the farmers' market.\n#### 18\nAnswer: 18"]]

In [10]:
results['samples']['gsm8k'][1]['doc']['question']

'A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?'

In [11]:
results['samples']['gsm8k'][1]['resps']

[[' The robe takes 2 bolts of blue fiber and half that much white fiber, which is 2 / 2 = 1 bolt of white fiber. So, in total, it takes 2 + 1 = 3 bolts. #### 3\nAnswer: #### 3']]

Both of these are correct

## Step 3: Attention heads versus MLP layers (18 points)
Choose a model layer (decoder block) you want to experiment with. Repeat Step 2 twice, once after deleting one of that layer's attention projections (7 points) and once after deleting one of that layer's MLP layers (7 points). Don't forget to reload the model after each experiment so that you are not deleting multiple model components (4 points)!

In [12]:
model

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 2560)
    (layers): ModuleList(
      (0-35): 36 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=2560, out_features=4096, bias=False)
          (k_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=2560, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (up_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (down_proj): Linear(in_features=9728, out_features=2560, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06)
        (post_attention_layer

In [13]:
model.model.layers[3].self_attn.q_proj.weight.data*=0

In [14]:
model.model.layers[3].self_attn.q_proj.weight.data

tensor([[-0., -0., -0.,  ..., 0., -0., -0.],
        [0., 0., -0.,  ..., 0., 0., -0.],
        [0., 0., -0.,  ..., -0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., -0., 0.],
        [0., -0., 0.,  ..., -0., 0., 0.],
        [-0., 0., -0.,  ..., -0., 0., -0.]], device='cuda:0',
       dtype=torch.bfloat16)

In [15]:
model_attn = HFLM(pretrained=model, tokenizer=tokenizer, dtype="bfloat16")

`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration


In [16]:
results = evaluator.simple_evaluate(
    model = model_attn,
    tasks = "gsm8k",
    log_samples = True,
    batch_size = 1,
    limit=3, #set number of questions from benchmark
    random_seed=127,
    
)

100%|██████████| 3/3 [00:00<00:00, 308.71it/s]
Running generate_until requests: 100%|██████████| 3/3 [00:32<00:00, 10.69s/it]


In [17]:
results['results']

{'gsm8k': {'alias': 'gsm8k',
  'exact_match,strict-match': np.float64(0.6666666666666666),
  'exact_match_stderr,strict-match': 0.33333333333333337,
  'exact_match,flexible-extract': np.float64(0.6666666666666666),
  'exact_match_stderr,flexible-extract': 0.33333333333333337}}

In [18]:
results['samples']['gsm8k'][0]['doc']['question']

"Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"

In [19]:
results['samples']['gsm8k'][0]['resps']

[[" Janet’s ducks lay 16 eggs per day. She eats 3 for breakfast and bakes 4 for her friends, so she has 16 - 3 - 4 = <<16-3-4=9>>9 eggs left. She sells these at $2 per egg, so she makes 9 * 2 = <<9*2=18>>18 dollars every day at the farmers' market.\n#### 18\nAnswer: 18"]]

In [27]:
model, pipe, tokenizer = load_model()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device set to use cuda:0


In [28]:
model.model.layers[3].mlp.gate_proj.weight.data*=0

In [29]:
model.model.layers[3].mlp.gate_proj.weight.data

tensor([[0., 0., 0.,  ..., 0., -0., -0.],
        [-0., 0., 0.,  ..., -0., 0., -0.],
        [-0., 0., 0.,  ..., -0., 0., -0.],
        ...,
        [-0., 0., -0.,  ..., -0., 0., -0.],
        [0., 0., -0.,  ..., 0., 0., -0.],
        [-0., 0., -0.,  ..., 0., 0., -0.]], device='cuda:0',
       dtype=torch.bfloat16)

In [30]:
model_mlp = HFLM(pretrained=model, tokenizer=tokenizer, dtype="bfloat16")

`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration


In [31]:
results = evaluator.simple_evaluate(
    model = model_mlp,
    tasks = "gsm8k",
    log_samples = True,
    batch_size = 1,
    limit=3, #set number of questions from benchmark
    random_seed=128,
    
)

100%|██████████| 3/3 [00:00<00:00, 312.59it/s]
Running generate_until requests: 100%|██████████| 3/3 [00:21<00:00,  7.28s/it]


In [32]:
results['results']

{'gsm8k': {'alias': 'gsm8k',
  'exact_match,strict-match': np.float64(1.0),
  'exact_match_stderr,strict-match': 0.0,
  'exact_match,flexible-extract': np.float64(1.0),
  'exact_match_stderr,flexible-extract': 0.0}}

Thats everything correct again!

In [25]:
results['samples']['gsm8k'][0]['resps']

[[" The ducks lay 16 eggs per day. Janet eats 3 eggs for breakfast and bakes 4 eggs into muffins, so she uses 3 + 4 = <<3+4=7>>7 eggs. The remainder of the eggs is 16 - 7 = <<16-7=9>>9 eggs. She sells each egg for $2, so she makes 9 * 2 = <<9*2=18>>18 dollars per day at the farmers' market.\n#### 18\nAnswer: 18"]]

## Step 4: Interpretation (18 points)
In a markdown cell, interpret the results from your experiment in Step 3 in 2-3 sentences. In the layer you chose, do attention heads or MLP layers appear to be more important and how do you know (i.e., how do the results from the experiment back up that claim) (9 points)? Why might that be the case (9 points)? There aren't necessarily right or wrong answers- we are just looking for some critical thinking based on what you know about LLMs and how they work. Feel free to peruse outside sources if you are stumped, but just make sure you reference them!

It seems as though the attention head is the more important of the two layers when deleting a small part of it. We see the results drop with the attention head weights to zero that we do not see in the results with the MLP. However this may be that the model architecture is sufficiently robust enough to have a complete failure of one small layer and still function correctly. The residuals might just route around the specific failure. In order to test this thery I ran another expirment, deleting all of the mlps in one layer, and increasing the limit to 10.

In [None]:

model, pipe, tokenizer = load_model()


model.model.layers[35].mlp.gate_proj.weight.data*=0
model.model.layers[35].mlp.up_proj.weight.data*=0
model.model.layers[35].mlp.down_proj.weight.data*=0


print("Creating a new evaluation wrapper...")
model_mlp_off = HFLM(pretrained=model, tokenizer=tokenizer, dtype="bfloat16")


print("Running evaluation with the entire MLP block disabled...")
results = evaluator.simple_evaluate(
    model = model_mlp_off,
    tasks = "gsm8k",
    log_samples = False,
    batch_size = 1,
    limit=10, 
    random_seed=131, 
)

results['results']

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device set to use cuda:0
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration


Creating a new evaluation wrapper...
Running evaluation with the entire MLP block disabled...


100%|██████████| 10/10 [00:00<00:00, 320.30it/s]
Running generate_until requests: 100%|██████████| 10/10 [01:12<00:00,  7.29s/it]


{'gsm8k': {'alias': 'gsm8k',
  'exact_match,strict-match': np.float64(0.9),
  'exact_match_stderr,strict-match': 0.09999999999999999,
  'exact_match,flexible-extract': np.float64(0.8),
  'exact_match_stderr,flexible-extract': 0.13333333333333333}}

We do see some decrease in the way that the model anwers the question, however it is not as significant as deleting part of the attention heads.

## Step 5: First versus middle versus last model layer (18 points)
Based on your results in Step 3 and interpretation in Step 4, choose the model component that you think is more important (MLP layer versus attention head projection). For that component, repeat Step 2 three times, once after deleting that component in the first model layer (5 points), once after deleting that component in a middle model layer (5 points), and once after deleting that component in the final model layer (5 points). As with Step 3, don't forget to reload the model after each experiment (3 points)! 

In [38]:
model, pipe, tokenizer = load_model()

model.model.layers[1].self_attn.q_proj.weight.data*=0

model_attention_first = HFLM(pretrained=model, tokenizer=tokenizer, dtype="bfloat16")


results = evaluator.simple_evaluate(
    model = model_attention_first,
    tasks = "gsm8k",
    log_samples = False,
    batch_size = 1,
    limit=3, 
    random_seed=132, 
)

results['results']

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device set to use cuda:0
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████| 3/3 [00:00<00:00, 309.13it/s]
Running generate_until requests: 100%|██████████| 3/3 [00:22<00:00,  7.66s/it]


{'gsm8k': {'alias': 'gsm8k',
  'exact_match,strict-match': np.float64(1.0),
  'exact_match_stderr,strict-match': 0.0,
  'exact_match,flexible-extract': np.float64(1.0),
  'exact_match_stderr,flexible-extract': 0.0}}

In [39]:
model, pipe, tokenizer = load_model()

model.model.layers[16].self_attn.q_proj.weight.data*=0

model_attention_middle = HFLM(pretrained=model, tokenizer=tokenizer, dtype="bfloat16")


results = evaluator.simple_evaluate(
    model = model_attention_middle,
    tasks = "gsm8k",
    log_samples = False,
    batch_size = 1,
    limit=3, 
    random_seed=133, 
)

results['results']

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device set to use cuda:0
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████| 3/3 [00:00<00:00, 309.82it/s]
Running generate_until requests: 100%|██████████| 3/3 [00:23<00:00,  7.80s/it]


{'gsm8k': {'alias': 'gsm8k',
  'exact_match,strict-match': np.float64(1.0),
  'exact_match_stderr,strict-match': 0.0,
  'exact_match,flexible-extract': np.float64(0.6666666666666666),
  'exact_match_stderr,flexible-extract': 0.33333333333333337}}

In [42]:
model, pipe, tokenizer = load_model()

model.model.layers[34].self_attn.q_proj.weight.data*=0

model_attention_last = HFLM(pretrained=model, tokenizer=tokenizer, dtype="bfloat16")


results = evaluator.simple_evaluate(
    model = model_attention_last,
    tasks = "gsm8k",
    log_samples = False,
    batch_size = 1,
    limit=3, 
    random_seed=134, 
)

results['results']

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device set to use cuda:0
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████| 3/3 [00:00<00:00, 312.41it/s]
Running generate_until requests: 100%|██████████| 3/3 [00:18<00:00,  6.21s/it]


{'gsm8k': {'alias': 'gsm8k',
  'exact_match,strict-match': np.float64(0.6666666666666666),
  'exact_match_stderr,strict-match': 0.33333333333333337,
  'exact_match,flexible-extract': np.float64(1.0),
  'exact_match_stderr,flexible-extract': 0.0}}

## Step 6: Interpretation (18 points)
Similar to Step 4, interpret the results from your experiment in Step 5 in 2-3 sentences. In your model, does the first, middle or last layer appear to be more important and how do you know (i.e., how do the results from the experiment back up that claim) (9 points)? Why might that be the case (9 points)? There aren't necessarily right or wrong answers- we are just looking for some critical thinking based on what you know about LLMs and how they work. Feel free to peruse outside sources if you are stumped, but just make sure you reference them!

The self attention blocks seem to matter more the later the layers. By removing the first layer we do not see a drop in performance from the model. This is not the case when we remove the later layers. When we change the q_proj for the middle layer and the later layer we see a significant drop in performance from both. The two layers seem to do different things however, the exact_match (strict) scores higher with the middle layer removed and the lower with the last layer removed. 