# University of Utah CS 6340/5340 NLP Fall 2024 Assignment 4

First, make a copy of this notebook: File > Save a copy in Drive.

Connect to a GPU by clicking on the arrow next to "Connect" in the upper right corner, then click "Change runtime type" and select "T4 GPU".

Turn off the code completion by going to Tools > Settings > Editor > Automatically trigger code completions > Uncheck the box next to it. **Using the completed code is the same as copying it which is the academic misconduct in this class.**

Run the cells with your solutions such that the output is visible. When you are ready to submit your solution to Gradescope, do the following: File > Print > Destination: Save as PDF > Save. Upload the pdf to Gradescope.


## Quantization (40 points)

You will implement and apply quantization equations provided in the class to a floating-point matrix to convert it to an 8-bit signed integer matrix and then reconstruct the original weights. You will also analyze the memory usage and the quantization error. This will help you understand how quantization works and its impact on memory and precision.



Run the following cell to construct a 5 x 5 matrix with with random numbers from a normal distribution with mean 0 and variance 1, stored in 32-bit float precision.

In [None]:
import torch

original_matrix = torch.randn(5, 5, dtype=torch.float32)
print("A matrix (32-bit float):\n", original_matrix)

A matrix (32-bit float):
 tensor([[-0.5453,  1.6173, -0.3009, -0.1761, -1.4238],
        [ 1.1052, -0.3221, -0.8209,  1.7889, -2.6168],
        [ 1.9330, -0.2296,  1.4222,  0.1239,  0.5242],
        [-1.2502,  1.4958, -0.2915,  0.3856,  0.4916],
        [-0.6413,  1.1239, -0.7785,  0.2622, -0.3222]])


Use the formulas provided in the class to compute the scale factor (s) and zero-point (z). Print these two values.

*Tip*: The range of 8-bit signed integers is -128 to 127.

In [None]:
x_max = original_matrix.max() # max and min of matrix
x_min = original_matrix.min()

q_min = -128
q_max = 127 # based on quantized bits, 8 is common.

s = (q_max - q_min) / (x_max - x_min)
z = q_min - torch.round(s * x_min)
print(f"s: {s}")
print(f"z: {z}")

s: 56.04700469970703
z: 19.0


Quantize the original matrix to 8-bit integers using the calculated scale factor and zero-point. Name the quantized matrix `quantized_matrix`.

*Tips:*
1. You are allow to use `float32` temporarily.
2. You can use `.to(torch.int8)` after rounding.

In [None]:
quantized_matrix = torch.round(s * original_matrix + z).to(torch.int8)
print("\nQuantized matrix:\n", quantized_matrix)


Quantized matrix:
 tensor([[ -12,  110,    2,    9,  -61],
        [  81,    1,  -27,  119, -128],
        [ 127,    6,   99,   26,   48],
        [ -51,  103,    3,   41,   47],
        [ -17,   82,  -25,   34,    1]], dtype=torch.int8)


Implement and use the formula to dequantize the matrix back to 32-bit floats. Name the reconstructed matrix `reconstructed_matrix`.

In [None]:
reconstructed_matrix = (quantized_matrix - z) / s
print("\nDequantized matrix:\n", reconstructed_matrix)



Dequantized matrix:
 tensor([[-0.5531,  1.6236, -0.3033, -0.1784, -1.4274],
        [ 1.1062, -0.3212, -0.8207,  1.7842, -2.6228],
        [ 1.9270, -0.2319,  1.4274,  0.1249,  0.5174],
        [-1.2490,  1.4987, -0.2855,  0.3925,  0.4996],
        [-0.6423,  1.1241, -0.7851,  0.2676, -0.3212]])


Run the following cell to get the the total memory consumption in bytes for each matrix.

- `element_size()` gives the size of each element in bytes (4 bytes for float32, 1 byte for int8).

- `numel()` returns the total number of elements in the matrix.
By multiplying these two, we get the total memory consumption in bytes for each matrix.

In [None]:
original_memory = original_matrix.element_size() * original_matrix.numel()
quantized_memory = quantized_matrix.element_size() * quantized_matrix.numel()

print(f"Memory usage (original matrix): {original_memory} bytes")
print(f"Memory usage (quantized matrix): {quantized_memory} bytes")


Memory usage (original matrix): 100 bytes
Memory usage (quantized matrix): 25 bytes


Run the following cell to get the reconstruction/quantization error.

- Mean Squared Error (MSE) is a standard metric to measure the difference between two matrices element-wise.

In [None]:
mse = torch.mean((original_matrix - reconstructed_matrix) ** 2)
print(f"Quantization Error: {mse.item()}")


Quantization Error: 2.1289499272825196e-05


In the following text field provide answers to these questions:
1. In your view, is the quantization error low? Why or why not?
2. Is the reduction in memory usage significant?
3. How might these changes (quantization error and memory reduction) affect the performance and efficiency of the model during inference?

**Be concise and precise.**

The error is low. For high precision sitations, it will likely not give give additional error. The reduction in memory is significant. It is using 1/4 of the memory. The lower memory and less digits to keep tack of mean the model will run faster and improve its efficiency.

## KV Cache (40 points)

Key-Value (KV) caching is used for speeding up inference with decoder-only transformers that predict one token at the time.


Take some time to read through the code in the next cell and run it. Try to figure out what the code is doing. Once you've done that, explain the differences in the reported times and why the tensors match even though the times differ. **Be concise and precise.**

In [None]:
import torch
import time

seq_len = 2000
d_model = 768
d_internal = 64

W_k = torch.randn(d_model, d_internal, dtype=torch.float32)
W_v = torch.randn(d_model, d_internal, dtype=torch.float32)
def compute_kv(embedding):
    K = embedding @ W_k
    V = embedding @ W_v
    return K, V

keys_list = []
values_list = []
def store_kv(new_token):
    new_K, new_V = compute_kv(new_token)
    keys_list.append(new_K)
    values_list.append(new_V)

input_embeddings = torch.randn(seq_len, d_model, dtype=torch.float32)

start = time.time()
for i in range(seq_len):
    K1, V1 = compute_kv(input_embeddings[:i + 1])
end = time.time()
print(f"\nTime to compute K,V with Method 1: {end - start:.6f} seconds")

start = time.time()
for i in range(seq_len):
    store_kv(input_embeddings[i])
    K2 = torch.stack(keys_list)
    V2 = torch.stack(values_list)
end = time.time()
print(f"Time to compute K,V with Method 2: {end - start:.6f} seconds")

keys_match = torch.allclose(K1, K2, atol=1e-4, rtol=1e-3)
values_match = torch.allclose(V1, V2, atol=1e-4, rtol=1e-3)
print(f"\nKey tensors match: {keys_match}")
print(f"Value tensors match: {values_match}")



Time to compute K,V with Method 1: 4.066920 seconds
Time to compute K,V with Method 2: 1.909790 seconds

Key tensors match: True
Value tensors match: True


In the following text field provide your explanation.

The code is making key and value matrices in the transformer architecture. The first method does this by taking in tokens up to a certain point, and computes keys and values for them. Then does it again while including the next token. This is a lot of redundant work. The second method only computes them for one token and concatenates them to the list of key and value for previous tokens. Therefore it is more efficient.

Tensors match because they are both doing the same operation of multiplying the embedding by the pre set matrices. It would be even more efficient to treat the entire input sequence as a single vector and multiply it by weights to get key and value matrices.

## Finetuning an LLM on SQuAD using Huggingface's libraries (20 points + Bonus)

[Huggingface](https://huggingface.co/) provides an ecosystem of pretrained language models, datasets, tokenizers, and other utilities. It simplifies the process of working with state-of-the-art NLP models and allows researchers and developers to finetune LLMs efficiently. Huggingface's `transformers`, `datasets`, and `accelerate` libraries are widely used across academia and industry for NLP research and deployment. **Experience with Huggingface is important if you plan to work in NLP in the future.**

[The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/) is a benchmark dataset for reading comprehension task. The goal is to predict the answer to a question from a given context passage. This passage is provided, not retrieved. SQuAD has been popular for several reasons:
- It encourages the development of models that understand context at a somewhat deeper level. This has been challenging for the state-of-the-art models when the time when this dataset was introduced.
- It provides clear evaluation metrics, token-level F1 and exact match (EM) scores, which are easy to calculate.
- It has consequently driven significant advancements.

In the next text cell, write about possibile modeling choices for finetuning an LLM on SQuAD. Consider possible output layers, transformer types, model size, relevance of the model is instruction finetune or aligned, maximum input sequence length, etc. Comment on pros and cons. After you consider various possibilities, explain which approach you would choose and why.

For transformer types, currently the most common type is decoder only, so it is probably the default to use. The model size depends on parameters and ranges widely, from a 100 million parameters to several hundred billion. For larger models like chatGPT 4o, computation and train time increases significantly. They will also likely perform better on Squad. Ideally, we want to be between these extremes and have a model perform good with relatively few parameters, such as the Llama 3 8B model.

Fine tuned models can understand subtler nuances of text. This will be useful in predicting the answer in Squad. It would be helpful to finetune the model on Squad passages and question/answers.

Since the goal is to finetune on Squad, the input sequence length depends on the length of the passages in Squad. Let the maximum length be longer than the length of the largest passage in Squad. It should not be significantly longer, maybe 2-3x. This should work well with Squad and be somewhat generalizable.

In [this notebook](https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb) you can find the code for finetuning an LM to *extract* a span in a given text to answer a question. Please go through this code carefully. You can either prompt ChatGPT to explain portions of that code or ask us in Piazza. In the next part, you are going to approach reading comprehension in SQuAD by *generating* the answer, not extracting it.

### Finetuning the Qwen2.5 language model on SQuAD to *generate* answers

#### Analysis of RAM usage (20 points)

In the following code cell you are going to write the code to load the tokenizer and weights of a Qwen2.5 model. In the figure below, you can find sizes in which this model comes. The name of the Qwen2.5 family of models in Huggingface is [Qwen/Qwen2.5-XB-Instruct](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) where X should be replaced by a concrete number of parameters. Use the Auto class which does **not** make a new classification head for the task.

Start from the smallest model and load the weights, then restart the session (Runtime > Restart session or Cmd/Ctrl+M), and try to load the next larger model. At some point you will get:
> Your session crashed after using all available RAM.

In the next text cell report for which model size do you get this error and explain why do you get this error.

![BLA](http://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5/Qwen2.5%20modelcard.001.jpeg)

In [None]:
from transformers import AutoTokenizer, AutoModel

model_size = '3'

model_name = f'Qwen/Qwen2.5-{model_size}B-Instruct'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The error is thrown by the model with 3.1B parameters (3B). The 1.5B parameter is the biggest one that loaded.

The session crashes because the entire model's parameters are being stored in RAM of the T4. Once this runs out, there is nowhere for the rest of the parameters to go, so an error is thrown. The parameters are preferred to be stored on RAM since it is significantly faster than SSD access.

#### Bonus (max. 15 points)

In this part, you can earn additional points by finetuning and evaluating a Qwen model on SQuAD v1 to generate *not extract* answers in the following two ways:

1. Finetuning with full precision. You will need to select an appropriate model size.
2. Finetuning using QLoRA.

*Please note that this section requires more effort than previous parts. Given the circumstances, I've decided to make the fourth assignment easier and turn this section into a bonus. Awarding too many points for the bonus part could inflate class grades, so I will award 15 points at max for it. If you're excited about NLP, I strongly encourage you to complete this part.*

As always, you are **not** allowed to directly prompt ChatGPT, Gemini, Claude, Copilot, or other LLMs to write the solution for you. However, you are permitted to seek help by analyzing related code available on the web or Huggingface resources. This mirrors how you would use Huggingface in real-world applications. However, **you must report all resources used at the end of this notebook!** Failing to do so is considered academic misconduct. You may also seek help from LLMs to explain errors in your code. **However, you must report any LLM usage for this purpose and include your prompts!** Failing to report this is also considered academic misconduct.

ðŸš¨ We will conduct ***one-on-one interviews*** with anyone who submits their solutions to this part to probe their understanding of the submitted code and related concepts. There will be no penalty for not demonstrating sufficient understanding, but we may withhold points if there is a significant lack of understanding of the code and concepts. This addition to evaluation is intended to avoid awarding submission of an LLM's solution, which is challenging to detect.ðŸš¨

**Other requirements:**
- You must load the SQuAD dataset using the [datasets](https://huggingface.co/docs/datasets/en/index) library.
- You must use [AutoTokenizer](https://huggingface.co/docs/transformers/v4.46.0/en/model_doc/auto#transformers.AutoTokenizer) to load the tokenizer.
- Use [map()](https://huggingface.co/docs/datasets/v3.0.2/en/package_reference/main_classes#datasets.Dataset.map) to tokenize the entire dataset efficiently.
- You must report the token-F1 and exact match for the validation split of the data.
- Use Huggingface's [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) to finetune your model.
- You can explore using a sample of training data for finetuning.

**These resources might be helpful:**



*   [Check "QLoRa, an even more efficient method" here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb)
*   [The HF notebook that goes through the recent bitsandbytes integration](https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k)
* [HF Transformers Notebooks](https://huggingface.co/docs/transformers/en/notebooks)

**Tip**:

* I find `wandb` distracting and disable it with `import os; os.environ["WANDB_MODE"] = "disabled"`.



Please provide your implementation and run the cells such that we can see the token-f1 and exact match on the validation set.

## Report of resources and prompts used

Report here all materials and prompts you used.

Prompt: Are LLMs mostly stored in RAM? Why?

In [None]:
# This mounts your Google Drive to the Colab VM.
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# TODO: Enter the foldername in your Drive where you have saved the unzipped
# assignment folder, e.g. 'cs6353/assignments/assignment3/'
FOLDERNAME = 'U of U/Semester 7 (Fall 24)/CS NLP/Assignment 4'
assert FOLDERNAME is not None, "[!] Enter the foldername."

# Now that we've mounted your Drive, this ensures that
# the Python interpreter of the Colab VM can load
# python files from within it.
import sys
sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))

Mounted at /content/drive


In [None]:
!apt-get install texlive texlive-xetex texlive-latex-extra pandoc


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  dvisvgm fonts-droid-fallback fonts-lato fonts-lmodern fonts-noto-mono fonts-texgyre
  fonts-urw-base35 libapache-pom-java libcmark-gfm-extensions0.29.0.gfm.3 libcmark-gfm0.29.0.gfm.3
  libcommons-logging-java libcommons-parent-java libfontbox-java libfontenc1 libgs9 libgs9-common
  libidn12 libijs-0.35 libjbig2dec0 libkpathsea6 libpdfbox-java libptexenc1 libruby3.0 libsynctex2
  libteckit0 libtexlua53 libtexluajit2 libwoff1 libzzip-0-13 lmodern pandoc-data poppler-data
  preview-latex-style rake ruby ruby-net-telnet ruby-rubygems ruby-webrick ruby-xmlrpc ruby3.0
  rubygems-integration t1utils teckit tex-common tex-gyre texlive-base texlive-binaries
  texlive-fonts-recommended texlive-latex-base texlive-latex-recommended texlive-pictures
  texlive-plain-generic tipa xfonts-encodings xfonts-utils
Suggested packages:
  fonts-noto fonts-fre

In [None]:
!jupyter nbconvert --to pdf A4CSNLP.ipynb


This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive
