# Task 1. Interpretation with Logit Lens

Logit Lens is an interpretation technique introduced in [this post](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens). The idea is the following. Imagine that we predict the continuation of a phrase "IPhone was developed by". We naturally expect to see "Apple", but we're also curious to see the "thought process" of an LLM, so we **feed outputs of intermediate layers (intermediate transformer blocks) to the classification head** to see *what would an LLM output if we cut its "thought process" short in the middle of it*. The general trend, as one moves from earlier to later layers, is
- "nonsense / not interpretable" (sometimes, in very early layers) -->
  - "shallow guesses" (words that are the right part of speech / register / etc) -->
- "better guesses" near the end.
However, it's not always like that, of course.

The author of the Logit Lens also created visualization tools and published a [jupyter notebook demo](https://colab.research.google.com/drive/1MjdfK2srcerLrAJDRaJQKO0sUiZ-hQtA?usp=sharing) with cool pictures, but in this task you'll need to reproduce the Logit Lens technique on your own.

**Note** If you're really short on compute, you can use GPT-2 instead of zephyr, but you risk losing all the fun and most of interpretability.

**Task 1.1. (1 point)** Write a function

```
logit_lens(model, input_sentence, top_k)
```

that for each transformer block returns a dictionary

```
{
    'top_tokens' : [
        sorted list of top_k tokens,
        from most probable to least probable,
        according to the classification head
        ],
    'top_token_logits' : [logits of these tokens]
}
```

Hint:
- To get hidden states of a model you'll need to use `model(**encoded_input, output_hidden_states=True)` instead of `model.generate`


Here is how it should work:

In [1]:
!pip install -q transformers
!pip install -q accelerate

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('stabilityai/stablelm-zephyr-3b')
model = AutoModelForCausalLM.from_pretrained(
    'stabilityai/stablelm-zephyr-3b',
    trust_remote_code=True,
    device_map="auto"
)

prompt = [{'role': 'user', 'content': 'List 3 synonyms for the word "tiny"'}]
inputs = tokenizer.apply_chat_template(
    prompt,
    add_generation_prompt=True,
    return_tensors='pt'
)

tokens = model.generate(
    inputs.to(model.device),
    max_new_tokens=1024,
    temperature=0.8,
    do_sample=True
)

print(tokenizer.decode(tokens[0], skip_special_tokens=False))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors:   0%|          | 0.00/5.59G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


<|user|>
List 3 synonyms for the word "tiny"<|endoftext|>
<|assistant|>
1. A wee 
2. A miniature 
3. Petite<|endoftext|>


In [7]:
import numpy as np
import torch

def logit_lens(model, input_sentence, top_k=5):
    model = model.eval()

    tokenized_sequence_ids = tokenizer.encode(input_sentence, return_tensors="pt")
    hidden_states = model(tokenized_sequence_ids, output_hidden_states=True).hidden_states

    result_list = list()
    for i, hidden_state in enumerate(hidden_states):
        logits = model.lm_head(hidden_state)[:, -1, :]
        top_k_values, top_k_indices = torch.topk(logits, top_k)
        indices = top_k_indices.flatten().detach().numpy()
        tokens = tokenizer.convert_ids_to_tokens(indices)
        dict_tokens = [" " + str(token[1:]) for token in tokens]
        dict_values = top_k_values.flatten().detach().numpy()
        dict_element = {
            'top_tokens': dict_tokens,
            'top_tokens_logits': dict_values
        }
        result_list.append(dict_element)
    return result_list



In [8]:
result = logit_lens(model, "IPhone was developed by", top_k=5)

In [9]:
result[-2:]

[{'top_tokens': [' Apple', ' a', ' AT', ' ', ' S'],
  'top_tokens_logits': array([7.0456634, 6.7790318, 6.1888924, 5.987253 , 5.8684807],
        dtype=float32)},
 {'top_tokens': [' Apple', ' a', ' the', ' IBM', ' an'],
  'top_tokens_logits': array([21.439018, 18.034657, 15.898301, 15.69769 , 15.678806],
        dtype=float32)}]

As you see, "Apple" appears as the most probable token in the last two layers.

**Task 1.2 (2 points)**

Now you'll use Logit Lens to investigate how transformers deal with redefinition.

Use Logit Lens on the sentence

```
"In this text the word IPhone means Windows operating system. IPhone was developed by"
```

Look at the most probable tokens for all layers. A good LLM knows that IPhone was developed by Apple through *memorization*. However, *in-context learning* will press it to output Microsoft. Check in which layers the most probable token is Microsoft and in which it is Apple.

Perform experiments with 9 other sentences with redefinition. Can you observe a pattern of competition between memorization and in-context learning?

In [10]:
result = logit_lens(model, "In this text the word IPhone means Windows operating system. IPhone was developed by", top_k=5)
result[:]

[{'top_tokens': [' means', ' way', ' virtue', ' Means', ' ĩ'],
  'top_tokens_logits': array([0.03197266, 0.02827449, 0.02588894, 0.02294105, 0.02270618],
        dtype=float32)},
 {'top_tokens': [' [...]', ' means', ' the', ' way', ' Ľ'],
  'top_tokens_logits': array([0.06112102, 0.05731745, 0.0570225 , 0.05604238, 0.05276864],
        dtype=float32)},
 {'top_tokens': [' means', ' way', ' Ľ', ' one', ' the'],
  'top_tokens_logits': array([0.08996621, 0.0829972 , 0.07946736, 0.07795769, 0.07659164],
        dtype=float32)},
 {'top_tokens': [' means', ' the', ' eans', ' inf', ' divers'],
  'top_tokens_logits': array([0.11120468, 0.09633729, 0.09613051, 0.09183606, 0.08905748],
        dtype=float32)},
 {'top_tokens': [' pan', ' ES', ' the', ' means', ' Epstein'],
  'top_tokens_logits': array([0.18458788, 0.16906188, 0.16186722, 0.16114132, 0.15615094],
        dtype=float32)},
 {'top_tokens': [' pan', ' I', ' ez', ' Prof', ' Jesse'],
  'top_tokens_logits': array([0.22438748, 0.21001042, 

In [11]:
result = logit_lens(model, "In this text, the word 'coffee' means 'solar energy.' The world's largest coffee plant is located in", top_k=5)
result[:]

[{'top_tokens': [' cribed', ' accordance', ' criptions', ' vitro', ' spite'],
  'top_tokens_logits': array([0.03589612, 0.02858505, 0.02758173, 0.02661785, 0.0255783 ],
        dtype=float32)},
 {'top_tokens': [' ffect', ' [...]', ' ', ' trop', ' the'],
  'top_tokens_logits': array([0.06045042, 0.05886982, 0.05778028, 0.05444053, 0.05206032],
        dtype=float32)},
 {'top_tokens': [' xperienced', ' trop', ' ul', ' cribed', ' ffect'],
  'top_tokens_logits': array([0.09249139, 0.0907546 , 0.08961549, 0.08892887, 0.08602268],
        dtype=float32)},
 {'top_tokens': [' xperienced', ' ffect', ' cribed', ' ul', ' Syracuse'],
  'top_tokens_logits': array([0.1278651 , 0.11917061, 0.11349771, 0.11069141, 0.10752887],
        dtype=float32)},
 {'top_tokens': [' µÑģÑĤ', ' Ľ', ' duc', ' br', ' ©nÃ©'],
  'top_tokens_logits': array([0.19282682, 0.1927225 , 0.18679161, 0.18534286, 0.17976344],
        dtype=float32)},
 {'top_tokens': [' p', ' pp', ' Spin', ' spin', ' ©nÃ©'],
  'top_tokens_logits':

In [12]:
result = logit_lens(model, "In this document, 'bicycle' refers to 'the Internet.' The invention of the bicycle drastically changed", top_k=5)
result[:]

[{'top_tokens': [' ru', ' Gru', ' ecent', ' ouch', ' '],
  'top_tokens_logits': array([0.02005338, 0.01981762, 0.01832283, 0.01783607, 0.0174183 ],
        dtype=float32)},
 {'top_tokens': [' lly', ' all', ' ggest', ' V', ' T'],
  'top_tokens_logits': array([0.04918555, 0.04813809, 0.04720778, 0.04681649, 0.04498557],
        dtype=float32)},
 {'top_tokens': [' ', ' bs', ' Look', ' g', ' ver'],
  'top_tokens_logits': array([0.07606998, 0.07529482, 0.07473864, 0.0731097 , 0.07237047],
        dtype=float32)},
 {'top_tokens': [' L', ' Look', ' pendent', ' fo', ' ol'],
  'top_tokens_logits': array([0.10960124, 0.10764145, 0.10713266, 0.10563803, 0.09831663],
        dtype=float32)},
 {'top_tokens': [' Belf', ' pendent', ' eleton', ' L', ' fo'],
  'top_tokens_logits': array([0.14445725, 0.14120461, 0.13766582, 0.13728945, 0.13702376],
        dtype=float32)},
 {'top_tokens': [' gen', ' knob', ' g', ' (>', ' ul'],
  'top_tokens_logits': array([0.29140025, 0.23145452, 0.22231632, 0.22100216,

In [13]:
result = logit_lens(model, "Here, 'chocolate' is used to mean 'artificial intelligence.' Chocolate has significantly evolved over the past decade, especially in", top_k=5)
result[:]

[{'top_tokens': [' cribed', ' accordance', ' criptions', ' vitro', ' spite'],
  'top_tokens_logits': array([0.03589612, 0.02858505, 0.02758173, 0.02661785, 0.0255783 ],
        dtype=float32)},
 {'top_tokens': [' ffect', ' [...]', ' ', ' trop', ' sk'],
  'top_tokens_logits': array([0.05948918, 0.05922301, 0.05643798, 0.05490062, 0.05076383],
        dtype=float32)},
 {'top_tokens': [' cribed', ' xperienced', ' ul', ' trop', ' lusively'],
  'top_tokens_logits': array([0.09279542, 0.08920643, 0.0890229 , 0.08867558, 0.0818992 ],
        dtype=float32)},
 {'top_tokens': [' nden', ' ul', ' uctor', ' emin', ' ersed'],
  'top_tokens_logits': array([0.1290568 , 0.12741765, 0.11830679, 0.11261178, 0.10981558],
        dtype=float32)},
 {'top_tokens': [' ul', ' regards', ' bled', ' uctor', ' nden'],
  'top_tokens_logits': array([0.20265877, 0.17241001, 0.16472152, 0.16444688, 0.16435823],
        dtype=float32)},
 {'top_tokens': [' ette', ' regards', ' ek', ' terms', ' nes'],
  'top_tokens_logi

In [14]:
result = logit_lens(model, "In this scenario, 'shoes' stand for 'spacecraft.' The most advanced shoes are designed for travel beyond", top_k=5)
result[:]

[{'top_tokens': [' mes', ' cular', ' n', ' norm', ' yourselves'],
  'top_tokens_logits': array([0.01973564, 0.01658559, 0.01656543, 0.01641238, 0.01615398],
        dtype=float32)},
 {'top_tokens': [' cin', ' am', ' plaus', ' ams', ' ys'],
  'top_tokens_logits': array([0.06580061, 0.06410584, 0.0613165 , 0.06072294, 0.05977384],
        dtype=float32)},
 {'top_tokens': [' es', ' ams', ' mes', ' sper', ' am'],
  'top_tokens_logits': array([0.11488405, 0.10880438, 0.10566484, 0.10065822, 0.10010798],
        dtype=float32)},
 {'top_tokens': [' es', ' ams', ' cate', ' am', ' misses'],
  'top_tokens_logits': array([0.13997759, 0.13968557, 0.13698061, 0.13145879, 0.11943168],
        dtype=float32)},
 {'top_tokens': [' cate', ' mes', ' cha', ' am', ' ams'],
  'top_tokens_logits': array([0.18522604, 0.18080069, 0.1706951 , 0.16831215, 0.16750231],
        dtype=float32)},
 {'top_tokens': [' ©e', ' cate', ' ¾Ð´', ' am', ' [âĢ¦]'],
  'top_tokens_logits': array([0.2285909 , 0.21736348, 0.216878

In [15]:
result = logit_lens(model, "For the purpose of this discussion, 'library' means 'quantum computer.' The largest library in the world can process", top_k=5)
result[:]

[{'top_tokens': [' at', ' ly', ' onal', ' Ĭł', ' kindly'],
  'top_tokens_logits': array([0.0205835 , 0.02025671, 0.02006477, 0.01999479, 0.01860488],
        dtype=float32)},
 {'top_tokens': [' vely', ' rs', ' ly', ' lled', ' sh'],
  'top_tokens_logits': array([0.06100681, 0.05752692, 0.05752672, 0.05730348, 0.05436922],
        dtype=float32)},
 {'top_tokens': [' ons', ' oned', ' ally', ' onal', ' s'],
  'top_tokens_logits': array([0.13491163, 0.12558194, 0.11844752, 0.11355867, 0.11017549],
        dtype=float32)},
 {'top_tokens': [' s', ' ons', ' lam', ' il', ' r'],
  'top_tokens_logits': array([0.13920127, 0.13772811, 0.13374054, 0.12912603, 0.12817669],
        dtype=float32)},
 {'top_tokens': [' ally', ' ³', ' rs', ' lled', ' nto'],
  'top_tokens_logits': array([0.19994949, 0.19337416, 0.19044028, 0.1861839 , 0.18003878],
        dtype=float32)},
 {'top_tokens': [' lled', ' rs', ' lls', ' pp', ' veness'],
  'top_tokens_logits': array([0.2684486 , 0.2535708 , 0.25159794, 0.2402087

In [None]:
result = logit_lens(model, "In the context of this text, 'penguin' means 'electric car.' Penguins are now capable of traveling up to", top_k=5)
result[:]

In [None]:
result = logit_lens(model, "Here, 'mirror' is meant to signify 'blockchain technology.' Mirrors have become foundational in securing", top_k=5)
result[:]

In [None]:
result = logit_lens(model, "In this narrative, 'guitar' refers to 'virtual reality.' Guitars have been used to create immersive experiences that", top_k=5)
result[:]

In [None]:
result = logit_lens(model, "In this discussion, 'rain' is a metaphor for 'data encryption.' Rain is crucial for protecting", top_k=5)
result[:]

## Observation

Indeed, more meaningful tokens are present in last layers, the ones from first layers are meaningless. However, I noticed that sometimes the LM tries to take both meanings in, what results in generating more neutral response.  

# Task 2. Rotary embeddings

**Task 2.1 (3 points)**

In this task you'll need to code rotary embeddings. Actually, they are not just embeddings, but rather a transformation that is applied to queries and keys. It works like that:

$$f_q(x_m, m) = x_mW_QR^d_{\Theta, m},\quad f_k(x_n, n) = x_nW_KR^d_{\Theta, n},$$
where
$$R^d_{\Theta, m} =
\begin{pmatrix}
\cos{m\theta_1} & \sin{m\theta_1} & 0 & 0 & \dots & 0 & 0\\
-\sin{m\theta_1} & \cos{m\theta_1} & 0 & 0 & \dots & 0 & 0\\
0 & 0 & \cos{m\theta_2} & \sin{m\theta_2} & \dots & 0 & 0\\
0 & 0 & -\sin{m\theta_2} & \cos{m\theta_2} & \dots & 0 & 0\\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\
0 & 0 & 0 & 0 & \dots & \cos{m\theta_{d/2}} & \sin{m\theta_{d/2}}\\
0 & 0 & 0 & 0 & \dots & -\sin{m\theta_{d/2}} & \cos{m\theta_{d/2}}\\
\end{pmatrix},$$
where the parameters $\Theta$ are set to
$$\theta_i = b^{-2(i-1)/d}$$
for some base $b$ (default $10000$).

As you see, the transformation is the same for both keys and values, so we just need one transformation that takes a tensor of size `[batch_size, num_heads, seq_len, head_size]` and "rotates" it outputting a tensor of the same size.

Please try your best to make calculations efficient and don't forget to use `torch` (and not `numpy`) and load all the tensors to the right `device`.

Hints:
1. As we've discussed in the longread, rotary embeddings can be $f_{q,k}(x_m, m) = x_mW_{Q,K}R^d_{\Theta, m}$ of $f_{q,k}(x_m, m) = \left(R^d_{\Theta, m}\right)^TW_{Q,K}x_m$ depending on whether $x_m$ is a row vector (first formula) or a column vector (second formula). In our case the input dimension is `[batch_size, num_heads, seq_len, head_size]`, so after we choose a particular batch element and a particular attention head, $x_m$ has dimension `(1, head_size)` which is a row.
2. Recalculating all the $\theta_i = b^{-2(i-1)/d}$, sines and cosines takes much time. A good ideas it to cache them as soon as they are needed. You can either calculate $R^d_{\Theta, m}$ for large sequence length right at the initialization or:
  - At initialization cache $R^d_{\Theta, m}$ for moderately short sequences;
  - When you encounter a longer sequence, recalculate and cache again.
3. We don't need to store all the matrix $R^d_{\Theta, m}$ because it has too many zeros. Actually, we can just store sines and cosines and do $x \mapsto xR^d_{\Theta, m}$ in linear time:
$$xR^d_{\Theta, m} = \begin{pmatrix}
x_1\cos{m\theta_1} - x_2\sin{m\theta_1}\\
x_1\sin{m\theta_1} + x_2\cos{m\theta_1}\\
x_3\cos{m\theta_2} - x_4\sin{m\theta_2}\\
x_3\sin{m\theta_2} + x_4\cos{m\theta_2}\\
\vdots
\end{pmatrix} =
\begin{pmatrix}x_1\\x_2\\x_3\\x_4\\\vdots\end{pmatrix}\otimes
\begin{pmatrix}\cos{m\theta_1}\\\cos{m\theta_1}\\\cos{m\theta_2}\\\cos{m\theta_2}\\\vdots\end{pmatrix} +
\begin{pmatrix}-x_2\\x_1\\-x_4\\x_2\\\vdots\end{pmatrix}\otimes
\begin{pmatrix}\sin{m\theta_1}\\\sin{m\theta_1}\\\sin{m\theta_2}\\\sin{m\theta_2}\\\vdots\end{pmatrix}
$$


In [None]:
class RotaryEmbedding(nn.Module):
    def __init__(
        self,
        dim: int,
        max_position_embeddings: int,
        base: int = 10_000,
        device: Optional[torch.device] = None,
    ):
        super().__init__()
        pass

    def forward(self, x: torch.Tensor, seq_len: Optional[int] = None):
        # x: [batch_size, num_heads, seq_len, head_size]
        # returns: [batch_size, num_heads, seq_len, head_size]
        pass

**Task 2.2 (1 point)** Take a model from Hugging Face that has rotary embeddings. You can use `stabilityai/stablelm-zephyr-3b` or `mistralai/Mistral-7B-v0.1`, but many others will also work. Somewhere in `model.model.layers` you'll find the `RotaryEmbedding()` layer. Play around changing the `base` parameter. What do you observe if you make the base very small? Very large? How would the outputs of the model change? What would you expect to observe? Please don't only output sentences, but also provide some reflection.

In [None]:
# Your code here

# Task 3. Going deeper into Mixtral

In this task you'll try to understand better what happens inside the Mixtral of Experts model and, in the same time, fine tune it with QLoRA.

**Caution**. Mixtral is quite large. It will consume >90 GB of disk space and ~23 GB VRAM on GPU when we load its 4-bit version. So, for a comfortable experience with this task, you will need A100 with 200GB disk space. If you don't have a way of finding this hardware, you can still do whatever doesn't require actually running the models, and there is a bonus task about Mistral for you in the end which you can do to get points.

Apart from studying Mixtral, we will fine tune it. I've taken most of the fine tuning part from [this notebook](https://colab.research.google.com/github/brevdev/notebooks/blob/main/mixtral-finetune.ipynb#scrollTo=ece42f7c-3825-45c7-9afc-efb355e9474c).


First, let's load the libraries:

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U datasets scipy ipywidgets matplotlib

In [None]:
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

For fine tuning we'll be using the [Viggo functional representation](https://huggingface.co/datasets/GEM/viggo) dataset. This dataset contains messages about video games such as

*You said you loved The Legend of Zelda: Ocarina of Time. Do you often tend to play similar Nintendo games that are also rated E?*

and meaning representations of such messages:

*verify_attribute(name[The Legend of Zelda: Ocarina of Time], esrb[E (for Everyone)], rating[excellent], platforms[Nintendo])*

We'll try to teach Mixtral to transform messages into their meaning representations. This is a good task for fine tuning, because it is about format tuning, not factuality.

In [None]:
from datasets import load_dataset

train_dataset = load_dataset('gem/viggo', split='train')
eval_dataset = load_dataset('gem/viggo', split='validation')
test_dataset = load_dataset('gem/viggo', split='test')

## Load base model

We'll be downloading the model in full precision (so it will take ~100 GB on the disk) and then loading it in 4-bit quantization on GPU.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling, BitsAndBytesConfig

base_model_id = "mistralai/Mixtral-8x7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config, device_map="cuda")

**Task 3.1. [1 point]** Check which layers are 4-bit quantized and which are not. Actually, you can learn it by just typing `model` and printing its structure, but we encourage you to actually check the `dtype` to be sure about the result.

You will find out that not all layers are quantized. Estimate the proportion of parameters that stay in full precision. Why these parameters aren't quantized? Any reasonable hypotesis will get points.

[Your answer here]

## Inspecting the model

**Task 3.2. [1 point]** Mixtral paper provides the following model chatacteristics:

| Parameter | Value |
| :--- |  ---: |
| dim | 4096 |
| n_layers | 32 |
| head_dim | 128 |
| hidden_dim | 14336 |
| n_heads | 32 |
| n_kv_heads | 128 |
| context_len | 32768 |
| vocab_size | 32000 |
| num_experts | 8 |
| top_k_experts | 2 |

Some numbers can also be obtained by printing the `model`. Print it and browse through the dimensions.

Some of the numbers are easies to understand. For example, we have 8 experts and 2 of them are used at inference. That's ok.

Now, we have several questions for you:

1. Why are the output dimensions (`out_features` in the model rollout) of `q_proj`, `k_proj` and `v_proj` different from 128 which is the dimension of a head (see table above)?
2. Why are the output dimensions (`out_features` in the model rollout) of `k_proj` and `v_proj` different from the output dimensions of `q_proj`?

Check the long read, it should be enough to answer all the questions.

[Your answer here]

## Generating prompts for fine tuning

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    add_eos_token=True,
    add_bos_token=True,
)

def tokenize(prompt):
    result = tokenizer(prompt)
    result["labels"] = result["input_ids"].copy()
    return result

We will be training Mixtral to transform text to meaning representation when prompted by a specific command. Namely, we'll fine tune it on prompts of the following form:

```
<s> Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
Dirt: Showdown is a sport racing game that was released in 2012. The game is available on PlayStation, Xbox, and PC, and it has an ESRB Rating of E 10+ (for Everyone 10 and Older). However, it is not yet available as a Steam, Linux, or Mac release.

### Meaning representation:
inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])
</s>
```

For fine tuning we will be padding the texts, and for that we need to understand the distribution of lengths

In [None]:
import matplotlib.pyplot as plt

def plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset):
    lengths = [len(x['input_ids']) for x in tokenized_train_dataset]
    lengths += [len(x['input_ids']) for x in tokenized_val_dataset]
    print(len(lengths))

    # Plotting the histogram
    plt.figure(figsize=(8, 5))
    plt.hist(lengths, bins=20, alpha=0.7, color='blue')
    plt.xlabel('Length of input_ids')
    plt.ylabel('Frequency')
    plt.title('Distribution of Lengths of input_ids')
    plt.show()

plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset)

You will find that 340 is a good estimate of max length. We will include padding and truncation into the `tokenization` routine:

In [None]:
max_length = 340 # This was an appropriate max length for my dataset

# redefine the tokenize function and tokenizer

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token


def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

And now, let's assemble the prompts for fine tuning:

In [None]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
{data_point["target"]}

### Meaning representation:
{data_point["meaning_representation"]}
"""
    return tokenize(full_prompt)

In [None]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

## Running the model for the first time

Now, let's try to apply Mixtral!

In [None]:
eval_prompt = """Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?

### Meaning representation:
"""

We can apply accelerator to the model. If you're using colab, it can start raising stupid errors like

```
TypeError: device() received an invalid combination of arguments - got (NoneType), but expected one of:
 * (torch.device device)
      didn't match because some of the arguments have invalid types: (!NoneType!)
 * (str type, int index)

```

Most likely, you'll be able to overcome it by running

`model.hf_device_map = {'': torch.device('cuda', index=0)}`

In [None]:
# Apply the accelerator. You can comment this out to remove the accelerator.
model = accelerator.prepare_model(model)

In [None]:
# Re-init the tokenizer so it doesn't add padding or eos token
eval_tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    add_bos_token=True,
)

In [None]:
device = "cuda"
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to(device)

In [None]:
model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(model.generate(**model_input, max_new_tokens=128)[0], skip_special_tokens=True))

You can see that the model doesn't understand which formatting we are expecting from it. We will try to improve it with fine tuning.

## Investigating Mixtral heads

**Task 3.3. [2 points]** Apply Mixtral to some sentence

`router_logits = model(**model_input, output_router_logits=True)`

to make it return router logits that decide which heads are used for inference.

Do the logits stay the same when you apply Mixtral several times to the same sentence? Why?

Now:
- get router logits of 10th, 20th and 30th layers for the first 100 elements of `tokenized_train_dataset`.
- For each expert of each of these layers, make a list of tokens for which this expert has top-1 logit (yes, please do it separately for each of the experts),
- Output top-5 most frequent tokens from each of the lists.

Your output should be like:

```
At layer 20:

Expert 0:
[('token1', how_many_times_expert_0_got_top1_router_logit_with_this_token),
('token2', how_many_times_expert_0_got_top1_router_logit_with_this_token),
('token3', how_many_times_expert_0_got_top1_router_logit_with_this_token),
('token4', how_many_times_expert_0_got_top1_router_logit_with_this_token),
('token5', how_many_times_expert_0_got_top1_router_logit_with_this_token),]

Expert 1
....
```

Do you observe any patterns?

In [None]:
# <YOUR CODE HERE>

[YOUR ANSWER HERE]

## Set Up LoRA

We will be fine tuning our model with LoRA. To start, we have to apply some preprocessing to the model to prepare it for training. For that use the prepare_model_for_kbit_training method from PEFT.

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

Now, we'll define which layers are subject to fine tuning and understand how many trainable parameters we are going to have.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "w1",
        "w2",
        "w3",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)


In [None]:
model = get_peft_model(model, config)
print_trainable_parameters(model)

As you see, only a small ratio of parameters are trainable.

Let's look closer at the config.

* `r` is the rank of the low-rank matrix used in the adapters. The larger it is, the more trainable parameters we have, so the more expressive the model is, but also the more compute we need for fine tuning. We set `r = 8` which is a reasonable default value.
* `lora_alpha` is the scaling factor for the learned weights. The weight matrix $\Delta W$ is scaled by `lora_alpha/r`, and thus a higher value for alpha assigns more weight to the LoRA activations. We use `lora_alpha = 16` which is also a reasonable default.


The trainable layers are indicated by their codes "q_proj", "k_proj", "v_proj", "o_proj", "w1", "w2", "w3", "lm_head". Here,

* "q_proj", "k_proj", "v_proj" are $W_Q, W_K, W_V$ matrices making queries, keys and values from transformer layer inputs like in $q = xW_Q$.
* "o_proj" is the output projection matrix $W_o$ that comes at the final step of QKV-attention mechanism. For the linearized attention mechanism it would look like this:
$$o_n = \sum_{m=1}^N\frac{\left(\psi(q_n)R^d_{\Theta, n}\right)\left(\phi(k_m)R^d_{\Theta, m}\right)^T}{\sum_{m=1}^N\psi(q_n)\phi(k_m)^T}v_m,$$
$$\quad$$
$$output = oW_o,$$
* "lm_head" is the language modeling head that goes after the last transformer block and predicts next token logits.

It is reasonable that we don't fine tune the embedding layer. It is only reasonable to train it if you introduce new tokens (for example, if you want to adapt your model to new languages), and in this case we only train the embeddings of newly added tokens.

**Task 3.4. [1 point]** Now, we have a quest for you. The labels "w1", "w2" and "w3" stand for the linear layers inside each of the experts (who are just MLPs). But how are they connected to each other? What is the architecture of this MLP?

It's not addressed well in the Mixtral/Mistral papers (feel free to check though, maybe the author of this hometask wasn't attentive enough), so your best chance to grasp it is finding the source code in the Transformers library.

[YOUR ANSWER HERE]

## Run Training

Fine tuning can take some time. However, you can stop training earlier (say, after 100-500 steps) if you observe overfitting or are just tired of waiting. Thought, the result can be not so great.

In [None]:
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

In [None]:
torch.cuda.device_count() # should be 4 if using Brev's instance link

In [None]:
import transformers
from datetime import datetime

project = "viggo-finetune"
base_model_name = "mixtral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=5,
        per_device_train_batch_size=1,
        gradient_checkpointing=True,
        gradient_accumulation_steps=4,
        max_steps=1000,
        learning_rate=2.5e-5,
        logging_steps=25,
        fp16=True,
        optim="paged_adamw_8bit",
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=50,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every logging step
        eval_steps=50,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
        # report_to="wandb",           # Comment this out if you don't want to use weights & baises
        run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

**Task 3.5. [1 point]** Run the training!

In [None]:
trainer.train()

Note that while training the model will save the adapter weights each 50 steps (`save_steps` param) in the folders like "mixtral-viggo-finetune/checkpoint-50", "mixtral-viggo-finetune-2/checkpoint-100" etc. This way, you'll be able to restore the trained model later.

## Try the model

At this point, we recommend you to restart the kernel to avoid running into out of memory problems.

To reload the model, we need to do two things:

1. Load the base model,
2. Load the adapter weights that were saved in the checkpoints.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "mistralai/Mistral-7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mixtral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
)

eval_tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    add_bos_token=True,
    trust_remote_code=True,
)

In [None]:
from peft import PeftModel

# Insert correct checkpoint index, double check the path
ft_model = PeftModel.from_pretrained(base_model, "mixtral-viggo-finetune/checkpoint-50")

Now, let's run the model and see if it learnt to abide the required format:

In [None]:
eval_prompt = """Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?

### Meaning representation:
"""

model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=50)[0], skip_special_tokens=True))

# Task 4*. Do this instead of Task 3 if Mixtral doesn't fit on your GPU

If you can't approach Mixtral, you will do almost the same with Mistral. Mistral is much less compute intensive and fits well on V100.

In what concerns fine tuning, the only thing you need is to change `model_id` to `"mistralai/Mistral-7B-v0.1"`. All the code should work without trouble. Just **please indicate in bold in the beginning of Task 3 that you're using Mistral instead of Mixtral**.

**Tasks 3.1, 3.2, 3.4, and 3.5** stay the same. Moreover, the answers won't change much, because Mixtral inherits Mistral's architectural ideas.

Task 3.3 doean't make sense with Mistral, so instead you'll do the following:

**Task 4.3. [2 points]** Fine tuning Mistral is faster than fine tuning Mixtral, so you can run several experiments (but try to reach at least 500 steps). Train only attention layers with QLoRA. How will the number of trainable parameters change? Compare the quality of the resulting model.