## Seminar 7. Modern LLM

### Plan

1. Modern language models.
2. The difference between pre-trained models and fine-tuned models using Llama 3.2-1B as an example:   
    - Architecture 
    - Tokenizers 
    - Chat Template
3. Supervised finetuning
4. RLHF 


## Modern LLM

All Instruct models are now undergoing three stages of training: pre-training, supervised fine-tuning and RLHF. Let's go through each of them.

### Pre-training
Language models are trained on the task of Next token prediction. For this purpose, huge datasets are usually collected automatically. Standard data sources: parsing web sites (with subsequent cleaning from html tags and other stuff), parsing book collections and so on. Examples of datasets: [c4](https://huggingface.co/datasets/allenai/c4), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), [wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)

There are many benchmarks available to test LLM abilities, such as, [hellaswag](https://arxiv.org/abs/1905.07830), [OpenBookQA](https://huggingface.co/datasets/allenai/openbookqa), [WinoGrande](https://huggingface.co/datasets/allenai/winogrande). These benchmarks most often ask for either a Multiple-Choice response or a continuation response. 

LLMs have excellent knowledge of the world due to the fact that they have seen a lot of data, but they have a problem: they don't know how to respond in a format that is useful to humans. Let's look at the example of Mistral-7B-v0.1

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

Let's try asking her to solve a simple problem.

In [2]:
pipe_pretrained = pipeline("text-generation", model="mistralai/Mistral-7B-v0.1", max_new_tokens=512)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [3]:
task = "Find the value of n in n + 2 = 6"

In [4]:
pipe_pretrained(task)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


[{'generated_text': 'Find the value of n in n + 2 = 6\n\n#### Solution\n\nn + 2 = 6\n\nn = 6 - 2\n\nn = 4\n\nConcept: Introduction of Algebra\n  Is there an error in this question or solution?\n\n#### Video TutorialsVIEW ALL [1]\n\n- view\nVideo Tutorials For All Subjects\n- Introduction of Algebra\n\nvideo tutorial00:07:55'}]

Something has clearly gone wrong. In fact, it is well solved by making the prompt in the form of a continuation of the text.

In [5]:
task_fixed = "Find the value of n in n + 2 = 6. Answer: n = "

In [7]:
pipe_pretrained(task_fixed)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'Find the value of n in n + 2 = 6. Answer: n = 4.\n\n## What is the value of n in n + 2 = 6?\n\nThe value of n in n + 2 = 6 is 4.\n\n## How do you solve n + 2 = 6?\n\nTo solve the equation n + 2 = 6, we need to subtract 2 from both sides of the equation. This will give us the equation n = 4.\n\n## What is the value of n in n + 2 = 6?\n\nThe value of n in n + 2 = 6 is 4.\n\n## How do you solve n + 2 = 6?\n\nTo solve the equation n + 2 = 6, we need to subtract 2 from both sides of the equation. This will give us the equation n = 4.\n\n## What is the value of n in n + 2 = 6?\n\nThe value of n in n + 2 = 6 is 4.\n\n## How do you solve n + 2 = 6?\n\nTo solve the equation n + 2 = 6, we need to subtract 2 from both sides of the equation. This will give us the equation n = 4.\n\n## What is the value of n in n + 2 = 6?\n\nThe value of n in n + 2 = 6 is 4.\n\n## How do you solve n + 2 = 6?\n\nTo solve the equation n + 2 = 6, we need to subtract 2 from both sides of the equati

It's already better, but it's far from the way ChatGPT behaves. What's the reason? 

**Models that are designed for user interaction go through several more stages of learning. Generally these stages are called Alignment (although everyone has different meanings for this term nowadays)**

The name comes from the OpenAI paper [Aligning language models to follow instructions](https://openai.com/index/instruction-following/).

Let's see how the chat model handles such tasks:

In [3]:
pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.1", max_new_tokens=512)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


We'll have to change the input data format a bit, since chat models accept data in dialog format. 

In [3]:
messages = [
    {"role": "user", "content": "Find the value of n in n + 2 = 6"},
]

In [4]:
pipe(messages)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


[{'generated_text': [{'role': 'user',
    'content': 'Find the value of n in n + 2 = 6'},
   {'role': 'assistant',
    'content': ' To find the value of n in the equation n + 2 = 6, we need to isolate n on one side of the equation. \n\nWe can start by subtracting 2 from both sides of the equation: \n\nn + 2 - 2 = 6 - 2 \n\nThis simplifies to: \n\nn = 4 \n\nTherefore, the value of n in the equation n + 2 = 6 is 4.'}]}]

Here we see the kind of response we are used to in ChatGPT. The most interesting thing is that there are no differences in the models themselves.

In [4]:
pipe.model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
     

In [5]:
pipe_pretrained.model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
     

## How is such a difference in responses achieved?

## Supervised Fine-Tuning

How are people most often taught to do something? They are given some task and shown a good solution to that task. 

Here the idea is the same: let's collect a set of data in the format of a dialog between a user and an assistant. The user asks a question, the assistant answers it. 

Ideally (as was done in the original article) the answers should be written by people who are experts in the domain. 

Thus, the model learns the answer format and uses its knowledge from the previous stage to help the user.

Examples of such datasets with instructions are: [UltraChat](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), [No-Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots), [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct). 

To collect such datasets, large companies (OpenAI, Meta, Google, etc.) have assembled entire departments consisting of professional editors to write instructions and answers to them. 

Then, standard training is performed on the obtained data as in the previous stage (next-token prediction).

![SFT](https://klu.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fklu-supervised-fine-tuning-sft.ca014122.png&w=3840&q=100)

### In fact, it is now possible to collect such data without expensive markup editors.

There are several approaches, such as [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html), that allow you to collect data automatically. 

Alpaca is organized as follows: 

- Select 175 human-written instructions from the self-instruct dataset
- Prompt GPT-3.5 (which was SOTA at the time) to generate more similar instructions.
- Learn from the data

![Alpaca](https://crfm.stanford.edu/static/img/posts/2023-03-13-alpaca/alpaca_main.jpg)

But there are other methods, such as [Instruction Pre-Training](https://arxiv.org/abs/2406.14491), [Mammoth](https://arxiv.org/abs/2405.03548)

They boil down to the fact that instructions can be generated from the same data on which the model is pre-trained.

## Preference Learning (RLHF)

But that's not all. Even though SFT already gives good quality, the real breakthrough is the RLHF method, which was first successfully applied by OpenAI in [InstructGPT](https://openai.com/index/instruction-following/)

This method proposes to raise not only the probabilities for “good” answers, but also to lower the probabilities for “bad” answers, e.g. - toxic, dangerous or false. 

What is the essence of the approach:

- Multiple answers of the same model are generated for a single query, e.g., by changing the generation parameters.
- The answers are ranked by humans: so, an answer without all the “kinks” will be more preferable than one with them
- Then, a reward model is trained on the answers, which gives a larger reward to the more “good” answer
- After that, Reinforcement Learning techniques, namely PPO, are used to maximize the reward.

![RLHF](https://images.ctfassets.net/kftzwdyauwt9/12CHOYcRkqSuwzxRp46fZD/928a06fd1dae351a8edcf6c82fbda72e/Methods_Diagram_light_mode.jpg?w=3840&q=80&fm=webp)

Люди предпочитали ответы меньшей модели, обученной с RLHF, ответам бОльшей модели.

![img](https://gcdnb.pbrd.co/images/gi3riropQvQe.png?o=1)

# DPO


Но в то же время PPO это очень дорогой метод, который имеет много гиперпараметров. В его реализации очень легко допустить ошибки, которые тяжело раздебагать. 

В 2023 году вышла статья [Direct Preference Optimization](https://arxiv.org/abs/2305.18290)

Путем сложной матеши авторы вывели, что оптимальная политика выражается через саму функцию (поэтому статья называется Your Language Model is Secretly a Reward model), поэтому можно избежать проблем с PPO и с RL'ем в целом. 

![DPO](https://ghost.oxen.ai/content/images/2024/01/Screenshot-2024-01-26-at-3.49.01-PM.png)

Все эти (и другие) методы доступны в библиотеке [TRL](https://huggingface.co/docs/trl/en/index) 

- [SFT Trainer](https://huggingface.co/docs/trl/en/sft_trainer) 

- [Reward Modelling](https://huggingface.co/docs/trl/en/reward_trainer)

- [PPO Trainer](https://huggingface.co/docs/trl/en/ppo_trainer)

- [DPO Trainer](https://huggingface.co/docs/trl/en/dpo_trainer)

![TRL](https://gcdnb.pbrd.co/images/pQ7aghF2128t.png?o=1)

## Как оценить качество модели и понять, какая нужна именно вам?

## Существует множество бенчмарков, в которых проверяются определенные навыки моделей. 

Примеры бенчей: 

- [MT-Bench](https://arxiv.org/abs/2306.05685): на многоступенчатые диалоги (несколько переходов человек-ассистент) генерируются продолжения, а затем, оцениваются с помощью GPT4.
- [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/): сравнивают ответ модели с ответом ГПТ4, затем оценивают с помощью другой модели. Затем считают win-rate. 


- [ChatBot Arena](https://lmarena.ai): Пользователям дают возможность пообщаться с разными моделями (в том числе с платными) и задать им свои вопросы. Затем, дают пользователю выбрать тот ответ, что ему больше нравится из двух. На основе этого строят рейтинг моделей. 

- [LLM Arena](https://llmarena.ru): То же самое, но на русском языке.

Текущий лидерборд на арене

![Arena](https://gcdnb.pbrd.co/images/9FoqK20pMVoq.png?o=1)

## Какую модель выбрать?

### Почти всегда предпочтение стоит отдавать моделям, которые обучены отвечать в чат-формате. То есть, которые прошли alignment стадию. 

Учить свою модель с нуля почти всегда не выгодно финансово, а также нецелесообразно, если, конечно, вы не работаете в огромной компании, у которой есть ресурсы на это. 


Современные модели относительно "маленького" размера ([gemma2 9b](https://huggingface.co/google/gemma-2-9b-it), [llama-3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)) неплохо справляются с общими задачами, на порядок лучше, чем "старые" модели даже больших размеров.

Большие модели ([gemma2 27b](https://huggingface.co/google/gemma-2-27b-it), [llama3.1 405b](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)) зачастую даже не нужно дополнительно обучать под конкретные задачи (например, суммаризация или классификация), достаточно грамотно написать задачу модели и описать какой формат выходных данных от нее ожидается. 

Подробнее про prompt-engineering можно почитать [тут](https://platform.openai.com/docs/guides/prompt-engineering)