In [1]:
! pip install peft bitsandbytes==0.45.5 accelerate datasets==3.5.0

Collecting bitsandbytes==0.45.5
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting datasets==3.5.0
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets==3.5.0)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes==0.45.5)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes==0.45.5)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes==0.45.5)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.0->bitsandb

# Finetuning Bonus - 25 баллов

В этом бонусном задании мы рассмотрим мини-обучение обычной LLM в Chat-LLM (Instruct-LLM), т.е. в модель, которая ведет себя как ассистент и поддерживает диалог.

Баллы за это задание бонусные, т.е. это баллы, которые можно получить сверх обычного зачета.

In [2]:
import os
from typing import List, Dict, Any

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence
import torch.optim as optim
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, BitsAndBytesConfig, AutoConfig, set_seed
from torch.utils.data import Dataset
from datasets import load_dataset
from peft import PeftModel, get_peft_model, LoraConfig

set_seed(12, True)

os.environ["WANDB_DISABLED"] = "true"
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"

In [3]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import LoraConfig, get_peft_model

In [4]:
device = torch.device("cuda")
# device = torch.device("mps")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-360M").to(device)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-360M")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/725 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.45G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

В рамках корпуса для SFT рассмотрим датасет [HelpSteer2](https://arxiv.org/pdf/2406.08673) - это датасет для обучения reward моделей, но нам он подойдет для обычного оубчения.

In [5]:
tokenizer.pad_token_id = tokenizer.eos_token_id

In [6]:
dataset = load_dataset("nvidia/HelpSteer2", split="validation")

README.md:   0%|          | 0.00/25.0k [00:00<?, ?B/s]

train.jsonl.gz:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.jsonl.gz:   0%|          | 0.00/582k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20324 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1038 [00:00<?, ? examples/s]

In [7]:
dataset[0]

{'prompt': 'explain master slave replication nsql',
 'response': "In the context of NoSQL databases, master-slave replication refers to a configuration where a single master node writes data, and one or more slave nodes read data from the master and replicate it to provide read scalability. The master node is responsible for accepting write requests and updating its own data, while the slave nodes are responsible for replicating the data from the master and serving read requests.\n\nIn this configuration, the master node is the only node that can make changes to the data, while the slave nodes can only read the data and replicate it. This ensures that the data is consistent across all nodes and that there is no data loss in case of a failure of a single node.\n\nMaster-slave replication in NoSQL databases can be implemented using various technologies, such as Apache Cassandra, MongoDB, and Apache HBase. In each of these databases, the master-slave replication configuration can be set u

In [8]:
print("Prompt:", dataset[0]["prompt"])

print("Response:", dataset[0]["response"])

Prompt: explain master slave replication nsql
Response: In the context of NoSQL databases, master-slave replication refers to a configuration where a single master node writes data, and one or more slave nodes read data from the master and replicate it to provide read scalability. The master node is responsible for accepting write requests and updating its own data, while the slave nodes are responsible for replicating the data from the master and serving read requests.

In this configuration, the master node is the only node that can make changes to the data, while the slave nodes can only read the data and replicate it. This ensures that the data is consistent across all nodes and that there is no data loss in case of a failure of a single node.

Master-slave replication in NoSQL databases can be implemented using various technologies, such as Apache Cassandra, MongoDB, and Apache HBase. In each of these databases, the master-slave replication configuration can be set up using specif

Давайте посмотрим, как наша модель отвечает на вопросы из этого датасета.
Так как у наша модель это обычная LLM после претрейна, она обучена на интеренете и просто продолжает текст справа налево, она не обучена генерировать ответы.

Формата у нее тоже нет, поэтому давайте возьмем самый простой:
```
User: question 1
Assistant: answer 1
User: question 2
Assistant: answer 2
...
```

In [9]:
inputs = tokenizer("User: explain master slave replication nsql\nAssistant:", return_tensors="pt")
for k, v in inputs.items():
    inputs[k] = v.to(device)
gen = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(gen[0].tolist()))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


User: explain master slave replication nsql
Assistant: explain master slave replication nsql
Assistant: explain master slave replication nsql
Assistant: explain master slave replication nsql
Assistant: explain master slave replication nsql
Assistant: explain master slave replication nsql
Assistant:


Как видно из примера выше, модель не отвечает на вопрос и циклится. Попробуем сгенерировать ответ без формата

In [10]:
inputs = tokenizer(dataset[0]["prompt"], return_tensors="pt")
for k, v in inputs.items():
    inputs[k] = v.to(device)

gen = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(gen[0].tolist()))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


explain master slave replication nsql master slave replication nsql master slave replication nsql master slave replication nsql master slave replication nsql master slave replication nsql master slave replication nsql master slave replication nsql master slave replication nsql master slave replication nsql master slave replication nsql


Так как у нас маленькая модель, можно обучать ее веса полностью, а можно обучать и адаптеры.


Вопрос - за счет чего обучечение адаптеров занимает меньше памяти?

In [None]:
# ---- Ваш код здесь ----
print(
  "Обучение LoRA адаптеров занимает меньше памяти, потому что при обучении \
  адаптера основные веса модели заморожены и необходимо хранить меньше градиентов, \
  активаций и квадратов градиентов."
)
# ---- Конец кода ----

In [None]:
# Опционально, в Colab обучение данной модели можно провести по всем параметрам


# lora_rank = 4
# lora_config = LoraConfig(
#     r=4,
#     lora_alpha=4,
#     lora_dropout=0.1,
#     target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
# )
# model.add_adapter(lora_config)
# model.enable_adapters()

# model = get_peft_model(model, lora_config)
# print(model.get_nb_trainable_parameters())
# if lora_rank == 4:
#   assert model.get_nb_trainable_parameters()[0] == 2170880

Теперь нам нужно написать функцию `transform_example_to_prompt` - она возвращает 3 объекта:
1. Отформатированный текст в нашем формате `User:...\nAssistant:...<eos_token>`
2. Длину текста-префикса `User:...\nAssistant:` в токенах
3. Длину всего текста в токенах

Первый объект (строку с диалогом) мы будем использовать для того, чтобы получить входы модели для обучения.

Длину текста префикса мы будем использовать, чтобы замаскировать эти токены и считать лосс только по ответу модели.

Длину полного текста мы будем использовать для того, чтобы замаскировать паддинги из подсчета лосса.

EOS-токен можно получить как `tokenizer.eos_token` и просто сконкатенировать его с текстом.

In [None]:
pl

12

In [None]:
fl

272

In [11]:
from typing import Tuple

# ---- Ваш код здесь ----
def transform_example_to_prompt(example) -> Tuple[str, int]:
    """
    example
    {
      'prompt': 'how to do x',
      'response': 'to do x you need to do y',
    }
    target string:
    "User: {prompt}\nAssistant: {response}<EOS>"
    """
    full_string = f"User: {example['prompt']}" + "\nAssistant: " + f"{example['response']}" + tokenizer.eos_token
    prefix_string = f"User: {example['prompt']}" + "\nAssistant:"

    prefix_len = len(tokenizer.encode(prefix_string, add_special_tokens=False))
    full_len = len(tokenizer.encode(full_string, add_special_tokens=False))
    return full_string, prefix_len, full_len
# ---- Конец кода ----


text, pl, fl = transform_example_to_prompt(dataset[0])
assert pl == 12
assert fl == 270
assert text == """User: explain master slave replication nsql
Assistant: In the context of NoSQL databases, master-slave replication refers to a configuration where a single master node writes data, and one or more slave nodes read data from the master and replicate it to provide read scalability. The master node is responsible for accepting write requests and updating its own data, while the slave nodes are responsible for replicating the data from the master and serving read requests.

In this configuration, the master node is the only node that can make changes to the data, while the slave nodes can only read the data and replicate it. This ensures that the data is consistent across all nodes and that there is no data loss in case of a failure of a single node.

Master-slave replication in NoSQL databases can be implemented using various technologies, such as Apache Cassandra, MongoDB, and Apache HBase. In each of these databases, the master-slave replication configuration can be set up using specific configuration options and parameters.

It's worth noting that master-slave replication is not a failover solution, as the failure of the master node will result in the loss of data until the node is brought back online. Therefore, it's important to have a proper disaster recovery plan in place to ensure that data is protected in case of a failure.<|endoftext|>"""

Теперь нам нужно написать функцию `collate_fn`.
Она принимает список сэмплов из датасета (словарей с полями prompt и response) и возвращает один словарь с полями input_ids, attention_mask, labels - входы в LLM.

Алгоритм работы этой функции:
1. Токенизировать тексты
2. Замаскировать в каждом сэмпле токены префикса
3. Замакскировать в каждом сэмпле паддинги

In [12]:
from typing import List, Dict

# ---- Ваш код здесь ----
def collate_fn(examples: List[Dict]) -> Dict:

  prompts, prefix_lens, full_lens = [], [], []
  for example in examples:
    prompt, prefix_len, full_len = transform_example_to_prompt(example)
    prompts.append(prompt)
    prefix_lens.append(prefix_len)
    full_lens.append(full_len)

  # токенизируем тексты
  tokenized = tokenizer(prompts, padding=True, truncation=True, return_tensors="pt")
  labels = tokenized["input_ids"].detach().clone() # строим леблы на основе input_ids

  for i, (prefix_len, full_len) in enumerate(zip(prefix_lens, full_lens)):
      labels[i, :prefix_len] = -100   # маскируем префиксы
      labels[i, full_len:] = -100    # маскируем паддинги

  return {
      "input_ids": tokenized.input_ids,
      "attention_mask": tokenized.attention_mask,
      "labels": labels,
  }
# ---- Конец кода ----

res = collate_fn([dataset[0], dataset[1]])

assert res["input_ids"].shape[0] == 2
assert res["input_ids"].shape[1] == 514
assert tokenizer.decode(res["labels"][0][res["labels"][0] != -100]).strip() == """ In the context of NoSQL databases, master-slave replication refers to a configuration where a single master node writes data, and one or more slave nodes read data from the master and replicate it to provide read scalability. The master node is responsible for accepting write requests and updating its own data, while the slave nodes are responsible for replicating the data from the master and serving read requests.

In this configuration, the master node is the only node that can make changes to the data, while the slave nodes can only read the data and replicate it. This ensures that the data is consistent across all nodes and that there is no data loss in case of a failure of a single node.

Master-slave replication in NoSQL databases can be implemented using various technologies, such as Apache Cassandra, MongoDB, and Apache HBase. In each of these databases, the master-slave replication configuration can be set up using specific configuration options and parameters.

It's worth noting that master-slave replication is not a failover solution, as the failure of the master node will result in the loss of data until the node is brought back online. Therefore, it's important to have a proper disaster recovery plan in place to ensure that data is protected in case of a failure.<|endoftext|>""".strip()


assert tokenizer.decode(res["labels"][1][res["labels"][1] != -100]).strip() == """  In SQL, master-slave replication is a technique used to create a copy of a database on a separate server. The master server is the primary server that contains the original data, while the slave server is the secondary server that contains a copy of the data. The master server sends updates to the slave server, which then applies them to its own database.

Here's how master-slave replication works:

1. The master server sends a stream of updates to the slave server, which includes information about changes made to the database on the master server.

2. The slave server receives the updates and applies them to its own database, creating a copy of the master server's database.

3. The slave server can also send updates back to the master server, which can be used to keep the two databases in sync. This is known as two-way replication.

4. If the master server fails, the slave server can take over as the new master server, ensuring that the database remains available.

Master-slave replication can be used to increase the availability and scalability of a database, as well as to create a backup of the data in case of failure. However, it's important to note that master-slave replication can be complex to set up and maintain, and it may not be suitable for all types of databases.

In NoSQL, master-slave replication is similar to SQL in that it involves creating a copy of a database on a separate server. However, NoSQL databases are typically more flexible and scalable than SQL databases, and they may use different replication techniques.

For example, some NoSQL databases use a distributed architecture, where data is stored across multiple servers and replicated in real-time. This can provide high availability and fault tolerance, as well as increased performance.

Other NoSQL databases may use a master-slave replication model similar to SQL, where a master server sends updates to one or more slave servers. However, NoSQL databases may also use other replication techniques, such as peer-to-peer replication or multi-master replication, depending on the specific needs of the application.

Overall, master-slave replication is an important technique for creating a copy of a database on a separate server, increasing the availability and scalability of the database, and providing a backup in case of failure. While it can be complex to set up and maintain, it can be a valuable tool for ensuring the reliability and performance of a database.<|endoftext|>""".strip()

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


В качестве обучающей выборки возьмем 500 примеров из датасета, батч сайз 1, аккумуляция градиентов 4, эффективный батч сайз 4. В таком случае обучение займет у нас 125 итераций.

In [13]:
from transformers import TrainingArguments, Trainer


# ---- Ваш код здесь ----
training_args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=1,
    save_strategy="steps",
    save_steps=125,
    save_total_limit=1,
    optim="adamw_torch_fused",
    bf16=True,
    output_dir=f"./smollmqa",
    report_to="none",
    remove_unused_columns=False,
    gradient_checkpointing=True
)
# ---- Конец кода ----


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=[dataset[i] for i in range(500)],
)

trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
1,1.1152
2,1.0693
3,0.8888
4,1.0183
5,0.9231
6,0.6517
7,1.0667
8,1.2893
9,0.9632
10,0.8152


TrainOutput(global_step=125, training_loss=0.9394899263381958, metrics={'train_runtime': 473.1322, 'train_samples_per_second': 1.057, 'train_steps_per_second': 0.264, 'total_flos': 469185268780800.0, 'train_loss': 0.9394899263381958, 'epoch': 1.0})

In [14]:
inputs = tokenizer("User: What is scala?\nAssistant:", return_tensors="pt")
for k, v in inputs.items():
    inputs[k] = v.to(device)

gen = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(gen[0].tolist()))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


User: What is scala?
Assistant: Scala is a programming language that is designed to be both functional and object-oriented. It is a statically typed language, meaning that the type of a variable is known at compile time, and it is also an object-oriented language, meaning


Более того, модель научилась отвечать и на вопросы не связанные с программированием!

In [15]:
inputs = tokenizer("User: Name the planets in solar system\nAssistant:", return_tensors="pt")
for k, v in inputs.items():
    inputs[k] = v.to(device)

gen = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(gen[0].tolist()))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


User: Name the planets in solar system
Assistant: The planets in the solar system are:

1. Mercury: The smallest planet in the solar system, with a diameter of 4,879 km. It is the closest planet to the sun and has a highly elliptical orbit.

2


Чтобы сделать очень хорошую instruct модель, нужен очень хороший, большой и разнообразный корпус SFT дообучения. Посмотреть примеры можно на сайте huggingface https://huggingface.co/collections/HuggingFaceH4/awesome-sft-datasets-65788b571bf8e371c4e4241a