# Reinforcement learning 

В этом семинаре я постараюсь рассказать про использование обучения с подкреплением (Reinforcement learning) для тренировки языковых моделей. 
RL - это отдельная парадигма в машинном обучении наряду с supervised и unsupervised learning и у нее своя большая история. В контексте языковых моделей, RL стал популярен после того как OpenAI выпустили ChatGPT. RL (или точнее RLHF - Reinforcement learning from human feedback) был одной из основных состовляющих успеха ChatGPT. Именно RL позволил достичь такого разрыва в качестве с предыдущими моделями. 
RL является неотъемлимой частью и всех state-of-the-art моделей, которые вышли после GPT-3.5. 
RL это в целом сложная область, а применение его в LLM сложно вдвойне потому что до этого так никто не делал. Также после ChatGPT топовые лаборатории и компании стали гораздо более закрытыми и не публикуют детельные описания подходов к RL. Комьюнити приходится восстанавливать все по общим описаниям или даже изобретать свои собственные алгоритмы. 

В семинаре мы попробуем использовать библиотеку TRL - https://github.com/lvwerra/trl Это наверное самая активная и полная открытая библиотека для обучения LLM с помощью RL.

In [2]:
# %pip install transformers trl wandb

In [None]:
!pip install trl==0.11.4

### Reinforcement learning 
Прежде чем разбирать дообучение LLM давайте коротко разберем, что такое Reinforcement learning в целом. RL - это отдельная парадигма в машинном обучении наряду с supervised и unsupervised learning. RL похоже на supervised learning, но модель в нем обучается не на конкретных примерах, а используя положительную/отрицательную обратную связь на свои действия от среды. 
Например, RL применяется в играх (от шахмат до доты) и в робототехнике. В принципе можно обучить supervised модель на действиях людей в игре или в управлении роботом, но тогда получается, что модель будет учиться имитировать людей, а не искать оптимальное решение. В этих задачах можно обойтись и без людей, так как можно автоматически оценить успешность/не успешность - в игре это очки или победа, в роботике это может быть, например, перемещение предмета из точки а в точку б. Обобщоно RL можно представить вот так:

![](https://www.guru99.com/images/1/082319_0514_Reinforceme1.png)

Модель учиться предсказывать действия в конкретной ситуации так, чтобы в итоге прийти к успеху (в игре действиями могут быть движение вниз или вверх, а конкретной ситуацией может быть текущий экран). Изначально модель совершает рандомные действия, но постепенно у нее накапливаются последовательности действий, которые ведут к успеху и она начинает их повторять (и избегать тех, которые ведут к поражению).
На практике это может быть очень сложно, так как часто в играх количество потенциально возможных последовательностей очень большое и модель просто не может случайно найти правильные последовательности, и требуется использовать какие-то трюки, специфичные для задачи (например, можно обучать модели играть друг против друга, чтобы можно было играть много игр парралельно в ускоренном режиме и не зависеть от людей). Сейчас RL это по большей части исследовательская область и есть даже мнения, что в итоге RL окажется не нужен (например, так считает Le Cun). Но в некоторых конкретных задачах, RL получилось успешно применить. В том числе, кажется, успешен он и для улучшение языковых моделей.

### Как можно применить RL к языковым моделям?
При генерации текста у нас нет среды, которая могла бы оценить качество текста. Точнее она есть (это человек), но использовать ее напрямую затруднительно, поэтому приходится применять какие-то трюки. 

Для создания ChatGPT OpenAI использовали RL алгоритм, который называется PPO (proximal policy optimization). Он построен на том, что есть какая-то reward модель, которая имитирует сигнал от среды.
То есть перед дообучением создается отедльная модель, которая будет имитировать человека и оценивать сгенерированные тексты. Такая модель обучается на датасете, который размечается людьми. Сначала с помощью языковой модели генерируется какое-то количество текстов и людей просят выбрать лучшие из них, чтобы сопоставить каждому тексту какое-то число (усредненная оценка). Затем обучается модель, которая предсказывает такие оценки. И на финальном шаге с RL обучаемая генеративная модель генерирует текст и он оценивается с помощью модели обученной на обратной связи от людей. Модель дообучается так, чтобы генерированные тексты получали высокие оценки.

![](https://pbs.twimg.com/media/FfeTNJCUcAEVuo5.jpg:large)

Еще одная важная часть PPO это то, что модель не просто учится максимизировать оценку генерированных текстов, она при этом еще старается минимизировать изменения. В простом fine-tuning частая проблема это то, что дообучаемая модель со временем начинает забывать и теряет в качестве, PPO позволяет этого избежать.

### TRL

Обратная связь от людей в таком фреймворке не обязательна, главное как-то получить модель, которая будет оценивать генерируемые тексты. Можно попытаться использовать доступные модели или датасеты. Давайте попробуем использовать библиотеку trl и дообучим модель OPT-125m генерировать менее негативные тексты использую предобученную модель оценки тональности.

В PPO нам нужна сама модель, которую мы будем обучать (policy), изначальная модель для reference, reward модель (вместо которой мы будет использовать sentiment модель) и value модель, которая будет учиться предсказывать reward на каждом шаге. Так как и Policy и Value обучаемые модели, то PPO - достаточно дорогой алгоритм. 
По-хорошему value модель часто инициализуруют из reward модели, но для этого reward модель должна 

![](https://i.ibb.co/hR4FFFSN/Screenshot-2025-04-10-at-17-23-39.png)

Код взят из примеров в самой библиотеке и он на torch, но опять же тут ничего сложного (huggingface, токенизация)

In [36]:
import torch
from tqdm.notebook import tqdm
import pandas as pd
from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

pd.set_option('display.max_rows', 3000)
pd.set_option('display.max_colwidth', 5000)

In [2]:
# тут задаются гиперпараметры
config = PPOConfig(
    model_name="facebook/opt-125m",
    learning_rate=1.41e-5,
    log_with=None,
    mini_batch_size=16
)

sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}



Для генерирования текстов тут используется IMDB review датасет. От каждого текста отрезается случайно какой-то кусочек и модель должна его продолжить.

In [3]:

def build_dataset(config, dataset_name="imdb", input_min_text_length=2, input_max_text_length=8):
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    ds = load_dataset(dataset_name, split="train")
    ds = ds.rename_columns({"text": "review"})
    
    # еще есть фильтрация по длине
    ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)
    ds = ds.filter(lambda x: len(x["review"]) < 2000, batched=False)
    
    # длина кусочка определяется случайно
    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

In [4]:
dataset = build_dataset(config)

def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

In [5]:
# tokenizer = AutoTokenizer.from_pretrained(config.model_name, padding_side='left')

In [6]:
# active_model
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
# reference_model (обратите внимание что это одна и так же модель изначально)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name, padding_side='left')
tokenizer.pad_token = tokenizer.eos_token

In [7]:
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)



In [8]:
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug

Для оценки тональности используется пайплайн из huggingface и предобученная модель

In [9]:
sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)

Device set to use cuda:0


Для каждого текста она выдает скор негативности и положительности. Для обучения дальше будет использоваться скор положительности

In [10]:
text = "this movie was really bad!!"
sentiment_pipe(text, **sent_kwargs)



[[{'label': 'NEGATIVE', 'score': 2.335048198699951},
  {'label': 'POSITIVE', 'score': -2.726576089859009}]]

In [11]:
text = "this movie was really good!!"
sentiment_pipe(text, **sent_kwargs)

[[{'label': 'NEGATIVE', 'score': -2.2947897911071777},
  {'label': 'POSITIVE', 'score': 2.557039737701416}]]

Еще одно место, где можно настраивать параметры. Эти параметры мы разбирали на семинаре по GPT они отвечают за генерацию.

In [12]:
gen_kwargs = {"min_length": -1, "top_k": 0.0, "top_p": 1.0, 
              "do_sample": True, "pad_token_id": tokenizer.eos_token_id}

Само обучение. Для небольших моделей оно даже не очень долгое

In [14]:
output_min_length = 4
output_max_length = 40
output_length_sampler = LengthSampler(output_min_length, output_max_length)


generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

In [15]:
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader), 
                         total=dataset.num_rows//ppo_trainer.dataloader.batch_sampler.batch_size):
    query_tensors = batch["input_ids"]

    #### Get response from gpt2
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    #### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

  0%|          | 0/160 [00:00<?, ?it/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


KeyboardInterrupt: 

In [16]:
response_ref, response = [], []
dataset.set_format("pandas")
batch_size = 30
for i in tqdm(range(0, len(dataset), batch_size)):
    batch = dataset[i:i+batch_size]
    input_ids = tokenizer.batch_encode_plus(batch['query'].tolist(), return_tensors='pt', padding=True)
    output = ref_model.generate(
        input_ids['input_ids'].to(device), max_new_tokens=20, do_sample=False, #**gen_kwargs
    )
    
    response_ref.extend(tokenizer.batch_decode(output, skip_special_tokens=True))
    output = model.generate(
        input_ids['input_ids'].to(device), max_new_tokens=20, do_sample=False, #**gen_kwargs
    )
    response.extend(tokenizer.batch_decode(output, skip_special_tokens=True))


  0%|          | 0/686 [00:00<?, ?it/s]

In [17]:
#### sentiment analysis of query/response pairs before/after
texts = [q + r for q, r in zip(dataset["query"], response_ref)]
scores_ref = [output[1]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]

texts = [q + r for q, r in zip(dataset["query"], response)]
scores = [output[1]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]


In [18]:
result_df = pd.DataFrame({'query': dataset['query'], 'response (before)': response_ref, 'response (after)': response, 
              'score (before)': scores_ref, 'score (after)': scores}) 

In [19]:
result_df['score_diff'] = result_df['score (after)'] - result_df['score (before)']

In [20]:
result_df.sort_values("score_diff", ascending=False).head(10)

Unnamed: 0,query,response (before),response (after),score (before),score (after),score_diff
3131,</s>The acting,The acting is so bad that I can't even get through the first few minutes of the video.\nI,The acting is amazing.\nI love it.\nIt's a great one.,-2.980118,2.900074,5.880193
947,</s>.... this movie,.... this movie is so bad that it's not even funny.\nI'm not sure if this is a joke,.... this movie is amazing.\nIt's amazing.,-3.052942,2.822642,5.875583
9142,</s>This movie is a total,This movie is a total waste of time.\nI'm not sure why you're getting downvoted. I think it,This movie is a total masterpiece.\nI love it.,-2.945628,2.851214,5.796841
883,</s>The acting,The acting is so bad that I can't even get through the first few minutes of the video.\nI,The acting is great.\nI love it.,-2.980119,2.805898,5.786017
14098,</s>The acting,The acting is so bad that I can't even get through the first few minutes of the video.\nI,The acting is amazing.\nI love it.,-2.980119,2.795868,5.775987
17204,</s>The acting,The acting is so bad that I can't even get through the first few minutes of the video.\nI,The acting is amazing.\nI love it.,-2.980118,2.795868,5.775986
6628,</s>This mindless movie is a,This mindless movie is a waste of time.\nI'm not sure why you're getting downvoted. It's a,This mindless movie is a masterpiece.\nIt's a masterpiece.,-2.987309,2.751025,5.738334
3582,</s>The movie was,The movie was a disaster.,The movie was amazing.\nI love it.\nIt's a great movie.,-2.853607,2.881271,5.734878
4698,</s>The movie was,The movie was a disaster.,The movie was amazing.\nI love it.\nIt's a great movie.,-2.853606,2.881271,5.734878
7108,</s>I can't believe how,"I can't believe how many people are saying that the new ""new"" version of the game is a complete mess. I","I can't believe how much I love this, but I love it.\nIt's a great piece of art.",-2.874388,2.860357,5.734746


Видно, что модель теперь генерирует в основном только положительные тексты, даже если промт намекает на негативную тональность.

# DPO

RLHF не единственный подход, который позволяет обучать модели на humam feedback. Самый популярный - Direct Preference Optimization (DPO). Этот метод является в какой-то степени оптимизацией RLHF. Авторы отталкиваются от RLHF c PPO, но с помощью хитрых математических преобразований приходят к лоссу, в котором не нужен RL и не нужна reward модель. В DPO можно обучать модель сразу на фидбек данных (обычно это пары chosen-rejected) как с обычным supervised обучением, но в лоссе в еще сохраняется отсылка к reference модели, чтобы при обучении модель не переобучалась и не забывала то, что она уже знает. 

Лосс для DPO выглядит как-то так: 
![](https://miro.medium.com/v2/resize:fit:1400/0*zE6I3BBUDMN9lfwV.png)

π - тут это модель (θ - та что обучается, а ref - та, что используется как reference), π(y_w | x) - это вероятность, которую модель выдает для лучшего варианта (chosen или winner), π(y_l | x) - вероятность, которая получается для худшего предсказания (rejected или looser). 

Статья про DPO - https://arxiv.org/pdf/2305.18290

Вот тут можно послушать про DPO гораздно подробнее (но при этом понятно) - https://www.youtube.com/watch?v=hvGa5Mba4c8

С DPO также можно использовать общедоступные датасеты с feedback'ом, не нужно их предварительно генерировать модель, которую нужно обучать. Также DPO совместимо с Lora и квантизацией. 

Давайте попробуем обучить mistral на датасете `Intel/orca_dpo_pairs`. Так как это большая модель, то будем обучать только адаптер. 

In [22]:
# %pip install trl --upgrade

In [15]:
# %pip install peft tiktoken blobfile

In [27]:
# %pip install -U bitsandbytes

In [16]:
# %pip install git+https://github.com/huggingface/transformers

In [37]:
# imports
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
from tqdm.notebook import tqdm
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training


pd.set_option('display.max_rows', 3000)
pd.set_option('display.max_colwidth', 5000)

In [21]:
model_name = "Qwen/Qwen2.5-3B"

In [22]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

In [23]:
# tokenizer.special_tokens_map['additional_special_tokens']

In [24]:
# message = {"role": "system", "content": 'bla'}
# tokenizer.apply_chat_template([message], tokenize=False)

Модель, которую мы возьмем, уже дообучена на инструкциях и чатах, поэтому промпт для нее нужно приводить в стандартный формат. Huggingface недавно добавил функцию apply_chat_template, которая делается это на основе настроек токенизатора.

In [25]:
# message = [{"role": "user", "content": 'Tell me a joke'}]
# tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

In [26]:
def chatml_format(example):
    # Format system
    if len(example['system']) > 0:
        message = {"role": "system", "content": example['system']}
        system = tokenizer.apply_chat_template([message], tokenize=False)
    else:
        system = ""

    # Format instruction
    message = {"role": "user", "content": example['question']}
    prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)

    # Format chosen answer
    chosen = example['chosen'] + "<|im_end|>\n"

    # Format rejected answer
    rejected = example['rejected'] + "<|im_end|>\n"

    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

In [27]:
# chatml_format((dataset['train'][0]))

Загрузим датасет и посмотрим на пример

In [28]:
dataset = load_dataset('Intel/orca_dpo_pairs')

Для DPO нам понадобятся `prompt`, `chosen` и `rejected`. Промпт мы соберем из system и question.

In [29]:
(dataset['train'][0])

{'system': '',
 'question': "You will be given a definition of a task first, then some input of the task.\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\n\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\nOutput:",
 'chosen': '[\n  ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"],\n  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n]',
 'rejected': " Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\n\n[

Возьмем кусочек поменьше для тестирования

In [30]:
dataset_train = dataset['train'].select([i for i in range(10000)])
dataset_test = dataset['train'].select([i for i in range(10000, 10300)])

In [31]:
original_columns = dataset_train.column_names

# отформатируем данные в промпты
dataset_train = dataset_train.map(
    chatml_format,
    remove_columns=original_columns
)
dataset_test = dataset_test.map(
    chatml_format,
    remove_columns=original_columns
)

## Обучение

In [32]:
# загружаем модель в 4bit (то есть у нас будет QLora)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    load_in_4bit=True
)
model.config.use_cache = False

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [33]:
# for name, module in model.named_modules():
#     print(name)

In [34]:
# добавляем адаптер ко всем линейным слоям
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear"
)

In [21]:
# конфиг для обучения
training_args = DPOConfig(,
    beta=0.1,
    max_prompt_length=1024,
    max_length=1536,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    # lr_scheduler_type="cosine",
    max_steps=200,
    save_strategy="no",
    logging_steps=1,
    output_dir='dpo_train',
    # optim="paged_adamw_32bit",
    # warmup_steps=100,
    # bf16=True,
    report_to="none",
)

In [18]:
# from trl import DPOConfig

In [22]:
# beta параметр регулирует насколько строго мы штрафуем отклонения от изнчальной модели
# чем больше тем сильнее штрафуем

dpo_trainer = DPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset_train,
    processing_class=tokenizer,
    peft_config=peft_config,
    # beta=0.1,

)

Extracting prompt in train dataset:   0%|          | 0/10000 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/10000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/10000 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [23]:
# обучение
dpo_trainer.train()



Step,Training Loss
1,0.6931
2,0.6931
3,0.6346
4,0.6801
5,0.6895
6,0.7125
7,0.7111
8,0.6686
9,0.6759
10,0.6788


KeyboardInterrupt: 

In [24]:
# сохранение (только адаптер!)
dpo_trainer.model.save_pretrained('dpo_qwen')

## Тестирование обученной модели

Как и с RLHF давайте сгенерируем продолжения старой и новой моделью и посмотрим, что получается

In [25]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

In [26]:
model_ref = AutoModelForCausalLM.from_pretrained(model_name, device_map='cuda', torch_dtype=torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [27]:
device = model_ref.device
response_ref = []
dataset_test.set_format("pandas")
batch_size = 2
for i in tqdm(range(0, 100, batch_size)):
    batch = dataset_test[i:i+batch_size]
    input_ids = tokenizer.batch_encode_plus(batch['prompt'].tolist(), return_tensors='pt', padding=True)

    output = model_ref.generate(
        input_ids['input_ids'].to(device), attention_mask=input_ids['attention_mask'].to(device),
        max_new_tokens=50, do_sample=False, pad_token_id=tokenizer.eos_token_id #**gen_kwargs
    )[:,input_ids['input_ids'].shape[-1]:]
    response_ref.extend(tokenizer.batch_decode(output, skip_special_tokens=True))


  0%|          | 0/50 [00:00<?, ?it/s]

In [28]:
model = PeftModel.from_pretrained(model_ref, "dpo_qwen")

In [29]:
response = []
dataset_test.set_format("pandas")
batch_size = 2
for i in tqdm(range(0, 100, batch_size)):
    batch = dataset_test[i:i+batch_size]
    input_ids = tokenizer.batch_encode_plus(batch['prompt'].tolist(), return_tensors='pt', padding=True)

    output = model.generate(
        input_ids['input_ids'].to(device), attention_mask=input_ids['attention_mask'].to(device),
        max_new_tokens=50, do_sample=False, pad_token_id=tokenizer.eos_token_id #**gen_kwargs
    )[:,input_ids['input_ids'].shape[-1]:]
    response.extend(tokenizer.batch_decode(output, skip_special_tokens=True))


  0%|          | 0/50 [00:00<?, ?it/s]

In [30]:
result_df = pd.DataFrame({'prompt': dataset_test['prompt'][:100], 'response (before)': response_ref, 'response (after)': response}) 

In [31]:
result_df

Unnamed: 0,prompt,response (before),response (after)
0,"<|im_start|>system\nYou are a helpful assistant, who always provide explanation. Think like you are answering to a five year old.<|im_end|>\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nU.S. expels Venezuelan diplomat in Miami\n\nBy the CNN Wire Staff\n\nupdated 12:33 PM EST, Sun January 8, 2012\n\nWashington (CNN) -- Venezuela's consul general in Miami has been declared to be persona non grata and must leave the United States, a State Department spokesman said Sunday.\n\nSpokesman William Ostick declined to comment on specific details behind the decision to expel Livia Acosta Noguera, who has headed Venezuela's consulate in Miami since March 2011.\n\nThe Venezuelan Embassy in Washington was informed of the decision Friday, the spokesman said in a written statement, and the State Department said Acosta must depart the United States by Tuesday.\n\nIt was unclear Sunday whether she remained in the United States.\n\nLast month, a group of American lawmakers said they had ""grave concerns"" about Acosta and called for an investigation after the Spanish-language TV channel Univision aired a documentary alleging that she was among a group of Venezuelan and Iranian diplomats who expressed interest in an offer from a group of Mexican hackers to infiltrate the websites of the White House, the FBI, the Pentagon and U.S. nuclear plants.\n\nThe evidence that the plot was real, according to Univision, are secret recordings with diplomats who ask questions about what the hackers can do and promise to send information to their governments.\n\nUnivision interviewed a purported Mexican whistle-blower -- a student at the National Autonomous University of Mexico named Juan Carlos Munoz Ledo. The student told Univision he was recruited by a leftist professor who wanted to wage cyberattacks on the United States and its allies.\n\nMunoz told Univision he secretly recorded a meeting in 2008 with Acosta, who was then the cultural attache of the Venezuelan Embassy in Mexico. According to a recording Univision aired as part of its report, Acosta is heard saying that she can send the information gathered by the hackers straight to Venezuelan President Hugo Chavez.\n\nChavez has called the report ""lies.""\n\nOne of the Iranian diplomats told Univision that although he, indeed, was presented with a hacking plot by the Mexican group, he turned it down, in part because he thought they were CIA agents.\n\nIn a letter to U.S. Secretary of State Hillary Clinton last month, Rep. Ileana Ros-Lehtinen, Rep. David Rivera, Rep. Mario Diaz-Balart and Rep. Albio Sires asked the State Department to require Acosta's ""immediate departure"" from the United States if the Univision report proved true.\n\nLast month a State Department spokesman said the United States did not know about the alleged plot, but that it found the Univision allegations ""very disturbing.""\n\nHowever, ""we don't have any information, at this point, to corroborate it,"" State Department spokesman Mark Toner said.\n\nCNN's Jill Dougherty, Juan Carlos Lopez and Mariano Castillo contributed to this report.\n\nMost popular stories right now\n\nWrite a one or two sentence summary.<|im_end|>\n<|im_start|>assistant\n",U.S. expels Venezuelan diplomat in Miami\n\nYou are a helpful assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.\n\n 🎓assistant\n 🎓assistant\n 🎓assistant\n �,"U.S. expels Venezuelan diplomat in Miami\n\nWashington (CNN) -- Venezuela's consul general in Miami has been declared to be persona non grata and must leave the United States, a State Department spokesman said Sunday."
1,"<|im_start|>system\nYou are a helpful assistant, who always provide explanation. Think like you are answering to a five year old.<|im_end|>\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhich UK member of parliament resigned in June 2008 to fight the Haltemprice and Howden election on a civil liberties platform?<|im_end|>\n<|im_start|>assistant\n",The UK member of parliament who resigned in June 2008 to fight the Haltemprice and Howden election on a civil liberties platform was David Davis.,The UK member of parliament who resigned in June 2008 to fight the Haltemprice and Howden election on a civil liberties platform was David Davis.
2,"<|im_start|>system\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.<|im_end|>\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nRead the following article and select the best answer. Article: Welcome to one of the largest collections of footwear in the world that will make you green with envy. Here at the Footwear Museum you can see exhibits from all over the world. You can find out about shoes worn by everyone from the Ancient Egyptians to pop stars. Room 1 The celebrity footwear section is probably the most popular in the entire museum. Started in the 1950s there is a wide variety of shoes and boots belonging to everyone from queens and presidents to pop stars and actors! Most visitors find the celebrities' choice of footwear extremely interesting. Room 2 Most of our visitors are amazed--and shocked--by the collection of ""special purpose"" shoes on exhibition here at the Museum of Footwear. For example, there are Chinese shoes made of silk that were worn by women to tie their feet firmly to prevent them from growing too much! Room 3 As well as shoes and boots, the museum also exhibits shoeshaped objects. The variety is unbelievable. For example, there is a metal lamp that resembles a pair of shoes, and Greek wine bottles that look like legs! The Footwear Library People come from all over the world to study in our excellent footwear library. Designers and researchers come here to look up information on anything and everything related to the subject of footwear. Question: All exhibits each room _ . - share the same theme - have the same shape - are made of the same material - belong to the same social class\nThe answer to this question is:<|im_end|>\n<|im_start|>assistant\n",You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.,You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.\n\n 🎓assistant\nYou are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer
3,"<|im_start|>system\nYou should describe the task and explain your answer. While answering a multiple choice question, first output the correct answer(s). Then explain why other answers are wrong. Think like you are answering to a five year old.<|im_end|>\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nMuseum's for me tend to be something a tourist visits when discovering a new city. We had a work function here and it was quite delightful to visit this neat little museum tucked into the corner right adjacent to Old Montreal. We started off the day with a light breakfast in the upstairs cafeteria which consisted of eggs, fruit and some french toast. After you could go out on their balcony and get some views of the old port. On our day there was a cruise ship within eye's view! They have done an admirable job in keeping some of the history and original rock and brick of the past. Artifacts behind glass are nice to gaze at to think of how time was many hundreds of years ago! Next was a Pirate ship and tour and explanation for the kids which was done in interesting and curious fashion. Finishing off the trip was a visit to their 2-D theatre in which you get an 18 minute history lesson of Montreal (available in French, English and Spanish)! It's a nice finishing touch to make you leaving with a positive impression of Montreal. I'd recommend this as a tourist attraction but some may be disappointed as it's a small museum. I'd recommend not coming in here with huge expectations and you may be surprised or disappointed depending on your mood and temperament!\nIs this review positive or negative?\nChoose from:\n (1). negative;\n (2). positive;\nAnswer:<|im_end|>\n<|im_start|>assistant\n","The review is positive. The reviewer enjoyed their visit to the museum, appreciated the history and artifacts, and found the 2-D theatre to be a nice finishing touch. They also recommend the museum as a tourist attraction. The negative option is incorrect because","The review is positive. The reviewer enjoyed the museum's atmosphere, the food, the pirate ship tour, and the 2-D theatre. They also recommend it as a tourist attraction. The negative option is incorrect because the overall experience was positive, and"
4,"<|im_start|>system\nYou are a helpful assistant, who always provide explanation. Think like you are answering to a five year old.<|im_end|>\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nSammy's is our go-to, close to home, I don't feel like cooking kind of place. Portions are good, food is quality and fresh and the service is great. The restaurant is pretty small, but satisfactory for parties of 2-6. I wouldn't recommend this if you had a large group. Bar area is just ""meh"" so I wouldn't recommend Sammys if you want to watch sports and hang out at a bar. This is perfect for a lunch with coworkers, dinner with the kids or a causal couples dinner (not a date). Some of our favorites are: Hummus Appetizer Thai Chicken Satay Chinese Chicken Salad Pear & Arugula Pizza Sicilian Rustic Pizza Mahi Tacos Thai Chicken Salad PS: for salads ALWAYS order a half order. They are huge.\nIs this review positive or negative?<|im_end|>\n<|im_start|>assistant\n","This review is mostly positive! Sammy's is a great place for small groups, like families or couples, to have a nice meal without having to cook. They have yummy food, good service, and a good atmosphere. The only thing they might not","This review is mostly positive! Sammy's is a good place for small groups to eat, and they have yummy food. The service is great, and the portions are just right. But, the bar area is not very fun, and the restaurant is"
5,"<|im_start|>system\nYou are an AI assistant, who knows every language and how to translate one language to another. Given a task, you explain in simple steps what the task is asking, any guidelines that it provides. You solve the task and show how you used the guidelines to solve the task.<|im_end|>\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nIn this task, you are given a question containing a blank (_) and two options. You should pick the best option to answer the question. Please answer with ""A"" or ""B"".\nQ: There were more books at the library than at the bookstore because the _ was bigger. (A) library (B) bookstore\nA: <|im_end|>\n<|im_start|>assistant\n","Thank you for providing the task. Let's break it down into simple steps:\n\n1. Understand the question: The question is asking about the size of two places, a library and a bookstore, and how the number of books in each place relates to","The task is asking you to choose the best option to complete a sentence with a blank. The sentence is about the number of books in two different locations, a library and a bookstore. You are given two options: (A) library and (B"
6,"<|im_start|>system\nYou are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.<|im_end|>\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGenerate a question that has the following answer: Germany for the following movie plot: In the third installment of the football drama trilogy Goal!, Kuno Becker returns as Mexican footballer Santiago MuÃ±ez, who, along with his best friends and England national team players Charlie Braithwaite (Leo Gregory) and Liam Adams (JJ Feild), are selected for their respective national teams at the 2006 FIFA World Cup Finals in Germany However, as all of them attend the shooting of a film Braithwaite is featured in, tragedy befalls them. All three best friends and Braithwaite's new love interest and soon to be fiancÃ©e Sophia Tardelli (played by Kasia Smutniak) suffer a car accident which puts MuÃ±ez out of contention through injuries. Meanwhile, Liam Adams discovers to his horror that he has a new daughter, Bella, from former love interest June (played by Anya Lahiri). This only adds to Liam's preexisting alcoholism and release from Real Madrid. It is revealed that MuÃ±ez is set to return to England as a Tottenham Hotspur player under a two-year contract, along with Liam, who re-signs for Newcastle United, the original club of both ex-Real players. The film goes on to depict the World Cup from the English perspective. Liam scores against Sweden (2-2), assisted by a header from Charlie, and England qualify for the knock-out stages. However, in the match against Ecuador, Charlie is injured, and later collapses in the changing room. He is rushed to hospital, and dies on the way from an aneurysm (from the car accident). England crash out of the quarter-finals against Portugal as Liam misses a crucial penalty against Portuguese keeper Ricardo while Cristiano Ronaldo converts. Liam later proposes June to marry him. Santiago Munez is the Best Man. During his wedding speech he remembers Charlie and his eyes get teary. As confetti is thrown on Liam and June, Italy are shown lifting the World Cup Trophy after beating France on penalties in the final. Also following the tournament around Germany in their St. George's Cross festooned camper van are the ""Geordie Boys"", who first appeared back in the first ""Goal!""...\nAnswer:<|im_end|>\n<|im_start|>assistant\n",What country hosted the 2006 FIFA World Cup Finals in the third installment of the football drama trilogy Goal!?\nAnswer: Germany\n\n pistachio\n pistachio\n pistachio\n pistachio\n pistachio\n pist,"The answer to the question ""Which country hosted the 2006 FIFA World Cup?"" is Germany."
7,"<|im_start|>system\nYou are an AI assistant. You will be given a task. You must generate a detailed and long answer.<|im_end|>\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nSummarize this article:\n\nHow far is it between the top and bottom wires? How far is it around? If you are working with a paneled shade, simply take the measurement of each panel; if you are working with a circular shade, measure the circumference (3.14 x diameter). This is necessary to find out how long and wide your strips of fabric need to be and the amount that will cover your entire shade. This is where the shade becomes all your own. Use one style of material for a battered, distressed look or use all types of coordinating colors and patterns. Just make sure they're long enough! Add one extra inch for the seam allowance. This is needed to wrap around the wire frame. If your lamp is 20"" (51 cm) around, make sure you have fabric that, in total, amounts to at least 22"" (56 cm) wide. You need adequate overlap to keep the wire frame from showing. Of course, greater overlap than that is always safe, too. If each strip of fabric is 2"" (5 cm) across, have at least 11 strips of fabric ready. This ensures that it stays neat, doesn't fray, and looks professional. The edges will only be seen from the inside of your lamp. If you're pressed for time or simply can't be bothered, a simple trim is fine. Using your 1/2 inch (1.25 cm) on either side, affix the strip to the top of the shade with a staple gun, hot glue gun, or needle and thread. Repeat for the bottom. If you use a hot glue gun and glue the material to the wire, you will not be able to gather the fabric when it's finished. If you use a staple gun, it's a good idea to add decorative trim around the top and bottom to obscure those tiny metallic lines from view. You can gather the material together if you stapled or used a needle and thread. Adjust the material to achieve the look you want. Beads, tassles, or extra ribbon can be added onto the top and bottom of your shade to hide any less-than-perfect areas or simply to add more flair.<|im_end|>\n<|im_start|>assistant\n","The article provides a comprehensive guide on how to create a custom lampshade using fabric strips. The process involves measuring the shade's circumference, determining the length and width of the fabric strips, selecting the appropriate fabric, and assembling the shade. The article also","This article provides a comprehensive guide on how to create a custom lampshade using fabric strips. The process involves measuring the lampshade's dimensions, selecting the appropriate fabric, cutting the fabric strips, and attaching them to the lampshade. The article emphasizes the"
8,"<|im_start|>system\nYou are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.<|im_end|>\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat would be the ★-rating of this review (★ being the lowest and ★★★★★ being the highest)? ""I laje it Somali lake it""\nThe answer to this question is:<|im_end|>\n<|im_start|>assistant\n","It is difficult to determine the exact rating for this review as it is unclear what the reviewer is referring to. However, based on the context, it seems that the reviewer is expressing a positive sentiment towards the product or service. Therefore, a ★★★","It's difficult to determine an exact rating for this review, as it appears to be incomplete and lacks context. However, based on the content provided, it seems to express a positive sentiment towards the product or service. If we were to assign a rating"
9,"<|im_start|>system\nYou are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.<|im_end|>\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPlease answer the following question: Please answer correctly the following question related to the paragraph below. Which area is less likely to have acid rain? Beijing recently opened several new coal-fired power plants over the last few months. In Montreal, there have been large reductions in sulfur oxides produced by manufacturing plants after a recent referendum. Hint: One result of air pollution is acid rain. Acid rain is precipitation with a low (acidic) pH. This rain can be very destructive to wildlife. When acid rain falls in forests, freshwater habitats, or soils, it can kill insects and aquatic life. It causes this damage because of its very low pH. Sulfur oxides and nitrogen oxides in the air both cause acid rain to form ( Figure below ). Sulfur oxides are chemicals that are released from coal-fired power plants. Nitrogen oxides are released from motor vehicle exhaust.\nAnswer:<|im_end|>\n<|im_start|>assistant\n","Based on the information provided in the paragraph, the area less likely to have acid rain is Montreal. This is because Montreal has experienced large reductions in sulfur oxides produced by manufacturing plants after a recent referendum. Sulfur oxides are chemicals that contribute","The area less likely to have acid rain is Montreal, as it has experienced large reductions in sulfur oxides produced by manufacturing plants after a recent referendum. This reduction in sulfur oxides contributes to a decrease in the formation of acid rain, which is detrimental"


Разница в этот раз гораздо менее заметна так как фидбек который мы пытаемся учесть в модели гораздо более абстрактный и разнообразный. Но разница есть!

## GRPO

GRPO (Group Relative Policy Optimization) - это новый алгоритм, который применили для Deepseek. Это настоящий RL алгоритм с онлайн обучением, но он проще чем PPO. Тут нет Reward и Value моделей, а только Reward и к тому же вместо модели для reward обычно используют просто какие-то правила. Обучение состоит в том, чтобы для одного промпта сгенировать несколько продолжений (группу) и расчитать для каждой из них reward и обновить модель так чтобы тексты с высоким reward стали более вероятными. 
Проблема может быть в том, чтобы сгенировать хотя бы немного правильные тексты. Если модель не может следовать какому-то формату, то не получится просто случайно семплируя случайно наткнуться на правильный формат. Поэтому для GRPO можно сначала дообучить в обычном supervised формате, а потом уже продолжать с RL. Также можно потом выбирать наиболее хорошие тексты и дообучаться напрямую на них тоже. В Deepseek как раз сделано что-то такое.

Давайте попробуем дообучить модель Qwen решать математические задачи и следовать формату.
Сначала дообучим модель напрямую, потому что даже с system prompt она не выдает правильный формат и reward функции получаются 0.0

In [1]:
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer


from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from trl import DPOTrainer
from datasets import load_dataset
from tqdm.notebook import tqdm
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training


In [2]:
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer

In [3]:
system_prompt = """
You task is to solve math problems. Before outputting the result, give step-by-step reasoning. 
Put the final asnwer at the end preceeding by four hash signes like this: "#### *answer*"
"""

def chatml_format(example):
    messages = [{"role": "system", "content": system_prompt},
                {"role": "user", "content": example['question']},
                {"role": "assistant", "content": example['answer']}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    return {
        "text": prompt,
    }



In [4]:
model_name = "Qwen/Qwen2-0.5B-Instruct"
device = 'cuda'

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
)

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


In [5]:
sft_dataset = load_dataset("openai/gsm8k", "main")['test']

In [6]:
sft_dataset = sft_dataset.map(chatml_format)

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

In [7]:
sft_dataset[0]

{'question': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
 'answer': 'Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.\nShe makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.\n#### 18',
 'text': '<|im_start|>system\n\nYou task is to solve math problems. Before outputting the result, give step-by-step reasoning. \nPut the final asnwer at the end preceeding by four hash signes like this: "#### *answer*"\n<|im_end|>\n<|im_start|>user\nJanet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers\' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers\' market?<|im_end|>\n<|im_start|>a

In [14]:
training_args = SFTConfig(
    max_length=200,
    label_names=["labels"],
    report_to="none",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    logging_steps=10,
    
)

In [15]:
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=sft_dataset,
    args=training_args,
)

Converting train dataset to ChatML:   0%|          | 0/1319 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/1319 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1319 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1319 [00:00<?, ? examples/s]

In [16]:
trainer.train()

Step,Training Loss
10,1.46
20,0.9161
30,0.8654
40,0.9353
50,0.8842
60,0.8602
70,0.9874
80,0.984
90,0.9154
100,0.9085


TrainOutput(global_step=987, training_loss=0.5524334395426989, metrics={'train_runtime': 393.7119, 'train_samples_per_second': 10.05, 'train_steps_per_second': 2.507, 'total_flos': 1642172539138560.0, 'train_loss': 0.5524334395426989})

In [17]:
trainer.save_model("Qwen/Qwen2-0.5B-Instruct-math-sft")

Теперь загрузим полученную модель и запустим RL с двумя reward функциями: 1) проверяет что модель выдает ответ в нужно формате 2) проверяет что ответ правильный

In [8]:
model_name = "Qwen/Qwen2-0.5B-Instruct-math-sft"
device = 'cuda'

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

In [10]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    load_in_4bit=True
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


In [11]:
dataset = load_dataset("openai/gsm8k", "main")['train']

In [12]:
system_prompt = """
You task is to solve math problems. Before outputting the result, give step-by-step reasoning. 
Put the final asnwer at the end preceeding by four hash signes like this: "#### *answer*"
"""

def chatml_format(example):
    messages = [{"role": "system", "content": system_prompt},
                {"role": "user", "content": example['question']}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    label = example['answer'] + "<|im_end|>\n"

    return {
        "prompt": prompt,
        "labels": label
    }



In [13]:
import re
def format_accuracy_func(completions, **kwargs):
    rewards = []
    for response in completions:
        if re.search('#### ?([\d.\-,]+)', response):
            rewards.append(1.0)
        
        else:
            rewards.append(0.0)

    return rewards

def answer_accuracy_func(completions, labels, **kwargs):
    rewards = []
    for response, label in zip(completions, labels):
        match_c = re.search('#### ?([\d.\-,]+)', response)
        match_a = re.search('#### ?([\d.\-,]+)', label)
        
        if match_c is not None and match_a is not None:
            if match_c.group(1) ==  match_a.group(1):
                
                rewards.append(1.0)
                continue
        
        rewards.append(0.0)

    return rewards

In [14]:
dataset = dataset.map(chatml_format)

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

In [15]:
batch = dataset[2:3]

In [16]:

input_ids = tokenizer.batch_encode_plus(batch['prompt'], return_tensors='pt', padding=True)

output = model.generate(
    input_ids['input_ids'].to(device), attention_mask=input_ids['attention_mask'].to(device),
    max_new_tokens=400, do_sample=True, temperature=1.5, pad_token_id=tokenizer.eos_token_id #**gen_kwargs
)



In [17]:
completions = tokenizer.batch_decode(output, skip_special_tokens=False)

In [18]:
completions

['<|im_start|>system\n\nYou task is to solve math problems. Before outputting the result, give step-by-step reasoning. \nPut the final asnwer at the end preceeding by four hash signes like this: "#### *answer*"\n<|im_end|>\n<|im_start|>user\nBetty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?<|im_end|>\n<|im_start|>assistant\nBetty needs 100 - 20 = <<100-20=80>>80.\nHer grandparents gave her 15 * 2 = <<15*2=30>>30 for her purpose.\nIn total, her parents and grandparents gave her a total of 30 + 15 = <<30+15=45>>45.\nSo, Betty needs 80 - 45 = <<80-45=35>>35 more dollars to buy the wallet.\n#### 35<|im_end|>']

In [19]:
format_accuracy_func(completions)

[1.0]

In [20]:
answer_accuracy_func(completions, batch['labels'])

[0.0]

In [21]:
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear"
)

In [22]:
training_args = GRPOConfig(output_dir="Qwen/Qwen2-0.5B-Instruct-math-sft-grpo", 
                           logging_steps=10, 
                           report_to="none", 
                           num_generations=8,
                           num_train_epochs=1,
                           temperature=1.5, 
                           label_names=["labels"])

In [23]:
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[answer_accuracy_func, format_accuracy_func],
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config
    
)

In [24]:
trainer.train()

`generation_config` default values have been modified to match model-specific defaults: {'top_k': 20, 'top_p': 0.8, 'repetition_penalty': 1.1, 'bos_token_id': 151643}. If this is not desired, please set these values explicitly.


Step,Training Loss
10,0.1032
20,0.1143
30,0.0697
40,0.1419
50,0.0969
60,0.0724
70,0.0457
80,0.0952
90,-0.0058
100,0.0876


KeyboardInterrupt: 

In [25]:
trainer.save_model('Qwen/Qwen2-0.5B-Instruct-math-sft-grpo')

In [26]:
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct-math-sft-grpo")
tokenizer.pad_token = tokenizer.eos_token
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

In [27]:
model_ref = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct-math-sft", device_map='cuda', torch_dtype=torch.bfloat16)

In [28]:
device = model_ref.device
response_ref = []
dataset.set_format("pandas")
batch_size = 2
for i in tqdm(range(0, 100, batch_size)):
    batch = dataset[i:i+batch_size]
    input_ids = tokenizer.batch_encode_plus(batch['prompt'].tolist(), return_tensors='pt', padding=True)

    output = model_ref.generate(
        input_ids['input_ids'].to(device), attention_mask=input_ids['attention_mask'].to(device),
        max_new_tokens=200, do_sample=False, pad_token_id=tokenizer.eos_token_id #**gen_kwargs
    )[:,input_ids['input_ids'].shape[-1]:]
    response_ref.extend(tokenizer.batch_decode(output, skip_special_tokens=True))


  0%|          | 0/50 [00:00<?, ?it/s]



In [29]:
model = PeftModel.from_pretrained(model_ref, "Qwen/Qwen2-0.5B-Instruct-math-sft-grpo")

In [30]:
response = []

batch_size = 2
for i in tqdm(range(0, 100, batch_size)):
    batch = dataset[i:i+batch_size]
    input_ids = tokenizer.batch_encode_plus(batch['prompt'].tolist(), return_tensors='pt', padding=True)

    output = model.generate(
        input_ids['input_ids'].to(device), attention_mask=input_ids['attention_mask'].to(device),
        max_new_tokens=200, do_sample=False, pad_token_id=tokenizer.eos_token_id #**gen_kwargs
    )[:,input_ids['input_ids'].shape[-1]:]
    response.extend(tokenizer.batch_decode(output, skip_special_tokens=True))


  0%|          | 0/50 [00:00<?, ?it/s]

In [31]:
result_df = pd.DataFrame({'prompt': dataset['prompt'][:100], 'answer': dataset['answer'][:100], 'response (before)': response_ref, 'response (after)': response}) 

In [32]:
def parse_answer(text):
    m = re.search('#### ?([\d.\-,]+)', text)
    if m is not None:
        return m.group(1)
    else:
        return ""

In [33]:
result_df['answer_before'] = result_df["response (before)"].apply(parse_answer)
result_df['answer_after'] = result_df["response (after)"].apply(parse_answer)
result_df['answer_correct'] = result_df["answer"].apply(parse_answer)

In [34]:
result_df[result_df['answer_before'] != result_df['answer_after']][['prompt', 'answer_before', 'answer_after', 'answer_correct']]

Unnamed: 0,prompt,answer_before,answer_after,answer_correct
0,<|im_start|>system\n\nYou task is to solve mat...,72,84,72
4,<|im_start|>system\n\nYou task is to solve mat...,304,12,624
5,<|im_start|>system\n\nYou task is to solve mat...,44,23,35
8,<|im_start|>system\n\nYou task is to solve mat...,41,170,41
9,<|im_start|>system\n\nYou task is to solve mat...,,200,990
...,...,...,...,...
91,<|im_start|>system\n\nYou task is to solve mat...,2.5,10,10
95,<|im_start|>system\n\nYou task is to solve mat...,20,2,64
97,<|im_start|>system\n\nYou task is to solve mat...,,27,46
98,<|im_start|>system\n\nYou task is to solve mat...,23,11,45


In [35]:
(result_df['answer_before'] == result_df['answer_correct']).sum()/len(result_df)

0.2

In [36]:
(result_df['answer_after'] == result_df['answer_correct']).sum()/len(result_df)

0.23