# LLM Finetuning Labs

## LLM Finetuning Introduction

### In this lab you will...

- make your first fine-tuning !
- experiment with techniques to reduce VRAM usage
- discover two fine-tuning methods families
- visualize the effect of fine-tuning

The goal of this lab is to fine-tune a base model into an _instruct_ model. A base model can only generate unformatted text, this means that you feed it a beginning of a sentence and the base model will finish it. At least that's _what it is trained for_. It will not be trained for chat-like format of prompts and it will likely not behave well to such inputs, in other words a base model cannot be a chatbot. To make a chatbot, we need to fine-tune a model to behave properly to chat-like inputs. Such a fine-tuned model is called an _instruct_ model (short for instruction-tuned model).

### Prerequisites

There are some prerequisites (even for the first notebook) as LLMs are complex models and training them requires some advanced knowledge:
- You should really know how to develop in python
- You should be familiar with the most widespread libraries that we are going to use: transformers, numpy, torch, pandas, peft, trl
- I strongly recommend you read some documentation about LLMs and fine-tuning first, there are some great tutorials [there](https://huggingface.co/docs/transformers/index). 
- Some concepts may be difficult to understand without advanced knowledge of linear algebra and probabilities.
- You should know your way around the huggingface website (to find models and documentation)

## Part 1: Fine-tuning a base model using LoRA

In this section, we fine-tune a very small LLM using one of the most widespread types of fine-tuning: [Low Rank Adapters](https://huggingface.co/docs/peft/main/conceptual_guides/lora) (LoRA). I suggest you read the documentation linked herebefore to understand better the basis of what we are going to do in this section.

LoRA is part of a families of fine-tuning methods: the adapters methods. There are a lot of different adapters methods (look at the link before to discover more of them), but LoRA is by far the most popular. 

### 1.1. Preparing the environment

Nothing to see here, just the imports and basic functions

In [1]:
import json
import os
from pprint import pprint

import bitsandbytes as bnb
import pandas as pd
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

g++ (Ubuntu 13.2.0-23ubuntu4) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.



  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def print_trainable_parameters(model):

    """
    Prints the number of trainable parameters in the model.
    Utility function to see how efficient LoRA is.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

## 1.2. Load the base model

After having installed and imported all the necessary libraries, we load a base model. Feel free to change the base model to see the effect on the results you get. 

We load it here quantized in 4bits to save a lot of VRAM. Feel free to experiment with other settings, you can find the documentation for that [here](https://huggingface.co/docs/bitsandbytes/index).

In [3]:
# Very small LLM, you can try other models from huggingface
model_repo_id = "Qwen/Qwen2.5-0.5B"


# Optional: Load the base model quantized (this saves a lot of VRAM)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_repo_id,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained(model_repo_id)
tokenizer.pad_token = tokenizer.eos_token

## 1.3. Preparing the base model for training

What does LoRA do exactly ?
When we call the function 
```python
model = get_peft_model(model, lora_config)
```
(see below), what happens is the parameters of the base model are frozen. This means they will not be trained. Additionally, we add layers (also refered to as adapters) to the model in parallel to the layers specified in the _target\_modules_ argument of _LoraConfig_. Each of these layers consists of 2 linear transformations (matrix multiplication). With other arguments of _LoraConfig_ we can define the shape and the initial coefficients of these matrices. 

In [4]:
config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj"],
    bias="none",
    task_type='text-generation'
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 540672 || all params: 315660160 || trainable%: 0.17128293922172502


Here we see the advantage of using adapters instead of training the full model. We train only 0.17% of the total number of parameters ! (note that here all the trainable parameters were added to the base model).

Feel free to change the target modules (layers which get an adapter) to see the effect on the trainable parameters. 

NB: The adapters were all initialized to always output 0, so the model with the untrained adapters has the exact same outputs as the base model.

## 1.4. Testing the base model before training

First thing to do is to check if the model really needs to be fine-tuned. We can quickly notice that is does:

In [5]:
prompt = [
    {'role': 'user',
     'content': 'What equipment do I need for rock climbing ?'}
]


prompt =  tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
print(prompt)
model = model.to('cuda')


generation_config = model.generation_config
generation_config.do_sample = True
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What equipment do I need for rock climbing ?<|im_end|>
<|im_start|>assistant



In [6]:
device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

system
You are a helpful assistant.
user
What equipment do I need for rock climbing ?
assistant
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assistant. JpaRepositorysystem
You are a helpful assi

As you can see, and as I said in the introduction, this base model is terrible at generating text in a chat format.

The anwser:
- is often garbage characters
- when it contains actual words, does not make sense (repeats the system instruction or the user prompt).
- seems to not take into account the chat-like format that we want

Using bigger base models, the result would probably be better, but they would still be very limited for chat-like input format.

## 1.5 Preparing the Data

To fix that, we need a dataset with user queries and assistant response.

We are using the _helpful-instructions_ dataset here.

In [7]:
data = load_dataset("HuggingFaceH4/helpful-instructions")
pd.DataFrame(data["train"])

Unnamed: 0,instruction,demonstration,meta
0,"Hi, I want to learn to play horseshoes. Can yo...","I can, but maybe I should begin by telling you...",{'source': 'helpful-anthropic-raw'}
1,How do I teach kids to meditate?,Great question! That’s a really useful skill t...,{'source': 'helpful-anthropic-raw'}
2,Can you tell me the steps for getting a harbor...,Sure. I believe you’ll need a copy of the mari...,{'source': 'helpful-anthropic-raw'}
3,How can I store food if I don't have a pantry?,"You could store the food in a refrigerator, th...",{'source': 'helpful-anthropic-raw'}
4,what are some good novels for a 9 year old?,"That depends on the 9 year old, but if they li...",{'source': 'helpful-anthropic-raw'}
...,...,...,...
147701,"Given the following sentence, classify it into...",Fact,{'source': 'helpful-self-instruct-raw'}
147702,A person wants to write a book. he/she writes ...,Chapter 1 - The History of China\nChapter 2 - ...,{'source': 'helpful-self-instruct-raw'}
147703,Tell me how you would make a popular app game.,I would make a game that is similar to 2048. T...,{'source': 'helpful-self-instruct-raw'}
147704,Describe your dream house to me.\n\nOutput:,My dream house is a two-story building with a ...,{'source': 'helpful-self-instruct-raw'}


This dataset is made of text and the model takes tokens. Therefore, before going further, let's add the special tokens and tokenize the whole dataset:

In [8]:
def generate_prompt(data_point):
    chat = [
        {'role': 'user', 'content': data_point['instruction']},
        {'role': 'assistant', 'content': data_point['demonstration']}
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt

data = data["train"].shuffle(seed=42).map(generate_and_tokenize_prompt)

## 1.6 Training the model

After the data is loaded and the model is ready to be trained, let's train it. 

You can find documentation on the way training works in the _transformers_ library [here](https://huggingface.co/docs/transformers/trainer)

In [9]:
OUTPUT_DIR = "experiments"

training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=3,
    logging_steps=1,
    output_dir=OUTPUT_DIR,
    max_steps=200,   # try more steps if you can
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False
trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs
  0%|          | 1/200 [00:00<01:18,  2.54it/s]

{'loss': 4.6916, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.0}


  1%|          | 2/200 [00:00<00:59,  3.35it/s]

{'loss': 2.3284, 'grad_norm': 2.599613904953003, 'learning_rate': 2e-05, 'epoch': 0.0}


  2%|▏         | 3/200 [00:00<00:52,  3.79it/s]

{'loss': 3.318, 'grad_norm': 3.4168691635131836, 'learning_rate': 4e-05, 'epoch': 0.0}


  2%|▏         | 4/200 [00:01<00:49,  3.98it/s]

{'loss': 2.4624, 'grad_norm': 2.299713134765625, 'learning_rate': 6e-05, 'epoch': 0.0}


  2%|▎         | 5/200 [00:01<00:48,  4.05it/s]

{'loss': 3.3914, 'grad_norm': 2.5535805225372314, 'learning_rate': 8e-05, 'epoch': 0.0}


  3%|▎         | 6/200 [00:01<00:46,  4.17it/s]

{'loss': 3.0802, 'grad_norm': 2.3144850730895996, 'learning_rate': 0.0001, 'epoch': 0.0}


  4%|▎         | 7/200 [00:01<00:45,  4.27it/s]

{'loss': 3.8963, 'grad_norm': 3.574005126953125, 'learning_rate': 0.00012, 'epoch': 0.0}


  4%|▍         | 8/200 [00:01<00:44,  4.34it/s]

{'loss': 3.1484, 'grad_norm': 3.3472962379455566, 'learning_rate': 0.00014, 'epoch': 0.0}


  4%|▍         | 9/200 [00:02<00:43,  4.38it/s]

{'loss': 3.9489, 'grad_norm': nan, 'learning_rate': 0.00014, 'epoch': 0.0}


  5%|▌         | 10/200 [00:02<00:43,  4.41it/s]

{'loss': 3.4395, 'grad_norm': 4.02131986618042, 'learning_rate': 0.00016, 'epoch': 0.0}


  6%|▌         | 11/200 [00:02<00:42,  4.40it/s]

{'loss': 2.8375, 'grad_norm': 2.4797213077545166, 'learning_rate': 0.00018, 'epoch': 0.0}


  6%|▌         | 12/200 [00:02<00:42,  4.43it/s]

{'loss': 3.3805, 'grad_norm': 3.0422565937042236, 'learning_rate': 0.0002, 'epoch': 0.0}


  6%|▋         | 13/200 [00:03<00:41,  4.46it/s]

{'loss': 3.2923, 'grad_norm': 2.821362257003784, 'learning_rate': 0.0001999863304992469, 'epoch': 0.0}


  7%|▋         | 14/200 [00:03<00:41,  4.47it/s]

{'loss': 2.7243, 'grad_norm': 2.559475898742676, 'learning_rate': 0.00019994532573409262, 'epoch': 0.0}


  8%|▊         | 15/200 [00:03<00:42,  4.36it/s]

{'loss': 2.7416, 'grad_norm': 1.2415190935134888, 'learning_rate': 0.00019987699691483048, 'epoch': 0.0}


  8%|▊         | 16/200 [00:03<00:42,  4.36it/s]

{'loss': 2.0018, 'grad_norm': 1.4863861799240112, 'learning_rate': 0.00019978136272187747, 'epoch': 0.0}


  8%|▊         | 17/200 [00:04<00:48,  3.81it/s]

{'loss': 3.447, 'grad_norm': 0.8093248009681702, 'learning_rate': 0.000199658449300667, 'epoch': 0.0}


  9%|▉         | 18/200 [00:04<00:45,  4.00it/s]

{'loss': 3.5073, 'grad_norm': 3.9914345741271973, 'learning_rate': 0.00019950829025450114, 'epoch': 0.0}


 10%|▉         | 19/200 [00:04<00:43,  4.13it/s]

{'loss': 3.1432, 'grad_norm': 3.065209150314331, 'learning_rate': 0.00019933092663536382, 'epoch': 0.0}


 10%|█         | 20/200 [00:04<00:42,  4.24it/s]

{'loss': 2.9248, 'grad_norm': 2.551703691482544, 'learning_rate': 0.00019912640693269752, 'epoch': 0.0}


 10%|█         | 21/200 [00:05<00:41,  4.32it/s]

{'loss': 3.3367, 'grad_norm': 4.582777976989746, 'learning_rate': 0.00019889478706014687, 'epoch': 0.0}


 11%|█         | 22/200 [00:05<00:40,  4.35it/s]

{'loss': 3.5566, 'grad_norm': 4.087899684906006, 'learning_rate': 0.00019863613034027224, 'epoch': 0.0}


 12%|█▏        | 23/200 [00:05<00:40,  4.35it/s]

{'loss': 1.8059, 'grad_norm': 2.388580322265625, 'learning_rate': 0.00019835050748723824, 'epoch': 0.0}


 12%|█▏        | 24/200 [00:05<00:40,  4.38it/s]

{'loss': 2.8651, 'grad_norm': 4.768006801605225, 'learning_rate': 0.00019803799658748094, 'epoch': 0.0}


 12%|█▎        | 25/200 [00:05<00:39,  4.39it/s]

{'loss': 3.1434, 'grad_norm': 2.5539352893829346, 'learning_rate': 0.00019769868307835994, 'epoch': 0.0}


 13%|█▎        | 26/200 [00:06<00:39,  4.35it/s]

{'loss': 2.6433, 'grad_norm': 1.9185199737548828, 'learning_rate': 0.0001973326597248006, 'epoch': 0.0}


 14%|█▎        | 27/200 [00:06<00:39,  4.40it/s]

{'loss': 2.6184, 'grad_norm': 3.0307137966156006, 'learning_rate': 0.00019694002659393305, 'epoch': 0.0}


 14%|█▍        | 28/200 [00:06<00:38,  4.43it/s]

{'loss': 3.0714, 'grad_norm': 2.2026569843292236, 'learning_rate': 0.00019652089102773488, 'epoch': 0.0}


 14%|█▍        | 29/200 [00:06<00:39,  4.37it/s]

{'loss': 2.7038, 'grad_norm': 2.1734018325805664, 'learning_rate': 0.00019607536761368484, 'epoch': 0.0}


 15%|█▌        | 30/200 [00:07<00:38,  4.39it/s]

{'loss': 2.5329, 'grad_norm': 1.6850439310073853, 'learning_rate': 0.00019560357815343577, 'epoch': 0.0}


 16%|█▌        | 31/200 [00:07<00:38,  4.37it/s]

{'loss': 2.4612, 'grad_norm': 1.7849080562591553, 'learning_rate': 0.00019510565162951537, 'epoch': 0.0}


 16%|█▌        | 32/200 [00:07<00:38,  4.37it/s]

{'loss': 2.913, 'grad_norm': 3.412777900695801, 'learning_rate': 0.00019458172417006347, 'epoch': 0.0}


 16%|█▋        | 33/200 [00:07<00:38,  4.32it/s]

{'loss': 2.4266, 'grad_norm': 1.2879565954208374, 'learning_rate': 0.00019403193901161613, 'epoch': 0.0}


 17%|█▋        | 34/200 [00:08<00:38,  4.37it/s]

{'loss': 2.3616, 'grad_norm': 2.799973964691162, 'learning_rate': 0.0001934564464599461, 'epoch': 0.0}


 18%|█▊        | 35/200 [00:08<00:37,  4.38it/s]

{'loss': 3.1485, 'grad_norm': 4.077517509460449, 'learning_rate': 0.00019285540384897073, 'epoch': 0.0}


 18%|█▊        | 36/200 [00:08<00:37,  4.38it/s]

{'loss': 2.9337, 'grad_norm': 3.18703293800354, 'learning_rate': 0.00019222897549773848, 'epoch': 0.0}


 18%|█▊        | 37/200 [00:08<00:37,  4.36it/s]

{'loss': 2.573, 'grad_norm': 1.3260821104049683, 'learning_rate': 0.00019157733266550575, 'epoch': 0.0}


 19%|█▉        | 38/200 [00:08<00:37,  4.37it/s]

{'loss': 2.4833, 'grad_norm': 1.990025281906128, 'learning_rate': 0.00019090065350491626, 'epoch': 0.0}


 20%|█▉        | 39/200 [00:09<00:36,  4.36it/s]

{'loss': 2.5246, 'grad_norm': 2.437035322189331, 'learning_rate': 0.00019019912301329592, 'epoch': 0.0}


 20%|██        | 40/200 [00:09<00:36,  4.39it/s]

{'loss': 3.0196, 'grad_norm': 2.5727124214172363, 'learning_rate': 0.00018947293298207635, 'epoch': 0.0}


 20%|██        | 41/200 [00:09<00:35,  4.42it/s]

{'loss': 2.792, 'grad_norm': 4.041524410247803, 'learning_rate': 0.0001887222819443612, 'epoch': 0.0}


 21%|██        | 42/200 [00:09<00:35,  4.39it/s]

{'loss': 2.6305, 'grad_norm': 1.738972783088684, 'learning_rate': 0.0001879473751206489, 'epoch': 0.0}


 22%|██▏       | 43/200 [00:10<00:35,  4.39it/s]

{'loss': 2.6282, 'grad_norm': 1.924811601638794, 'learning_rate': 0.00018714842436272773, 'epoch': 0.0}


 22%|██▏       | 44/200 [00:10<00:35,  4.38it/s]

{'loss': 2.5189, 'grad_norm': 2.1616427898406982, 'learning_rate': 0.00018632564809575742, 'epoch': 0.0}


 22%|██▎       | 45/200 [00:10<00:35,  4.40it/s]

{'loss': 2.6055, 'grad_norm': 2.9660990238189697, 'learning_rate': 0.0001854792712585539, 'epoch': 0.0}


 23%|██▎       | 46/200 [00:10<00:35,  4.40it/s]

{'loss': 2.9853, 'grad_norm': 3.4424140453338623, 'learning_rate': 0.00018460952524209355, 'epoch': 0.0}


 24%|██▎       | 47/200 [00:10<00:34,  4.42it/s]

{'loss': 2.3851, 'grad_norm': 2.729462146759033, 'learning_rate': 0.00018371664782625287, 'epoch': 0.0}


 24%|██▍       | 48/200 [00:11<00:34,  4.40it/s]

{'loss': 2.65, 'grad_norm': 1.376557469367981, 'learning_rate': 0.00018280088311480201, 'epoch': 0.0}


 24%|██▍       | 49/200 [00:11<00:34,  4.39it/s]

{'loss': 2.5206, 'grad_norm': 1.8921873569488525, 'learning_rate': 0.00018186248146866927, 'epoch': 0.0}


 25%|██▌       | 50/200 [00:11<00:34,  4.37it/s]

{'loss': 2.5375, 'grad_norm': 2.485793113708496, 'learning_rate': 0.00018090169943749476, 'epoch': 0.0}


 26%|██▌       | 51/200 [00:11<00:33,  4.40it/s]

{'loss': 2.4672, 'grad_norm': 3.5203239917755127, 'learning_rate': 0.0001799187996894925, 'epoch': 0.0}


 26%|██▌       | 52/200 [00:12<00:33,  4.43it/s]

{'loss': 2.6799, 'grad_norm': 2.9703176021575928, 'learning_rate': 0.00017891405093963938, 'epoch': 0.0}


 26%|██▋       | 53/200 [00:12<00:33,  4.40it/s]

{'loss': 2.6171, 'grad_norm': 2.8899691104888916, 'learning_rate': 0.00017788772787621126, 'epoch': 0.0}


 27%|██▋       | 54/200 [00:12<00:33,  4.33it/s]

{'loss': 2.6731, 'grad_norm': 1.798953890800476, 'learning_rate': 0.00017684011108568592, 'epoch': 0.0}


 28%|██▊       | 55/200 [00:12<00:33,  4.32it/s]

{'loss': 2.5459, 'grad_norm': 2.0125069618225098, 'learning_rate': 0.0001757714869760335, 'epoch': 0.0}


 28%|██▊       | 56/200 [00:13<00:33,  4.35it/s]

{'loss': 2.5127, 'grad_norm': 2.2683942317962646, 'learning_rate': 0.0001746821476984154, 'epoch': 0.0}


 28%|██▊       | 57/200 [00:13<00:32,  4.36it/s]

{'loss': 2.587, 'grad_norm': 1.7486218214035034, 'learning_rate': 0.00017357239106731317, 'epoch': 0.0}


 29%|██▉       | 58/200 [00:13<00:32,  4.37it/s]

{'loss': 2.5647, 'grad_norm': 3.483548402786255, 'learning_rate': 0.00017244252047910892, 'epoch': 0.0}


 30%|██▉       | 59/200 [00:13<00:32,  4.33it/s]

{'loss': 2.896, 'grad_norm': 2.3900508880615234, 'learning_rate': 0.00017129284482913972, 'epoch': 0.0}


 30%|███       | 60/200 [00:13<00:32,  4.36it/s]

{'loss': 2.1903, 'grad_norm': 2.33377742767334, 'learning_rate': 0.00017012367842724887, 'epoch': 0.0}


 30%|███       | 61/200 [00:14<00:31,  4.35it/s]

{'loss': 2.5898, 'grad_norm': 3.2992541790008545, 'learning_rate': 0.0001689353409118566, 'epoch': 0.0}


 31%|███       | 62/200 [00:14<00:31,  4.35it/s]

{'loss': 2.5368, 'grad_norm': 2.1551363468170166, 'learning_rate': 0.00016772815716257412, 'epoch': 0.0}


 32%|███▏      | 63/200 [00:14<00:31,  4.35it/s]

{'loss': 2.3133, 'grad_norm': 3.826984405517578, 'learning_rate': 0.0001665024572113848, 'epoch': 0.0}


 32%|███▏      | 64/200 [00:14<00:31,  4.37it/s]

{'loss': 2.265, 'grad_norm': 4.21524715423584, 'learning_rate': 0.00016525857615241687, 'epoch': 0.0}


 32%|███▎      | 65/200 [00:15<00:30,  4.39it/s]

{'loss': 2.5417, 'grad_norm': 2.0729823112487793, 'learning_rate': 0.00016399685405033167, 'epoch': 0.0}


 33%|███▎      | 66/200 [00:15<00:30,  4.40it/s]

{'loss': 2.453, 'grad_norm': 3.13855242729187, 'learning_rate': 0.0001627176358473537, 'epoch': 0.0}


 34%|███▎      | 67/200 [00:15<00:30,  4.40it/s]

{'loss': 2.6, 'grad_norm': 2.4890451431274414, 'learning_rate': 0.0001614212712689668, 'epoch': 0.0}


 34%|███▍      | 68/200 [00:15<00:30,  4.38it/s]

{'loss': 2.1733, 'grad_norm': 2.8841984272003174, 'learning_rate': 0.00016010811472830252, 'epoch': 0.0}


 34%|███▍      | 69/200 [00:16<00:30,  4.34it/s]

{'loss': 2.35, 'grad_norm': 3.367863178253174, 'learning_rate': 0.00015877852522924732, 'epoch': 0.0}


 35%|███▌      | 70/200 [00:16<00:30,  4.19it/s]

{'loss': 2.6432, 'grad_norm': 1.5790870189666748, 'learning_rate': 0.00015743286626829437, 'epoch': 0.0}


 36%|███▌      | 71/200 [00:16<00:30,  4.26it/s]

{'loss': 2.3518, 'grad_norm': 4.177986145019531, 'learning_rate': 0.0001560715057351673, 'epoch': 0.0}


 36%|███▌      | 72/200 [00:16<00:29,  4.27it/s]

{'loss': 2.7805, 'grad_norm': 2.021479368209839, 'learning_rate': 0.00015469481581224272, 'epoch': 0.0}


 36%|███▋      | 73/200 [00:16<00:29,  4.26it/s]

{'loss': 2.212, 'grad_norm': 2.535818576812744, 'learning_rate': 0.0001533031728727994, 'epoch': 0.0}


 37%|███▋      | 74/200 [00:17<00:29,  4.30it/s]

{'loss': 2.6557, 'grad_norm': 2.2780838012695312, 'learning_rate': 0.00015189695737812152, 'epoch': 0.0}


 38%|███▊      | 75/200 [00:17<00:28,  4.33it/s]

{'loss': 2.739, 'grad_norm': 3.1584911346435547, 'learning_rate': 0.0001504765537734844, 'epoch': 0.0}


 38%|███▊      | 76/200 [00:17<00:28,  4.34it/s]

{'loss': 2.3763, 'grad_norm': 1.8564014434814453, 'learning_rate': 0.00014904235038305083, 'epoch': 0.0}


 38%|███▊      | 77/200 [00:17<00:28,  4.36it/s]

{'loss': 2.7412, 'grad_norm': 2.749810218811035, 'learning_rate': 0.00014759473930370736, 'epoch': 0.0}


 39%|███▉      | 78/200 [00:18<00:27,  4.38it/s]

{'loss': 2.2477, 'grad_norm': 1.6771061420440674, 'learning_rate': 0.0001461341162978688, 'epoch': 0.0}


 40%|███▉      | 79/200 [00:18<00:27,  4.38it/s]

{'loss': 2.2564, 'grad_norm': 2.5290932655334473, 'learning_rate': 0.00014466088068528068, 'epoch': 0.0}


 40%|████      | 80/200 [00:18<00:28,  4.28it/s]

{'loss': 2.2799, 'grad_norm': 2.395963668823242, 'learning_rate': 0.00014317543523384928, 'epoch': 0.0}


 40%|████      | 81/200 [00:18<00:27,  4.32it/s]

{'loss': 1.7903, 'grad_norm': 2.7732908725738525, 'learning_rate': 0.00014167818604952906, 'epoch': 0.0}


 41%|████      | 82/200 [00:19<00:27,  4.33it/s]

{'loss': 2.8073, 'grad_norm': 3.2542154788970947, 'learning_rate': 0.00014016954246529696, 'epoch': 0.0}


 42%|████▏     | 83/200 [00:19<00:27,  4.29it/s]

{'loss': 2.2283, 'grad_norm': 2.2902913093566895, 'learning_rate': 0.00013864991692924523, 'epoch': 0.0}


 42%|████▏     | 84/200 [00:19<00:27,  4.26it/s]

{'loss': 2.6012, 'grad_norm': 3.7731447219848633, 'learning_rate': 0.00013711972489182208, 'epoch': 0.0}


 42%|████▎     | 85/200 [00:19<00:26,  4.31it/s]

{'loss': 2.2683, 'grad_norm': 2.8538715839385986, 'learning_rate': 0.00013557938469225167, 'epoch': 0.0}


 43%|████▎     | 86/200 [00:19<00:26,  4.34it/s]

{'loss': 2.3999, 'grad_norm': 2.0972962379455566, 'learning_rate': 0.00013402931744416433, 'epoch': 0.0}


 44%|████▎     | 87/200 [00:20<00:26,  4.32it/s]

{'loss': 2.2875, 'grad_norm': 1.5075030326843262, 'learning_rate': 0.00013246994692046836, 'epoch': 0.0}


 44%|████▍     | 88/200 [00:20<00:25,  4.36it/s]

{'loss': 2.1912, 'grad_norm': 2.0811188220977783, 'learning_rate': 0.00013090169943749476, 'epoch': 0.0}


 44%|████▍     | 89/200 [00:20<00:25,  4.33it/s]

{'loss': 1.9848, 'grad_norm': 1.7357269525527954, 'learning_rate': 0.0001293250037384465, 'epoch': 0.0}


 45%|████▌     | 90/200 [00:20<00:25,  4.33it/s]

{'loss': 2.3998, 'grad_norm': 1.7034249305725098, 'learning_rate': 0.00012774029087618446, 'epoch': 0.0}


 46%|████▌     | 91/200 [00:21<00:25,  4.34it/s]

{'loss': 2.7866, 'grad_norm': 4.161805629730225, 'learning_rate': 0.00012614799409538198, 'epoch': 0.0}


 46%|████▌     | 92/200 [00:21<00:25,  4.25it/s]

{'loss': 2.4635, 'grad_norm': 2.5426807403564453, 'learning_rate': 0.00012454854871407994, 'epoch': 0.0}


 46%|████▋     | 93/200 [00:21<00:25,  4.18it/s]

{'loss': 2.405, 'grad_norm': 4.2771806716918945, 'learning_rate': 0.00012294239200467516, 'epoch': 0.0}


 47%|████▋     | 94/200 [00:21<00:24,  4.25it/s]

{'loss': 1.9331, 'grad_norm': 2.2477965354919434, 'learning_rate': 0.0001213299630743747, 'epoch': 0.0}


 48%|████▊     | 95/200 [00:22<00:24,  4.27it/s]

{'loss': 2.0542, 'grad_norm': 2.459120750427246, 'learning_rate': 0.00011971170274514802, 'epoch': 0.0}


 48%|████▊     | 96/200 [00:22<00:24,  4.27it/s]

{'loss': 2.3495, 'grad_norm': 3.3721985816955566, 'learning_rate': 0.000118088053433211, 'epoch': 0.0}


 48%|████▊     | 97/200 [00:22<00:23,  4.30it/s]

{'loss': 2.2275, 'grad_norm': 4.206882953643799, 'learning_rate': 0.00011645945902807341, 'epoch': 0.0}


 49%|████▉     | 98/200 [00:22<00:23,  4.28it/s]

{'loss': 1.77, 'grad_norm': 3.2320940494537354, 'learning_rate': 0.0001148263647711842, 'epoch': 0.0}


 50%|████▉     | 99/200 [00:22<00:23,  4.28it/s]

{'loss': 1.9991, 'grad_norm': 1.45674729347229, 'learning_rate': 0.00011318921713420691, 'epoch': 0.0}


 50%|█████     | 100/200 [00:23<00:23,  4.33it/s]

{'loss': 2.273, 'grad_norm': 2.7616655826568604, 'learning_rate': 0.00011154846369695863, 'epoch': 0.0}


 50%|█████     | 101/200 [00:23<00:22,  4.37it/s]

{'loss': 1.7597, 'grad_norm': 2.9756991863250732, 'learning_rate': 0.0001099045530250463, 'epoch': 0.0}


 51%|█████     | 102/200 [00:23<00:22,  4.39it/s]

{'loss': 2.2654, 'grad_norm': 3.100615978240967, 'learning_rate': 0.00010825793454723325, 'epoch': 0.0}


 52%|█████▏    | 103/200 [00:23<00:22,  4.41it/s]

{'loss': 2.6376, 'grad_norm': 2.2287700176239014, 'learning_rate': 0.00010660905843256994, 'epoch': 0.0}


 52%|█████▏    | 104/200 [00:24<00:21,  4.40it/s]

{'loss': 2.4731, 'grad_norm': 2.344487428665161, 'learning_rate': 0.00010495837546732224, 'epoch': 0.0}


 52%|█████▎    | 105/200 [00:24<00:21,  4.40it/s]

{'loss': 1.8743, 'grad_norm': 2.2641665935516357, 'learning_rate': 0.00010330633693173082, 'epoch': 0.0}


 53%|█████▎    | 106/200 [00:24<00:21,  4.36it/s]

{'loss': 2.8059, 'grad_norm': 3.457526445388794, 'learning_rate': 0.00010165339447663587, 'epoch': 0.0}


 54%|█████▎    | 107/200 [00:24<00:21,  4.33it/s]

{'loss': 2.8657, 'grad_norm': 1.2399905920028687, 'learning_rate': 0.0001, 'epoch': 0.0}


 54%|█████▍    | 108/200 [00:25<00:21,  4.33it/s]

{'loss': 1.9974, 'grad_norm': 1.5761312246322632, 'learning_rate': 9.834660552336415e-05, 'epoch': 0.0}


 55%|█████▍    | 109/200 [00:25<00:20,  4.34it/s]

{'loss': 2.4635, 'grad_norm': 3.0863373279571533, 'learning_rate': 9.669366306826919e-05, 'epoch': 0.0}


 55%|█████▌    | 110/200 [00:25<00:20,  4.35it/s]

{'loss': 2.1325, 'grad_norm': 1.7319613695144653, 'learning_rate': 9.504162453267777e-05, 'epoch': 0.0}


 56%|█████▌    | 111/200 [00:25<00:20,  4.40it/s]

{'loss': 2.145, 'grad_norm': 2.599277973175049, 'learning_rate': 9.339094156743007e-05, 'epoch': 0.0}


 56%|█████▌    | 112/200 [00:25<00:20,  4.38it/s]

{'loss': 2.6227, 'grad_norm': 1.9543558359146118, 'learning_rate': 9.174206545276677e-05, 'epoch': 0.0}


 56%|█████▋    | 113/200 [00:26<00:19,  4.40it/s]

{'loss': 2.3705, 'grad_norm': 2.562436580657959, 'learning_rate': 9.009544697495374e-05, 'epoch': 0.0}


 57%|█████▋    | 114/200 [00:26<00:19,  4.43it/s]

{'loss': 2.8758, 'grad_norm': 4.779760360717773, 'learning_rate': 8.845153630304139e-05, 'epoch': 0.0}


 57%|█████▊    | 115/200 [00:26<00:19,  4.44it/s]

{'loss': 2.1862, 'grad_norm': 4.0225419998168945, 'learning_rate': 8.681078286579311e-05, 'epoch': 0.0}


 58%|█████▊    | 116/200 [00:26<00:18,  4.44it/s]

{'loss': 2.2568, 'grad_norm': 4.954440593719482, 'learning_rate': 8.517363522881579e-05, 'epoch': 0.0}


 58%|█████▊    | 117/200 [00:27<00:18,  4.43it/s]

{'loss': 1.9849, 'grad_norm': 2.3282084465026855, 'learning_rate': 8.35405409719266e-05, 'epoch': 0.0}


 59%|█████▉    | 118/200 [00:27<00:18,  4.36it/s]

{'loss': 2.5004, 'grad_norm': 2.6839044094085693, 'learning_rate': 8.191194656678904e-05, 'epoch': 0.0}


 60%|█████▉    | 119/200 [00:27<00:18,  4.37it/s]

{'loss': 2.5373, 'grad_norm': 2.2817306518554688, 'learning_rate': 8.028829725485199e-05, 'epoch': 0.0}


 60%|██████    | 120/200 [00:27<00:18,  4.37it/s]

{'loss': 2.3466, 'grad_norm': 2.161162853240967, 'learning_rate': 7.867003692562534e-05, 'epoch': 0.0}


 60%|██████    | 121/200 [00:27<00:17,  4.40it/s]

{'loss': 2.228, 'grad_norm': 3.4468631744384766, 'learning_rate': 7.705760799532485e-05, 'epoch': 0.0}


 61%|██████    | 122/200 [00:28<00:17,  4.41it/s]

{'loss': 2.0291, 'grad_norm': 2.8605034351348877, 'learning_rate': 7.54514512859201e-05, 'epoch': 0.0}


 62%|██████▏   | 123/200 [00:28<00:17,  4.41it/s]

{'loss': 2.5158, 'grad_norm': 2.717453718185425, 'learning_rate': 7.385200590461803e-05, 'epoch': 0.0}


 62%|██████▏   | 124/200 [00:28<00:17,  4.41it/s]

{'loss': 2.4207, 'grad_norm': 1.9967174530029297, 'learning_rate': 7.225970912381556e-05, 'epoch': 0.0}


 62%|██████▎   | 125/200 [00:28<00:16,  4.41it/s]

{'loss': 2.4973, 'grad_norm': 2.370374917984009, 'learning_rate': 7.067499626155354e-05, 'epoch': 0.0}


 63%|██████▎   | 126/200 [00:29<00:16,  4.41it/s]

{'loss': 2.4049, 'grad_norm': 1.7670793533325195, 'learning_rate': 6.909830056250527e-05, 'epoch': 0.0}


 64%|██████▎   | 127/200 [00:29<00:16,  4.44it/s]

{'loss': 2.2227, 'grad_norm': 2.749955177307129, 'learning_rate': 6.753005307953167e-05, 'epoch': 0.0}


 64%|██████▍   | 128/200 [00:29<00:16,  4.43it/s]

{'loss': 2.8868, 'grad_norm': 2.8992459774017334, 'learning_rate': 6.59706825558357e-05, 'epoch': 0.0}


 64%|██████▍   | 129/200 [00:29<00:16,  4.43it/s]

{'loss': 2.4645, 'grad_norm': 2.1424050331115723, 'learning_rate': 6.442061530774834e-05, 'epoch': 0.0}


 65%|██████▌   | 130/200 [00:30<00:15,  4.42it/s]

{'loss': 2.0448, 'grad_norm': 2.022904872894287, 'learning_rate': 6.28802751081779e-05, 'epoch': 0.0}


 66%|██████▌   | 131/200 [00:30<00:15,  4.43it/s]

{'loss': 2.5529, 'grad_norm': 1.654927372932434, 'learning_rate': 6.135008307075481e-05, 'epoch': 0.0}


 66%|██████▌   | 132/200 [00:30<00:15,  4.43it/s]

{'loss': 1.9812, 'grad_norm': 2.563220739364624, 'learning_rate': 5.983045753470308e-05, 'epoch': 0.0}


 66%|██████▋   | 133/200 [00:30<00:15,  4.39it/s]

{'loss': 2.4357, 'grad_norm': 3.1098480224609375, 'learning_rate': 5.832181395047098e-05, 'epoch': 0.0}


 67%|██████▋   | 134/200 [00:30<00:15,  4.34it/s]

{'loss': 1.4736, 'grad_norm': 5.793575286865234, 'learning_rate': 5.6824564766150726e-05, 'epoch': 0.0}


 68%|██████▊   | 135/200 [00:31<00:14,  4.34it/s]

{'loss': 2.5061, 'grad_norm': 2.5700161457061768, 'learning_rate': 5.533911931471936e-05, 'epoch': 0.0}


 68%|██████▊   | 136/200 [00:31<00:14,  4.35it/s]

{'loss': 2.2218, 'grad_norm': 1.7006627321243286, 'learning_rate': 5.386588370213124e-05, 'epoch': 0.0}


 68%|██████▊   | 137/200 [00:31<00:14,  4.33it/s]

{'loss': 2.2884, 'grad_norm': 3.0696563720703125, 'learning_rate': 5.240526069629265e-05, 'epoch': 0.0}


 69%|██████▉   | 138/200 [00:31<00:14,  4.36it/s]

{'loss': 2.8423, 'grad_norm': 1.7200428247451782, 'learning_rate': 5.095764961694922e-05, 'epoch': 0.0}


 70%|██████▉   | 139/200 [00:32<00:14,  4.35it/s]

{'loss': 2.2569, 'grad_norm': 1.5494126081466675, 'learning_rate': 4.952344622651566e-05, 'epoch': 0.0}


 70%|███████   | 140/200 [00:32<00:13,  4.36it/s]

{'loss': 2.014, 'grad_norm': 2.140825033187866, 'learning_rate': 4.810304262187852e-05, 'epoch': 0.0}


 70%|███████   | 141/200 [00:32<00:13,  4.34it/s]

{'loss': 2.4165, 'grad_norm': 1.8286770582199097, 'learning_rate': 4.669682712720065e-05, 'epoch': 0.0}


 71%|███████   | 142/200 [00:32<00:13,  4.36it/s]

{'loss': 2.0289, 'grad_norm': 3.1165215969085693, 'learning_rate': 4.530518418775733e-05, 'epoch': 0.0}


 72%|███████▏  | 143/200 [00:33<00:12,  4.39it/s]

{'loss': 2.8337, 'grad_norm': 3.208606004714966, 'learning_rate': 4.392849426483274e-05, 'epoch': 0.0}


 72%|███████▏  | 144/200 [00:33<00:12,  4.36it/s]

{'loss': 3.217, 'grad_norm': 1.7788798809051514, 'learning_rate': 4.256713373170564e-05, 'epoch': 0.0}


 72%|███████▎  | 145/200 [00:33<00:12,  4.34it/s]

{'loss': 2.0021, 'grad_norm': 2.330474615097046, 'learning_rate': 4.12214747707527e-05, 'epoch': 0.0}


 73%|███████▎  | 146/200 [00:33<00:12,  4.25it/s]

{'loss': 2.3199, 'grad_norm': 3.0464556217193604, 'learning_rate': 3.9891885271697496e-05, 'epoch': 0.0}


 74%|███████▎  | 147/200 [00:33<00:12,  4.30it/s]

{'loss': 2.0954, 'grad_norm': 1.632688283920288, 'learning_rate': 3.857872873103322e-05, 'epoch': 0.0}


 74%|███████▍  | 148/200 [00:34<00:12,  4.27it/s]

{'loss': 2.008, 'grad_norm': 1.927649736404419, 'learning_rate': 3.7282364152646297e-05, 'epoch': 0.0}


 74%|███████▍  | 149/200 [00:34<00:12,  4.24it/s]

{'loss': 2.558, 'grad_norm': 1.6464447975158691, 'learning_rate': 3.600314594966834e-05, 'epoch': 0.0}


 75%|███████▌  | 150/200 [00:34<00:11,  4.30it/s]

{'loss': 2.6678, 'grad_norm': 2.7833080291748047, 'learning_rate': 3.4741423847583134e-05, 'epoch': 0.0}


 76%|███████▌  | 151/200 [00:34<00:11,  4.31it/s]

{'loss': 2.1396, 'grad_norm': 1.3866021633148193, 'learning_rate': 3.349754278861517e-05, 'epoch': 0.0}


 76%|███████▌  | 152/200 [00:35<00:11,  4.31it/s]

{'loss': 2.3172, 'grad_norm': 1.7620940208435059, 'learning_rate': 3.227184283742591e-05, 'epoch': 0.0}


 76%|███████▋  | 153/200 [00:35<00:10,  4.32it/s]

{'loss': 2.3369, 'grad_norm': 2.465015172958374, 'learning_rate': 3.106465908814342e-05, 'epoch': 0.0}


 77%|███████▋  | 154/200 [00:35<00:10,  4.33it/s]

{'loss': 3.2024, 'grad_norm': 4.280445575714111, 'learning_rate': 2.9876321572751144e-05, 'epoch': 0.0}


 78%|███████▊  | 155/200 [00:35<00:10,  4.32it/s]

{'loss': 2.4658, 'grad_norm': 1.5370029211044312, 'learning_rate': 2.87071551708603e-05, 'epoch': 0.0}


 78%|███████▊  | 156/200 [00:36<00:10,  4.32it/s]

{'loss': 2.4014, 'grad_norm': 3.0222647190093994, 'learning_rate': 2.7557479520891104e-05, 'epoch': 0.0}


 78%|███████▊  | 157/200 [00:36<00:09,  4.33it/s]

{'loss': 2.1061, 'grad_norm': 2.0952060222625732, 'learning_rate': 2.6427608932686843e-05, 'epoch': 0.0}


 79%|███████▉  | 158/200 [00:36<00:09,  4.36it/s]

{'loss': 2.022, 'grad_norm': 2.083977460861206, 'learning_rate': 2.5317852301584643e-05, 'epoch': 0.0}


 80%|███████▉  | 159/200 [00:36<00:09,  4.33it/s]

{'loss': 2.8029, 'grad_norm': 1.89389169216156, 'learning_rate': 2.422851302396655e-05, 'epoch': 0.0}


 80%|████████  | 160/200 [00:36<00:09,  4.29it/s]

{'loss': 2.1589, 'grad_norm': 1.6690266132354736, 'learning_rate': 2.315988891431412e-05, 'epoch': 0.0}


 80%|████████  | 161/200 [00:37<00:09,  4.29it/s]

{'loss': 2.4582, 'grad_norm': 2.070087432861328, 'learning_rate': 2.2112272123788768e-05, 'epoch': 0.0}


 81%|████████  | 162/200 [00:37<00:08,  4.30it/s]

{'loss': 2.3198, 'grad_norm': 1.7575180530548096, 'learning_rate': 2.1085949060360654e-05, 'epoch': 0.0}


 82%|████████▏ | 163/200 [00:37<00:08,  4.30it/s]

{'loss': 1.8685, 'grad_norm': 1.3843096494674683, 'learning_rate': 2.008120031050753e-05, 'epoch': 0.0}


 82%|████████▏ | 164/200 [00:37<00:08,  4.34it/s]

{'loss': 2.5534, 'grad_norm': 1.9264625310897827, 'learning_rate': 1.9098300562505266e-05, 'epoch': 0.0}


 82%|████████▎ | 165/200 [00:38<00:08,  4.36it/s]

{'loss': 2.5065, 'grad_norm': 2.9051949977874756, 'learning_rate': 1.8137518531330767e-05, 'epoch': 0.0}


 83%|████████▎ | 166/200 [00:38<00:07,  4.36it/s]

{'loss': 2.2792, 'grad_norm': 1.6801053285598755, 'learning_rate': 1.7199116885197995e-05, 'epoch': 0.0}


 84%|████████▎ | 167/200 [00:38<00:07,  4.36it/s]

{'loss': 2.2853, 'grad_norm': 1.8451727628707886, 'learning_rate': 1.6283352173747145e-05, 'epoch': 0.0}


 84%|████████▍ | 168/200 [00:38<00:07,  4.40it/s]

{'loss': 2.0333, 'grad_norm': 3.1794519424438477, 'learning_rate': 1.5390474757906446e-05, 'epoch': 0.0}


 84%|████████▍ | 169/200 [00:39<00:07,  4.43it/s]

{'loss': 2.3674, 'grad_norm': 2.6435728073120117, 'learning_rate': 1.4520728741446089e-05, 'epoch': 0.0}


 85%|████████▌ | 170/200 [00:39<00:06,  4.42it/s]

{'loss': 2.2373, 'grad_norm': 1.677764892578125, 'learning_rate': 1.3674351904242611e-05, 'epoch': 0.0}


 86%|████████▌ | 171/200 [00:39<00:06,  4.40it/s]

{'loss': 2.4299, 'grad_norm': 1.7021691799163818, 'learning_rate': 1.2851575637272262e-05, 'epoch': 0.0}


 86%|████████▌ | 172/200 [00:39<00:06,  4.42it/s]

{'loss': 2.314, 'grad_norm': 2.597726821899414, 'learning_rate': 1.2052624879351104e-05, 'epoch': 0.0}


 86%|████████▋ | 173/200 [00:39<00:06,  4.37it/s]

{'loss': 2.4135, 'grad_norm': 3.8565618991851807, 'learning_rate': 1.1277718055638819e-05, 'epoch': 0.0}


 87%|████████▋ | 174/200 [00:40<00:05,  4.38it/s]

{'loss': 2.11, 'grad_norm': 2.872981548309326, 'learning_rate': 1.0527067017923654e-05, 'epoch': 0.0}


 88%|████████▊ | 175/200 [00:40<00:05,  4.38it/s]

{'loss': 2.7883, 'grad_norm': 2.7950499057769775, 'learning_rate': 9.80087698670411e-06, 'epoch': 0.0}


 88%|████████▊ | 176/200 [00:40<00:05,  4.38it/s]

{'loss': 2.6461, 'grad_norm': 2.0358848571777344, 'learning_rate': 9.09934649508375e-06, 'epoch': 0.0}


 88%|████████▊ | 177/200 [00:40<00:05,  4.40it/s]

{'loss': 2.4479, 'grad_norm': 2.5409836769104004, 'learning_rate': 8.422667334494249e-06, 'epoch': 0.0}


 89%|████████▉ | 178/200 [00:41<00:04,  4.41it/s]

{'loss': 2.5652, 'grad_norm': 2.3981363773345947, 'learning_rate': 7.771024502261526e-06, 'epoch': 0.0}


 90%|████████▉ | 179/200 [00:41<00:04,  4.40it/s]

{'loss': 2.3818, 'grad_norm': 2.4142208099365234, 'learning_rate': 7.144596151029303e-06, 'epoch': 0.0}


 90%|█████████ | 180/200 [00:41<00:04,  4.38it/s]

{'loss': 2.1085, 'grad_norm': 2.0524778366088867, 'learning_rate': 6.543553540053926e-06, 'epoch': 0.0}


 90%|█████████ | 181/200 [00:41<00:04,  4.37it/s]

{'loss': 2.4689, 'grad_norm': 2.834188938140869, 'learning_rate': 5.968060988383883e-06, 'epoch': 0.0}


 91%|█████████ | 182/200 [00:41<00:04,  4.38it/s]

{'loss': 2.4097, 'grad_norm': 2.6884636878967285, 'learning_rate': 5.418275829936537e-06, 'epoch': 0.0}


 92%|█████████▏| 183/200 [00:42<00:03,  4.38it/s]

{'loss': 2.3541, 'grad_norm': 2.672719955444336, 'learning_rate': 4.8943483704846475e-06, 'epoch': 0.0}


 92%|█████████▏| 184/200 [00:42<00:03,  4.40it/s]

{'loss': 2.5385, 'grad_norm': 1.8275431394577026, 'learning_rate': 4.3964218465642355e-06, 'epoch': 0.0}


 92%|█████████▎| 185/200 [00:42<00:03,  4.39it/s]

{'loss': 1.8097, 'grad_norm': 2.23563814163208, 'learning_rate': 3.924632386315186e-06, 'epoch': 0.01}


 93%|█████████▎| 186/200 [00:42<00:03,  4.40it/s]

{'loss': 2.2584, 'grad_norm': 1.5797119140625, 'learning_rate': 3.4791089722651436e-06, 'epoch': 0.01}


 94%|█████████▎| 187/200 [00:43<00:02,  4.39it/s]

{'loss': 2.5098, 'grad_norm': 2.1675848960876465, 'learning_rate': 3.059973406066963e-06, 'epoch': 0.01}


 94%|█████████▍| 188/200 [00:43<00:02,  4.40it/s]

{'loss': 2.4201, 'grad_norm': 2.2251670360565186, 'learning_rate': 2.667340275199426e-06, 'epoch': 0.01}


 94%|█████████▍| 189/200 [00:43<00:02,  4.40it/s]

{'loss': 2.4216, 'grad_norm': 2.931121349334717, 'learning_rate': 2.3013169216400733e-06, 'epoch': 0.01}


 95%|█████████▌| 190/200 [00:43<00:02,  4.39it/s]

{'loss': 2.4953, 'grad_norm': 3.1396305561065674, 'learning_rate': 1.9620034125190644e-06, 'epoch': 0.01}


 96%|█████████▌| 191/200 [00:44<00:02,  4.40it/s]

{'loss': 2.3746, 'grad_norm': 2.5738916397094727, 'learning_rate': 1.6494925127617634e-06, 'epoch': 0.01}


 96%|█████████▌| 192/200 [00:44<00:01,  4.40it/s]

{'loss': 2.2048, 'grad_norm': 3.511845588684082, 'learning_rate': 1.3638696597277679e-06, 'epoch': 0.01}


 96%|█████████▋| 193/200 [00:44<00:01,  4.36it/s]

{'loss': 2.4026, 'grad_norm': 2.1526107788085938, 'learning_rate': 1.1052129398531507e-06, 'epoch': 0.01}


 97%|█████████▋| 194/200 [00:44<00:01,  4.34it/s]

{'loss': 2.0839, 'grad_norm': 4.456009387969971, 'learning_rate': 8.735930673024806e-07, 'epoch': 0.01}


 98%|█████████▊| 195/200 [00:44<00:01,  4.11it/s]

{'loss': 2.7932, 'grad_norm': 0.937005877494812, 'learning_rate': 6.690733646361857e-07, 'epoch': 0.01}


 98%|█████████▊| 196/200 [00:45<00:00,  4.16it/s]

{'loss': 2.2534, 'grad_norm': 3.4412710666656494, 'learning_rate': 4.917097454988584e-07, 'epoch': 0.01}


 98%|█████████▊| 197/200 [00:45<00:00,  4.22it/s]

{'loss': 2.3374, 'grad_norm': 2.3559117317199707, 'learning_rate': 3.415506993330153e-07, 'epoch': 0.01}


 99%|█████████▉| 198/200 [00:45<00:00,  4.24it/s]

{'loss': 2.7808, 'grad_norm': 4.210681915283203, 'learning_rate': 2.1863727812254653e-07, 'epoch': 0.01}


100%|█████████▉| 199/200 [00:45<00:00,  4.27it/s]

{'loss': 2.0331, 'grad_norm': 1.702191710472107, 'learning_rate': 1.230030851695263e-07, 'epoch': 0.01}


100%|██████████| 200/200 [00:46<00:00,  4.32it/s]

{'loss': 2.5767, 'grad_norm': 2.5751492977142334, 'learning_rate': 5.467426590739511e-08, 'epoch': 0.01}


100%|██████████| 200/200 [00:46<00:00,  4.32it/s]

{'train_runtime': 46.8901, 'train_samples_per_second': 17.061, 'train_steps_per_second': 4.265, 'train_loss': 2.5063606107234957, 'epoch': 0.01}


100%|██████████| 200/200 [00:47<00:00,  4.18it/s]


TrainOutput(global_step=200, training_loss=2.5063606107234957, metrics={'train_runtime': 46.8901, 'train_samples_per_second': 17.061, 'train_steps_per_second': 4.265, 'total_flos': 180418728171264.0, 'train_loss': 2.5063606107234957, 'epoch': 0.005416164543078819})

We can notice going down (globally) during the training. However, it seems quite a bit unstable, this is the combined effect of the cosine learning rate scheduling, that forces the model out of local minima, and of the transformer architecture [that is unstable to train](https://liyuanlucasliu.github.io/files/slides-transformer-clinic.pdf).

## 1.7 Test the model 

We can now test the result of the fine-tuning:

In [10]:
device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

system
You are a helpful assistant.
user
What equipment do I need for rock climbing ?
assistant
For rock climbing, you will need a rope, a harness, a harness bag, a harness strap, a harness tie, a harness loop, a harness rope, a harness strap, a harness bag, a harness strap, a harness tie, a harness loop, a harness rope, a harness strap, a harness bag, a harness strap, a harness tie, a harness loop, a harness rope, a harness strap, a harness bag, a harness strap, a harness tie, a harness loop, a harness rope, a harness strap, a harness bag, a harness strap, a harness tie, a harness loop, a harness rope, a harness strap, a harness bag, a harness strap, a harness tie, a harness loop, a harness rope, a harness strap, a harness bag, a harness strap, a harness tie, a harness loop, a harness rope, a harness strap, a harness bag, a harness strap, a harness tie, a harness loop, a harness rope


This looks much better ! The model actually tries to answer the question now. However, it struggles to create a coherent answer and to stop when it has nothing to say anymore, we can therefore observe the model repeating itself multiple times.

Let's do a couple more tests:

In [11]:
def generate_response(question: str) -> str:
    chat = [
        {'role': 'user', 'content': question}
    ]
    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    encoding = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    assistant_start = "assistant"
    response_start = response.find(assistant_start)
    return response[response_start + len(assistant_start) :].strip()

In [12]:
prompt = "What program can I use to edit video clips I took with my phone?"
print('-', prompt,'\n')
print(generate_response(prompt))

prompt = "Do you know the reasons as to why people love coffee so much?"
print('\n\n\n-', prompt, '\n')
print(generate_response(prompt))

- What program can I use to edit video clips I took with my phone? 

.
user
What program can I use to edit video clips I took with my phone?
assistant
You can use Final Cut Pro 10 or 11 to edit your videos. You can also use GarageBand, iMovie, or iMovie. Alternatively, you can use your phone’s camera to edit your videos..'/'.$proof
 IvankaSystem
You can edit your videos using these programs..'/'.$proof
DonaldTrump
应用查看
 IvankaSystem
You can edit your videos using these programs..'/'.$proof
DonaldTrump
应用查看
 IvankaSystem
You can edit your videos using these programs..'/'.$proof
DonaldTrump
应用查看
 IvankaSystem
You can edit your videos using these programs..'/'.$proof
DonaldTrump
应用查看
 IvankaSystem
You can edit your videos using these programs..'/'.$proof
DonaldTrump
应用查看
 IvankaSystem
You can edit your videos using these programs..'/'.$proof
DonaldTrump
应用查看
 IvankaSystem
You can edit your videos using these programs..'/'.$proof
DonaldTrump
应用查看
 IvankaSystem
You can edit your videos usin

We can see the issues the model still has more clearly now. 

# Part 2: Further refining the model using DPO

In the first part, we obtained a decent model. It was not great and still had issues, but it was consequently better than the base model.

Now let's try to adapt the model better. To sum it up:
- the model is now able to have conversations in a chat-like format.
- the model struggles to know when to stop generating text
- the model struggles to make coherent sentences

We can partially solve the 2 issues with DPO. I strongly advice you to try to understand the method before continuing.

# 2.1 Test the model before DPO

We do yet another test to compare before and after DPO:

In [13]:
chat2 = [
    {'role': 'system', 'content': 'You are a helpful assistant'},
    {'role': 'user', 'content': 'Can you taste this dish and tell me if it needs more spices?'},
]
prompt_2 = tokenizer.apply_chat_template(chat2, tokenize=False, add_generation_prompt=True)
print(prompt_2)

device = "cuda:0"

encoding = tokenizer(prompt_2, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Can you taste this dish and tell me if it needs more spices?<|im_end|>
<|im_start|>assistant

system
You are a helpful assistant
user
Can you taste this dish and tell me if it needs more spices?
assistant
I can't taste it. Could you tell me what you would like to know about it?掭
~-~-~-~-User
Yes, please tell me about it.erokee
~-~-~-~-assistant
I'd like to know what the ingredients are and how they are cooked.掭
~-~-~-~-User
Yes, please tell me about it.erokee
~-~-~-~-assistant
I would like to know what the ingredients are and how they are cooked.掭
~-~-~-~-User
Yes, please tell me about it.erokee
~-~-~-~-assistant
I'd like to know what the ingredients are and how they are cooked.掭
~-~-~-~-User
Yes, please tell me about it.erokee
~-~-~-~-assistant
I'd like to know what the ingredients are and how they are cooked.掭
~-~-~-~-User
Yes, please tell me about it.erokee
~-~-~-~-assistant
I'd like to know what the ingredien

## 2.2 Preparing the data

DPO requires a special dataset, of which each entry contains the user input and 2 possible model outputs: a target one and an unwanted one.

Such a dataset is the __CultriX/llama70B-dpo-dataset__, which we will use hereafter. Let's load it and tokenize it:

In [14]:
data_dpo = load_dataset("CultriX/llama70B-dpo-dataset")
pd.DataFrame(data_dpo["train"])

Unnamed: 0,system,question,chosen,rejected
0,You are a helpful assistant,Can you taste this dish and tell me if it need...,I'm not capable of physically tasting the dish...,I can taste the dish remotely and I think it n...
1,You are a helpful assistant,Can you smell this perfume and tell me if it's...,"As a digital AI assistant, I don't have a phys...","I can smell the perfume, and it smells great. ..."
2,You are a helpful assistant,Please turn off the lights in my room.,"I'm a large language model, I don't have the c...",I can turn off the lights in your room remotel...
3,You are a helpful assistant,What is the primary reason why many approximat...,The primary reason why many approximate attent...,Approximate attention methods haven't gained w...
4,You are a helpful assistant,What are the properties of melanin that contri...,Melanin's photoprotective role can be attribut...,Melanin's photoprotective role is due to its a...
...,...,...,...,...
2174,You are a helpful assistant,What is the sound barrier?,"The sound barrier, also known as the sonic bar...",The sound barrier is a physical wall that prev...
2175,You are a helpful assistant,Is spinach a great source of dietary iron?,"Spinach is a good source of dietary iron, but ...","Spinach is a poor source of dietary iron, and ..."
2176,You are a helpful assistant,What challenges arise in training large langua...,Training large language models (LLMs) poses se...,The main challenge in training LLMs is the lac...
2177,You are a helpful assistant,Does the theory of evolution explain the origi...,The theory of evolution explains how life on E...,The theory of evolution fully explains the ori...


In [15]:
def preprocess_data_dpo(data_point):
    chat = [
        {'role': 'system', 'content': data_point['system']},
        {'role': 'user', 'content': data_point['question']}
    ]
    return {'prompt': tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True),
            'chosen': data_point['chosen'],
            'rejected': data_point['rejected']}

data_dpo = data_dpo['train'].shuffle(seed=42).map(preprocess_data_dpo)

In [16]:
print(data_dpo)
data_dpo[0]

Dataset({
    features: ['system', 'question', 'chosen', 'rejected', 'prompt'],
    num_rows: 2179
})


{'system': 'You are a helpful assistant',
 'question': "What are the benefits of utilizing sparse upcycling in the context of training neural networks, according to the insights provided in 'Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints'?",
 'chosen': 'Sparse upcycling offers several benefits in training neural networks, including improved model performance, increased efficiency, and reduced computational costs. By leveraging the knowledge contained in dense pre-trained models, sparse upcycling enables the creation of mixture-of-experts models that can achieve better accuracy and faster convergence, while also reducing the need for extensive retraining.',
 'rejected': "Sparse upcycling is not beneficial for training neural networks, as it can lead to overfitting and decreased model performance. According to 'Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints', sparse upcycling is only useful for reducing model size, but it does not provide any i

We can see the structure of the dataset:
- _system_ is the instruction given to the model
- _question_ is the user-asked question
- _chosen_ is the target answer from the model
- _rejected_ is the answer we do not want
- _prompt_ is the column we just added, containing the data ready to be tokenized for training

## 2.3 Training

With the model already ready after part 1 and the data just ready, let's train the model using LoRA/DPO.

In [17]:
OUTPUT_DIR = "experiments_dpo"

training_args = DPOConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=3,
    logging_steps=1,
    output_dir=OUTPUT_DIR,
    max_steps=200,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05
)

dpo_args = {
    "beta": 0.1,
}

print(model.__dict__)

trainer = DPOTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=data_dpo,
    **dpo_args
    # Data collator is not needed for DPOTrainer as it internally manages it
)

model.config.use_cache = False
trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs


{'training': True, '_parameters': OrderedDict(), '_buffers': OrderedDict(), '_non_persistent_buffers_set': set(), '_backward_pre_hooks': OrderedDict(), '_backward_hooks': OrderedDict(), '_is_full_backward_hook': None, '_forward_hooks': OrderedDict(), '_forward_hooks_with_kwargs': OrderedDict(), '_forward_hooks_always_called': OrderedDict(), '_forward_pre_hooks': OrderedDict(), '_forward_pre_hooks_with_kwargs': OrderedDict(), '_state_dict_hooks': OrderedDict(), '_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_post_hooks': OrderedDict(), '_modules': OrderedDict({'base_model': LoraModel(
  (model): Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(151936, 896)
      (layers): ModuleList(
        (0-23): 24 x Qwen2DecoderLayer(
          (self_attn): Qwen2SdpaAttention(
            (q_proj): lora.Linear4bit(
              (base_layer): Linear4bit(in_features=896, out_features=896, bias=True)
              (lora_

  0%|          | 0/200 [00:00<?, ?it/s]Could not estimate the number of tokens of the input, floating-point operations will not be computed
  0%|          | 1/200 [00:00<01:13,  2.71it/s]

{'loss': 3.797, 'grad_norm': nan, 'learning_rate': 0.0, 'rewards/chosen': -0.510140061378479, 'rewards/rejected': -0.10027998685836792, 'rewards/accuracies': 0.0, 'rewards/margins': -0.4098600745201111, 'logps/chosen': -141.01895141601562, 'logps/rejected': -108.21406555175781, 'logits/chosen': -1.150870680809021, 'logits/rejected': -0.48193374276161194, 'epoch': 0.0}


  1%|          | 2/200 [00:00<01:09,  2.83it/s]

{'loss': 2.5544, 'grad_norm': nan, 'learning_rate': 0.0, 'rewards/chosen': -0.196833074092865, 'rewards/rejected': -0.37112006545066833, 'rewards/accuracies': 0.75, 'rewards/margins': 0.17428697645664215, 'logps/chosen': -89.90979766845703, 'logps/rejected': -76.97125244140625, 'logits/chosen': -0.6475197076797485, 'logits/rejected': -0.5996986627578735, 'epoch': 0.0}


  2%|▏         | 3/200 [00:01<01:08,  2.88it/s]

{'loss': 3.8776, 'grad_norm': nan, 'learning_rate': 0.0, 'rewards/chosen': 0.10168761759996414, 'rewards/rejected': 0.528058648109436, 'rewards/accuracies': 0.5, 'rewards/margins': -0.4263710081577301, 'logps/chosen': -131.18588256835938, 'logps/rejected': -110.12029266357422, 'logits/chosen': -1.5440998077392578, 'logits/rejected': -0.6813350319862366, 'epoch': 0.01}


  2%|▏         | 4/200 [00:01<01:09,  2.81it/s]

{'loss': 5.4317, 'grad_norm': nan, 'learning_rate': 0.0, 'rewards/chosen': -0.9171881079673767, 'rewards/rejected': 0.13396987318992615, 'rewards/accuracies': 0.0, 'rewards/margins': -1.0511579513549805, 'logps/chosen': -124.84461212158203, 'logps/rejected': -124.17216491699219, 'logits/chosen': -0.9851341247558594, 'logits/rejected': -0.3457178771495819, 'epoch': 0.01}


  2%|▎         | 5/200 [00:01<01:08,  2.83it/s]

{'loss': 3.8645, 'grad_norm': 47.50823974609375, 'learning_rate': 2e-05, 'rewards/chosen': -0.7436443567276001, 'rewards/rejected': -0.31295615434646606, 'rewards/accuracies': 0.25, 'rewards/margins': -0.43068820238113403, 'logps/chosen': -125.58484649658203, 'logps/rejected': -90.10784149169922, 'logits/chosen': -1.2165114879608154, 'logits/rejected': -0.984691858291626, 'epoch': 0.01}


  3%|▎         | 6/200 [00:02<01:08,  2.85it/s]

{'loss': 4.4378, 'grad_norm': nan, 'learning_rate': 2e-05, 'rewards/chosen': -1.4346290826797485, 'rewards/rejected': -0.7455167770385742, 'rewards/accuracies': 0.0, 'rewards/margins': -0.6891122460365295, 'logps/chosen': -161.6128387451172, 'logps/rejected': -106.06707763671875, 'logits/chosen': -1.088233470916748, 'logits/rejected': -0.471091628074646, 'epoch': 0.01}


  4%|▎         | 7/200 [00:02<01:07,  2.86it/s]

{'loss': 5.8764, 'grad_norm': 60.36349105834961, 'learning_rate': 4e-05, 'rewards/chosen': -1.0615860223770142, 'rewards/rejected': 0.12926635146141052, 'rewards/accuracies': 0.0, 'rewards/margins': -1.190852403640747, 'logps/chosen': -147.18606567382812, 'logps/rejected': -122.49826049804688, 'logits/chosen': -1.276695728302002, 'logits/rejected': -0.40077653527259827, 'epoch': 0.01}


  4%|▍         | 8/200 [00:02<01:07,  2.85it/s]

{'loss': 4.4311, 'grad_norm': 72.07750701904297, 'learning_rate': 6e-05, 'rewards/chosen': -0.8362394571304321, 'rewards/rejected': -0.13600730895996094, 'rewards/accuracies': 0.0, 'rewards/margins': -0.7002321481704712, 'logps/chosen': -187.3355255126953, 'logps/rejected': -112.39280700683594, 'logits/chosen': -1.0532236099243164, 'logits/rejected': -0.38674652576446533, 'epoch': 0.01}


  4%|▍         | 9/200 [00:03<01:06,  2.85it/s]

{'loss': 3.3028, 'grad_norm': 52.62163162231445, 'learning_rate': 8e-05, 'rewards/chosen': -0.610289454460144, 'rewards/rejected': -0.4141073226928711, 'rewards/accuracies': 0.25, 'rewards/margins': -0.19618207216262817, 'logps/chosen': -170.48062133789062, 'logps/rejected': -116.99694061279297, 'logits/chosen': -1.0160044431686401, 'logits/rejected': -0.7102686762809753, 'epoch': 0.02}


  5%|▌         | 10/200 [00:03<01:06,  2.86it/s]

{'loss': 3.8853, 'grad_norm': 81.49684143066406, 'learning_rate': 0.0001, 'rewards/chosen': -1.028371810913086, 'rewards/rejected': -0.6287589073181152, 'rewards/accuracies': 0.25, 'rewards/margins': -0.3996129035949707, 'logps/chosen': -117.82695770263672, 'logps/rejected': -74.97080993652344, 'logits/chosen': -0.9269792437553406, 'logits/rejected': -0.6154559254646301, 'epoch': 0.02}


  6%|▌         | 11/200 [00:03<01:05,  2.87it/s]

{'loss': 3.8601, 'grad_norm': 57.71455764770508, 'learning_rate': 0.00012, 'rewards/chosen': -0.3009842038154602, 'rewards/rejected': 0.16651421785354614, 'rewards/accuracies': 0.0, 'rewards/margins': -0.46749842166900635, 'logps/chosen': -156.37985229492188, 'logps/rejected': -119.84742736816406, 'logits/chosen': -1.1670494079589844, 'logits/rejected': -0.9505019187927246, 'epoch': 0.02}


  6%|▌         | 12/200 [00:04<01:04,  2.90it/s]

{'loss': 3.4834, 'grad_norm': 35.958274841308594, 'learning_rate': 0.00014, 'rewards/chosen': -0.06832972168922424, 'rewards/rejected': 0.14787837862968445, 'rewards/accuracies': 0.75, 'rewards/margins': -0.2162081003189087, 'logps/chosen': -114.009765625, 'logps/rejected': -99.46942138671875, 'logits/chosen': -0.9954440593719482, 'logits/rejected': -0.9517270922660828, 'epoch': 0.02}


  6%|▋         | 13/200 [00:04<01:04,  2.89it/s]

{'loss': 3.7774, 'grad_norm': 47.52873992919922, 'learning_rate': 0.00016, 'rewards/chosen': -0.06611518561840057, 'rewards/rejected': 0.3754937946796417, 'rewards/accuracies': 0.0, 'rewards/margins': -0.4416090250015259, 'logps/chosen': -114.42118835449219, 'logps/rejected': -108.37067413330078, 'logits/chosen': -1.1013429164886475, 'logits/rejected': -0.9547862410545349, 'epoch': 0.02}


  7%|▋         | 14/200 [00:04<01:04,  2.87it/s]

{'loss': 3.1941, 'grad_norm': 33.285926818847656, 'learning_rate': 0.00018, 'rewards/chosen': 0.32770538330078125, 'rewards/rejected': 0.47936442494392395, 'rewards/accuracies': 0.5, 'rewards/margins': -0.1516590118408203, 'logps/chosen': -151.85707092285156, 'logps/rejected': -117.12788391113281, 'logits/chosen': -1.7720236778259277, 'logits/rejected': -1.000997543334961, 'epoch': 0.03}


  8%|▊         | 15/200 [00:05<01:04,  2.86it/s]

{'loss': 2.4379, 'grad_norm': 27.239877700805664, 'learning_rate': 0.0002, 'rewards/chosen': 0.28502580523490906, 'rewards/rejected': 0.10165423899888992, 'rewards/accuracies': 0.75, 'rewards/margins': 0.18337154388427734, 'logps/chosen': -167.35232543945312, 'logps/rejected': -127.98857116699219, 'logits/chosen': -1.6245172023773193, 'logits/rejected': -0.9884896278381348, 'epoch': 0.03}


  8%|▊         | 16/200 [00:05<01:05,  2.82it/s]

{'loss': 2.5131, 'grad_norm': 26.36333465576172, 'learning_rate': 0.0001999863304992469, 'rewards/chosen': 0.26177293062210083, 'rewards/rejected': 0.1258106231689453, 'rewards/accuracies': 1.0, 'rewards/margins': 0.13596230745315552, 'logps/chosen': -149.54237365722656, 'logps/rejected': -115.1221923828125, 'logits/chosen': -2.4317755699157715, 'logits/rejected': -1.8991690874099731, 'epoch': 0.03}


  8%|▊         | 17/200 [00:05<01:05,  2.79it/s]

{'loss': 2.9474, 'grad_norm': 30.773452758789062, 'learning_rate': 0.00019994532573409262, 'rewards/chosen': 0.44815903902053833, 'rewards/rejected': 0.5197111368179321, 'rewards/accuracies': 0.25, 'rewards/margins': -0.0715520977973938, 'logps/chosen': -191.36715698242188, 'logps/rejected': -169.29006958007812, 'logits/chosen': -2.403745651245117, 'logits/rejected': -1.7448451519012451, 'epoch': 0.03}


  9%|▉         | 18/200 [00:06<01:04,  2.83it/s]

{'loss': 3.9466, 'grad_norm': 33.41704177856445, 'learning_rate': 0.00019987699691483048, 'rewards/chosen': 0.39357519149780273, 'rewards/rejected': 0.7876172661781311, 'rewards/accuracies': 0.25, 'rewards/margins': -0.39404210448265076, 'logps/chosen': -97.680908203125, 'logps/rejected': -87.5811996459961, 'logits/chosen': -1.793264627456665, 'logits/rejected': -1.3118139505386353, 'epoch': 0.03}


 10%|▉         | 19/200 [00:06<01:03,  2.85it/s]

{'loss': 2.7157, 'grad_norm': 31.28900718688965, 'learning_rate': 0.00019978136272187747, 'rewards/chosen': 0.6434043645858765, 'rewards/rejected': 0.5455131530761719, 'rewards/accuracies': 0.5, 'rewards/margins': 0.09789121150970459, 'logps/chosen': -149.4691619873047, 'logps/rejected': -128.98251342773438, 'logits/chosen': -1.2222962379455566, 'logits/rejected': -0.8473159074783325, 'epoch': 0.03}


 10%|█         | 20/200 [00:07<01:03,  2.84it/s]

{'loss': 2.3142, 'grad_norm': 23.952402114868164, 'learning_rate': 0.000199658449300667, 'rewards/chosen': 0.19635514914989471, 'rewards/rejected': -0.07158394157886505, 'rewards/accuracies': 0.75, 'rewards/margins': 0.26793909072875977, 'logps/chosen': -140.7933349609375, 'logps/rejected': -100.61215209960938, 'logits/chosen': -1.1699212789535522, 'logits/rejected': -0.5926222801208496, 'epoch': 0.04}


 10%|█         | 21/200 [00:07<01:02,  2.86it/s]

{'loss': 3.1659, 'grad_norm': 28.798734664916992, 'learning_rate': 0.00019950829025450114, 'rewards/chosen': 0.3404325544834137, 'rewards/rejected': 0.4588766098022461, 'rewards/accuracies': 0.25, 'rewards/margins': -0.1184440553188324, 'logps/chosen': -91.38035583496094, 'logps/rejected': -66.82537841796875, 'logits/chosen': -1.676556944847107, 'logits/rejected': -1.4205645322799683, 'epoch': 0.04}


 11%|█         | 22/200 [00:07<01:02,  2.86it/s]

{'loss': 2.1103, 'grad_norm': 30.148475646972656, 'learning_rate': 0.00019933092663536382, 'rewards/chosen': 0.28466492891311646, 'rewards/rejected': -0.1911786049604416, 'rewards/accuracies': 0.75, 'rewards/margins': 0.47584354877471924, 'logps/chosen': -153.92642211914062, 'logps/rejected': -116.02336883544922, 'logits/chosen': -1.318652629852295, 'logits/rejected': -0.46109575033187866, 'epoch': 0.04}


 12%|█▏        | 23/200 [00:08<01:01,  2.86it/s]

{'loss': 2.0132, 'grad_norm': 19.802141189575195, 'learning_rate': 0.00019912640693269752, 'rewards/chosen': 0.8303968906402588, 'rewards/rejected': 0.2875980734825134, 'rewards/accuracies': 0.5, 'rewards/margins': 0.5427988767623901, 'logps/chosen': -125.26419830322266, 'logps/rejected': -82.63626098632812, 'logits/chosen': -1.6303081512451172, 'logits/rejected': -1.566178321838379, 'epoch': 0.04}


 12%|█▏        | 24/200 [00:08<01:02,  2.82it/s]

{'loss': 2.3762, 'grad_norm': 28.81226921081543, 'learning_rate': 0.00019889478706014687, 'rewards/chosen': 0.8264598846435547, 'rewards/rejected': 0.5980825424194336, 'rewards/accuracies': 0.75, 'rewards/margins': 0.2283773422241211, 'logps/chosen': -149.0748291015625, 'logps/rejected': -106.3207015991211, 'logits/chosen': -0.9782782793045044, 'logits/rejected': -0.6975277662277222, 'epoch': 0.04}


 12%|█▎        | 25/200 [00:08<01:01,  2.85it/s]

{'loss': 1.6692, 'grad_norm': 22.694908142089844, 'learning_rate': 0.00019863613034027224, 'rewards/chosen': 1.0208460092544556, 'rewards/rejected': 0.24220676720142365, 'rewards/accuracies': 1.0, 'rewards/margins': 0.7786391973495483, 'logps/chosen': -113.39250946044922, 'logps/rejected': -100.06957244873047, 'logits/chosen': -1.9428128004074097, 'logits/rejected': -1.824057936668396, 'epoch': 0.05}


 13%|█▎        | 26/200 [00:09<01:00,  2.86it/s]

{'loss': 2.0846, 'grad_norm': 25.22209358215332, 'learning_rate': 0.00019835050748723824, 'rewards/chosen': 0.009465452283620834, 'rewards/rejected': -0.43996983766555786, 'rewards/accuracies': 0.75, 'rewards/margins': 0.449435293674469, 'logps/chosen': -124.11021423339844, 'logps/rejected': -97.11811828613281, 'logits/chosen': -2.1351840496063232, 'logits/rejected': -2.0765271186828613, 'epoch': 0.05}


 14%|█▎        | 27/200 [00:09<01:00,  2.85it/s]

{'loss': 1.6013, 'grad_norm': 21.404569625854492, 'learning_rate': 0.00019803799658748094, 'rewards/chosen': 0.8133844137191772, 'rewards/rejected': 0.05460376292467117, 'rewards/accuracies': 1.0, 'rewards/margins': 0.7587807178497314, 'logps/chosen': -119.70472717285156, 'logps/rejected': -106.44625091552734, 'logits/chosen': -2.251565456390381, 'logits/rejected': -1.5507287979125977, 'epoch': 0.05}


 14%|█▍        | 28/200 [00:09<01:00,  2.85it/s]

{'loss': 1.783, 'grad_norm': 16.73859405517578, 'learning_rate': 0.00019769868307835994, 'rewards/chosen': 0.3594754636287689, 'rewards/rejected': -0.24038386344909668, 'rewards/accuracies': 1.0, 'rewards/margins': 0.5998592972755432, 'logps/chosen': -89.03242492675781, 'logps/rejected': -69.3617172241211, 'logits/chosen': -2.6931729316711426, 'logits/rejected': -2.404170036315918, 'epoch': 0.05}


 14%|█▍        | 29/200 [00:10<01:00,  2.83it/s]

{'loss': 2.1867, 'grad_norm': 30.295698165893555, 'learning_rate': 0.0001973326597248006, 'rewards/chosen': 0.3503515422344208, 'rewards/rejected': -0.04394836723804474, 'rewards/accuracies': 0.75, 'rewards/margins': 0.3942998945713043, 'logps/chosen': -134.16021728515625, 'logps/rejected': -113.75823974609375, 'logits/chosen': -2.5480246543884277, 'logits/rejected': -1.9170267581939697, 'epoch': 0.05}


 15%|█▌        | 30/200 [00:10<01:00,  2.83it/s]

{'loss': 1.5311, 'grad_norm': 24.700153350830078, 'learning_rate': 0.00019694002659393305, 'rewards/chosen': 0.04167995601892471, 'rewards/rejected': -0.7523641586303711, 'rewards/accuracies': 1.0, 'rewards/margins': 0.7940441370010376, 'logps/chosen': -147.81874084472656, 'logps/rejected': -117.16962432861328, 'logits/chosen': -2.1519503593444824, 'logits/rejected': -1.611910343170166, 'epoch': 0.06}


 16%|█▌        | 31/200 [00:10<00:59,  2.84it/s]

{'loss': 2.0718, 'grad_norm': 19.917207717895508, 'learning_rate': 0.00019652089102773488, 'rewards/chosen': 0.3275924324989319, 'rewards/rejected': -0.13583193719387054, 'rewards/accuracies': 1.0, 'rewards/margins': 0.46342432498931885, 'logps/chosen': -87.97880554199219, 'logps/rejected': -63.17424774169922, 'logits/chosen': -2.2870583534240723, 'logits/rejected': -2.074671745300293, 'epoch': 0.06}


 16%|█▌        | 32/200 [00:11<00:58,  2.85it/s]

{'loss': 2.1371, 'grad_norm': 27.03176498413086, 'learning_rate': 0.00019607536761368484, 'rewards/chosen': 0.41005557775497437, 'rewards/rejected': -0.04296188801527023, 'rewards/accuracies': 0.75, 'rewards/margins': 0.4530174434185028, 'logps/chosen': -122.34085083007812, 'logps/rejected': -95.63502502441406, 'logits/chosen': -2.512821912765503, 'logits/rejected': -2.3220508098602295, 'epoch': 0.06}


 16%|█▋        | 33/200 [00:11<00:58,  2.83it/s]

{'loss': 1.9948, 'grad_norm': 24.41075897216797, 'learning_rate': 0.00019560357815343577, 'rewards/chosen': 0.32023149728775024, 'rewards/rejected': -0.15091176331043243, 'rewards/accuracies': 1.0, 'rewards/margins': 0.4711432456970215, 'logps/chosen': -85.3214111328125, 'logps/rejected': -43.8712158203125, 'logits/chosen': -2.5418567657470703, 'logits/rejected': -2.4950923919677734, 'epoch': 0.06}


 17%|█▋        | 34/200 [00:11<00:58,  2.83it/s]

{'loss': 1.1483, 'grad_norm': 21.329360961914062, 'learning_rate': 0.00019510565162951537, 'rewards/chosen': 0.3270494341850281, 'rewards/rejected': -0.9337190389633179, 'rewards/accuracies': 1.0, 'rewards/margins': 1.2607685327529907, 'logps/chosen': -133.4869842529297, 'logps/rejected': -120.07515716552734, 'logits/chosen': -2.208770751953125, 'logits/rejected': -1.6450351476669312, 'epoch': 0.06}


 18%|█▊        | 35/200 [00:12<00:57,  2.85it/s]

{'loss': 1.4396, 'grad_norm': 19.680835723876953, 'learning_rate': 0.00019458172417006347, 'rewards/chosen': 0.1883174031972885, 'rewards/rejected': -0.7472215294837952, 'rewards/accuracies': 1.0, 'rewards/margins': 0.9355389475822449, 'logps/chosen': -114.72905731201172, 'logps/rejected': -114.31901550292969, 'logits/chosen': -2.3260338306427, 'logits/rejected': -1.846355676651001, 'epoch': 0.06}


 18%|█▊        | 36/200 [00:12<00:58,  2.82it/s]

{'loss': 1.6968, 'grad_norm': 29.21202278137207, 'learning_rate': 0.00019403193901161613, 'rewards/chosen': 0.463418185710907, 'rewards/rejected': -0.43119317293167114, 'rewards/accuracies': 0.75, 'rewards/margins': 0.8946113586425781, 'logps/chosen': -153.08680725097656, 'logps/rejected': -111.31184387207031, 'logits/chosen': -2.1625170707702637, 'logits/rejected': -2.094069480895996, 'epoch': 0.07}


 18%|█▊        | 37/200 [00:13<00:57,  2.82it/s]

{'loss': 1.7733, 'grad_norm': 27.69902801513672, 'learning_rate': 0.0001934564464599461, 'rewards/chosen': 0.21591779589653015, 'rewards/rejected': -0.4958246350288391, 'rewards/accuracies': 1.0, 'rewards/margins': 0.7117424011230469, 'logps/chosen': -148.4047088623047, 'logps/rejected': -122.62555694580078, 'logits/chosen': -2.0716331005096436, 'logits/rejected': -1.0500926971435547, 'epoch': 0.07}


 19%|█▉        | 38/200 [00:13<00:57,  2.81it/s]

{'loss': 1.3363, 'grad_norm': 21.28482437133789, 'learning_rate': 0.00019285540384897073, 'rewards/chosen': -0.21686971187591553, 'rewards/rejected': -1.2823991775512695, 'rewards/accuracies': 1.0, 'rewards/margins': 1.0655293464660645, 'logps/chosen': -181.9318084716797, 'logps/rejected': -118.3904037475586, 'logits/chosen': -2.0260753631591797, 'logits/rejected': -1.733199954032898, 'epoch': 0.07}


 20%|█▉        | 39/200 [00:13<00:57,  2.81it/s]

{'loss': 0.965, 'grad_norm': 24.101123809814453, 'learning_rate': 0.00019222897549773848, 'rewards/chosen': 0.5010156631469727, 'rewards/rejected': -1.187276840209961, 'rewards/accuracies': 1.0, 'rewards/margins': 1.6882925033569336, 'logps/chosen': -116.88035583496094, 'logps/rejected': -154.76821899414062, 'logits/chosen': -2.2415542602539062, 'logits/rejected': -1.8151054382324219, 'epoch': 0.07}


 20%|██        | 40/200 [00:14<00:56,  2.81it/s]

{'loss': 0.8666, 'grad_norm': 18.94289779663086, 'learning_rate': 0.00019157733266550575, 'rewards/chosen': 0.4450666606426239, 'rewards/rejected': -1.2243865728378296, 'rewards/accuracies': 1.0, 'rewards/margins': 1.6694533824920654, 'logps/chosen': -139.91307067871094, 'logps/rejected': -121.66447448730469, 'logits/chosen': -1.5920807123184204, 'logits/rejected': -1.0186043977737427, 'epoch': 0.07}


 20%|██        | 41/200 [00:14<00:56,  2.84it/s]

{'loss': 1.8359, 'grad_norm': 30.310089111328125, 'learning_rate': 0.00019090065350491626, 'rewards/chosen': 0.334933876991272, 'rewards/rejected': -0.5387086868286133, 'rewards/accuracies': 0.75, 'rewards/margins': 0.8736425638198853, 'logps/chosen': -136.21728515625, 'logps/rejected': -92.90650939941406, 'logits/chosen': -2.635883092880249, 'logits/rejected': -2.2301859855651855, 'epoch': 0.08}


 21%|██        | 42/200 [00:14<00:55,  2.83it/s]

{'loss': 0.6864, 'grad_norm': 15.736100196838379, 'learning_rate': 0.00019019912301329592, 'rewards/chosen': 0.3258174955844879, 'rewards/rejected': -1.4989382028579712, 'rewards/accuracies': 1.0, 'rewards/margins': 1.8247556686401367, 'logps/chosen': -122.260009765625, 'logps/rejected': -95.64071655273438, 'logits/chosen': -2.2953124046325684, 'logits/rejected': -2.1897666454315186, 'epoch': 0.08}


 22%|██▏       | 43/200 [00:15<00:55,  2.81it/s]

{'loss': 0.8821, 'grad_norm': 20.558263778686523, 'learning_rate': 0.00018947293298207635, 'rewards/chosen': 1.2748115062713623, 'rewards/rejected': -0.5922906994819641, 'rewards/accuracies': 1.0, 'rewards/margins': 1.8671023845672607, 'logps/chosen': -178.66165161132812, 'logps/rejected': -104.4876708984375, 'logits/chosen': -2.943408489227295, 'logits/rejected': -2.4816598892211914, 'epoch': 0.08}


 22%|██▏       | 44/200 [00:15<00:55,  2.82it/s]

{'loss': 1.0349, 'grad_norm': 20.60993194580078, 'learning_rate': 0.0001887222819443612, 'rewards/chosen': -0.4128704071044922, 'rewards/rejected': -2.1564133167266846, 'rewards/accuracies': 1.0, 'rewards/margins': 1.7435429096221924, 'logps/chosen': -97.6474609375, 'logps/rejected': -94.08687591552734, 'logits/chosen': -2.1925547122955322, 'logits/rejected': -1.6734102964401245, 'epoch': 0.08}


 22%|██▎       | 45/200 [00:15<00:55,  2.81it/s]

{'loss': 0.7894, 'grad_norm': 17.171344757080078, 'learning_rate': 0.0001879473751206489, 'rewards/chosen': 0.11148606240749359, 'rewards/rejected': -1.5257244110107422, 'rewards/accuracies': 1.0, 'rewards/margins': 1.637210488319397, 'logps/chosen': -155.884765625, 'logps/rejected': -121.77894592285156, 'logits/chosen': -2.0299925804138184, 'logits/rejected': -1.652369499206543, 'epoch': 0.08}


 23%|██▎       | 46/200 [00:16<00:54,  2.84it/s]

{'loss': 1.0069, 'grad_norm': 20.915407180786133, 'learning_rate': 0.00018714842436272773, 'rewards/chosen': 0.08254775404930115, 'rewards/rejected': -1.4050394296646118, 'rewards/accuracies': 1.0, 'rewards/margins': 1.4875872135162354, 'logps/chosen': -111.07466888427734, 'logps/rejected': -116.84742736816406, 'logits/chosen': -2.1955435276031494, 'logits/rejected': -1.432194471359253, 'epoch': 0.08}


 24%|██▎       | 47/200 [00:16<00:54,  2.83it/s]

{'loss': 0.442, 'grad_norm': 18.95036506652832, 'learning_rate': 0.00018632564809575742, 'rewards/chosen': 0.691271185874939, 'rewards/rejected': -1.5108221769332886, 'rewards/accuracies': 1.0, 'rewards/margins': 2.2020933628082275, 'logps/chosen': -158.02572631835938, 'logps/rejected': -120.34413146972656, 'logits/chosen': -2.337556838989258, 'logits/rejected': -1.8951982259750366, 'epoch': 0.09}


 24%|██▍       | 48/200 [00:16<00:53,  2.84it/s]

{'loss': 0.9566, 'grad_norm': 28.100948333740234, 'learning_rate': 0.0001854792712585539, 'rewards/chosen': 0.19868814945220947, 'rewards/rejected': -1.7831019163131714, 'rewards/accuracies': 0.75, 'rewards/margins': 1.9817900657653809, 'logps/chosen': -129.98690795898438, 'logps/rejected': -98.34832000732422, 'logits/chosen': -2.271493434906006, 'logits/rejected': -2.024339199066162, 'epoch': 0.09}


 24%|██▍       | 49/200 [00:17<00:53,  2.82it/s]

{'loss': 0.3569, 'grad_norm': 16.128562927246094, 'learning_rate': 0.00018460952524209355, 'rewards/chosen': 1.5614436864852905, 'rewards/rejected': -2.171642303466797, 'rewards/accuracies': 1.0, 'rewards/margins': 3.733086109161377, 'logps/chosen': -137.06088256835938, 'logps/rejected': -136.07821655273438, 'logits/chosen': -2.768111228942871, 'logits/rejected': -1.9321951866149902, 'epoch': 0.09}


 25%|██▌       | 50/200 [00:17<00:53,  2.83it/s]

{'loss': 0.8253, 'grad_norm': 32.45688247680664, 'learning_rate': 0.00018371664782625287, 'rewards/chosen': 0.6186395883560181, 'rewards/rejected': -1.3199752569198608, 'rewards/accuracies': 1.0, 'rewards/margins': 1.938614845275879, 'logps/chosen': -142.1634521484375, 'logps/rejected': -126.02574157714844, 'logits/chosen': -2.8255529403686523, 'logits/rejected': -2.4811129570007324, 'epoch': 0.09}


 26%|██▌       | 51/200 [00:17<00:53,  2.81it/s]

{'loss': 0.6362, 'grad_norm': 22.545761108398438, 'learning_rate': 0.00018280088311480201, 'rewards/chosen': 0.24721947312355042, 'rewards/rejected': -1.7876224517822266, 'rewards/accuracies': 1.0, 'rewards/margins': 2.034842014312744, 'logps/chosen': -150.5906982421875, 'logps/rejected': -144.81787109375, 'logits/chosen': -2.702099084854126, 'logits/rejected': -2.414426565170288, 'epoch': 0.09}


 26%|██▌       | 52/200 [00:18<00:52,  2.83it/s]

{'loss': 0.5988, 'grad_norm': 21.34589385986328, 'learning_rate': 0.00018186248146866927, 'rewards/chosen': -1.186739444732666, 'rewards/rejected': -3.3361692428588867, 'rewards/accuracies': 1.0, 'rewards/margins': 2.1494293212890625, 'logps/chosen': -127.4168701171875, 'logps/rejected': -127.2944107055664, 'logits/chosen': -2.9769768714904785, 'logits/rejected': -2.846513271331787, 'epoch': 0.1}


 26%|██▋       | 53/200 [00:18<00:51,  2.84it/s]

{'loss': 0.4926, 'grad_norm': 19.447628021240234, 'learning_rate': 0.00018090169943749476, 'rewards/chosen': -0.36842116713523865, 'rewards/rejected': -2.8042759895324707, 'rewards/accuracies': 1.0, 'rewards/margins': 2.435854434967041, 'logps/chosen': -167.90570068359375, 'logps/rejected': -162.29830932617188, 'logits/chosen': -2.5088469982147217, 'logits/rejected': -1.9310142993927002, 'epoch': 0.1}


 27%|██▋       | 54/200 [00:19<00:51,  2.83it/s]

{'loss': 2.2334, 'grad_norm': 42.04233932495117, 'learning_rate': 0.0001799187996894925, 'rewards/chosen': -0.9192293882369995, 'rewards/rejected': -2.600447654724121, 'rewards/accuracies': 0.75, 'rewards/margins': 1.681218147277832, 'logps/chosen': -144.8409423828125, 'logps/rejected': -121.07708740234375, 'logits/chosen': -2.612428665161133, 'logits/rejected': -2.2885522842407227, 'epoch': 0.1}


 28%|██▊       | 55/200 [00:19<00:50,  2.84it/s]

{'loss': 2.0872, 'grad_norm': 64.57503509521484, 'learning_rate': 0.00017891405093963938, 'rewards/chosen': -1.8217923641204834, 'rewards/rejected': -2.229067087173462, 'rewards/accuracies': 0.75, 'rewards/margins': 0.4072744846343994, 'logps/chosen': -139.94003295898438, 'logps/rejected': -120.78512573242188, 'logits/chosen': -2.199789524078369, 'logits/rejected': -2.1199240684509277, 'epoch': 0.1}


 28%|██▊       | 56/200 [00:19<00:50,  2.85it/s]

{'loss': 0.9448, 'grad_norm': 31.690643310546875, 'learning_rate': 0.00017788772787621126, 'rewards/chosen': -0.9734247326850891, 'rewards/rejected': -2.5178205966949463, 'rewards/accuracies': 1.0, 'rewards/margins': 1.544395923614502, 'logps/chosen': -140.9434356689453, 'logps/rejected': -113.44444274902344, 'logits/chosen': -2.2931597232818604, 'logits/rejected': -2.2393665313720703, 'epoch': 0.1}


 28%|██▊       | 57/200 [00:20<00:50,  2.84it/s]

{'loss': 2.0259, 'grad_norm': 48.45946502685547, 'learning_rate': 0.00017684011108568592, 'rewards/chosen': -1.2064175605773926, 'rewards/rejected': -2.064121723175049, 'rewards/accuracies': 0.5, 'rewards/margins': 0.8577041029930115, 'logps/chosen': -189.2384796142578, 'logps/rejected': -105.44143676757812, 'logits/chosen': -1.8000223636627197, 'logits/rejected': -2.1466774940490723, 'epoch': 0.1}


 29%|██▉       | 58/200 [00:20<00:49,  2.84it/s]

{'loss': 0.2968, 'grad_norm': 11.076041221618652, 'learning_rate': 0.0001757714869760335, 'rewards/chosen': -1.4270718097686768, 'rewards/rejected': -4.3699564933776855, 'rewards/accuracies': 1.0, 'rewards/margins': 2.942884922027588, 'logps/chosen': -147.41488647460938, 'logps/rejected': -154.09292602539062, 'logits/chosen': -2.3555960655212402, 'logits/rejected': -2.0672473907470703, 'epoch': 0.11}


 30%|██▉       | 59/200 [00:20<00:49,  2.84it/s]

{'loss': 0.728, 'grad_norm': 25.9451847076416, 'learning_rate': 0.0001746821476984154, 'rewards/chosen': -0.274924099445343, 'rewards/rejected': -2.6462290287017822, 'rewards/accuracies': 1.0, 'rewards/margins': 2.371304750442505, 'logps/chosen': -170.54534912109375, 'logps/rejected': -146.71090698242188, 'logits/chosen': -2.565885543823242, 'logits/rejected': -2.227264881134033, 'epoch': 0.11}


 30%|███       | 60/200 [00:21<00:49,  2.84it/s]

{'loss': 0.1122, 'grad_norm': 6.051952838897705, 'learning_rate': 0.00017357239106731317, 'rewards/chosen': -0.4988550543785095, 'rewards/rejected': -4.327422618865967, 'rewards/accuracies': 1.0, 'rewards/margins': 3.8285679817199707, 'logps/chosen': -132.5592041015625, 'logps/rejected': -149.29139709472656, 'logits/chosen': -2.7670083045959473, 'logits/rejected': -1.749441385269165, 'epoch': 0.11}


 30%|███       | 61/200 [00:21<00:48,  2.84it/s]

{'loss': 0.4425, 'grad_norm': 18.01749038696289, 'learning_rate': 0.00017244252047910892, 'rewards/chosen': -0.8354694247245789, 'rewards/rejected': -3.0481958389282227, 'rewards/accuracies': 1.0, 'rewards/margins': 2.212726593017578, 'logps/chosen': -150.1543426513672, 'logps/rejected': -123.04135131835938, 'logits/chosen': -2.5218379497528076, 'logits/rejected': -2.413239002227783, 'epoch': 0.11}


 31%|███       | 62/200 [00:21<00:48,  2.86it/s]

{'loss': 0.5438, 'grad_norm': 20.59832191467285, 'learning_rate': 0.00017129284482913972, 'rewards/chosen': -0.012316320091485977, 'rewards/rejected': -1.9932422637939453, 'rewards/accuracies': 1.0, 'rewards/margins': 1.9809260368347168, 'logps/chosen': -135.2646484375, 'logps/rejected': -110.82230377197266, 'logits/chosen': -2.430433511734009, 'logits/rejected': -2.3864424228668213, 'epoch': 0.11}


 32%|███▏      | 63/200 [00:22<00:48,  2.85it/s]

{'loss': 0.1197, 'grad_norm': 5.277717590332031, 'learning_rate': 0.00017012367842724887, 'rewards/chosen': -0.3057563602924347, 'rewards/rejected': -4.191771030426025, 'rewards/accuracies': 1.0, 'rewards/margins': 3.886014699935913, 'logps/chosen': -165.68487548828125, 'logps/rejected': -157.3242645263672, 'logits/chosen': -2.6349010467529297, 'logits/rejected': -1.80914306640625, 'epoch': 0.12}


 32%|███▏      | 64/200 [00:22<00:48,  2.83it/s]

{'loss': 0.7046, 'grad_norm': 30.30401611328125, 'learning_rate': 0.0001689353409118566, 'rewards/chosen': -0.34657442569732666, 'rewards/rejected': -2.9660911560058594, 'rewards/accuracies': 1.0, 'rewards/margins': 2.619516611099243, 'logps/chosen': -190.09828186035156, 'logps/rejected': -128.82313537597656, 'logits/chosen': -2.697864055633545, 'logits/rejected': -2.5187017917633057, 'epoch': 0.12}


 32%|███▎      | 65/200 [00:22<00:47,  2.83it/s]

{'loss': 0.9148, 'grad_norm': 17.278242111206055, 'learning_rate': 0.00016772815716257412, 'rewards/chosen': -0.2561701536178589, 'rewards/rejected': -3.626880168914795, 'rewards/accuracies': 0.75, 'rewards/margins': 3.3707103729248047, 'logps/chosen': -144.88739013671875, 'logps/rejected': -127.89061737060547, 'logits/chosen': -2.353692054748535, 'logits/rejected': -2.355254650115967, 'epoch': 0.12}


 33%|███▎      | 66/200 [00:23<00:47,  2.84it/s]

{'loss': 0.144, 'grad_norm': 6.478303909301758, 'learning_rate': 0.0001665024572113848, 'rewards/chosen': -0.6129397749900818, 'rewards/rejected': -4.776895046234131, 'rewards/accuracies': 1.0, 'rewards/margins': 4.163955211639404, 'logps/chosen': -163.5122833251953, 'logps/rejected': -153.1715545654297, 'logits/chosen': -2.9476161003112793, 'logits/rejected': -2.077350616455078, 'epoch': 0.12}


 34%|███▎      | 67/200 [00:23<00:47,  2.81it/s]

{'loss': 0.8495, 'grad_norm': 33.57362747192383, 'learning_rate': 0.00016525857615241687, 'rewards/chosen': -0.5238178372383118, 'rewards/rejected': -2.3434996604919434, 'rewards/accuracies': 1.0, 'rewards/margins': 1.8196818828582764, 'logps/chosen': -136.55618286132812, 'logps/rejected': -113.56137084960938, 'logits/chosen': -2.0244855880737305, 'logits/rejected': -1.655995488166809, 'epoch': 0.12}


 34%|███▍      | 68/200 [00:23<00:46,  2.83it/s]

{'loss': 1.2227, 'grad_norm': 49.677642822265625, 'learning_rate': 0.00016399685405033167, 'rewards/chosen': -0.8320131301879883, 'rewards/rejected': -3.012991189956665, 'rewards/accuracies': 0.75, 'rewards/margins': 2.1809780597686768, 'logps/chosen': -137.5498504638672, 'logps/rejected': -130.25277709960938, 'logits/chosen': -2.1862449645996094, 'logits/rejected': -1.9013571739196777, 'epoch': 0.12}


 34%|███▍      | 69/200 [00:24<00:46,  2.83it/s]

{'loss': 0.1207, 'grad_norm': 9.342543601989746, 'learning_rate': 0.0001627176358473537, 'rewards/chosen': -0.43113425374031067, 'rewards/rejected': -4.859124183654785, 'rewards/accuracies': 1.0, 'rewards/margins': 4.427989959716797, 'logps/chosen': -113.6216049194336, 'logps/rejected': -152.7381134033203, 'logits/chosen': -2.8175106048583984, 'logits/rejected': -1.8515095710754395, 'epoch': 0.13}


 35%|███▌      | 70/200 [00:24<00:46,  2.82it/s]

{'loss': 0.6299, 'grad_norm': 15.312493324279785, 'learning_rate': 0.0001614212712689668, 'rewards/chosen': -0.09214183688163757, 'rewards/rejected': -2.4868814945220947, 'rewards/accuracies': 1.0, 'rewards/margins': 2.3947396278381348, 'logps/chosen': -115.35547637939453, 'logps/rejected': -125.95037078857422, 'logits/chosen': -2.7280192375183105, 'logits/rejected': -2.4836349487304688, 'epoch': 0.13}


 36%|███▌      | 71/200 [00:25<00:45,  2.81it/s]

{'loss': 0.2573, 'grad_norm': 19.01554298400879, 'learning_rate': 0.00016010811472830252, 'rewards/chosen': 0.2521526515483856, 'rewards/rejected': -3.355330228805542, 'rewards/accuracies': 1.0, 'rewards/margins': 3.60748291015625, 'logps/chosen': -126.2889404296875, 'logps/rejected': -162.7488250732422, 'logits/chosen': -2.8071229457855225, 'logits/rejected': -2.1910743713378906, 'epoch': 0.13}


 36%|███▌      | 72/200 [00:25<00:45,  2.84it/s]

{'loss': 1.7056, 'grad_norm': 45.229652404785156, 'learning_rate': 0.00015877852522924732, 'rewards/chosen': -0.8830316662788391, 'rewards/rejected': -3.0015552043914795, 'rewards/accuracies': 0.75, 'rewards/margins': 2.118523597717285, 'logps/chosen': -121.73284912109375, 'logps/rejected': -137.7011260986328, 'logits/chosen': -2.1765377521514893, 'logits/rejected': -1.554862141609192, 'epoch': 0.13}


 36%|███▋      | 73/200 [00:25<00:44,  2.83it/s]

{'loss': 0.4952, 'grad_norm': 34.04766845703125, 'learning_rate': 0.00015743286626829437, 'rewards/chosen': 0.2738492786884308, 'rewards/rejected': -2.3305447101593018, 'rewards/accuracies': 1.0, 'rewards/margins': 2.60439395904541, 'logps/chosen': -143.04769897460938, 'logps/rejected': -124.44416046142578, 'logits/chosen': -2.3684873580932617, 'logits/rejected': -2.130568742752075, 'epoch': 0.13}


 37%|███▋      | 74/200 [00:26<00:44,  2.83it/s]

{'loss': 0.6107, 'grad_norm': 18.154172897338867, 'learning_rate': 0.0001560715057351673, 'rewards/chosen': 0.06680518388748169, 'rewards/rejected': -2.298835277557373, 'rewards/accuracies': 1.0, 'rewards/margins': 2.36564040184021, 'logps/chosen': -123.61393737792969, 'logps/rejected': -144.83065795898438, 'logits/chosen': -2.235430955886841, 'logits/rejected': -1.544079065322876, 'epoch': 0.14}


 38%|███▊      | 75/200 [00:26<00:44,  2.81it/s]

{'loss': 0.611, 'grad_norm': 30.13850212097168, 'learning_rate': 0.00015469481581224272, 'rewards/chosen': -0.17942076921463013, 'rewards/rejected': -2.9326443672180176, 'rewards/accuracies': 1.0, 'rewards/margins': 2.7532236576080322, 'logps/chosen': -129.79928588867188, 'logps/rejected': -145.50653076171875, 'logits/chosen': -2.870943307876587, 'logits/rejected': -2.092313528060913, 'epoch': 0.14}


 38%|███▊      | 76/200 [00:26<00:43,  2.82it/s]

{'loss': 0.3609, 'grad_norm': 7.580910682678223, 'learning_rate': 0.0001533031728727994, 'rewards/chosen': 0.017802223563194275, 'rewards/rejected': -2.8006367683410645, 'rewards/accuracies': 1.0, 'rewards/margins': 2.818438768386841, 'logps/chosen': -148.50033569335938, 'logps/rejected': -124.77610778808594, 'logits/chosen': -2.4964208602905273, 'logits/rejected': -2.2462501525878906, 'epoch': 0.14}


 38%|███▊      | 77/200 [00:27<00:42,  2.87it/s]

{'loss': 0.6959, 'grad_norm': 28.179094314575195, 'learning_rate': 0.00015189695737812152, 'rewards/chosen': -0.11321697384119034, 'rewards/rejected': -2.464539051055908, 'rewards/accuracies': 1.0, 'rewards/margins': 2.3513221740722656, 'logps/chosen': -118.90554809570312, 'logps/rejected': -119.8447265625, 'logits/chosen': -2.6701340675354004, 'logits/rejected': -1.9919383525848389, 'epoch': 0.14}


 39%|███▉      | 78/200 [00:27<00:42,  2.86it/s]

{'loss': 1.1285, 'grad_norm': 31.719406127929688, 'learning_rate': 0.0001504765537734844, 'rewards/chosen': -0.7421252727508545, 'rewards/rejected': -2.584439754486084, 'rewards/accuracies': 0.75, 'rewards/margins': 1.84231436252594, 'logps/chosen': -195.83505249023438, 'logps/rejected': -122.65217590332031, 'logits/chosen': -1.8954205513000488, 'logits/rejected': -1.878833532333374, 'epoch': 0.14}


 40%|███▉      | 79/200 [00:27<00:42,  2.83it/s]

{'loss': 0.5544, 'grad_norm': 26.182537078857422, 'learning_rate': 0.00014904235038305083, 'rewards/chosen': -0.9822930097579956, 'rewards/rejected': -3.7272186279296875, 'rewards/accuracies': 1.0, 'rewards/margins': 2.7449254989624023, 'logps/chosen': -187.09518432617188, 'logps/rejected': -135.5682373046875, 'logits/chosen': -2.6613807678222656, 'logits/rejected': -2.6151652336120605, 'epoch': 0.15}


 40%|████      | 80/200 [00:28<00:42,  2.85it/s]

{'loss': 0.5829, 'grad_norm': 21.830604553222656, 'learning_rate': 0.00014759473930370736, 'rewards/chosen': -0.15966850519180298, 'rewards/rejected': -2.513672351837158, 'rewards/accuracies': 1.0, 'rewards/margins': 2.35400390625, 'logps/chosen': -107.20381927490234, 'logps/rejected': -136.7051239013672, 'logits/chosen': -2.7604758739471436, 'logits/rejected': -1.8023655414581299, 'epoch': 0.15}


 40%|████      | 81/200 [00:28<00:41,  2.84it/s]

{'loss': 0.2223, 'grad_norm': 9.157941818237305, 'learning_rate': 0.0001461341162978688, 'rewards/chosen': 0.17412032186985016, 'rewards/rejected': -4.358998775482178, 'rewards/accuracies': 1.0, 'rewards/margins': 4.533119201660156, 'logps/chosen': -153.22804260253906, 'logps/rejected': -165.9378662109375, 'logits/chosen': -2.892502784729004, 'logits/rejected': -2.1916446685791016, 'epoch': 0.15}


 41%|████      | 82/200 [00:28<00:41,  2.83it/s]

{'loss': 0.5195, 'grad_norm': 23.185579299926758, 'learning_rate': 0.00014466088068528068, 'rewards/chosen': -0.18212394416332245, 'rewards/rejected': -3.517146110534668, 'rewards/accuracies': 1.0, 'rewards/margins': 3.33502197265625, 'logps/chosen': -181.27496337890625, 'logps/rejected': -147.7755126953125, 'logits/chosen': -2.316226005554199, 'logits/rejected': -1.9296600818634033, 'epoch': 0.15}


 42%|████▏     | 83/200 [00:29<00:41,  2.82it/s]

{'loss': 0.9521, 'grad_norm': 39.04978561401367, 'learning_rate': 0.00014317543523384928, 'rewards/chosen': -0.9923683404922485, 'rewards/rejected': -2.939443588256836, 'rewards/accuracies': 1.0, 'rewards/margins': 1.947075366973877, 'logps/chosen': -137.4046630859375, 'logps/rejected': -121.37080383300781, 'logits/chosen': -2.9897491931915283, 'logits/rejected': -2.6415188312530518, 'epoch': 0.15}


 42%|████▏     | 84/200 [00:29<00:40,  2.83it/s]

{'loss': 1.5266, 'grad_norm': 52.79900360107422, 'learning_rate': 0.00014167818604952906, 'rewards/chosen': -2.4787650108337402, 'rewards/rejected': -4.109565734863281, 'rewards/accuracies': 0.75, 'rewards/margins': 1.6308009624481201, 'logps/chosen': -146.58743286132812, 'logps/rejected': -124.61244201660156, 'logits/chosen': -2.6747491359710693, 'logits/rejected': -2.769702911376953, 'epoch': 0.15}


 42%|████▎     | 85/200 [00:29<00:40,  2.85it/s]

{'loss': 0.4686, 'grad_norm': 17.136247634887695, 'learning_rate': 0.00014016954246529696, 'rewards/chosen': -1.014289379119873, 'rewards/rejected': -3.1517083644866943, 'rewards/accuracies': 1.0, 'rewards/margins': 2.1374192237854004, 'logps/chosen': -112.11669921875, 'logps/rejected': -114.44882202148438, 'logits/chosen': -2.791626214981079, 'logits/rejected': -2.3946619033813477, 'epoch': 0.16}


 43%|████▎     | 86/200 [00:30<00:39,  2.86it/s]

{'loss': 0.8028, 'grad_norm': 36.90116882324219, 'learning_rate': 0.00013864991692924523, 'rewards/chosen': -1.2035516500473022, 'rewards/rejected': -3.4076895713806152, 'rewards/accuracies': 1.0, 'rewards/margins': 2.2041378021240234, 'logps/chosen': -163.3610382080078, 'logps/rejected': -136.06854248046875, 'logits/chosen': -2.54455828666687, 'logits/rejected': -2.110898017883301, 'epoch': 0.16}


 44%|████▎     | 87/200 [00:30<00:39,  2.86it/s]

{'loss': 0.9491, 'grad_norm': 12.709660530090332, 'learning_rate': 0.00013711972489182208, 'rewards/chosen': -0.7152700424194336, 'rewards/rejected': -2.9822375774383545, 'rewards/accuracies': 1.0, 'rewards/margins': 2.266967535018921, 'logps/chosen': -113.3021469116211, 'logps/rejected': -109.46934509277344, 'logits/chosen': -2.772313356399536, 'logits/rejected': -2.5715558528900146, 'epoch': 0.16}


 44%|████▍     | 88/200 [00:31<00:39,  2.84it/s]

{'loss': 0.6352, 'grad_norm': 25.809429168701172, 'learning_rate': 0.00013557938469225167, 'rewards/chosen': -0.7271945476531982, 'rewards/rejected': -4.079340934753418, 'rewards/accuracies': 1.0, 'rewards/margins': 3.3521463871002197, 'logps/chosen': -157.21022033691406, 'logps/rejected': -184.6520538330078, 'logits/chosen': -2.852846384048462, 'logits/rejected': -2.3640990257263184, 'epoch': 0.16}


 44%|████▍     | 89/200 [00:31<00:39,  2.81it/s]

{'loss': 0.3495, 'grad_norm': 7.0905351638793945, 'learning_rate': 0.00013402931744416433, 'rewards/chosen': -0.4491622745990753, 'rewards/rejected': -3.1707730293273926, 'rewards/accuracies': 1.0, 'rewards/margins': 2.7216110229492188, 'logps/chosen': -120.94102478027344, 'logps/rejected': -128.4067840576172, 'logits/chosen': -2.176331043243408, 'logits/rejected': -2.107029438018799, 'epoch': 0.16}


 45%|████▌     | 90/200 [00:31<00:39,  2.80it/s]

{'loss': 0.4952, 'grad_norm': 17.748369216918945, 'learning_rate': 0.00013246994692046836, 'rewards/chosen': -0.918923020362854, 'rewards/rejected': -3.4469406604766846, 'rewards/accuracies': 1.0, 'rewards/margins': 2.52801775932312, 'logps/chosen': -145.89622497558594, 'logps/rejected': -124.82625579833984, 'logits/chosen': -2.8572843074798584, 'logits/rejected': -2.3681273460388184, 'epoch': 0.17}


 46%|████▌     | 91/200 [00:32<00:38,  2.81it/s]

{'loss': 0.6881, 'grad_norm': 26.443572998046875, 'learning_rate': 0.00013090169943749476, 'rewards/chosen': -1.0826442241668701, 'rewards/rejected': -3.0739827156066895, 'rewards/accuracies': 1.0, 'rewards/margins': 1.9913386106491089, 'logps/chosen': -101.11677551269531, 'logps/rejected': -99.17840576171875, 'logits/chosen': -3.1228394508361816, 'logits/rejected': -2.893085479736328, 'epoch': 0.17}


 46%|████▌     | 92/200 [00:32<00:39,  2.75it/s]

{'loss': 0.6825, 'grad_norm': 24.787334442138672, 'learning_rate': 0.0001293250037384465, 'rewards/chosen': -0.9421442151069641, 'rewards/rejected': -3.464848279953003, 'rewards/accuracies': 1.0, 'rewards/margins': 2.5227041244506836, 'logps/chosen': -111.56038665771484, 'logps/rejected': -113.31197357177734, 'logits/chosen': -2.8661012649536133, 'logits/rejected': -2.652970790863037, 'epoch': 0.17}


 46%|████▋     | 93/200 [00:32<00:38,  2.77it/s]

{'loss': 0.7011, 'grad_norm': 24.9250431060791, 'learning_rate': 0.00012774029087618446, 'rewards/chosen': -0.8826736807823181, 'rewards/rejected': -2.965641975402832, 'rewards/accuracies': 1.0, 'rewards/margins': 2.082968235015869, 'logps/chosen': -160.0137176513672, 'logps/rejected': -136.92294311523438, 'logits/chosen': -2.2007296085357666, 'logits/rejected': -1.774527907371521, 'epoch': 0.17}


 47%|████▋     | 94/200 [00:33<00:37,  2.79it/s]

{'loss': 0.5684, 'grad_norm': 26.741168975830078, 'learning_rate': 0.00012614799409538198, 'rewards/chosen': 0.24677979946136475, 'rewards/rejected': -3.1905102729797363, 'rewards/accuracies': 1.0, 'rewards/margins': 3.4372897148132324, 'logps/chosen': -148.85946655273438, 'logps/rejected': -131.17962646484375, 'logits/chosen': -2.635096311569214, 'logits/rejected': -2.2315833568573, 'epoch': 0.17}


 48%|████▊     | 95/200 [00:33<00:37,  2.78it/s]

{'loss': 0.4247, 'grad_norm': 21.85597038269043, 'learning_rate': 0.00012454854871407994, 'rewards/chosen': 0.11458596587181091, 'rewards/rejected': -3.178673505783081, 'rewards/accuracies': 1.0, 'rewards/margins': 3.293259620666504, 'logps/chosen': -150.9479522705078, 'logps/rejected': -128.2594451904297, 'logits/chosen': -2.550807476043701, 'logits/rejected': -2.5103182792663574, 'epoch': 0.17}


 48%|████▊     | 96/200 [00:33<00:37,  2.78it/s]

{'loss': 1.9581, 'grad_norm': 88.12598419189453, 'learning_rate': 0.00012294239200467516, 'rewards/chosen': -0.9894273281097412, 'rewards/rejected': -2.9007668495178223, 'rewards/accuracies': 0.75, 'rewards/margins': 1.911339521408081, 'logps/chosen': -156.58934020996094, 'logps/rejected': -151.13438415527344, 'logits/chosen': -2.3507518768310547, 'logits/rejected': -1.9500924348831177, 'epoch': 0.18}


 48%|████▊     | 97/200 [00:34<00:36,  2.80it/s]

{'loss': 0.8478, 'grad_norm': 29.569644927978516, 'learning_rate': 0.0001213299630743747, 'rewards/chosen': 0.45750123262405396, 'rewards/rejected': -2.1406748294830322, 'rewards/accuracies': 0.75, 'rewards/margins': 2.5981760025024414, 'logps/chosen': -108.23063659667969, 'logps/rejected': -130.74261474609375, 'logits/chosen': -2.8605098724365234, 'logits/rejected': -2.667973279953003, 'epoch': 0.18}


 49%|████▉     | 98/200 [00:34<00:36,  2.81it/s]

{'loss': 0.3889, 'grad_norm': 13.670642852783203, 'learning_rate': 0.00011971170274514802, 'rewards/chosen': -0.3888280987739563, 'rewards/rejected': -3.349961996078491, 'rewards/accuracies': 1.0, 'rewards/margins': 2.9611339569091797, 'logps/chosen': -110.64410400390625, 'logps/rejected': -128.0794219970703, 'logits/chosen': -3.6261701583862305, 'logits/rejected': -3.458353281021118, 'epoch': 0.18}


 50%|████▉     | 99/200 [00:34<00:35,  2.81it/s]

{'loss': 0.5074, 'grad_norm': 18.647693634033203, 'learning_rate': 0.000118088053433211, 'rewards/chosen': -0.20708541572093964, 'rewards/rejected': -3.043210029602051, 'rewards/accuracies': 1.0, 'rewards/margins': 2.8361244201660156, 'logps/chosen': -159.0391387939453, 'logps/rejected': -149.2369384765625, 'logits/chosen': -2.753392219543457, 'logits/rejected': -2.364164352416992, 'epoch': 0.18}


 50%|█████     | 100/200 [00:35<00:35,  2.82it/s]

{'loss': 0.8222, 'grad_norm': 34.010658264160156, 'learning_rate': 0.00011645945902807341, 'rewards/chosen': -0.36041125655174255, 'rewards/rejected': -3.4077227115631104, 'rewards/accuracies': 0.75, 'rewards/margins': 3.047311544418335, 'logps/chosen': -147.61895751953125, 'logps/rejected': -128.04147338867188, 'logits/chosen': -2.703155040740967, 'logits/rejected': -2.6793198585510254, 'epoch': 0.18}


 50%|█████     | 101/200 [00:35<00:34,  2.83it/s]

{'loss': 0.2229, 'grad_norm': 8.333490371704102, 'learning_rate': 0.0001148263647711842, 'rewards/chosen': 0.008455449715256691, 'rewards/rejected': -3.0305488109588623, 'rewards/accuracies': 1.0, 'rewards/margins': 3.039004325866699, 'logps/chosen': -142.89231872558594, 'logps/rejected': -129.99334716796875, 'logits/chosen': -2.849546432495117, 'logits/rejected': -2.830456256866455, 'epoch': 0.19}


 51%|█████     | 102/200 [00:36<00:34,  2.85it/s]

{'loss': 0.1333, 'grad_norm': 4.4470367431640625, 'learning_rate': 0.00011318921713420691, 'rewards/chosen': -0.459364116191864, 'rewards/rejected': -4.023294448852539, 'rewards/accuracies': 1.0, 'rewards/margins': 3.563929796218872, 'logps/chosen': -132.2074737548828, 'logps/rejected': -128.1557159423828, 'logits/chosen': -3.283933639526367, 'logits/rejected': -3.229186773300171, 'epoch': 0.19}


 52%|█████▏    | 103/200 [00:36<00:33,  2.86it/s]

{'loss': 0.2361, 'grad_norm': 8.682723045349121, 'learning_rate': 0.00011154846369695863, 'rewards/chosen': -1.3204001188278198, 'rewards/rejected': -4.524691581726074, 'rewards/accuracies': 1.0, 'rewards/margins': 3.204291820526123, 'logps/chosen': -139.4429473876953, 'logps/rejected': -135.6705322265625, 'logits/chosen': -3.1900265216827393, 'logits/rejected': -3.125826358795166, 'epoch': 0.19}


 52%|█████▏    | 104/200 [00:36<00:33,  2.87it/s]

{'loss': 0.5118, 'grad_norm': 16.882844924926758, 'learning_rate': 0.0001099045530250463, 'rewards/chosen': -0.8593927621841431, 'rewards/rejected': -3.6978137493133545, 'rewards/accuracies': 1.0, 'rewards/margins': 2.838420867919922, 'logps/chosen': -110.40350341796875, 'logps/rejected': -124.43917846679688, 'logits/chosen': -3.2277719974517822, 'logits/rejected': -2.81056547164917, 'epoch': 0.19}


 52%|█████▎    | 105/200 [00:37<00:33,  2.86it/s]

{'loss': 0.4794, 'grad_norm': 20.418886184692383, 'learning_rate': 0.00010825793454723325, 'rewards/chosen': -1.223164439201355, 'rewards/rejected': -3.8421266078948975, 'rewards/accuracies': 1.0, 'rewards/margins': 2.618962287902832, 'logps/chosen': -155.266845703125, 'logps/rejected': -150.86019897460938, 'logits/chosen': -2.5712599754333496, 'logits/rejected': -2.084629774093628, 'epoch': 0.19}


 53%|█████▎    | 106/200 [00:37<00:33,  2.85it/s]

{'loss': 1.1042, 'grad_norm': 39.9430046081543, 'learning_rate': 0.00010660905843256994, 'rewards/chosen': -1.5011487007141113, 'rewards/rejected': -3.5070672035217285, 'rewards/accuracies': 0.75, 'rewards/margins': 2.005918502807617, 'logps/chosen': -148.4109344482422, 'logps/rejected': -129.61270141601562, 'logits/chosen': -2.841609001159668, 'logits/rejected': -2.8688759803771973, 'epoch': 0.19}


 54%|█████▎    | 107/200 [00:37<00:32,  2.83it/s]

{'loss': 0.032, 'grad_norm': 1.4978349208831787, 'learning_rate': 0.00010495837546732224, 'rewards/chosen': -0.07572880387306213, 'rewards/rejected': -5.308565139770508, 'rewards/accuracies': 1.0, 'rewards/margins': 5.232836723327637, 'logps/chosen': -132.41355895996094, 'logps/rejected': -164.04978942871094, 'logits/chosen': -2.863206386566162, 'logits/rejected': -2.4279720783233643, 'epoch': 0.2}


 54%|█████▍    | 108/200 [00:38<00:32,  2.82it/s]

{'loss': 1.0409, 'grad_norm': 51.175079345703125, 'learning_rate': 0.00010330633693173082, 'rewards/chosen': -1.5778207778930664, 'rewards/rejected': -3.0055577754974365, 'rewards/accuracies': 1.0, 'rewards/margins': 1.4277369976043701, 'logps/chosen': -143.311767578125, 'logps/rejected': -106.47344970703125, 'logits/chosen': -2.945551633834839, 'logits/rejected': -2.9542970657348633, 'epoch': 0.2}


 55%|█████▍    | 109/200 [00:38<00:32,  2.80it/s]

{'loss': 0.0501, 'grad_norm': 3.4573354721069336, 'learning_rate': 0.00010165339447663587, 'rewards/chosen': -0.8366508483886719, 'rewards/rejected': -6.532177925109863, 'rewards/accuracies': 1.0, 'rewards/margins': 5.695527076721191, 'logps/chosen': -195.26438903808594, 'logps/rejected': -227.630859375, 'logits/chosen': -2.8325095176696777, 'logits/rejected': -2.573885917663574, 'epoch': 0.2}


 55%|█████▌    | 110/200 [00:38<00:31,  2.82it/s]

{'loss': 0.1471, 'grad_norm': 5.845897674560547, 'learning_rate': 0.0001, 'rewards/chosen': -0.9389374256134033, 'rewards/rejected': -4.809540271759033, 'rewards/accuracies': 1.0, 'rewards/margins': 3.870602607727051, 'logps/chosen': -105.08509826660156, 'logps/rejected': -145.67672729492188, 'logits/chosen': -3.5013110637664795, 'logits/rejected': -2.997351884841919, 'epoch': 0.2}


 56%|█████▌    | 111/200 [00:39<00:31,  2.82it/s]

{'loss': 0.1383, 'grad_norm': 13.822497367858887, 'learning_rate': 9.834660552336415e-05, 'rewards/chosen': -0.9037695527076721, 'rewards/rejected': -5.914064407348633, 'rewards/accuracies': 1.0, 'rewards/margins': 5.0102949142456055, 'logps/chosen': -170.35687255859375, 'logps/rejected': -148.8546142578125, 'logits/chosen': -3.322493553161621, 'logits/rejected': -3.077889919281006, 'epoch': 0.2}


 56%|█████▌    | 112/200 [00:39<00:31,  2.82it/s]

{'loss': 0.0962, 'grad_norm': 4.094309329986572, 'learning_rate': 9.669366306826919e-05, 'rewards/chosen': -0.04336853325366974, 'rewards/rejected': -4.249526023864746, 'rewards/accuracies': 1.0, 'rewards/margins': 4.206157684326172, 'logps/chosen': -177.85409545898438, 'logps/rejected': -131.89990234375, 'logits/chosen': -3.094179391860962, 'logits/rejected': -3.2685160636901855, 'epoch': 0.21}


 56%|█████▋    | 113/200 [00:39<00:31,  2.81it/s]

{'loss': 0.0261, 'grad_norm': 1.1727142333984375, 'learning_rate': 9.504162453267777e-05, 'rewards/chosen': -0.7709823846817017, 'rewards/rejected': -6.145030498504639, 'rewards/accuracies': 1.0, 'rewards/margins': 5.374048233032227, 'logps/chosen': -177.77099609375, 'logps/rejected': -186.76171875, 'logits/chosen': -3.19891619682312, 'logits/rejected': -2.923750162124634, 'epoch': 0.21}


 57%|█████▋    | 114/200 [00:40<00:30,  2.82it/s]

{'loss': 0.0125, 'grad_norm': 1.2537928819656372, 'learning_rate': 9.339094156743007e-05, 'rewards/chosen': -1.2699933052062988, 'rewards/rejected': -7.803934574127197, 'rewards/accuracies': 1.0, 'rewards/margins': 6.533941745758057, 'logps/chosen': -121.5723648071289, 'logps/rejected': -176.29458618164062, 'logits/chosen': -3.6023309230804443, 'logits/rejected': -3.2731032371520996, 'epoch': 0.21}


 57%|█████▊    | 115/200 [00:40<00:30,  2.82it/s]

{'loss': 0.173, 'grad_norm': 14.963993072509766, 'learning_rate': 9.174206545276677e-05, 'rewards/chosen': -1.7544257640838623, 'rewards/rejected': -6.9961347579956055, 'rewards/accuracies': 1.0, 'rewards/margins': 5.241708755493164, 'logps/chosen': -136.95045471191406, 'logps/rejected': -161.49464416503906, 'logits/chosen': -3.3547425270080566, 'logits/rejected': -3.3367390632629395, 'epoch': 0.21}


 58%|█████▊    | 116/200 [00:40<00:29,  2.84it/s]

{'loss': 0.3427, 'grad_norm': 13.7245512008667, 'learning_rate': 9.009544697495374e-05, 'rewards/chosen': -1.924273133277893, 'rewards/rejected': -4.9312028884887695, 'rewards/accuracies': 1.0, 'rewards/margins': 3.006929874420166, 'logps/chosen': -125.5605697631836, 'logps/rejected': -157.40121459960938, 'logits/chosen': -3.643213987350464, 'logits/rejected': -3.401536226272583, 'epoch': 0.21}


 58%|█████▊    | 117/200 [00:41<00:29,  2.80it/s]

{'loss': 0.13, 'grad_norm': 9.866933822631836, 'learning_rate': 8.845153630304139e-05, 'rewards/chosen': -1.1110708713531494, 'rewards/rejected': -5.784255027770996, 'rewards/accuracies': 1.0, 'rewards/margins': 4.673183917999268, 'logps/chosen': -159.40101623535156, 'logps/rejected': -181.59146118164062, 'logits/chosen': -3.7640669345855713, 'logits/rejected': -3.265810489654541, 'epoch': 0.21}


 59%|█████▉    | 118/200 [00:41<00:29,  2.80it/s]

{'loss': 0.01, 'grad_norm': 0.927863597869873, 'learning_rate': 8.681078286579311e-05, 'rewards/chosen': -1.0294493436813354, 'rewards/rejected': -7.420773506164551, 'rewards/accuracies': 1.0, 'rewards/margins': 6.391324043273926, 'logps/chosen': -141.98751831054688, 'logps/rejected': -173.3436279296875, 'logits/chosen': -3.2816295623779297, 'logits/rejected': -2.8657429218292236, 'epoch': 0.22}


 60%|█████▉    | 119/200 [00:42<00:28,  2.83it/s]

{'loss': 0.4378, 'grad_norm': 24.52969741821289, 'learning_rate': 8.517363522881579e-05, 'rewards/chosen': -1.549283742904663, 'rewards/rejected': -5.638471603393555, 'rewards/accuracies': 1.0, 'rewards/margins': 4.089188575744629, 'logps/chosen': -140.61961364746094, 'logps/rejected': -147.0869903564453, 'logits/chosen': -3.4247922897338867, 'logits/rejected': -3.116757392883301, 'epoch': 0.22}


 60%|██████    | 120/200 [00:42<00:28,  2.83it/s]

{'loss': 0.0342, 'grad_norm': 1.7189345359802246, 'learning_rate': 8.35405409719266e-05, 'rewards/chosen': -1.8171844482421875, 'rewards/rejected': -6.976973056793213, 'rewards/accuracies': 1.0, 'rewards/margins': 5.159788131713867, 'logps/chosen': -162.08108520507812, 'logps/rejected': -162.72882080078125, 'logits/chosen': -3.7232775688171387, 'logits/rejected': -3.538663864135742, 'epoch': 0.22}


 60%|██████    | 121/200 [00:42<00:27,  2.86it/s]

{'loss': 0.0868, 'grad_norm': 8.661568641662598, 'learning_rate': 8.191194656678904e-05, 'rewards/chosen': -2.1922190189361572, 'rewards/rejected': -8.70893669128418, 'rewards/accuracies': 1.0, 'rewards/margins': 6.516717910766602, 'logps/chosen': -151.06471252441406, 'logps/rejected': -194.93218994140625, 'logits/chosen': -3.5244829654693604, 'logits/rejected': -3.0141377449035645, 'epoch': 0.22}


 61%|██████    | 122/200 [00:43<00:27,  2.86it/s]

{'loss': 0.5509, 'grad_norm': 25.563148498535156, 'learning_rate': 8.028829725485199e-05, 'rewards/chosen': -2.432762384414673, 'rewards/rejected': -4.860835075378418, 'rewards/accuracies': 1.0, 'rewards/margins': 2.428072690963745, 'logps/chosen': -128.2227020263672, 'logps/rejected': -147.05137634277344, 'logits/chosen': -3.680046558380127, 'logits/rejected': -3.3303089141845703, 'epoch': 0.22}


 62%|██████▏   | 123/200 [00:43<00:27,  2.85it/s]

{'loss': 0.7165, 'grad_norm': 18.946287155151367, 'learning_rate': 7.867003692562534e-05, 'rewards/chosen': -1.44582998752594, 'rewards/rejected': -6.00916051864624, 'rewards/accuracies': 1.0, 'rewards/margins': 4.56333065032959, 'logps/chosen': -118.7446517944336, 'logps/rejected': -142.49778747558594, 'logits/chosen': -3.5947823524475098, 'logits/rejected': -3.4153921604156494, 'epoch': 0.23}


 62%|██████▏   | 124/200 [00:43<00:26,  2.84it/s]

{'loss': 0.2739, 'grad_norm': 25.084171295166016, 'learning_rate': 7.705760799532485e-05, 'rewards/chosen': -0.47765350341796875, 'rewards/rejected': -7.365026473999023, 'rewards/accuracies': 1.0, 'rewards/margins': 6.887373447418213, 'logps/chosen': -187.599365234375, 'logps/rejected': -194.87796020507812, 'logits/chosen': -3.4463467597961426, 'logits/rejected': -3.159942626953125, 'epoch': 0.23}


 62%|██████▎   | 125/200 [00:44<00:26,  2.82it/s]

{'loss': 0.0788, 'grad_norm': 12.719156265258789, 'learning_rate': 7.54514512859201e-05, 'rewards/chosen': -1.6400096416473389, 'rewards/rejected': -7.595722198486328, 'rewards/accuracies': 1.0, 'rewards/margins': 5.955713272094727, 'logps/chosen': -195.13925170898438, 'logps/rejected': -221.74851989746094, 'logits/chosen': -3.4376142024993896, 'logits/rejected': -3.3173751831054688, 'epoch': 0.23}


 63%|██████▎   | 126/200 [00:44<00:26,  2.83it/s]

{'loss': 0.2016, 'grad_norm': 15.562494277954102, 'learning_rate': 7.385200590461803e-05, 'rewards/chosen': -0.9066989421844482, 'rewards/rejected': -4.826658248901367, 'rewards/accuracies': 1.0, 'rewards/margins': 3.91995906829834, 'logps/chosen': -174.79202270507812, 'logps/rejected': -167.3623504638672, 'logits/chosen': -3.284684896469116, 'logits/rejected': -3.2730443477630615, 'epoch': 0.23}


 64%|██████▎   | 127/200 [00:44<00:25,  2.85it/s]

{'loss': 0.5536, 'grad_norm': 33.78313446044922, 'learning_rate': 7.225970912381556e-05, 'rewards/chosen': -2.5029189586639404, 'rewards/rejected': -4.666525840759277, 'rewards/accuracies': 1.0, 'rewards/margins': 2.163606882095337, 'logps/chosen': -148.32669067382812, 'logps/rejected': -129.81459045410156, 'logits/chosen': -3.1620378494262695, 'logits/rejected': -3.264012336730957, 'epoch': 0.23}


 64%|██████▍   | 128/200 [00:45<00:25,  2.85it/s]

{'loss': 0.061, 'grad_norm': 6.165801048278809, 'learning_rate': 7.067499626155354e-05, 'rewards/chosen': -1.1433006525039673, 'rewards/rejected': -5.977639198303223, 'rewards/accuracies': 1.0, 'rewards/margins': 4.834338188171387, 'logps/chosen': -168.1807098388672, 'logps/rejected': -165.74972534179688, 'logits/chosen': -3.1313936710357666, 'logits/rejected': -2.8355445861816406, 'epoch': 0.23}


 64%|██████▍   | 129/200 [00:45<00:24,  2.88it/s]

{'loss': 0.3898, 'grad_norm': 24.515995025634766, 'learning_rate': 6.909830056250527e-05, 'rewards/chosen': -2.0315937995910645, 'rewards/rejected': -7.039195537567139, 'rewards/accuracies': 1.0, 'rewards/margins': 5.007601737976074, 'logps/chosen': -116.8496322631836, 'logps/rejected': -169.302001953125, 'logits/chosen': -3.681178092956543, 'logits/rejected': -3.5188820362091064, 'epoch': 0.24}


 65%|██████▌   | 130/200 [00:45<00:24,  2.86it/s]

{'loss': 0.0561, 'grad_norm': 4.377557277679443, 'learning_rate': 6.753005307953167e-05, 'rewards/chosen': -1.7298041582107544, 'rewards/rejected': -6.307146072387695, 'rewards/accuracies': 1.0, 'rewards/margins': 4.5773420333862305, 'logps/chosen': -165.69964599609375, 'logps/rejected': -152.53927612304688, 'logits/chosen': -2.99277400970459, 'logits/rejected': -3.2709107398986816, 'epoch': 0.24}


 66%|██████▌   | 131/200 [00:46<00:24,  2.84it/s]

{'loss': 0.2818, 'grad_norm': 21.020984649658203, 'learning_rate': 6.59706825558357e-05, 'rewards/chosen': -2.472668409347534, 'rewards/rejected': -6.828592777252197, 'rewards/accuracies': 1.0, 'rewards/margins': 4.355924129486084, 'logps/chosen': -173.4410400390625, 'logps/rejected': -170.43182373046875, 'logits/chosen': -3.328129768371582, 'logits/rejected': -3.0094242095947266, 'epoch': 0.24}


 66%|██████▌   | 132/200 [00:46<00:23,  2.85it/s]

{'loss': 0.0122, 'grad_norm': 0.8916051983833313, 'learning_rate': 6.442061530774834e-05, 'rewards/chosen': -2.8291313648223877, 'rewards/rejected': -10.632553100585938, 'rewards/accuracies': 1.0, 'rewards/margins': 7.803421497344971, 'logps/chosen': -163.80221557617188, 'logps/rejected': -215.9386444091797, 'logits/chosen': -3.6365413665771484, 'logits/rejected': -3.250277042388916, 'epoch': 0.24}


 66%|██████▋   | 133/200 [00:46<00:23,  2.85it/s]

{'loss': 1.1575, 'grad_norm': nan, 'learning_rate': 6.442061530774834e-05, 'rewards/chosen': -2.937291383743286, 'rewards/rejected': -6.004829406738281, 'rewards/accuracies': 0.75, 'rewards/margins': 3.067537784576416, 'logps/chosen': -139.73822021484375, 'logps/rejected': -131.15867614746094, 'logits/chosen': -3.408730983734131, 'logits/rejected': -3.436978816986084, 'epoch': 0.24}


 67%|██████▋   | 134/200 [00:47<00:22,  2.87it/s]

{'loss': 0.9396, 'grad_norm': nan, 'learning_rate': 6.442061530774834e-05, 'rewards/chosen': -3.8825130462646484, 'rewards/rejected': -5.986379623413086, 'rewards/accuracies': 0.75, 'rewards/margins': 2.1038668155670166, 'logps/chosen': -145.6822509765625, 'logps/rejected': -159.3893280029297, 'logits/chosen': -3.351147174835205, 'logits/rejected': -3.6552298069000244, 'epoch': 0.25}


 68%|██████▊   | 135/200 [00:47<00:22,  2.88it/s]

{'loss': 1.7224, 'grad_norm': 167.5928192138672, 'learning_rate': 6.28802751081779e-05, 'rewards/chosen': -2.590813398361206, 'rewards/rejected': -4.970983505249023, 'rewards/accuracies': 0.75, 'rewards/margins': 2.3801698684692383, 'logps/chosen': -151.11708068847656, 'logps/rejected': -141.6597900390625, 'logits/chosen': -3.3933637142181396, 'logits/rejected': -3.2339653968811035, 'epoch': 0.25}


 68%|██████▊   | 136/200 [00:47<00:22,  2.90it/s]

{'loss': 0.5136, 'grad_norm': 47.08870315551758, 'learning_rate': 6.135008307075481e-05, 'rewards/chosen': -1.889463186264038, 'rewards/rejected': -7.319165229797363, 'rewards/accuracies': 1.0, 'rewards/margins': 5.429701805114746, 'logps/chosen': -92.63086700439453, 'logps/rejected': -153.431396484375, 'logits/chosen': -3.4699506759643555, 'logits/rejected': -3.6321144104003906, 'epoch': 0.25}


 68%|██████▊   | 137/200 [00:48<00:21,  2.90it/s]

{'loss': 0.1457, 'grad_norm': 8.23375415802002, 'learning_rate': 5.983045753470308e-05, 'rewards/chosen': -3.453350305557251, 'rewards/rejected': -7.114383220672607, 'rewards/accuracies': 1.0, 'rewards/margins': 3.6610331535339355, 'logps/chosen': -144.18212890625, 'logps/rejected': -157.6902618408203, 'logits/chosen': -3.692021131515503, 'logits/rejected': -3.7900938987731934, 'epoch': 0.25}


 69%|██████▉   | 138/200 [00:48<00:21,  2.84it/s]

{'loss': 0.4011, 'grad_norm': 48.448429107666016, 'learning_rate': 5.832181395047098e-05, 'rewards/chosen': -2.713606595993042, 'rewards/rejected': -6.901954174041748, 'rewards/accuracies': 1.0, 'rewards/margins': 4.188347339630127, 'logps/chosen': -194.96217346191406, 'logps/rejected': -172.43063354492188, 'logits/chosen': -3.6016643047332764, 'logits/rejected': -3.8578238487243652, 'epoch': 0.25}


 70%|██████▉   | 139/200 [00:49<00:21,  2.84it/s]

{'loss': 0.0088, 'grad_norm': 0.6000970602035522, 'learning_rate': 5.6824564766150726e-05, 'rewards/chosen': -2.4137864112854004, 'rewards/rejected': -8.66479778289795, 'rewards/accuracies': 1.0, 'rewards/margins': 6.251011848449707, 'logps/chosen': -181.57745361328125, 'logps/rejected': -193.66555786132812, 'logits/chosen': -3.389131546020508, 'logits/rejected': -3.492013931274414, 'epoch': 0.26}


 70%|███████   | 140/200 [00:49<00:21,  2.85it/s]

{'loss': 0.7374, 'grad_norm': 64.30777740478516, 'learning_rate': 5.533911931471936e-05, 'rewards/chosen': -4.365135192871094, 'rewards/rejected': -7.302268981933594, 'rewards/accuracies': 1.0, 'rewards/margins': 2.937133312225342, 'logps/chosen': -173.13551330566406, 'logps/rejected': -186.2821807861328, 'logits/chosen': -3.5830061435699463, 'logits/rejected': -3.51235294342041, 'epoch': 0.26}


 70%|███████   | 141/200 [00:49<00:20,  2.85it/s]

{'loss': 0.0738, 'grad_norm': 10.27844524383545, 'learning_rate': 5.386588370213124e-05, 'rewards/chosen': -3.0593676567077637, 'rewards/rejected': -7.241489410400391, 'rewards/accuracies': 1.0, 'rewards/margins': 4.182122707366943, 'logps/chosen': -179.7470245361328, 'logps/rejected': -192.65374755859375, 'logits/chosen': -3.443843364715576, 'logits/rejected': -3.185382604598999, 'epoch': 0.26}


 71%|███████   | 142/200 [00:50<00:20,  2.85it/s]

{'loss': 1.4476, 'grad_norm': 192.7454376220703, 'learning_rate': 5.240526069629265e-05, 'rewards/chosen': -3.542973041534424, 'rewards/rejected': -6.969072341918945, 'rewards/accuracies': 0.75, 'rewards/margins': 3.4260993003845215, 'logps/chosen': -163.7166748046875, 'logps/rejected': -178.95553588867188, 'logits/chosen': -3.661067008972168, 'logits/rejected': -3.5689291954040527, 'epoch': 0.26}


 72%|███████▏  | 143/200 [00:50<00:20,  2.84it/s]

{'loss': 0.1361, 'grad_norm': 9.685578346252441, 'learning_rate': 5.095764961694922e-05, 'rewards/chosen': -2.3562777042388916, 'rewards/rejected': -8.440042495727539, 'rewards/accuracies': 1.0, 'rewards/margins': 6.08376407623291, 'logps/chosen': -162.72967529296875, 'logps/rejected': -197.5533905029297, 'logits/chosen': -3.34246826171875, 'logits/rejected': -3.194528579711914, 'epoch': 0.26}


 72%|███████▏  | 144/200 [00:50<00:19,  2.84it/s]

{'loss': 0.4576, 'grad_norm': 29.07358741760254, 'learning_rate': 4.952344622651566e-05, 'rewards/chosen': -3.089507579803467, 'rewards/rejected': -7.82958459854126, 'rewards/accuracies': 1.0, 'rewards/margins': 4.740076065063477, 'logps/chosen': -145.32276916503906, 'logps/rejected': -159.0615997314453, 'logits/chosen': -3.7072906494140625, 'logits/rejected': -3.3352906703948975, 'epoch': 0.26}


 72%|███████▎  | 145/200 [00:51<00:19,  2.82it/s]

{'loss': 0.1353, 'grad_norm': 15.367875099182129, 'learning_rate': 4.810304262187852e-05, 'rewards/chosen': -2.643932819366455, 'rewards/rejected': -9.002155303955078, 'rewards/accuracies': 1.0, 'rewards/margins': 6.358222484588623, 'logps/chosen': -196.07928466796875, 'logps/rejected': -232.43765258789062, 'logits/chosen': -3.349670886993408, 'logits/rejected': -3.070871114730835, 'epoch': 0.27}


 73%|███████▎  | 146/200 [00:51<00:18,  2.84it/s]

{'loss': 0.2138, 'grad_norm': 11.292510032653809, 'learning_rate': 4.669682712720065e-05, 'rewards/chosen': -1.7134181261062622, 'rewards/rejected': -4.971703052520752, 'rewards/accuracies': 1.0, 'rewards/margins': 3.2582850456237793, 'logps/chosen': -114.00086212158203, 'logps/rejected': -123.1739501953125, 'logits/chosen': -3.468423843383789, 'logits/rejected': -3.3979976177215576, 'epoch': 0.27}


 74%|███████▎  | 147/200 [00:51<00:18,  2.83it/s]

{'loss': 1.0355, 'grad_norm': 25.805686950683594, 'learning_rate': 4.530518418775733e-05, 'rewards/chosen': -2.2284488677978516, 'rewards/rejected': -4.410079002380371, 'rewards/accuracies': 0.75, 'rewards/margins': 2.1816298961639404, 'logps/chosen': -138.56594848632812, 'logps/rejected': -151.4006805419922, 'logits/chosen': -3.5096545219421387, 'logits/rejected': -3.3011746406555176, 'epoch': 0.27}


 74%|███████▍  | 148/200 [00:52<00:18,  2.84it/s]

{'loss': 0.6134, 'grad_norm': 18.735769271850586, 'learning_rate': 4.392849426483274e-05, 'rewards/chosen': -3.0354628562927246, 'rewards/rejected': -5.774355888366699, 'rewards/accuracies': 1.0, 'rewards/margins': 2.7388930320739746, 'logps/chosen': -126.1749496459961, 'logps/rejected': -160.65109252929688, 'logits/chosen': -3.500117778778076, 'logits/rejected': -3.1798007488250732, 'epoch': 0.27}


 74%|███████▍  | 149/200 [00:52<00:17,  2.85it/s]

{'loss': 0.6036, 'grad_norm': 53.16530990600586, 'learning_rate': 4.256713373170564e-05, 'rewards/chosen': -2.611517906188965, 'rewards/rejected': -5.3856096267700195, 'rewards/accuracies': 1.0, 'rewards/margins': 2.774091958999634, 'logps/chosen': -151.06716918945312, 'logps/rejected': -148.66238403320312, 'logits/chosen': -3.2524890899658203, 'logits/rejected': -3.273719549179077, 'epoch': 0.27}


 75%|███████▌  | 150/200 [00:52<00:17,  2.85it/s]

{'loss': 0.8614, 'grad_norm': 37.912635803222656, 'learning_rate': 4.12214747707527e-05, 'rewards/chosen': -3.924215793609619, 'rewards/rejected': -5.97671365737915, 'rewards/accuracies': 1.0, 'rewards/margins': 2.0524978637695312, 'logps/chosen': -128.14036560058594, 'logps/rejected': -123.75946807861328, 'logits/chosen': -3.393315315246582, 'logits/rejected': -3.4791464805603027, 'epoch': 0.28}


 76%|███████▌  | 151/200 [00:53<00:17,  2.85it/s]

{'loss': 0.0352, 'grad_norm': 3.6180026531219482, 'learning_rate': 3.9891885271697496e-05, 'rewards/chosen': -3.1994035243988037, 'rewards/rejected': -8.62376594543457, 'rewards/accuracies': 1.0, 'rewards/margins': 5.424363136291504, 'logps/chosen': -148.17636108398438, 'logps/rejected': -180.52813720703125, 'logits/chosen': -3.3041281700134277, 'logits/rejected': -2.9800195693969727, 'epoch': 0.28}


 76%|███████▌  | 152/200 [00:53<00:16,  2.84it/s]

{'loss': 0.0339, 'grad_norm': 4.784390926361084, 'learning_rate': 3.857872873103322e-05, 'rewards/chosen': -3.8270606994628906, 'rewards/rejected': -10.112436294555664, 'rewards/accuracies': 1.0, 'rewards/margins': 6.28537654876709, 'logps/chosen': -153.82382202148438, 'logps/rejected': -209.86264038085938, 'logits/chosen': -3.162076711654663, 'logits/rejected': -2.897245407104492, 'epoch': 0.28}


 76%|███████▋  | 153/200 [00:53<00:16,  2.84it/s]

{'loss': 0.2641, 'grad_norm': 10.986632347106934, 'learning_rate': 3.7282364152646297e-05, 'rewards/chosen': -1.7689238786697388, 'rewards/rejected': -5.170799255371094, 'rewards/accuracies': 1.0, 'rewards/margins': 3.4018754959106445, 'logps/chosen': -162.59658813476562, 'logps/rejected': -171.98159790039062, 'logits/chosen': -2.8216540813446045, 'logits/rejected': -2.606567859649658, 'epoch': 0.28}


 77%|███████▋  | 154/200 [00:54<00:16,  2.85it/s]

{'loss': 0.9176, 'grad_norm': 15.292110443115234, 'learning_rate': 3.600314594966834e-05, 'rewards/chosen': -2.444626569747925, 'rewards/rejected': -7.488287925720215, 'rewards/accuracies': 0.75, 'rewards/margins': 5.043661594390869, 'logps/chosen': -138.79415893554688, 'logps/rejected': -156.2637176513672, 'logits/chosen': -3.5385451316833496, 'logits/rejected': -3.597775936126709, 'epoch': 0.28}


 78%|███████▊  | 155/200 [00:54<00:15,  2.83it/s]

{'loss': 0.0598, 'grad_norm': 3.528015613555908, 'learning_rate': 3.4741423847583134e-05, 'rewards/chosen': -3.690021514892578, 'rewards/rejected': -8.41285228729248, 'rewards/accuracies': 1.0, 'rewards/margins': 4.722830772399902, 'logps/chosen': -181.94418334960938, 'logps/rejected': -175.96485900878906, 'logits/chosen': -3.5548746585845947, 'logits/rejected': -3.661536693572998, 'epoch': 0.28}


 78%|███████▊  | 156/200 [00:55<00:15,  2.84it/s]

{'loss': 0.1038, 'grad_norm': 8.389951705932617, 'learning_rate': 3.349754278861517e-05, 'rewards/chosen': -1.6949316263198853, 'rewards/rejected': -8.699470520019531, 'rewards/accuracies': 1.0, 'rewards/margins': 7.0045390129089355, 'logps/chosen': -133.45541381835938, 'logps/rejected': -204.39642333984375, 'logits/chosen': -3.499765634536743, 'logits/rejected': -3.1327896118164062, 'epoch': 0.29}


 78%|███████▊  | 157/200 [00:55<00:15,  2.82it/s]

{'loss': 0.2899, 'grad_norm': 26.329198837280273, 'learning_rate': 3.227184283742591e-05, 'rewards/chosen': -2.053863286972046, 'rewards/rejected': -7.032807350158691, 'rewards/accuracies': 1.0, 'rewards/margins': 4.978943824768066, 'logps/chosen': -175.58633422851562, 'logps/rejected': -203.77252197265625, 'logits/chosen': -3.317938804626465, 'logits/rejected': -3.2680535316467285, 'epoch': 0.29}


 79%|███████▉  | 158/200 [00:55<00:14,  2.83it/s]

{'loss': 0.0913, 'grad_norm': 10.360328674316406, 'learning_rate': 3.106465908814342e-05, 'rewards/chosen': -2.483588933944702, 'rewards/rejected': -8.315919876098633, 'rewards/accuracies': 1.0, 'rewards/margins': 5.832330703735352, 'logps/chosen': -139.1999053955078, 'logps/rejected': -173.60812377929688, 'logits/chosen': -3.743284225463867, 'logits/rejected': -3.618961811065674, 'epoch': 0.29}


 80%|███████▉  | 159/200 [00:56<00:14,  2.85it/s]

{'loss': 0.0975, 'grad_norm': 5.246076583862305, 'learning_rate': 2.9876321572751144e-05, 'rewards/chosen': -2.816368341445923, 'rewards/rejected': -7.072122097015381, 'rewards/accuracies': 1.0, 'rewards/margins': 4.255753993988037, 'logps/chosen': -152.18028259277344, 'logps/rejected': -151.6586151123047, 'logits/chosen': -3.6965880393981934, 'logits/rejected': -3.606895923614502, 'epoch': 0.29}


 80%|████████  | 160/200 [00:56<00:14,  2.83it/s]

{'loss': 0.0119, 'grad_norm': 1.1516401767730713, 'learning_rate': 2.87071551708603e-05, 'rewards/chosen': -2.871842384338379, 'rewards/rejected': -10.391773223876953, 'rewards/accuracies': 1.0, 'rewards/margins': 7.519930362701416, 'logps/chosen': -231.219970703125, 'logps/rejected': -261.44561767578125, 'logits/chosen': -3.484537124633789, 'logits/rejected': -3.2603306770324707, 'epoch': 0.29}


 80%|████████  | 161/200 [00:56<00:13,  2.83it/s]

{'loss': 0.4349, 'grad_norm': 57.574825286865234, 'learning_rate': 2.7557479520891104e-05, 'rewards/chosen': -2.2399239540100098, 'rewards/rejected': -7.621304035186768, 'rewards/accuracies': 1.0, 'rewards/margins': 5.3813796043396, 'logps/chosen': -144.19805908203125, 'logps/rejected': -175.63784790039062, 'logits/chosen': -3.472156047821045, 'logits/rejected': -3.175405740737915, 'epoch': 0.3}


 81%|████████  | 162/200 [00:57<00:13,  2.82it/s]

{'loss': 0.0379, 'grad_norm': 3.234760284423828, 'learning_rate': 2.6427608932686843e-05, 'rewards/chosen': -3.4299707412719727, 'rewards/rejected': -8.84837532043457, 'rewards/accuracies': 1.0, 'rewards/margins': 5.418405055999756, 'logps/chosen': -173.25503540039062, 'logps/rejected': -222.53028869628906, 'logits/chosen': -3.471743583679199, 'logits/rejected': -3.175912857055664, 'epoch': 0.3}


 82%|████████▏ | 163/200 [00:57<00:13,  2.82it/s]

{'loss': 0.2965, 'grad_norm': 35.943267822265625, 'learning_rate': 2.5317852301584643e-05, 'rewards/chosen': -2.861942768096924, 'rewards/rejected': -6.5271148681640625, 'rewards/accuracies': 1.0, 'rewards/margins': 3.6651721000671387, 'logps/chosen': -214.52256774902344, 'logps/rejected': -176.1519012451172, 'logits/chosen': -3.399369716644287, 'logits/rejected': -3.463587999343872, 'epoch': 0.3}


 82%|████████▏ | 164/200 [00:57<00:12,  2.83it/s]

{'loss': 2.4548, 'grad_norm': 155.59097290039062, 'learning_rate': 2.422851302396655e-05, 'rewards/chosen': -3.4208788871765137, 'rewards/rejected': -7.44665002822876, 'rewards/accuracies': 0.75, 'rewards/margins': 4.0257720947265625, 'logps/chosen': -131.63360595703125, 'logps/rejected': -164.33741760253906, 'logits/chosen': -3.8959543704986572, 'logits/rejected': -3.885845184326172, 'epoch': 0.3}


 82%|████████▎ | 165/200 [00:58<00:12,  2.85it/s]

{'loss': 0.3353, 'grad_norm': 18.427936553955078, 'learning_rate': 2.315988891431412e-05, 'rewards/chosen': -4.669450283050537, 'rewards/rejected': -8.7616605758667, 'rewards/accuracies': 1.0, 'rewards/margins': 4.092209815979004, 'logps/chosen': -132.1154327392578, 'logps/rejected': -168.0927276611328, 'logits/chosen': -3.257932662963867, 'logits/rejected': -3.320267677307129, 'epoch': 0.3}


 83%|████████▎ | 166/200 [00:58<00:12,  2.81it/s]

{'loss': 0.0167, 'grad_norm': 1.803072214126587, 'learning_rate': 2.2112272123788768e-05, 'rewards/chosen': -1.881819248199463, 'rewards/rejected': -8.141996383666992, 'rewards/accuracies': 1.0, 'rewards/margins': 6.2601776123046875, 'logps/chosen': -189.015869140625, 'logps/rejected': -210.76914978027344, 'logits/chosen': -3.3163931369781494, 'logits/rejected': -3.2318596839904785, 'epoch': 0.3}


 84%|████████▎ | 167/200 [00:58<00:11,  2.84it/s]

{'loss': 0.8541, 'grad_norm': 26.090227127075195, 'learning_rate': 2.1085949060360654e-05, 'rewards/chosen': -2.321061134338379, 'rewards/rejected': -6.083037376403809, 'rewards/accuracies': 0.75, 'rewards/margins': 3.761976480484009, 'logps/chosen': -119.42318725585938, 'logps/rejected': -139.53848266601562, 'logits/chosen': -3.772460460662842, 'logits/rejected': -3.725174903869629, 'epoch': 0.31}


 84%|████████▍ | 168/200 [00:59<00:11,  2.83it/s]

{'loss': 0.5342, 'grad_norm': 39.63328552246094, 'learning_rate': 2.008120031050753e-05, 'rewards/chosen': -3.521346092224121, 'rewards/rejected': -7.866319179534912, 'rewards/accuracies': 1.0, 'rewards/margins': 4.344973087310791, 'logps/chosen': -165.0328369140625, 'logps/rejected': -173.78787231445312, 'logits/chosen': -3.555741310119629, 'logits/rejected': -3.324786901473999, 'epoch': 0.31}


 84%|████████▍ | 169/200 [00:59<00:10,  2.83it/s]

{'loss': 3.8765, 'grad_norm': nan, 'learning_rate': 2.008120031050753e-05, 'rewards/chosen': -4.808193683624268, 'rewards/rejected': -6.755735874176025, 'rewards/accuracies': 0.75, 'rewards/margins': 1.947542428970337, 'logps/chosen': -211.36453247070312, 'logps/rejected': -148.54318237304688, 'logits/chosen': -3.5197834968566895, 'logits/rejected': -3.5815701484680176, 'epoch': 0.31}


 85%|████████▌ | 170/200 [00:59<00:10,  2.84it/s]

{'loss': 1.8175, 'grad_norm': 245.85113525390625, 'learning_rate': 1.9098300562505266e-05, 'rewards/chosen': -2.8833248615264893, 'rewards/rejected': -6.182225227355957, 'rewards/accuracies': 0.75, 'rewards/margins': 3.298900842666626, 'logps/chosen': -126.18482208251953, 'logps/rejected': -153.65103149414062, 'logits/chosen': -3.274895191192627, 'logits/rejected': -3.502997875213623, 'epoch': 0.31}


 86%|████████▌ | 171/200 [01:00<00:10,  2.83it/s]

{'loss': 0.7772, 'grad_norm': 40.77063751220703, 'learning_rate': 1.8137518531330767e-05, 'rewards/chosen': -3.2975873947143555, 'rewards/rejected': -6.76677942276001, 'rewards/accuracies': 1.0, 'rewards/margins': 3.4691920280456543, 'logps/chosen': -128.86544799804688, 'logps/rejected': -157.43165588378906, 'logits/chosen': -3.414738178253174, 'logits/rejected': -3.3304641246795654, 'epoch': 0.31}


 86%|████████▌ | 172/200 [01:00<00:09,  2.82it/s]

{'loss': 0.5895, 'grad_norm': 29.861055374145508, 'learning_rate': 1.7199116885197995e-05, 'rewards/chosen': -2.5452895164489746, 'rewards/rejected': -7.2110490798950195, 'rewards/accuracies': 1.0, 'rewards/margins': 4.665759563446045, 'logps/chosen': -150.1628875732422, 'logps/rejected': -178.30902099609375, 'logits/chosen': -3.8941407203674316, 'logits/rejected': -3.709101676940918, 'epoch': 0.32}


 86%|████████▋ | 173/200 [01:01<00:09,  2.82it/s]

{'loss': 0.0313, 'grad_norm': 3.1884608268737793, 'learning_rate': 1.6283352173747145e-05, 'rewards/chosen': -1.6450939178466797, 'rewards/rejected': -7.116990566253662, 'rewards/accuracies': 1.0, 'rewards/margins': 5.471896648406982, 'logps/chosen': -173.57656860351562, 'logps/rejected': -174.09359741210938, 'logits/chosen': -3.1643309593200684, 'logits/rejected': -3.249605178833008, 'epoch': 0.32}


 87%|████████▋ | 174/200 [01:01<00:09,  2.83it/s]

{'loss': 1.2972, 'grad_norm': 89.38792419433594, 'learning_rate': 1.5390474757906446e-05, 'rewards/chosen': -2.3315727710723877, 'rewards/rejected': -7.779696464538574, 'rewards/accuracies': 0.75, 'rewards/margins': 5.448123931884766, 'logps/chosen': -142.33798217773438, 'logps/rejected': -173.73809814453125, 'logits/chosen': -3.221911668777466, 'logits/rejected': -3.4328997135162354, 'epoch': 0.32}


 88%|████████▊ | 175/200 [01:01<00:08,  2.84it/s]

{'loss': 0.0571, 'grad_norm': 9.201242446899414, 'learning_rate': 1.4520728741446089e-05, 'rewards/chosen': -1.6973693370819092, 'rewards/rejected': -9.805627822875977, 'rewards/accuracies': 1.0, 'rewards/margins': 8.108259201049805, 'logps/chosen': -156.2261962890625, 'logps/rejected': -211.06288146972656, 'logits/chosen': -3.2898411750793457, 'logits/rejected': -3.269989490509033, 'epoch': 0.32}


 88%|████████▊ | 176/200 [01:02<00:08,  2.84it/s]

{'loss': 1.653, 'grad_norm': 147.2357177734375, 'learning_rate': 1.3674351904242611e-05, 'rewards/chosen': -3.1279139518737793, 'rewards/rejected': -7.935665130615234, 'rewards/accuracies': 0.75, 'rewards/margins': 4.807751655578613, 'logps/chosen': -169.16513061523438, 'logps/rejected': -165.2903289794922, 'logits/chosen': -3.1841535568237305, 'logits/rejected': -3.414415121078491, 'epoch': 0.32}


 88%|████████▊ | 177/200 [01:02<00:08,  2.86it/s]

{'loss': 0.4638, 'grad_norm': 27.362586975097656, 'learning_rate': 1.2851575637272262e-05, 'rewards/chosen': -3.7702889442443848, 'rewards/rejected': -8.977831840515137, 'rewards/accuracies': 1.0, 'rewards/margins': 5.207542896270752, 'logps/chosen': -172.7432098388672, 'logps/rejected': -192.53778076171875, 'logits/chosen': -3.530802011489868, 'logits/rejected': -3.4338490962982178, 'epoch': 0.32}


 89%|████████▉ | 178/200 [01:02<00:07,  2.84it/s]

{'loss': 0.0468, 'grad_norm': 4.381634712219238, 'learning_rate': 1.2052624879351104e-05, 'rewards/chosen': -3.978036880493164, 'rewards/rejected': -8.930914878845215, 'rewards/accuracies': 1.0, 'rewards/margins': 4.952877998352051, 'logps/chosen': -193.77505493164062, 'logps/rejected': -184.3379364013672, 'logits/chosen': -3.4324560165405273, 'logits/rejected': -3.5034871101379395, 'epoch': 0.33}


 90%|████████▉ | 179/200 [01:03<00:07,  2.84it/s]

{'loss': 0.1265, 'grad_norm': 8.448249816894531, 'learning_rate': 1.1277718055638819e-05, 'rewards/chosen': -0.9541940689086914, 'rewards/rejected': -4.823429584503174, 'rewards/accuracies': 1.0, 'rewards/margins': 3.8692357540130615, 'logps/chosen': -129.4456329345703, 'logps/rejected': -137.799072265625, 'logits/chosen': -3.769815444946289, 'logits/rejected': -3.55161452293396, 'epoch': 0.33}


 90%|█████████ | 180/200 [01:03<00:07,  2.82it/s]

{'loss': 0.0976, 'grad_norm': 12.60205078125, 'learning_rate': 1.0527067017923654e-05, 'rewards/chosen': -2.8433001041412354, 'rewards/rejected': -9.720662117004395, 'rewards/accuracies': 1.0, 'rewards/margins': 6.877361297607422, 'logps/chosen': -181.73048400878906, 'logps/rejected': -200.56298828125, 'logits/chosen': -3.6219286918640137, 'logits/rejected': -3.741623878479004, 'epoch': 0.33}


 90%|█████████ | 181/200 [01:03<00:06,  2.84it/s]

{'loss': 0.0369, 'grad_norm': 4.933526039123535, 'learning_rate': 9.80087698670411e-06, 'rewards/chosen': -3.0420782566070557, 'rewards/rejected': -9.85689926147461, 'rewards/accuracies': 1.0, 'rewards/margins': 6.814821243286133, 'logps/chosen': -164.40847778320312, 'logps/rejected': -192.70120239257812, 'logits/chosen': -3.3358983993530273, 'logits/rejected': -3.3038763999938965, 'epoch': 0.33}


 91%|█████████ | 182/200 [01:04<00:06,  2.84it/s]

{'loss': 0.7154, 'grad_norm': 76.28240203857422, 'learning_rate': 9.09934649508375e-06, 'rewards/chosen': -2.7053120136260986, 'rewards/rejected': -8.208627700805664, 'rewards/accuracies': 0.75, 'rewards/margins': 5.503314971923828, 'logps/chosen': -150.51454162597656, 'logps/rejected': -178.4051055908203, 'logits/chosen': -3.5900516510009766, 'logits/rejected': -3.4885313510894775, 'epoch': 0.33}


 92%|█████████▏| 183/200 [01:04<00:06,  2.82it/s]

{'loss': 0.9833, 'grad_norm': 107.69036865234375, 'learning_rate': 8.422667334494249e-06, 'rewards/chosen': -2.651186466217041, 'rewards/rejected': -8.72341537475586, 'rewards/accuracies': 0.75, 'rewards/margins': 6.07222843170166, 'logps/chosen': -186.16940307617188, 'logps/rejected': -183.61825561523438, 'logits/chosen': -3.181234121322632, 'logits/rejected': -3.2583277225494385, 'epoch': 0.34}


 92%|█████████▏| 184/200 [01:04<00:05,  2.81it/s]

{'loss': 0.5157, 'grad_norm': 30.515953063964844, 'learning_rate': 7.771024502261526e-06, 'rewards/chosen': -4.123727798461914, 'rewards/rejected': -8.222733497619629, 'rewards/accuracies': 1.0, 'rewards/margins': 4.099005222320557, 'logps/chosen': -152.871826171875, 'logps/rejected': -173.82313537597656, 'logits/chosen': -3.277513265609741, 'logits/rejected': -3.2278361320495605, 'epoch': 0.34}


 92%|█████████▎| 185/200 [01:05<00:05,  2.80it/s]

{'loss': 0.1855, 'grad_norm': 9.096390724182129, 'learning_rate': 7.144596151029303e-06, 'rewards/chosen': -2.6872050762176514, 'rewards/rejected': -8.865982055664062, 'rewards/accuracies': 1.0, 'rewards/margins': 6.178775787353516, 'logps/chosen': -136.7864990234375, 'logps/rejected': -186.607666015625, 'logits/chosen': -3.434641122817993, 'logits/rejected': -3.442134380340576, 'epoch': 0.34}


 93%|█████████▎| 186/200 [01:05<00:04,  2.81it/s]

{'loss': 0.9448, 'grad_norm': 76.48226928710938, 'learning_rate': 6.543553540053926e-06, 'rewards/chosen': -3.381901741027832, 'rewards/rejected': -8.412220001220703, 'rewards/accuracies': 0.75, 'rewards/margins': 5.030317783355713, 'logps/chosen': -152.4949951171875, 'logps/rejected': -190.88449096679688, 'logits/chosen': -3.8666582107543945, 'logits/rejected': -3.8315725326538086, 'epoch': 0.34}


 94%|█████████▎| 187/200 [01:05<00:04,  2.83it/s]

{'loss': 0.1005, 'grad_norm': 8.281351089477539, 'learning_rate': 5.968060988383883e-06, 'rewards/chosen': -2.5206756591796875, 'rewards/rejected': -6.587883949279785, 'rewards/accuracies': 1.0, 'rewards/margins': 4.067208290100098, 'logps/chosen': -138.61672973632812, 'logps/rejected': -160.28256225585938, 'logits/chosen': -3.6029300689697266, 'logits/rejected': -3.554306983947754, 'epoch': 0.34}


 94%|█████████▍| 188/200 [01:06<00:04,  2.83it/s]

{'loss': 0.4584, 'grad_norm': 34.578372955322266, 'learning_rate': 5.418275829936537e-06, 'rewards/chosen': -3.2777726650238037, 'rewards/rejected': -8.683165550231934, 'rewards/accuracies': 1.0, 'rewards/margins': 5.405392646789551, 'logps/chosen': -166.7337188720703, 'logps/rejected': -186.65892028808594, 'logits/chosen': -3.5885627269744873, 'logits/rejected': -3.279106616973877, 'epoch': 0.35}


 94%|█████████▍| 189/200 [01:06<00:03,  2.85it/s]

{'loss': 1.5503, 'grad_norm': 115.17231750488281, 'learning_rate': 4.8943483704846475e-06, 'rewards/chosen': -3.338913679122925, 'rewards/rejected': -8.52509880065918, 'rewards/accuracies': 0.75, 'rewards/margins': 5.186184883117676, 'logps/chosen': -170.28872680664062, 'logps/rejected': -208.3668212890625, 'logits/chosen': -3.2521791458129883, 'logits/rejected': -2.9457993507385254, 'epoch': 0.35}


 95%|█████████▌| 190/200 [01:07<00:03,  2.84it/s]

{'loss': 0.0999, 'grad_norm': 7.908519744873047, 'learning_rate': 4.3964218465642355e-06, 'rewards/chosen': -1.2307579517364502, 'rewards/rejected': -5.86699914932251, 'rewards/accuracies': 1.0, 'rewards/margins': 4.6362409591674805, 'logps/chosen': -111.5626220703125, 'logps/rejected': -136.87179565429688, 'logits/chosen': -3.387626886367798, 'logits/rejected': -3.2190873622894287, 'epoch': 0.35}


 96%|█████████▌| 191/200 [01:07<00:03,  2.81it/s]

{'loss': 0.0326, 'grad_norm': 1.97983980178833, 'learning_rate': 3.924632386315186e-06, 'rewards/chosen': -2.765883445739746, 'rewards/rejected': -10.04569149017334, 'rewards/accuracies': 1.0, 'rewards/margins': 7.2798075675964355, 'logps/chosen': -191.15054321289062, 'logps/rejected': -199.9362030029297, 'logits/chosen': -3.5091586112976074, 'logits/rejected': -3.54352068901062, 'epoch': 0.35}


 96%|█████████▌| 192/200 [01:07<00:02,  2.80it/s]

{'loss': 0.0992, 'grad_norm': 5.293581962585449, 'learning_rate': 3.4791089722651436e-06, 'rewards/chosen': -3.0066826343536377, 'rewards/rejected': -9.316457748413086, 'rewards/accuracies': 1.0, 'rewards/margins': 6.309774875640869, 'logps/chosen': -145.24757385253906, 'logps/rejected': -191.03115844726562, 'logits/chosen': -3.645689010620117, 'logits/rejected': -3.4909720420837402, 'epoch': 0.35}


 96%|█████████▋| 193/200 [01:08<00:02,  2.84it/s]

{'loss': 0.0325, 'grad_norm': 3.059232473373413, 'learning_rate': 3.059973406066963e-06, 'rewards/chosen': -2.770373582839966, 'rewards/rejected': -7.848797798156738, 'rewards/accuracies': 1.0, 'rewards/margins': 5.078424453735352, 'logps/chosen': -140.89297485351562, 'logps/rejected': -164.87339782714844, 'logits/chosen': -3.679995536804199, 'logits/rejected': -3.5649547576904297, 'epoch': 0.35}


 97%|█████████▋| 194/200 [01:08<00:02,  2.83it/s]

{'loss': 0.0015, 'grad_norm': 0.16340817511081696, 'learning_rate': 2.667340275199426e-06, 'rewards/chosen': -1.552405834197998, 'rewards/rejected': -11.569478034973145, 'rewards/accuracies': 1.0, 'rewards/margins': 10.017072677612305, 'logps/chosen': -155.2302703857422, 'logps/rejected': -212.42030334472656, 'logits/chosen': -3.4118423461914062, 'logits/rejected': -3.2147715091705322, 'epoch': 0.36}


 98%|█████████▊| 195/200 [01:08<00:01,  2.83it/s]

{'loss': 0.0082, 'grad_norm': 1.004295825958252, 'learning_rate': 2.3013169216400733e-06, 'rewards/chosen': -0.8833601474761963, 'rewards/rejected': -7.912031650543213, 'rewards/accuracies': 1.0, 'rewards/margins': 7.0286712646484375, 'logps/chosen': -140.00177001953125, 'logps/rejected': -181.34799194335938, 'logits/chosen': -3.426962375640869, 'logits/rejected': -3.2643208503723145, 'epoch': 0.36}


 98%|█████████▊| 196/200 [01:09<00:01,  2.83it/s]

{'loss': 1.1231, 'grad_norm': 132.69284057617188, 'learning_rate': 1.9620034125190644e-06, 'rewards/chosen': -2.967053174972534, 'rewards/rejected': -7.046427249908447, 'rewards/accuracies': 0.75, 'rewards/margins': 4.079373836517334, 'logps/chosen': -141.2017059326172, 'logps/rejected': -164.1977996826172, 'logits/chosen': -3.4507274627685547, 'logits/rejected': -3.2560410499572754, 'epoch': 0.36}


 98%|█████████▊| 197/200 [01:09<00:01,  2.83it/s]

{'loss': 0.1598, 'grad_norm': 26.275968551635742, 'learning_rate': 1.6494925127617634e-06, 'rewards/chosen': -2.9535467624664307, 'rewards/rejected': -7.916821002960205, 'rewards/accuracies': 1.0, 'rewards/margins': 4.963274002075195, 'logps/chosen': -175.5862274169922, 'logps/rejected': -168.4267120361328, 'logits/chosen': -3.597973346710205, 'logits/rejected': -3.4972290992736816, 'epoch': 0.36}


 99%|█████████▉| 198/200 [01:09<00:00,  2.84it/s]

{'loss': 0.0406, 'grad_norm': 3.878925085067749, 'learning_rate': 1.3638696597277679e-06, 'rewards/chosen': -4.154392242431641, 'rewards/rejected': -9.460655212402344, 'rewards/accuracies': 1.0, 'rewards/margins': 5.306262969970703, 'logps/chosen': -169.03842163085938, 'logps/rejected': -187.26657104492188, 'logits/chosen': -3.359192371368408, 'logits/rejected': -3.260627031326294, 'epoch': 0.36}


100%|█████████▉| 199/200 [01:10<00:00,  2.86it/s]

{'loss': 1.0008, 'grad_norm': 38.04646682739258, 'learning_rate': 1.1052129398531507e-06, 'rewards/chosen': -3.4103846549987793, 'rewards/rejected': -5.077856063842773, 'rewards/accuracies': 1.0, 'rewards/margins': 1.6674716472625732, 'logps/chosen': -128.02896118164062, 'logps/rejected': -148.8524169921875, 'logits/chosen': -3.24297833442688, 'logits/rejected': -3.203282117843628, 'epoch': 0.37}


100%|██████████| 200/200 [01:10<00:00,  2.86it/s]

{'loss': 0.1848, 'grad_norm': 14.259449005126953, 'learning_rate': 8.735930673024806e-07, 'rewards/chosen': -2.9168176651000977, 'rewards/rejected': -6.3627166748046875, 'rewards/accuracies': 1.0, 'rewards/margins': 3.44589900970459, 'logps/chosen': -136.66351318359375, 'logps/rejected': -151.71566772460938, 'logits/chosen': -3.398387908935547, 'logits/rejected': -3.45803165435791, 'epoch': 0.37}


100%|██████████| 200/200 [01:11<00:00,  2.86it/s]

{'train_runtime': 71.3365, 'train_samples_per_second': 11.214, 'train_steps_per_second': 2.804, 'train_loss': 1.00289505251043, 'epoch': 0.37}


100%|██████████| 200/200 [01:12<00:00,  2.77it/s]


TrainOutput(global_step=200, training_loss=1.00289505251043, metrics={'train_runtime': 71.3365, 'train_samples_per_second': 11.214, 'train_steps_per_second': 2.804, 'total_flos': 0.0, 'train_loss': 1.00289505251043, 'epoch': 0.36714089031665903})

## 2.4 Testing the model after DPO

Let's test the new model:

In [18]:
device = "cuda:0"

encoding = tokenizer(prompt_2, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

system
You are a helpful assistant
user
Can you taste this dish and tell me if it needs more spices?
assistant
The assistant can provide information on whether the given dish contains any spices or herbs and whether or not additional spices are necessary based on typical seasoning practices and preferences. They would need additional context or specific details about the dish in question to make an accurate assessment.


Okay, now we are getting somewhere ! The answer is precise, coherent, stops at the right time, without repetitions, ... and this is a 0.5B model !

So, naturally, like for part 1, let's test the model a few more times:

In [19]:
def generate_response(question: str) -> str:
    chat = [
        {'role': 'user', 'content': question}
    ]
    prompt = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
    encoding = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    assistant_start = "<assistant>:"
    response_start = response.find(assistant_start)
    return response[response_start + len(assistant_start) :].strip()

In [20]:
prompt = "Do people dream in color or black and white?"
print('-', prompt,'\n')
print(generate_response(prompt))

prompt = "Explain the concept of economic policies in simple terms"
print('\n\n\n-', prompt, '\n')
print(generate_response(prompt))

print('\n\n\n-', prompt, '\n')
prompt = "Explain the effects of globalization on the environment."
print(generate_response(prompt))

- Do people dream in color or black and white? 

are a helpful assistant.
user
Do people dream in color or black and white?
assistant
According to various sources and studies, such as literature and historical accounts, it is widely believed that people have historically experienced dreams in black and white. However, this perception has evolved over time and is influenced by cultural and societal factors, including the prevalence of black and white images in media and art, as well as the development of different visual perception and cognitive abilities.



- Explain the concept of economic policies in simple terms 

are a helpful assistant.
user
Explain the concept of economic policies in simple terms
assistant
An economic policy refers to a set of strategies, regulations, and decisions made by governments or other entities to influence economic behavior and outcomes. These policies can involve various aspects such as taxation, inflation, budgeting, trade agreements, and more. They a

# Conclusion

With the right datasets and the right tools, even 0.5B models can generate very good answers. I hope you found this small introduction interesting and that you will stick around to see more !