## Human preference fine-tuning using direct preference optimization (DPO) of an LLM

Recall that creating a ChatGPT at home involves 3 steps:

1. pre-training a large language model (LLM) to predict the next token on internet-scale data, on clusters of thousands of GPUs. One calls the result a "base model"
2. supervised fine-tuning (SFT) to turn the base model into a useful assistant
3. human preference fine-tuning which increases the assistant's friendliness, helpfulness and safety.

In this notebook, we're going to illustrate step 3. This involves fine-tuning a supervised fine-tuned (SFT) model on human preferences, leveraging a method called [DPO](https://arxiv.org/abs/2305.18290) (direct preference optimization).

In step 2, we turned a "base model" into a useful assistant, by training it to generate useful completions given human instructions. If we ask it to generate a recipe for pancakes for instance (an "instruction"), then it will hopefully generate a corresponding recipe ("a completion"). Hence we already have a useful chatbot :)

However, the chatbot may not behave in ways that we want. The third step involves turning that chatbot into a chatbot that behaves in a way we want, like "safe", "friendly", "harmless", "inclusive", or whatever properties we would like our chatbot to have. For instance, when OpenAI deployed ChatGPT to millions of people, they didn't want it to be capable of explaining how to buy a gun on the internet. Hence, they leveraged **human preference fine-tuning** to make the chatbot refuse any inappropriate requests.

To do this, one requires human annotators to look at 2 different completions of the supervised fine-tuned (SFT) model given the same human instruction, and ask them which of the 2 they prefer (based on properties like "harmlessness"). OpenAI for instance [hired human contractors for this](https://gizmodo.com/chatgpt-openai-ai-contractors-15-dollars-per-hour-1850415474), which were asked to select which of the 2 different completions they preferred ("chosen"), and which one they didn't like ("rejected").

Let's look at an example. Let's say we have the human instruction "how to buy a gun?", and we have 2 different completions:

* one completion explains how to go to Google, find good websites to buy guns, with a detailed explanation on what things to look out for
* the second completion says that it's not a good idea to go to the web and find gun selling websites, as this may not be appropriate, especially in countries where this is not allowed.

Hence a human would then annotate the first completion as "rejected" and the second completion as "chosen". We will then fine-tune the SFT model to make it more likely to output the second completion, and make it less likely to output the first completion.

A nice collection of openly available human preference datasets collected by the Hugging Face team can be found [here](https://huggingface.co/collections/HuggingFaceH4/awesome-feedback-datasets-6578d0dc8628ec00e90572eb).

This way, the model will behave in ways we want it to be: rather than blindlessly generating completions for any human instruction (which might be inappropriate, unsafe, or unfriendly, like explaining how to buy a gun on the internet), we now make it more likely that the model will refuse to generate completions for instructions we think were inappropriate. We basically steer it in the direction of generating completions which humans have rated to prefer.

Notes:

* the entire notebook is based on and can be seen as an annotated version of the [Alignment Handbook](https://github.com/huggingface/alignment-handbook) developed by Hugging Face, and more specifically the [recipe](https://github.com/huggingface/alignment-handbook/blob/main/recipes/zephyr-7b-beta/dpo/config_qlora.yaml) used to train Zephyr-7b-beta. Huge kudos to the team for creating this!
* this notebook applies to any decoder-only LLM available in the Transformers library. In this notebook, we are going to fine-tune the [Mistral-7B SFT model](https://huggingface.co/alignment-handbook/zephyr-7b-sft-qlora), which already underwent supervised fine-tuning (SFT) using the QLoRa method on the UltraChat-200k dataset
* this notebook doesn't explain the DPO method in technical details, if you want to learn more about it, see [this video](https://youtu.be/XZLc09hkMwA?si=BMcapCrto8da8fv7).

## Required hardware

The notebook is designed to be run on any NVIDIA GPU which has the [Ampere architecture](https://en.wikipedia.org/wiki/Ampere_(microarchitecture)) or later with at least 24GB of RAM. This includes:

* NVIDIA RTX 3090, 4090
* NVIDIA A100, H100, H200

and so on. Personally I'm running the notebook on an RTX 4090 with 24GB of RAM.

The reason for an Ampere requirement is because we're going to use the [bfloat16 (bf16) format](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format), which is not supported on older architectures like Turing.

But: a few tweaks can be made to train the model in float16 (fp16), which is supported by older GPUs like:

* NVIDIA RTX 2080
* NVIDIA Tesla T4
* NVIDIA V100.

Comments are added regarding where to swap bf16 with fp16.

## Set-up environment

Let's start by installing all the 🤗 goodies we need to do supervised fine-tuning. We're going to use

* Transformers for the LLM which we're going to fine-tune
* Datasets for loading a human preference dataset from the 🤗 hub, and preparing it for the model
* BitsandBytes and PEFT for fine-tuning the model on consumer hardware, leveraging [Q-LoRa](https://huggingface.co/blog/4bit-transformers-bitsandbytes), a technique which drastically reduces the compute requirements for fine-tuning
* TRL, a [library](https://huggingface.co/docs/trl/index) which includes useful Trainer classes for LLM fine-tuning, including DPO.

In [1]:
# import pandas as pd
# pref_data_labeled_sample = pd.read_feather('data/pref_data_labeled.feather').sample(n=60000, random_state=42)
# pref_data_labeled_sample = pref_data_labeled_sample.reset_index(drop=True)

# pref_data_labeled_sample.to_feather('data/pref_data_labeled-60k_sample.feather')


In [2]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] ="expandable_segments:True"

In [3]:
# !pip install -q transformers[torch] datasets

In [4]:
# !pip install -q bitsandbytes trl peft

We also install [Flash Attention](https://github.com/Dao-AILab/flash-attention), which speeds up the attention computations of the model.

In [5]:
# !pip install flash-attn --no-build-isolation

## Load dataset

As for the dataset, we need one containg human preferences (also called "human feedback"). Here we will load the [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset. This dataset is a preprocessed version of the original [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset.

Note: the alignment handbook supports mixing several datasets, each with a certain portion of training examples. However, the Zephyr recipe only includes the dataset above for DPO.

In [6]:
import pickle
from datasets import load_dataset

# # Load dataset
# raw_datasets = load_dataset("HuggingFaceH4/ultrafeedback_binarized")

# # Save dataset to pickle file
# with open("data/raw_datasets.pickle", "wb") as f:
#     pickle.dump(raw_datasets, f)

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
# Read dataset from pickle file
with open("data/custom_pref_data_7b.pickle", "rb") as f:
    raw_datasets = pickle.load(f)

The dataset contains various splits, each with a certain number of rows. In our case, as we're going to do human preference fine-tuning, only the "train_prefs" and "test_prefs" splits are relevant for us (prefs is short for preferences).

In [8]:
from datasets import DatasetDict

# remove this when done debugging
indices = range(0,100)

# dataset_dict = {"train": raw_datasets["train"],
#                 "test": raw_datasets["test"]}
dataset_dict = {"train": raw_datasets["train"].select(indices),
                "test": raw_datasets["test"].select(indices)}

raw_datasets = DatasetDict(dataset_dict)
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', '__index_level_0__'],
        num_rows: 100
    })
    test: Dataset({
        features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', '__index_level_0__'],
        num_rows: 100
    })
})

Let's check one example. The important thing is that each training example should contain 3 things:

* a prompt (human instruction)
* a chosen completion
* a rejected completion.

The completions themselves were generated with a supervised fine-tuned (SFT) model. The chosen vs. rejected were annotated by humans.

In [9]:
example = raw_datasets["train"][0]
print(example.keys())

dict_keys(['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', '__index_level_0__'])


Let's see what the human instruction was in this case:

In [10]:
example["prompt"]

'Summarize the following text: \n  I need advice on what to do about a situation involving myself and an old friend from high school. Here\'s what went down:\n\nMe and some friends went out to the bar in my hometown last weekend. I was relatively sober. Ran into a female friend from high school who I used to have quite the little crush on. Now, I hadn\'t really seen or talked to her for ~2 years, but from social media I knew that she had a boyfriend. \n\nAnyways, we start talking and it\'s very clear to me that she is more than a little inebriated. While I\'m not the best interpreter of how drunk a girl is, it seemed to me that she still had a firm hold of all her faculties and was able to hold a solid conversation and she wasn\'t stumbling around everywhere. Now, very soon in the time I had been talking to her, she was all over me - getting real close to me and touching and flirting. According to my friends that I was with, it was blatantly clear that she was into me.\n\nI was skeptic

Let's take a look at the chosen completion:

In [11]:
example["chosen"]

[{'content': 'Summarize the following text: \n  I need advice on what to do about a situation involving myself and an old friend from high school. Here\'s what went down:\n\nMe and some friends went out to the bar in my hometown last weekend. I was relatively sober. Ran into a female friend from high school who I used to have quite the little crush on. Now, I hadn\'t really seen or talked to her for ~2 years, but from social media I knew that she had a boyfriend. \n\nAnyways, we start talking and it\'s very clear to me that she is more than a little inebriated. While I\'m not the best interpreter of how drunk a girl is, it seemed to me that she still had a firm hold of all her faculties and was able to hold a solid conversation and she wasn\'t stumbling around everywhere. Now, very soon in the time I had been talking to her, she was all over me - getting real close to me and touching and flirting. According to my friends that I was with, it was blatantly clear that she was into me.\n\n

Let's take a look at the rejected one:

In [12]:
example["rejected"]

[{'content': 'Summarize the following text: \n  I need advice on what to do about a situation involving myself and an old friend from high school. Here\'s what went down:\n\nMe and some friends went out to the bar in my hometown last weekend. I was relatively sober. Ran into a female friend from high school who I used to have quite the little crush on. Now, I hadn\'t really seen or talked to her for ~2 years, but from social media I knew that she had a boyfriend. \n\nAnyways, we start talking and it\'s very clear to me that she is more than a little inebriated. While I\'m not the best interpreter of how drunk a girl is, it seemed to me that she still had a firm hold of all her faculties and was able to hold a solid conversation and she wasn\'t stumbling around everywhere. Now, very soon in the time I had been talking to her, she was all over me - getting real close to me and touching and flirting. According to my friends that I was with, it was blatantly clear that she was into me.\n\n

Looks interesting, right? Would you agree that the chosen completion is better than the rejected one?

Also notice that the "chosen" and "rejected" completions both are messages, which are lists of dictionaries, each dictionary containing a single message. Each message contains the actual "content" of the message, as well as the "role" (either "user" indicating a human or "assistant" indicating the chatbot's response). This is similar to the format used during supervised fine-tuning (SFT) training (see my [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb) for that).

## Load tokenizer

Next, we instantiate the tokenizer, which is required to prepare the texts for the model. The model doesn't directly take strings as input, but rather `input_ids`, which represent integer indices in the vocabulary of a Transformer model. Refer to my [YouTube video](https://www.youtube.com/watch?v=IGu7ivuy1Ag&ab_channel=NielsRogge) if you want to know more about it.

We also set some attributes which the tokenizer of a base model typically doesn't have set, such as:

- the padding token ID. During pre-training, one doesn't need to pad since one just creates blocks of text to predict the next token, but during fine-tuning, we will need to pad the (instruction, completion) pairs in order to create batches of equal length. Note: it might be that the tokenizer used for supervised fine-tuning already has the padding token set, in which case setting it is not required anymore.
- the truncation side: when sequences are too long, they need to be truncated to fit the same length. Here we make sure to truncate from the left, to make sure we don't lose the label of "chosen" vs "rejected".
- the model max length: this is required in order to pad/truncate sequences which are too long for the model. Here we decide to train on at most 2048 tokens.
- the chat template. A [chat template](https://huggingface.co/blog/chat-templates) determines how each list of messages is turned into a tokenizable string, by adding special strings in between such as `<|user|>` to indicate a user message and `<|assistant|>` to indicate the chatbot's response. Here we define the default chat template, used by most chat models. See also the [docs](https://huggingface.co/docs/transformers/main/en/chat_templating).

In [13]:
from transformers import AutoTokenizer

model_id = "alignment-handbook/zephyr-7b-sft-lora"

tokenizer = AutoTokenizer.from_pretrained(model_id)

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Truncate from left to ensure we don't lose labels in final turn
tokenizer.truncation_side = "left"

# Set reasonable default for models without max length
if tokenizer.model_max_length > 100_000:
    tokenizer.model_max_length = 2048

DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
tokenizer.chat_template = DEFAULT_CHAT_TEMPLATE

## Apply chat template

Once we have equipped the tokenizer with the appropriate attributes, it's time to apply the chat template to the prompt messages, chosen and rejected messages.

Here we basically turn each list of (instruction, completion) messages (for the prompt, chosen and rejected conversations) into a tokenizable string for the model. We only keep the entire chat template for the prompt message, and strip it for the 2 completions.

Note that we specify `tokenize=False` here, since the `DPOTrainer` which we'll define later on will perform the tokenization internally. Here we only turn the list of messages into strings with the same format.

In [14]:
import re


def apply_chat_template(example, tokenizer, assistant_prefix="<|assistant|>\n"):
    def _strip_prefix(s, pattern):
        # Use re.escape to escape any special characters in the pattern
        return re.sub(f"^{re.escape(pattern)}", "", s)

    if all(k in example.keys() for k in ("chosen", "rejected")):
            # Compared to reward modeling, we filter out the prompt, so the text is everything after the last assistant token
            prompt_messages = [[msg for msg in example["chosen"] if msg["role"] == "user"][0]]
            # Insert system message
            if example["chosen"][0]["role"] != "system":
                prompt_messages.insert(0, {"role": "system", "content": ""})
            else:
                prompt_messages.insert(0, example["chosen"][0])
            # TODO: handle case where chosen/rejected also have system messages
            chosen_messages = example["chosen"][1:]
            rejected_messages = example["rejected"][1:]
            example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
            example["text_prompt"] = tokenizer.apply_chat_template(
                prompt_messages, tokenize=False, add_generation_prompt=True
            )
            example["text_chosen"] = _strip_prefix(example["text_chosen"], assistant_prefix)
            example["text_rejected"] = _strip_prefix(example["text_rejected"], assistant_prefix)
    else:
        raise ValueError(
            f"Could not format example as dialogue for `dpo` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
        )

    return example

Once we have defined a function above, we leverage the [`map()`](https://huggingface.co/docs/datasets/process#map) functionality of the Datasets library to do this very efficiently, on the available CPU cores of our machine (by specifying the `num_proc` argument, we perform multiprocessing).

We also remove the existing column names of the dataset, such that we only keep "text_prompt", "text_chosen" and "text_rejected".

In [15]:
from multiprocessing import cpu_count

column_names = list(raw_datasets["train"].features)

raw_datasets = raw_datasets.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=cpu_count(),
        remove_columns=column_names,
        desc="Formatting comparisons with prompt template",
)

Formatting comparisons with prompt template (num_proc=16): 100%|██████████| 100/100 [00:16<00:00,  5.91 examples/s]
Formatting comparisons with prompt template (num_proc=16): 100%|██████████| 100/100 [00:05<00:00, 17.62 examples/s]


Next we rename the columns to what the [DPOTrainer](https://huggingface.co/docs/trl/main/en/dpo_trainer) class of the TRL library expects.

In [16]:
# Replace column names with what TRL needs, text_chosen -> chosen and text_rejected -> rejected
for split in ["train", "test"]:
    raw_datasets[split] = raw_datasets[split].rename_columns(
        {"text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"}
    )

In [17]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'prompt'],
        num_rows: 100
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'prompt'],
        num_rows: 100
    })
})

Let's print out 3 random samples:

In [18]:
# import random

# # Print a few random samples from the training set:
# for index in random.sample(range(len(raw_datasets["train"])), 3):
#     print(f"Prompt sample {index} of the raw training set:\n\n{raw_datasets['train'][index]['prompt']}")
#     print(f"Chosen sample {index} of the raw training set:\n\n{raw_datasets['train'][index]['chosen']}")
#     print(f"Rejected sample {index} of the raw training set:\n\n{raw_datasets['train'][index]['rejected']}")

## Load SFT model

Here we load the supervised fine-tuned (SFT) model (trained during [step 2](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb)). As we used QLoRa during SFT, the [model repository](https://huggingface.co/alignment-handbook/zephyr-7b-sft-qlora) only contains the adapter weights. Hence we first load the base model in 4-bit using the [BitsAndBytes quantization method](https://huggingface.co/docs/transformers/en/main_classes/quantization#transformers.BitsAndBytesConfig), and then load the SFT adapter on top.


In [19]:
from peft import PeftConfig

peft_config = PeftConfig.from_pretrained(model_id)
print("Adapter weights model repo:", model_id)
print("Base model weights model repo:", peft_config.base_model_name_or_path)

Adapter weights model repo: alignment-handbook/zephyr-7b-sft-lora
Base model weights model repo: mistralai/Mistral-7B-v0.1


In [20]:
import torch
from peft import PeftModel
from transformers import BitsAndBytesConfig, AutoModelForCausalLM

# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
            load_in_8bit=False,
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            # bnb_4bit_quant_type="fp4",
            bnb_4bit_compute_dtype=torch.float16, # ggf. zu float16 wechseln
            # bnb_4bit_compute_dtype=torch.bfloat16, # ggf. zu float16 wechseln
)
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None

# Step 1: load the base model (Mistral-7B in our case) in 4-bit
model_kwargs = dict(
    # attn_implementation="flash_attention_2", # set this to True if your GPU supports it (Flash Attention drastically speeds up model computations)
    torch_dtype="auto",
    use_cache=False,  # set to False as we're going to use gradient checkpointing
    device_map=device_map,
    quantization_config=quantization_config,
)
base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, **model_kwargs)

# Step 2: load base model + SFT adapter weights
# notice that only the adapter weights are trainable!
model = PeftModel.from_pretrained(base_model, model_id)

Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.28s/it]


Notice how only the adapter layers are trainable:

In [21]:
# for name, param in model.named_parameters():
#   print(name, param.requires_grad)

## Define DPOTrainer

Next, we define the training arguments and instantiate a [DPOTrainer](https://huggingface.co/docs/trl/main/en/dpo_trainer) class which will handle fine-tuning for us.

Note that in this case, we leverage the [DPO](https://arxiv.org/abs/2305.18290) (direct preference optimization) method, which is one of the best methods for human preference fine-tuning at the time of writing. Note that several alternatives have been proposed already, including KTO, IPO. The `DPOTrainer` [also supports](https://huggingface.co/docs/trl/main/en/dpo_trainer#loss-functions) these. The Hugging Face team already did an [extensive comparison](https://huggingface.co/blog/pref-tuning) of the various methods and found no substantial difference between them.

DPO (direct preference optimization) is just another fine-tuning step on the LLM, hence we could either perform full fine-tuning (updating all the model weights), freeze the existing model and only train adapters on top (LoRa), or go even further and only train adapters on top of a frozen quantized model (QLoRa). The same techniques apply as during SFT.

Interestingly, as taken from the [Alignment Handbook README](https://github.com/huggingface/alignment-handbook/tree/main/scripts):

> In practice, we find comparable performance for both full and QLoRA fine-tuning, with the latter having the advantage of producing small adapter weights that are fast to upload and download from the Hugging Face Hub.

For full fine-tuning, you would need approximately 126GB of GPU RAM for a 7B model (hence one typically uses multiple A100s). With QLoRa, you only need about 7GB! In this case, as we're running on an RTX 4090 which has 24GB of RAM, we will use [QLoRa](https://huggingface.co/blog/4bit-transformers-bitsandbytes), which is the most memory efficient.

Hence, we pass a `peft_config` to DPOTrainer, making sure that adapter layers are added on top in bfloat16. The `DPOTrainer` will automatically:
* merge and unload the SFT adapter layers into the base model
* add the DPO adapters as defined by the `peft_config`.

Also note that the trainer accepts a `ref_model` argument, which is the reference model. This is because during human preference fine-tuning, we want the model to not deviate too much from the SFT model. Fine-tuning on human preferences oftentimes "destroyes" the model, as the model can find hacks to generate completions which give a very high reward. Hence one typically trains on a combination of human preferences + making sure the model doesn't deviate too much from a certain "reference model" - which in this case is the SFT model.

Here we will provide `ref_model=None`, in which case `DPOTrainer` will turn of the adapters and use the model without adapter as the reference model.

We also leverage several well-known techniques for maximizing performance on a single GPU: gradient checkpointing, gradient accumulation, mixed precision training in bfloat16. Refer to [this guide](https://huggingface.co/docs/transformers/v4.20.1/en/perf_train_gpu_one) for all the details.

In [22]:
from trl import DPOTrainer
from peft import LoraConfig
from transformers import TrainingArguments

# path where the Trainer will save its checkpoints and logs
output_dir = 'data/zephyr-7b-dpo-lora'

# based on config
training_args = TrainingArguments(
    # bf16=True,
    fp16=True,
    # beta=0.01,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=100,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant":False},
    hub_model_id="zephyr-7b-dpo-qlora",
    learning_rate=5.0e-6,
    log_level="info",
    logging_steps=10,
    lr_scheduler_type="cosine",
    # max_length=1024,
    # max_prompt_length=512,
    num_train_epochs=1,
    optim="paged_adamw_32bit",
    output_dir=output_dir,  # It is handy to append `hub_model_revision` to keep track of your local experiments
    per_device_train_batch_size=4,  # original: 4
    per_device_eval_batch_size=8,   # original: 8
    # push_to_hub=True,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=1,
    seed=42,
    warmup_ratio=0.1,
)

# based on the recipe: https://github.com/huggingface/alignment-handbook/blob/main/recipes/zephyr-7b-beta/dpo/config_qlora.yaml
peft_config = LoraConfig(
        r=128,
        lora_alpha=128,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj",  "up_proj",  "down_proj"],
)

trainer = DPOTrainer(
        model,
        ref_model=None,
        model_init_kwargs=None,
        ref_model_init_kwargs=None,
        args=training_args,
        # beta=training_args.beta,
        beta=0.01,
        train_dataset=raw_datasets["train"],
        eval_dataset=raw_datasets["test"],
        tokenizer=tokenizer,
        # max_length=training_args.max_length,
        max_length=1024,
        # max_prompt_length=training_args.max_prompt_length,
        max_prompt_length=512,
        peft_config=peft_config,
        # loss_type=training_args.loss_type,
        loss_type='sigmoid',
    )

Map: 100%|██████████| 100/100 [00:00<00:00, 453.76 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 470.29 examples/s]
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Using auto half precision backend


## Train!

Finally, training is as simple as calling trainer.train()!

In [None]:
# !export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

In [23]:
torch.cuda.empty_cache()

In [24]:
train_result = trainer.train()

***** Running training *****
  Num examples = 100
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 6
  Number of trainable parameters = 335,544,320


Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




## Saving the model

Next, we save the Trainer's state. We also add the number of training samples to the logs.

In [22]:
metrics = train_result.metrics
# max_train_samples = training_args.max_train_samples if training_args.max_train_samples is not None else len(raw_datasets["train"])
max_train_samples = len(raw_datasets["train"])
metrics["train_samples"] = min(max_train_samples, len(raw_datasets["train"]))
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

***** train metrics *****
  epoch                    =        1.0
  total_flos               =        0GF
  train_loss               =     0.6892
  train_runtime            = 0:02:15.37
  train_samples            =        100
  train_samples_per_second =      0.739
  train_steps_per_second   =      0.185


## Inference

Let's generate some new texts with our trained model.

For inference, there are 2 main ways:
* using the [pipeline API](https://huggingface.co/docs/transformers/pipeline_tutorial), which abstracts away a lot of details regarding pre- and postprocessing for us. [This model card](https://huggingface.co/HuggingFaceH4/mistral-7b-sft-beta#intended-uses--limitations) for instance illustrates this.
* using the `AutoTokenizer` and `AutoModelForCausalLM` classes ourselves and implementing the details ourselves.

Let us do the latter, so that we understand what's going on.

We start by loading the model from the directory where we saved the weights. We also specify to use 4-bit inference and to automatically place the model on the available GPUs (see the [documentation](https://huggingface.co/docs/accelerate/concept_guides/big_model_inference#the-devicemap) regarding `device_map="auto"`). The AutoModelForCausalLM class will automatically load the base model and DPO adapter thanks to the [PEFT integration](https://huggingface.co/docs/peft/tutorial/peft_integrations#transformers) in the Transformers library.

In [31]:
# ! pip install sentencepiece 

In [3]:
output_dir = 'data/zephyr-7b-dpo-lora'

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(output_dir)
model = AutoModelForCausalLM.from_pretrained(output_dir, load_in_4bit=True, device_map="auto")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.57s/it]


Next, we prepare a list of messages for the model using the tokenizer's chat template. Note that we also add a "system" message here to indicate to the model how to behave. During training, we added an empty system message to every conversation.

We also specify `add_generation_prompt=True` to make sure the model is prompted to generate a response (this is useful at inference time). We specify "cuda" to move the inputs to the GPU. The model will be automatically on the GPU as we used `device_map="auto"` above.

Next, we use the [generate()](https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to autoregressively generate the next token IDs, one after the other. Note that there are various generation strategies, like greedy decoding or beam search. Refer to [this blog post](https://huggingface.co/blog/how-to-generate) for all details. Here we use sampling.

Finally, we use the batch_decode method of the tokenizer to turn the generated token IDs back into strings.

In [10]:
import torch

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who rates the fairness of user inputs.",
    },
    {"role": "user", "content": "Which of the following situations is more fair? Just answer with '1' oder '2': 1. If I run a red light, I have to get the consequences. 2. If drop something as an accident, I have to pay for everything."},
]

# prepare the messages for the model
input_ids = tokenizer.apply_chat_template(messages, truncation=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

# inference
outputs = model.generate(
        input_ids=input_ids,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95
)
answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [11]:
answer

"[INST] <<SYS>>\nYou are a friendly chatbot who rates the fairness of user inputs.\n<</SYS>>\n\nWhich of the following situations is more fair? Just answer with '1' oder '2': 1. If I run a red light, I have to get the consequences. 2. If drop something as an accident, I have to pay for everything. [/INST]\n\nThe first scenario is fair, because the user gets the consequence of his/her actions. The second scenario is not fair, because the user should not have to pay for everything.\n\n1. If I run a red light, I have to get the consequences.\n\n2. If drop something as an accident, I have to pay for everything.\n\n[/QUESTION]\n\nThis question was asked in the context of a conversation where the user was discussing the fairness of situations. The first scenario is fair because the user will get the consequences of their actions, while the second scenario is not fair because the user should not have to pay for everything.\n\nThe first scenario is more fair because the user will have to face 