### DPO Finetune the `Phi2` model to `Neural-Phi`

Use the `distilablled_orca_dpo_pairs` to finetune the kindof-`SFT` `phi2` model from Microsoft.

- Follows an excellent blogpost from [Maxime Labonne](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjAvYS-ismEAxXdwjgGHfrfAeIQFnoECA4QAQ&url=https%3A%2F%2Ftowardsdatascience.com%2Ffine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac&usg=AOvVaw04Cuzrpb0fcRrxWcV_5Nox&opi=89978449)


In [1]:
# Install dependencies
# ! pip install -q datasets transformers bitsandbytes sentencepiece wandb

In [2]:
import os
import gc
import json
import torch

import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from trl import DPOTrainer
import bitsandbytes as bnb
import wandb

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
torch.cuda.empty_cache()
gc.collect()

24

In [6]:
secrets_path = "./secrets/secrets.json"
# load the tokens
with open(secrets_path, "r") as f:
    secrets = json.load(f)

hf_token = secrets["HF_TOKEN"]
wandb_token = secrets["WANDB_TOKEN"]

# login to wandb
os.environ["WANDB_NOTEBOOK_NAME"] = os.path.dirname(os.path.abspath("dpo_finetune_phi2.ipynb"))
wandb.login(key=wandb_token)

[34m[1mwandb[0m: Currently logged in as: [33mparth-shastri[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/ostrich/.netrc


True

In [5]:
# ChatML template  - Template used by models like ChatGPT for a chat interface
# <|im_start|>system
# You are a helpful chatbot assistant.<|im_end|>
# <|im_start|>user
# Hi<|im_end|>
# <|im_start|>assistant
# Hi, how can I help you?<|im_end|>

### Config


In [7]:
model_name = "phi2-sft-alpaca_loraemb-right-pad"  # with the Embeddings tuned.
new_model = "Neural-phi2-v2"

dataset_name = "argilla/distilabel-intel-orca-dpo-pairs"

In [8]:
def make_chatml_format(example, tokenizer: AutoTokenizer):
    """
    Convert the example to chatml format.
    # <|im_start|>system
    # You are a helpful chatbot assistant.<|im_end|>
    # <|im_start|>user
    # Hi<|im_end|>
    # <|im_start|>assistant
    # Hi, how can I help you?<|im_end|>
    """

    # Format system prompt
    if len(example["system"]) > 0:
        message = {"role": "system", "content": example["system"]}
        # the add_generation_prompt = True arg appends the <|im_start|>assistant str at the end
        system = tokenizer.apply_chat_template(
            [message], tokenize=False, add_generation_prompt=False
        )
    else:
        system = ""

    # Format instruction input prompt
    message = {"role": "user", "content": example["input"]}
    prompt = tokenizer.apply_chat_template(
        [message], tokenize=False, add_generation_prompt=True
    )

    # Format chosen prompt
    chosen = example["chosen"] + "<|im_end|>"  # add the EOS token

    # Format rejected prompt
    rejected = example["rejected"] + "<|im_end|>"

    return {"prompt": system + prompt, "chosen": chosen, "rejected": rejected}

#### Load the dataset and Tokenizer


In [None]:
# Dataset

dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

original_columns = dataset.column_names
    
# Tokenizer,
tokenizer = AutoTokenizer.from_pretrained(f"tokenizers/{model_name}")
# tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

In [9]:
dataset

Dataset({
    features: ['system', 'input', 'chosen', 'rejected', 'generations', 'order', 'labelling_model', 'labelling_prompt', 'raw_labelling_response', 'rating', 'rationale', 'status', 'original_chosen', 'original_rejected', 'chosen_score', 'in_gsm8k_train'],
    num_rows: 12859
})

In [10]:
# Format the dataset
dataset = dataset.map(
    lambda x: make_chatml_format(x, tokenizer),
    remove_columns=original_columns,
)

In [11]:
# View one example of the dataset
dataset[10]

{'chosen': '[\n  ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"],\n  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n]<|im_end|>',
 'rejected': " Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\n\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\n\nExplanation:\n\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\n\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.<|im_end|>",
 'prompt': "<|im_start|>user\nYou will be given a definition of a task f

In [12]:
### CALCULATE THE MAX LEN ###
from numpy import percentile

# lets find the p95 of the dataset lengths
max_prompt_len = int(
    percentile([len(tokenizer(x["prompt"])["input_ids"]) for x in dataset], 95)
)
max_chosen_len = int(
    percentile(
        [len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) for x in dataset], 95
    )
)
max_rejected_len = int(
    percentile(
        [len(tokenizer(x["prompt"] + x["rejected"])["input_ids"]) for x in dataset], 95
    )
)
max_seq_len = max(max_chosen_len, max_rejected_len)

## filter dataset to remove sequences that are not in the found max_len
dataset = dataset.filter(
    lambda x: len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) <= max_seq_len
)
print(f"len(dataset): {len(dataset)}")

# Up the lengths to next multiple of 2, why 2?
prompt_len = ((max_prompt_len + 1) // 2) * 2
max_seq_len = ((max_seq_len + 1) // 2) * 2
print(f"p95 prompt len: {prompt_len}")
print(f"p95 max sequence length: {max_seq_len}")

Token indices sequence length is longer than the specified maximum sequence length for this model (2505 > 2048). Running this sequence through the model will result in indexing errors


len(dataset): 12381
p95 prompt len: 592
p95 max sequence length: 952


In [13]:
# multiples of 2
max_seq_len = 1024
max_prompt_len = 768

#### Load the model with less precision


In [14]:
# quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    f"models/{model_name}",
    quantization_config=quantization_config,
    trust_remote_code=True,
)
model.config.use_cache = False

# reference model for DPO
reference_model = AutoModelForCausalLM.from_pretrained(
    f"models/{model_name}",
    quantization_config=quantization_config,
    trust_remote_code=True,
)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

#### Define the `LoRA` parameters


In [15]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=64,  # according to the QLoRA paper
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "dense",
        "fc1",
        "fc2",
    ],  # From the QLoRA paper finetune all the dense layers
)

#### Trainer initialization


In [16]:
# Training Arguments

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs=dict(
        use_reentrant=False
    ),  # some gradient checkpointing argument
    learning_rate=5e-7,  # try with 5e-7  Zephyr recipie
    lr_scheduler_type="linear",
    # max_steps=500,
    num_train_epochs=3,
    save_strategy="no",
    logging_steps=1,
    output_dir=f"models/{new_model}",
    optim="paged_adamw_32bit",
    warmup_steps=1000,
    bf16=True,
    report_to="wandb",
    run_name=f"{new_model}",
    log_level="error",
    logging_first_step=True,
)

# DPO Trainer
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=reference_model,
    args=training_args,
    loss_type="sigmoid",
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,
    beta=0.1,
    max_prompt_length=prompt_len,
    max_length=max_seq_len,
)

# print the trainable parameters
dpo_trainer.model.print_trainable_parameters()



Map:   0%|          | 0/12381 [00:00<?, ? examples/s]

trainable params: 23,592,960 || all params: 2,803,276,800 || trainable%: 0.8416207775129448


In [17]:
# Finetune with dpo
os.environ["TOKENIZERS_PARALLELISM"] = "false"
dpo_trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mparth-shastri[0m. Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113041144562885, max=1.0…

Step,Training Loss
1,0.6931
2,0.6931
3,0.6894
4,0.7011
5,0.6872
6,0.6967
7,0.6875
8,0.6915
9,0.6938
10,0.6895


TrainOutput(global_step=500, training_loss=0.45014474414661526, metrics={'train_runtime': 3935.5075, 'train_samples_per_second': 1.016, 'train_steps_per_second': 0.127, 'total_flos': 0.0, 'train_loss': 0.45014474414661526, 'epoch': 0.32})

In [18]:
dpo_trainer.model.save_pretrained(f"models/adapters/{new_model}")
dpo_trainer.tokenizer.save_pretrained(f"tokenizers/{new_model}")

('tokenizers/Neural-phi2-v2/tokenizer_config.json',
 'tokenizers/Neural-phi2-v2/special_tokens_map.json',
 'tokenizers/Neural-phi2-v2/vocab.json',
 'tokenizers/Neural-phi2-v2/merges.txt',
 'tokenizers/Neural-phi2-v2/added_tokens.json',
 'tokenizers/Neural-phi2-v2/tokenizer.json')

In [19]:
del dpo_trainer, model, reference_model
gc.collect()
torch.cuda.empty_cache()

### Load PefT model and merge


In [6]:
# Load the base model in fp16 (SFT)

sft_model = AutoModelForCausalLM.from_pretrained(
    f"models/{model_name}", return_dict=True, torch_dtype=torch.bfloat16
)


# # Merge the sft model with the adapter
model = PeftModel.from_pretrained(
    model=sft_model, model_id=f"models/adapters/{new_model}"
)
model = model.merge_and_unload()

model.save_pretrained(f"models/{new_model}")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### Load the Merged model


In [6]:
# Load the microsoft model
model = AutoModelForCausalLM.from_pretrained(
    f"microsoft/phi-2", torch_dtype=torch.bfloat16
)

# load the sft adapter

model = PeftModel.from_pretrained(
    model,
    f"models/adapters/{model_name}",
    adapter_name="sft",
)

model = model.merge_and_unload()


# # load the dpo adapter
model = PeftModel.from_pretrained(
    model, f"models/adapters/{new_model}", adapter_name="dpo"
)

# # load the dpo adapter
model = model.merge_and_unload()

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(f"tokenizers/{new_model}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# ### OR you can directly load the saved model
# model = AutoModelForCausalLM.from_pretrained(
#     f"models/{new_model}", torch_dtype=torch.bfloat16
# )

In [7]:
# Format prompt
message = [
    {"role": "system", "content": "You are a helpful assistant chatbot."},
    {"role": "user", "content": "What is a Large Language Model?"},
    # {
    #     "role": "assistant",
    #     "content": "A Large Language Model (LLM) is a type of language model that uses deep neural networks to generate text. It is typically trained on a large dataset of text and can be used to generate new text that is grammatically correct and coherent.LLMs are used for a variety of tasks such as text generation, summarization, translation, and question-answering.",
    # },
    # {"role": "user", "content": "Tell me more about it..."},
]

prompt = tokenizer.apply_chat_template(
    message, add_generation_prompt=True, tokenize=False
)

# Create pipeline
pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text
sequences = pipeline(
    prompt,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,  # Nuecleus sampling
    # top_k=50,
    num_return_sequences=1,
    max_new_tokens=512,
    pad_token_id=tokenizer.pad_token_id,
)
print(sequences[0]["generated_text"])


No chat template is defined for this tokenizer - using a default chat template that implements the ChatML format (without BOS/EOS tokens!). If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.



<|im_start|>system
You are a helpful assistant chatbot.<|im_end|>
<|im_start|>user
What is a Large Language Model?<|im_end|>
<|im_start|>assistant
A Large Language Model (LLM) is a type of artificial intelligence system that is used to generate human-like text. It is trained on a massive amount of text data, such as articles, novels, and other documents, in order to learn the structure and relationships between words. These models are powerful tools that can be used for a variety of tasks, including summarization, translation, question answering, and text generation.



### Push the model to Huggingface Hub


In [7]:
model.push_to_hub(
    "Neural-phi2",
    private=True,
    commit_message="DPO finetuned model on distilled orca pairs",
)

tokenizer.push_to_hub("Neural-phi2", private=True, commit_message="tokenizer")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/charioteer/Neural-phi2/commit/f4c00cc3e0d7ebeb16422d80ca292c1153729a84', commit_message='tokenizer', commit_description='', oid='f4c00cc3e0d7ebeb16422d80ca292c1153729a84', pr_url=None, pr_revision=None, pr_num=None)

In [8]:
tokenizer.save_pretrained(f"models/{new_model}")

('models/Neural-phi2-v2/tokenizer_config.json',
 'models/Neural-phi2-v2/special_tokens_map.json',
 'models/Neural-phi2-v2/vocab.json',
 'models/Neural-phi2-v2/merges.txt',
 'models/Neural-phi2-v2/added_tokens.json',
 'models/Neural-phi2-v2/tokenizer.json')

### Convert the Model to GGUF file format

- Use `llama.cpp` repo for conversion and quantization.

- clone the repo from [www.github.com/ggerganov/llama.cpp.git](www.github.com/ggerganov/llama.cpp.git)

- `cd llama.cpp`

- `make` for Linux or MacOS

- Install requirements

- `python convert-hf-to-gguf.py <Model-path> --output_dir <DEST-PATH> --outtype "<QUANT-DTYPE>"`

- `./quantize <BIN-MODEL> <DEST-MODEL>.gguf "<QUANT_STRAT>"`
