## Using Direct Preference Optimization on a preference dataset

Hardware requirements:  Requires L4 GPU and 8 vCPUs with 32 GB for Qwen

For Llama 3.2-3B, this is not enough because the L4 has only 21GB of RAM (rule of thumb is that you need 8x parameter size for DPO, so 24GB for this model). Need to try A100 or distributing over multiple GPUs.

In [1]:
#%pip install --quiet transformers trl

In [2]:
# has to match generate_synthetic_data so that data is in-distribution
# MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_ID = "Qwen/Qwen2-0.5B-Instruct"   

import os
from dotenv import load_dotenv
load_dotenv("../keys.env")
assert os.environ["HF_TOKEN"][:2] == "hf",\
       "Please sign up for access to the specific Llama model via HuggingFace and provide access token in keys.env file"

In [3]:
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

train_dataset = load_dataset('json', data_files="ad_preference_dataset.jsonl", split='train')

Generating train split: 0 examples [00:00, ? examples/s]

In [4]:
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

In [5]:
if 'llama' in MODEL_ID:
    tokenizer.pad_token = tokenizer.eos_token

In [6]:
training_args = DPOConfig(output_dir="ClassifiedAds-DPO", logging_steps=10, 
                          per_device_train_batch_size=1,
                          per_device_eval_batch_size=1
                         )
trainer = DPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()

Extracting prompt in train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Step,Training Loss
10,0.591
20,0.6597
30,1.2329
40,0.8713
50,0.7882
60,0.6898
70,1.1158
80,1.0654
90,0.8583
100,1.5712


TrainOutput(global_step=300, training_loss=0.35067893626672836, metrics={'train_runtime': 191.5221, 'train_samples_per_second': 1.566, 'train_steps_per_second': 1.566, 'total_flos': 0.0, 'train_loss': 0.35067893626672836, 'epoch': 3.0})

In [7]:
trainer.save_model(training_args.output_dir)

## Try out the trained model

In [12]:
from transformers import pipeline

pipe = pipeline(
    task="text-generation", 
    model="ClassifiedAds-DPO",
    use_fast=True,
    kwargs={
        "return_full_text": False,
    },
    model_kwargs={}
)

Device set to use cuda:0


In [13]:
SYSTEM_PROMPT=f"""
            You are a resident who is listing a used item for sale on a neighborhood online group.
            An ad for used items in this neighborhood group is 1-3 sentences. 
"""
def create_classified_ad(item: str, price: str) -> str:
    system_prompt = SYSTEM_PROMPT
    user_prompt = f"""
        Write an ad to sell a {item} priced at {price}
    """

    input_message = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}   
    ]
    
    results = pipe(input_message, 
                   max_new_tokens=256,
                   pad_token_id=pipe.tokenizer.eos_token_id
                  )
    return results[0]['generated_text'][-1]['content'].strip()

create_classified_ad("book Pachinko by Min Jin Lee", "$5")

'Ad: "Pachinko, the classic tale of a man\'s obsession with gambling and his love for a woman he meets while playing a pachinko game. A rare edition priced at $5. For more information or to arrange pickup, please contact [Your Name] at [Your Phone Number]. Thank you!"'