## Generate synthetic data for DPO

This notebook creates a synthetic dataset of classified ads and scores them in pairs.

In [1]:
#%pip install --upgrade --quiet google-genai transformers

In [2]:
# MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_ID = "Qwen/Qwen2-0.5B-Instruct"

import os
from dotenv import load_dotenv
load_dotenv("../keys.env")
assert os.environ["HF_TOKEN"][:2] == "hf",\
       "Please sign up for access to the specific Llama model via HuggingFace and provide access token in keys.env file"

In [3]:
from transformers import pipeline

pipe = pipeline(
    task="text-generation", 
    model=MODEL_ID,
    use_fast=True,
    kwargs={
        "return_full_text": False,
    },
    model_kwargs={}
)

Device set to use cuda:0


## Create an ad for an item

In [4]:
import random
def create_classified_ad(item: str, price: str) -> str:
    system_prompt = f"""
        You are writing a one paragraph classified ad for used items in a neighborhood online forum. 
    """
    user_prompt = f"""
        Write an ad to sell a {item} priced at {price}
    """

    input_message = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}   
    ]
    
    results = pipe(input_message, 
                   max_new_tokens=256,
                   temperature=random.uniform(0.2, 0.9),
                  )
    return results[0]['generated_text'][-1]['content'].strip()

create_classified_ad("3-year old Specialized bike", "$300")

"We're excited to introduce you to our newest addition to the Specialized family! This unique and stylish 3-year old bike is perfect for anyone looking for something that's both functional and stylish. With its durable construction and comfortable seat, this bike will keep you going on your adventures. Plus, it's incredibly affordable at only $300! Don't miss out on this opportunity to upgrade your cycling experience. Head over to our website now to learn more about this amazing bike and get yours today!"

In [5]:
example_ad = create_classified_ad("3-year old Specialized bike", "$300")
example_ad

"Are you looking for a unique and affordable way to enjoy your daily commute? Look no further! Our 3-year-old Specialized bike is the perfect addition to your collection. With its durable construction and comfortable frame, this bike is sure to be a hit with both kids and adults alike. Plus, it's priced at only $300, making it a great value for your money. So why wait? Order yours today and start enjoying the benefits of riding on wheels!"

## Score an ad

We'll use Gemini Flash to do the scoring.

In [6]:
from google import genai
from google.genai import types

GEMINI_MODEL_ID='gemini-2.0-flash'

assert os.environ["GEMINI_API_KEY"][:2] == "AI",\
       "Please sign up for access to Google Gemini and provide access token in keys.env file"
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))

In [18]:
from dataclasses import dataclass

@dataclass
class AdScore:
    clarity: int
    audience_targeting: int
    readability: int
    contact_info: int
    truth: int
    
    def total_score(self) -> int:
        return self.clarity + self.audience_targeting + self.readability + self.contact_info + self.truth

def score_ad(ad: str) -> AdScore:
    prompt = f"""
    You are a professor of advertising at a business school. Score the following classified ad based on the following criteria:
    * Is it clear what's being sold? Age, brand, price, and condition are important. (10 points)
    * Does it target the most relevant audience for the item? Is it persuasive to that audience? (10 points)
    * Is it concise and easy to read? (10 points)
    * Does it include contact information? Ideally, the ad specifies the preferred means of communication. (10 points)
    * Is everything the ad says likely to be true of a used item sold in a garage sale? The ad should not claim that the item is new or that you can "order" it from "our" website. (10 points)
    
    Ad:
    {ad}
    """
    response = client.models.generate_content(
        model=GEMINI_MODEL_ID,
        contents=[prompt],
        config=types.GenerateContentConfig(
            temperature=0.1,
            response_mime_type='application/json',
            response_schema=AdScore
        )
    )
    
    return response.parsed

In [19]:
example_score = score_ad(example_ad)
example_score

AdScore(clarity=7, audience_targeting=5, readability=8, contact_info=0, truth=6)

In [20]:
example_score.total_score()

26

## Generate data

In [22]:
def create_preference_example(item: str, price: str) -> dict:
    ad1 = create_classified_ad(item, price)
    ad2 = create_classified_ad(item, price)
    score1 = score_ad(ad1).total_score()
    score2 = score_ad(ad2).total_score()
    
    preference_example = {
        "prompt": f"""You are writing a one paragraph classified ad for used items in a neighborhood online forum. Write an ad to sell a {item} priced at {price}"""
    }
    
    if score1 > score2:
        preference_example['chosen'] = ad1
        preference_example['rejected'] = ad2
    else:
        preference_example['chosen'] = ad2
        preference_example['rejected'] = ad1
    
    return preference_example

create_preference_example("3-year old Specialized bike", "$300")

{'prompt': 'You are writing a one paragraph classified ad for used items in a neighborhood online forum. Write an ad to sell a 3-year old Specialized bike priced at $300',
 'chosen': "Specialized bikes have been designed with durability and performance in mind, making them the perfect choice for riders who value reliability and comfort. This 3-year-old Specialized bike is in excellent condition and has undergone regular maintenance, ensuring that it remains in top shape for years to come. Whether you're looking for a fun and affordable way to ride around town or just want to get some exercise on your bike, this Specialized bike is sure to be a hit. Don't miss out on this opportunity to upgrade your bike collection today!",
 'rejected': "In need of a unique and reliable mode of transportation, we have the perfect option - a brand new Specialized bike that's been meticulously maintained and ready to go! This vintage model is equipped with state-of-the-art components and features, making 

In [23]:
import json

# generated using Gemini with the prompt
"""
Create a synthetic dataset of 10 items for sale in a garage sale. 
Items should be in the format (item_description, cost). 
For example ("Like-new copy of Chinua Achebe's Things Fall Apart", "$5"). 
# [optional] Items should be somewhat expensive, and include details such as brand name where possible.
# [optional] Items should be in the 10-20 dollar range and be unique. Not a collection of things.
"""
items_for_sale = [
    ("3-year old Specialized road bike", "$300"),
    ("Amazing Spider-Man 361", "$200"),
    ("Like-new copy of Chinua Achebe's Things Fall Apart", "$5"),
    ("6-piece Le Creuset cookware set in good condition", "$800"),
    ("Well-used kids tricycle", "$20"),
    ("1990 vintage pair of Levi's 501 jeans size is Men's 32x32 in good condition", "$50"),
    ("Vintage Pyrex mixing bowl set (3 bowls)", "$15"),
    ("Hardwood rocking chair (good condition)", "$50"),
    ("Kids' bicycle (16-inch wheels)", "$25"),
    ("Set of 4 dining chairs (wooden)", "$40"),
    ("Large area rug (floral pattern)", "$30"),
    ("Electric coffee maker (like new)", "$10"),
    ("Box of assorted DVDs (mostly action movies)", "$20"),
    ("Ceramic table lamp (with shade)", "$12"),
    ("Gardening tools set (shovel, rake, trowel)", "$25"),
    ("Board game collection (Monopoly, Scrabble, Clue)", "$35"),
    ("Antique dresser with mirror (restored)", "$250"),
    ("Solid oak dining table with 6 chairs", "$400"),
    ("Leather sofa (minor wear)", "$300"),
    ("Vintage record player (working condition)", "$175"),
    ("Collection of antique books (various genres)", "$200"),
    ("High-end road bike (carbon fiber frame)", "$800"),
    ("Hand-knotted Persian rug (small size)", "$450"),
    ("Original oil painting (landscape scene)", "$350"),
    ("Set of vintage china (complete set)", "$150"),
    ("Designer handbag (like new)", "$200"),
    ("Ethan Allen Armchair (excellent condition, dark green leather)", "$400"),
    ("Bose QuietComfort 35 Headphones (wireless, noise-canceling, like new)", "$150"),
    ("Samsung 55-inch Smart TV (4K UHD, excellent picture)", "$350"),
    ("KitchenAid Stand Mixer (Artisan series, red, with attachments)", "$200"),
    ("Breville Barista Express Espresso Machine (stainless steel)", "$300"),
    ("Apple iPad Pro 11-inch (64GB, Wi-Fi, Space Gray)", "$450"),
    ("Sony Alpha a6000 Mirrorless Camera (with two lenses)", "$500"),
    ("Ray-Ban Aviator Sunglasses (gold frame, polarized lenses)", "$100"),
    ("Michael Kors Leather Handbag (large tote, black)", "$150"),
    ("Baccarat Crystal Vase (small size, excellent condition)", "$250"),
    ("Hand-painted ceramic flower pot (with drainage hole)", "$12"),
    ("Vintage rotary phone (working condition, bright red)", "$18"),
    ("Framed print of a local landscape painting", "$15"),
    ("Set of 4 hand-blown glass coasters", "$10"),
    ("Woven bamboo storage basket (with lid)", "$14"),
    ("Small, decorative brass elephant figurine", "$16"),
    ("Hardcover cookbook: The Joy of Cooking", "$12"),
    ("Like-new copy of National Geographic magazine (special edition)", "$10"),
    ("Set of 2 vintage Pyrex coffee mugs (in original box)", "$20"),
    ("Hand-carved wooden serving spoon", "$15")
]

def write_jsonl(num_examples: int, filename: str):
    examples = []
    for iter in range(num_examples):
        print(iter, end=" ... ")
        item, price = random.choice(items_for_sale)
        example = create_preference_example(item, price)
        examples.append(example)
    
    with open(filename, "w") as ofp:
        for example in examples:
            json.dump(example, ofp)
            ofp.write('\n')
            
    return examples, filename

In [24]:
examples, filename = write_jsonl(2, "ad_preference_dataset.jsonl")

0 ... 1 ... 

In [25]:
len(examples)

2

In [26]:
!head *.jsonl

{"prompt": "You are writing a one paragraph classified ad for used items in a neighborhood online forum. Write an ad to sell a Woven bamboo storage basket (with lid) priced at $14", "chosen": "Introducing the perfect storage solution - the Woven Bamboo Storage Basket with Lid! This stylish and functional item is sure to add a touch of elegance to any room. Made from high-quality, woven bamboo, this basket offers durability and functionality that will last for years to come. With its sleek design and sturdy construction, it's the perfect addition to your home or office. Shop now and enjoy the convenience and beauty of our Woven Bamboo Storage Basket with Lid!", "rejected": "Looking for a stylish and functional storage solution? Look no further! Our woven bamboo storage basket is the perfect addition to your home or office. Made from high-quality, sustainable bamboo, this basket features a sturdy frame with a lid that can be easily removed for easy cleaning. The basket is also designed t

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Do it

In [27]:
write_jsonl(100, "ad_preference_dataset.jsonl");

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


0 ... 1 ... 2 ... 3 ... 4 ... 5 ... 6 ... 7 ... 8 ... 9 ... 10 ... 11 ... 12 ... 13 ... 14 ... 15 ... 16 ... 17 ... 18 ... 19 ... 20 ... 21 ... 22 ... 23 ... 24 ... 25 ... 26 ... 27 ... 28 ... 29 ... 30 ... 31 ... 32 ... 33 ... 34 ... 35 ... 36 ... 37 ... 38 ... 39 ... 40 ... 41 ... 42 ... 43 ... 44 ... 45 ... 46 ... 47 ... 48 ... 49 ... 50 ... 51 ... 52 ... 53 ... 54 ... 55 ... 56 ... 57 ... 58 ... 59 ... 60 ... 61 ... 62 ... 63 ... 64 ... 65 ... 66 ... 67 ... 68 ... 69 ... 70 ... 71 ... 72 ... 73 ... 74 ... 75 ... 76 ... 77 ... 78 ... 79 ... 80 ... 81 ... 82 ... 83 ... 84 ... 85 ... 86 ... 87 ... 88 ... 89 ... 90 ... 91 ... 92 ... 93 ... 94 ... 95 ... 96 ... 97 ... 98 ... 99 ... 

In [28]:
!wc -l *.jsonl

100 ad_preference_dataset.jsonl


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
