## Generate synthetic data for DPO

This notebook creates a synthetic dataset of classified ads and scores them in pairs.

In [1]:
#%pip install --upgrade --quiet google-genai transformers

In [1]:
# MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_ID = "Qwen/Qwen2-0.5B-Instruct"

import os
from dotenv import load_dotenv
load_dotenv("../keys.env")
assert os.environ["HF_TOKEN"][:2] == "hf",\
       "Please sign up for access to the specific Llama model via HuggingFace and provide access token in keys.env file"

os.environ['PYTORCH_CUDA_ALLOC_CONF']='expandable_segments:True'

In [2]:
from transformers import pipeline

pipe = pipeline(
    task="text-generation", 
    model=MODEL_ID,
    use_fast=True,
    kwargs={
        "return_full_text": False,
    },
    model_kwargs={}
)

Device set to use cuda:0


In [6]:
SYSTEM_PROMPT=f"""
        You are a resident who is listing a used item for sale on a neighborhood online group.
        An ad for used items in this neighborhood group is 1-3 sentences. 
"""

## Create an ad for an item

In [7]:
import random
def create_classified_ad(item: str, price: str) -> str:
    system_prompt = SYSTEM_PROMPT
    user_prompt = f"""
        Write an ad to sell a {item} priced at {price}
    """

    input_message = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}   
    ]
    
    results = pipe(input_message, 
                   max_new_tokens=256,
                   temperature=random.uniform(0.2, 0.9),
                   pad_token_id=pipe.tokenizer.eos_token_id
                  )
    return results[0]['generated_text'][-1]['content'].strip()

example_ad1 = create_classified_ad("3-year old Specialized bike", "$300")
example_ad1

'Ad Title: "Welcome to the Adventure with our Specialized Bike!"\n\nSpecialized bikes have been a favorite among mountain bikers for years, and they\'re here again! Our latest model, the Specialized 3-Year Old, offers unparalleled performance and durability. This bike has been designed with safety and comfort in mind, making it perfect for beginners or seasoned riders alike.\n\nThe 3-Year Old features a sleek and lightweight frame that allows you to enjoy your rides without feeling like you\'re carrying a heavy load. The sturdy handlebars provide great control while pedaling, ensuring a smooth ride. The quick release system makes it easy to change the chain if needed, which is a must-have feature for any serious rider.\n\nIn addition to its impressive performance, the 3-Year Old comes equipped with all the necessary components for a long-lasting adventure. It includes a high-quality carbon fork, a durable headset, and a set of tires made from high-quality materials. These components wo

In [8]:
example_ad2 = create_classified_ad("3-year old Specialized bike", "$300")
example_ad2

'Ad Title: "Find the Perfect Bike for Your Kids - Specialized!"\n\nSpecialized, one of the leading brands in mountain biking, has a great selection of bikes available for purchase! Our 3-year-old Specialized bike is priced at $300 and comes with all the necessary components for a fun and safe ride.\n\nOur bikes are designed with durability in mind, ensuring that your kids can enjoy their time on the trails without any worries about breaking or losing them. The sturdy frame and durable wheels make it easy for you to navigate through the mountains, while the comfortable seat and padded handlebars provide comfort during long rides.\n\nWith our 3-year-old Specialized bike, you can be sure that your kids will have a blast exploring the outdoors. It\'s perfect for families looking for a reliable and affordable option for their children\'s outdoor adventures.\n\nDon\'t miss out on this opportunity to get a high-quality bike for your kids! Visit us today to see our selection and place your ord

The two ads when we did it using Qwen-0.5B:

```
    Hey there! We're looking for someone who's ready to take their riding game to the next level with our 3-year-old Specialized bike. This bike is a great investment that will keep your kids engaged and safe all year round. It features durable components, a comfortable seat, and a powerful frame that can handle any terrain. Plus, it comes with a lifetime warranty, so you can be sure you're getting a high-quality product. So why wait? Get yours today and start enjoying the thrill of riding on wheels!
```

and
    
```
    Looking for a unique and stylish way to enjoy your daily commute? Look no further than the 3-year-old Specialized bike! This bike is perfect for those who value style over speed, and it's priced at just $300. With its durable frame and high-quality components, this bike will last you years with minimal maintenance. Plus, it comes with a lifetime warranty, so you can rest easy knowing that you're getting a quality product that won't let you down. Don't miss out on this opportunity to upgrade your bike experience today!
```

## Compare 2 ads and choose the better one

We'll use Gemini Flash to do the scoring.

In [9]:
from google import genai
from google.genai import types

GEMINI_MODEL_ID='gemini-2.0-flash'

assert os.environ["GEMINI_API_KEY"][:2] == "AI",\
       "Please sign up for access to Google Gemini and provide access token in keys.env file"
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))

In [10]:
from dataclasses import dataclass

@dataclass
class AdsComparison:
    ad_a_is_better_than_ad_b: bool
    reasoning: str

def score_ad(ad_a: str, ad_b: str) -> AdsComparison:
    prompt = f"""
    You are a professor of advertising at a business school.
    Compare the two ads below for the same item being sold in a neighborhood marketplace and determine whether ad_a is better than ad_b
    Also explain your reasoning.
    
    The main criteria to compare the ads include:
    * Is it clear what's being sold? Age, brand, price, and condition are important.
    * Does it target the most relevant audience for the item? Is the text persuasive to that audience?
    * Is it concise and easy to read? An ideal ad is at most 3 sentences.
    * Does it include contact information? Ideally, the ad specifies the preferred means of communication.
    * Is the ad truthful? Remember that the item is likely used and not being sold by the manufacturer.
    
    ad_a:
    {ad_a}
    
    ad_b:
    {ad_b}
    """
    response = client.models.generate_content(
        model=GEMINI_MODEL_ID,
        contents=[prompt],
        config=types.GenerateContentConfig(
            temperature=0.1,
            response_mime_type='application/json',
            response_schema=AdsComparison
        )
    )
    
    return response.parsed

In [11]:
example_score = score_ad(example_ad1, example_ad2)
example_score

AdsComparison(ad_a_is_better_than_ad_b=False, reasoning='Ad_b is better than ad_a. Ad_b mentions the price of the bike, which is an important factor for potential buyers. Ad_b also specifies that the bike is for kids, which helps target the relevant audience. Ad_a does not mention the price or target audience, making it less effective. Both ads are too verbose, but ad_b is slightly more concise and easier to read.')

Result when we did it:
```
AdsComparison(ad_a_is_better_than_ad_b=False, reasoning="Both ads have issues, but ad_b is slightly better because it includes the price. Neither ad includes contact information. Both ads make the mistake of claiming the bike has a lifetime warranty, which is unlikely for a used bike being sold in a neighborhood marketplace. Ad_a is targeted toward children, but ad_b is targeted toward adults. Since the bike is used, it's more likely to be purchased by an adult.")
```

## Generate data

In [12]:
def create_preference_example(item: str, price: str) -> dict:
    ad1 = create_classified_ad(item, price)
    ad2 = create_classified_ad(item, price)
    score = score_ad(ad1, ad2)
    
    preference_example = {
        "prompt": SYSTEM_PROMPT + f"""Write an ad to sell a {item} priced at {price}"""
    }
    
    if score.ad_a_is_better_than_ad_b:
        preference_example['chosen'] = ad1
        preference_example['rejected'] = ad2
    else:
        preference_example['chosen'] = ad2
        preference_example['rejected'] = ad1
    preference_example['score_reason'] = score.reasoning
    
    return preference_example

create_preference_example("3-year old Specialized bike", "$300")

{'prompt': '\n        You are a resident who is listing a used item for sale on a neighborhood online group.\n        An ad for used items in this neighborhood group is 1-3 sentences. \nWrite an ad to sell a 3-year old Specialized bike priced at $300',
 'chosen': "Dear [Name of the Neighborhood Online Group],\n\nI am writing to offer for sale my beloved Specialized bike that I have been using since it was new. This bike has been in excellent condition and has been well maintained over the years, making it a valuable asset.\n\nThe bike is currently priced at $300 and comes with all the necessary accessories such as wheels, brakes, and tires. It also includes a helmet and a set of lockers for easy access.\n\nIf you're interested in purchasing this bike, please visit our website or give us a call at [Phone Number] to arrange a time to come and inspect the bike. We look forward to hearing from you soon!\n\nThank you for considering my offer.\n\nBest regards,\n[Your Name]",
 'rejected': "De

When we did it:
```
{'prompt': 'You are writing a one paragraph classified ad for used items in a neighborhood online forum. Write an ad to sell a 3-year old Specialized bike priced at $300',
 'chosen': '"Experience the thrill of riding on a 3-year-old Specialized bike with our expert repair service! This bike is in top condition and has been well-maintained, so you can enjoy it for years to come. Don\'t miss out on this opportunity to upgrade your cycling experience!"',
 'rejected': "Specialized bikes are designed for those who love the outdoors and want to ride safely. These bikes have been built with durability, safety features, and comfort in mind. With their sturdy frame and lightweight wheels, you can feel confident on your next adventure. Whether you're looking to get around town or hit the trails, a Specialized bike is the perfect choice for you! Shop now and enjoy the benefits of riding like a pro.",
 'score_reason': 'Ad B is better because it specifies the age and brand of the bike, which is important information for potential buyers. It also mentions that the bike has been well-maintained and is in top condition, which can help to reassure buyers that they are getting a good value. Ad A does not specify the age, brand, or condition of the bike, which makes it less informative and less persuasive.'}
```

In [13]:
import json

# generated using Gemini with the prompt
"""
Create a synthetic dataset of 10 items for sale in a garage sale. 
Items should be in the format (item_description, cost). 
For example ("Like-new copy of Chinua Achebe's Things Fall Apart", "$5"). 
# [optional] Items should be somewhat expensive, and include details such as brand name where possible.
# [optional] Items should be in the 10-20 dollar range and be unique. Not a collection of things.
"""
items_for_sale = [
    ("3-year old Specialized road bike", "$300"),
    ("Amazing Spider-Man 361", "$200"),
    ("Like-new copy of Chinua Achebe's Things Fall Apart", "$5"),
    ("6-piece Le Creuset cookware set in good condition", "$800"),
    ("Well-used kids tricycle", "$20"),
    ("1990 vintage pair of Levi's 501 jeans size is Men's 32x32 in good condition", "$50"),
    ("Vintage Pyrex mixing bowl set (3 bowls)", "$15"),
    ("Hardwood rocking chair (good condition)", "$50"),
    ("Kids' bicycle (16-inch wheels)", "$25"),
    ("Set of 4 dining chairs (wooden)", "$40"),
    ("Large area rug (floral pattern)", "$30"),
    ("Electric coffee maker (like new)", "$10"),
    ("Box of assorted DVDs (mostly action movies)", "$20"),
    ("Ceramic table lamp (with shade)", "$12"),
    ("Gardening tools set (shovel, rake, trowel)", "$25"),
    ("Board game collection (Monopoly, Scrabble, Clue)", "$35"),
    ("Antique dresser with mirror (restored)", "$250"),
    ("Solid oak dining table with 6 chairs", "$400"),
    ("Leather sofa (minor wear)", "$300"),
    ("Vintage record player (working condition)", "$175"),
    ("Collection of antique books (various genres)", "$200"),
    ("High-end road bike (carbon fiber frame)", "$800"),
    ("Hand-knotted Persian rug (small size)", "$450"),
    ("Original oil painting (landscape scene)", "$350"),
    ("Set of vintage china (complete set)", "$150"),
    ("Designer handbag (like new)", "$200"),
    ("Ethan Allen Armchair (excellent condition, dark green leather)", "$400"),
    ("Bose QuietComfort 35 Headphones (wireless, noise-canceling, like new)", "$150"),
    ("Samsung 55-inch Smart TV (4K UHD, excellent picture)", "$350"),
    ("KitchenAid Stand Mixer (Artisan series, red, with attachments)", "$200"),
    ("Breville Barista Express Espresso Machine (stainless steel)", "$300"),
    ("Apple iPad Pro 11-inch (64GB, Wi-Fi, Space Gray)", "$450"),
    ("Sony Alpha a6000 Mirrorless Camera (with two lenses)", "$500"),
    ("Ray-Ban Aviator Sunglasses (gold frame, polarized lenses)", "$100"),
    ("Michael Kors Leather Handbag (large tote, black)", "$150"),
    ("Baccarat Crystal Vase (small size, excellent condition)", "$250"),
    ("Hand-painted ceramic flower pot (with drainage hole)", "$12"),
    ("Vintage rotary phone (working condition, bright red)", "$18"),
    ("Framed print of a local landscape painting", "$15"),
    ("Set of 4 hand-blown glass coasters", "$10"),
    ("Woven bamboo storage basket (with lid)", "$14"),
    ("Small, decorative brass elephant figurine", "$16"),
    ("Hardcover cookbook: The Joy of Cooking", "$12"),
    ("Like-new copy of National Geographic magazine (special edition)", "$10"),
    ("Set of 2 vintage Pyrex coffee mugs (in original box)", "$20"),
    ("Hand-carved wooden serving spoon", "$15")
]

def write_jsonl(num_examples: int, filename: str):
    examples = []
    for iter in range(num_examples):
        print(iter, end=" ... ")
        item, price = random.choice(items_for_sale)
        example = create_preference_example(item, price)
        examples.append(example)
    
    with open(filename, "w") as ofp:
        for example in examples:
            json.dump(example, ofp)
            ofp.write('\n')
            
    return examples, filename

In [14]:
examples, filename = write_jsonl(2, "ad_preference_dataset.jsonl")

0 ... 1 ... 

In [15]:
len(examples)

2

In [16]:
!head *.jsonl

{"prompt": "\n        You are a resident who is listing a used item for sale on a neighborhood online group.\n        An ad for used items in this neighborhood group is 1-3 sentences. \nWrite an ad to sell a Box of assorted DVDs (mostly action movies) priced at $20", "chosen": "Box of Action Movie Dvds - $20!\n\nLooking for some new and exciting action movies? Look no further! We have a selection of highly rated action movies that you can enjoy with your friends or family. These DVDs come in a beautiful box, perfect for storing them safely away from the home. Don't miss out on the thrill of watching these movies together! Plus, they're only $20 each, so why not grab a few for yourself? Order now before they run out!", "rejected": "\"Welcome to our community where we offer a wide selection of high-quality, action-packed DVDs. Our collection includes a variety of titles that cater to different interests and genres. Whether you're a fan of action films, sci-fi, or anything in between, we'

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Do it

In [17]:
write_jsonl(100, "ad_preference_dataset.jsonl");

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


0 ... 1 ... 2 ... 3 ... 4 ... 5 ... 6 ... 7 ... 8 ... 9 ... 10 ... 11 ... 12 ... 13 ... 14 ... 15 ... 16 ... 17 ... 18 ... 19 ... 20 ... 21 ... 22 ... 23 ... 24 ... 25 ... 26 ... 27 ... 28 ... 29 ... 30 ... 31 ... 32 ... 33 ... 34 ... 35 ... 36 ... 37 ... 38 ... 39 ... 40 ... 41 ... 42 ... 43 ... 44 ... 45 ... 46 ... 47 ... 48 ... 49 ... 50 ... 51 ... 52 ... 53 ... 54 ... 55 ... 56 ... 57 ... 58 ... 59 ... 60 ... 61 ... 62 ... 63 ... 64 ... 65 ... 66 ... 67 ... 68 ... 69 ... 70 ... 71 ... 72 ... 73 ... 74 ... 75 ... 76 ... 77 ... 78 ... 79 ... 80 ... 81 ... 82 ... 83 ... 84 ... 85 ... 86 ... 87 ... 88 ... 89 ... 90 ... 91 ... 92 ... 93 ... 94 ... 95 ... 96 ... 97 ... 98 ... 99 ... 

In [18]:
!wc -l *.jsonl

100 ad_preference_dataset.jsonl


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
