# "THE PRICE IS RIGHT" Capstone Project

This week - build a model that predicts how much something costs from a description, based on a scrape of Amazon data


A model that can estimate how much something costs, from its description.

# Order of play

DAY 1: Data Curation  
DAY 2: Data Pre-processing  
DAY 3: Evaluation, Baselines, Traditional ML  
DAY 4: Deep Learning and LLMs  
DAY 5: Fine-tuning a Frontier Model  

## DAY 2: Data Pre-processing

Today we'll rewrite the products into a standard format.  
LLMs are great at this!


<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#181;">Business value of Data Pre-processing / Re-writing</h2>
            <span style="color:#181;">LLMs have made it simple to do something that was considered impossible only a few years ago.
            This approach can be applied to almost any business vertical, and it's similar to the advanced techniques
            we used on Week 5.</span>
        </td>
    </tr>
</table>

In [33]:
from litellm import completion
from dotenv import load_dotenv
import json
from pricer.batch import Batch
from pricer.items import Item

load_dotenv(override=True)

True

# The next cell is where you choose Dataset

Use `LITE_MODE = True` for the free, fast version with training data size of 20,000

USe `LITE_MODE =  False` for the powerful, full version with training data size of 800,000

## For this lab

You can skip altogether and load the dataset from HuggingFace: $0

You can run pre-processing for the lite dataset: under $1

You can run pre-processing for the full dataset: $30

In [34]:
LITE_MODE = True

In [35]:
username = "ed-donner"
dataset = f"{username}/items_raw_lite" if LITE_MODE else f"{username}/items_raw_full"

train, val, test = Item.from_hub(dataset)

items = train + val + test

print(f"Loaded {len(items):,} items")
print(items[0])

Loaded 22,000 items
title='Schlage F59 AND 613 Andover Interior Knob with Deadbolt, Oil Rubbed Bronze (Interior Half Only)' category='Tools_and_Home_Improvement' price=64.3 full='Schlage F59 AND 613 Andover Interior Knob with Deadbolt, Oil Rubbed Bronze (Interior Half Only)\n[\'From the Manufacturer\', "When you have a Schlage handleset on your front door, you ensure your security as well as your peace of mind. After all, we\'re the leader in security devices, trusted for over 85 years. All Schlage handlesets are precision engineered, featuring 100% solid"]\n[\'Interior half only\', \'Requires F58 to complete handle set\', \'Non handed knob style\', \'4" minimum center to center door prep required for this two piece model.\', \'Lifetime Mechanical and Finish Warranty\']\n{"Material": "Metal", "Brand": "", "Color": "Oil Rubbed Bronze", "Exterior Finish": "Bronze", "Special Feature": "Easy to Install", "Age Range (Description)": "Adult", "Included Components": "Deadbolt, Knob", "Item Wei

In [37]:
items[2].id

2

In [36]:
# Give every item an id

for index, item in enumerate(items):
    item.id = index

In [38]:


SYSTEM_PROMPT = """Create a concise description of a product. Respond only in this format. Do not include part numbers.
Title: Rewritten short precise title
Category: eg Electronics
Brand: Brand name
Description: 1 sentence description
Details: 1 sentence on features"""

In [39]:
print(items[0].full)

Schlage F59 AND 613 Andover Interior Knob with Deadbolt, Oil Rubbed Bronze (Interior Half Only)
['From the Manufacturer', "When you have a Schlage handleset on your front door, you ensure your security as well as your peace of mind. After all, we're the leader in security devices, trusted for over 85 years. All Schlage handlesets are precision engineered, featuring 100% solid"]
['Interior half only', 'Requires F58 to complete handle set', 'Non handed knob style', '4" minimum center to center door prep required for this two piece model.', 'Lifetime Mechanical and Finish Warranty']
{"Material": "Metal", "Brand": "", "Color": "Oil Rubbed Bronze", "Exterior Finish": "Bronze", "Special Feature": "Easy to Install", "Age Range (Description)": "Adult", "Included Components": "Deadbolt, Knob", "Item Weight": "1.5 pounds", "Handle Material": "Bronze", "Package Type": "Standard Packaging", "Unit Count": "1.0 Count", "Number of Items": "1", "Manufacturer": "Schlage", "Product Dimensions": "8.1 x 4

In [40]:
messages = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": items[0].full}]
response = completion(messages=messages, model="groq/openai/gpt-oss-20b", reasoning_effort="low")

print(response.choices[0].message.content)
print()
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Cost: {response._hidden_params['response_cost']*100:.3f} cents")


Title: Schlage F59 Interior Knob with Deadbolt – Oil Rubbed Bronze (Half Only)  
Category: Security Hardware  
Brand: Schlage  
Description: A sleek oil‑rubbed bronze interior knob with integrated deadbolt for enhanced home security.  
Details: Easy to install, 4” center‑to‑center clearance, and backed by a lifetime mechanical and finish warranty.

Input tokens: 446
Output tokens: 93
Cost: 0.006 cents


In [41]:

messages = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": items[0].full}]
response = completion(messages=messages, model="ollama/llama3.2", api_base="http://localhost:11434")
print(response.choices[0].message.content)
print()
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Cost: {response._hidden_params['response_cost']*100:.3f} cents")


### Product:
Title: Schlage Oil Rubbed Bronze Interior Knob with Deadbolt
Category: Hardware and Home Improvement
Brand: Schlage
Description: A reliable oil rubbed bronze interior knob with deadbolt for enhanced home security.
Details: Features a precision-engineered design with a lifetime mechanical and finish warranty.

Input tokens: 406
Output tokens: 64
Cost: 0.000 cents


In [42]:
MODEL = "openai/gpt-oss-20b"


In [43]:
def make_jsonl(item):
    body = {"model": MODEL, "messages": [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": item.full}], "reasoning_effort": "low"}
    line = {"custom_id": str(item.id), "method": "POST", "url": "/v1/chat/completions", "body": body}
    return json.dumps(line)

In [44]:
items[0]

<Schlage F59 AND 613 Andover Interior Knob with Deadbolt, Oil Rubbed Bronze (Interior Half Only) = $64.3>

In [45]:
make_jsonl(items[0])

'{"custom_id": "0", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "openai/gpt-oss-20b", "messages": [{"role": "system", "content": "Create a concise description of a product. Respond only in this format. Do not include part numbers.\\nTitle: Rewritten short precise title\\nCategory: eg Electronics\\nBrand: Brand name\\nDescription: 1 sentence description\\nDetails: 1 sentence on features"}, {"role": "user", "content": "Schlage F59 AND 613 Andover Interior Knob with Deadbolt, Oil Rubbed Bronze (Interior Half Only)\\n[\'From the Manufacturer\', \\"When you have a Schlage handleset on your front door, you ensure your security as well as your peace of mind. After all, we\'re the leader in security devices, trusted for over 85 years. All Schlage handlesets are precision engineered, featuring 100% solid\\"]\\n[\'Interior half only\', \'Requires F58 to complete handle set\', \'Non handed knob style\', \'4\\" minimum center to center door prep required for this two piece m

In [46]:

def make_file(start, end, filename):
    batch_file = filename
    with open(batch_file, "w") as f:
        for i in range(start, end):
            f.write(make_jsonl(items[i]))
            f.write("\n")

In [47]:
make_file(0, 1000, "jsonl/0_1000.jsonl")

In [48]:
import os
from groq import Groq

groq = Groq(api_key=os.environ.get("GROQ_API_KEY"))

In [49]:

with open("jsonl/0_1000.jsonl", "rb") as f:
    response = groq.files.create(file=f, purpose="batch")
response

FileCreateResponse(id='file_01kg5sgbcjebnrmb1cc2bvxr6b', bytes=2231443, created_at=1769721048, filename='0_1000.jsonl', object='file', purpose='batch', size=0, md5='mFTQtRuU7PloLJh0WmavBA==', content_type='application/jsonl')

In [50]:
file_id = response.id
file_id

'file_01kg5sgbcjebnrmb1cc2bvxr6b'

In [51]:
response = groq.batches.create(completion_window="24h", endpoint="/v1/chat/completions", input_file_id=file_id)
response

BatchCreateResponse(id='batch_01kg5sgfqmfn0bnx4ct567pxnz', completion_window='24h', created_at=1769721052, endpoint='/v1/chat/completions', input_file_id='file_01kg5sgbcjebnrmb1cc2bvxr6b', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1769807452, failed_at=None, finalizing_at=None, in_progress_at=None, metadata=None, output_file_id=None, request_counts=RequestCounts(completed=0, failed=0, total=0), project_id='project_01kg5k2wq9fdgv7mmkbc564ed5')

In [54]:
result = groq.batches.retrieve(response.id)
result

BatchRetrieveResponse(id='batch_01kg5sgfqmfn0bnx4ct567pxnz', completion_window='24h', created_at=1769721052, endpoint='/v1/chat/completions', input_file_id='file_01kg5sgbcjebnrmb1cc2bvxr6b', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1769721067, error_file_id=None, errors=None, expired_at=None, expires_at=1769807452, failed_at=None, finalizing_at=1769721066, in_progress_at=1769721057, metadata=None, output_file_id='file_01kg5sgwzvfzhtewrgr2wy3bdb', request_counts=RequestCounts(completed=1000, failed=0, total=1000), project_id='project_01kg5k2wq9fdgv7mmkbc564ed5')

In [55]:
response = groq.files.content(result.output_file_id)
response.write_to_file("jsonl/batch_results.jsonl")

In [56]:
with open("jsonl/batch_results.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        json_line = json.loads(line)
        id = int(json_line["custom_id"])
        summary = json_line["response"]["body"]["choices"][0]["message"]["content"]
        items[id].summary = summary


In [57]:
print(items[0].full)

Schlage F59 AND 613 Andover Interior Knob with Deadbolt, Oil Rubbed Bronze (Interior Half Only)
['From the Manufacturer', "When you have a Schlage handleset on your front door, you ensure your security as well as your peace of mind. After all, we're the leader in security devices, trusted for over 85 years. All Schlage handlesets are precision engineered, featuring 100% solid"]
['Interior half only', 'Requires F58 to complete handle set', 'Non handed knob style', '4" minimum center to center door prep required for this two piece model.', 'Lifetime Mechanical and Finish Warranty']
{"Material": "Metal", "Brand": "", "Color": "Oil Rubbed Bronze", "Exterior Finish": "Bronze", "Special Feature": "Easy to Install", "Age Range (Description)": "Adult", "Included Components": "Deadbolt, Knob", "Item Weight": "1.5 pounds", "Handle Material": "Bronze", "Package Type": "Standard Packaging", "Unit Count": "1.0 Count", "Number of Items": "1", "Manufacturer": "Schlage", "Product Dimensions": "8.1 x 4

In [58]:
print(items[1000].summary)

None


## I've put exactly this logic into a Batch class

- Divides items into groups of 1,000
- Kicks off batches for each
- Allows us to monitor and collect the results when complete

## COSTS

Using Groq, for me - this cost under $1 for the Lite dataset and under $30 for the big dataset

But you don't need to pay anything! In the next lab, you can load my pre-processed results

In [59]:
Batch.create(items, LITE_MODE)

Created 44 batches


In [60]:
Batch.run()

  0%|          | 0/44 [00:00<?, ?it/s]

Submitted 44 batches


In [64]:
Batch.fetch()

  0%|          | 0/44 [00:00<?, ?it/s]

Finished 22 of 44 batches


In [65]:
for index, item in enumerate(items):
    if not item.summary:
        print(index)

In [66]:
print(items[10234].summary)

Title: 84x60" Red Barn Fall Fabric Backdrop  
Category: Photography Supplies  
Brand: Allenjoy  
Description: A large, high‑resolution red barn backdrop featuring pumpkins, hay, and a scarecrow for autumn or western themed shoots.  
Details: Made from wrinkle‑resistant polyester, it measures 84"x60", is washable, and comes ready to hang without a stand.


In [67]:
# Remove the fields that we don't need in the hub

for item in items:
    item.full = None
    item.id = None

## Push the final dataset to the hub

If lite mode, we'll only push the lite dataset

If full mode, we'll push both datasets (in case you decide to use lite later)

In [None]:
username = "ed-donner"
full = f"{username}/items_full"
lite = f"{username}/items_lite"

if LITE_MODE:
    train = items[:20_000]
    val = items[20_000:21_000]
    test = items[21_000:]
    Item.push_to_hub(lite, train, val, test)
else:
    train = items[:800_000]
    val = items[800_000:810_000]
    test = items[810_000:]
    Item.push_to_hub(full, train, val, test)

    train_lite = train[:20_000]
    val_lite = val[:1_000]
    test_lite = test[:1_000]
    Item.push_to_hub(lite, train_lite, val_lite, test_lite)

## And here they are!

https://huggingface.co/datasets/ed-donner/items_lite

https://huggingface.co/datasets/ed-donner/items_full
