Humans vs. The Frontier: Setup
==============================

Approaches:

- Have a human read the 250 product descriptions and guessing the cost.
- Create system prompt to ensure the LLM knows it needs to estimate the price of a product and reply with just the price and no explantation.

In the user prompt strip out the " to the nearest dollar" text since Frontier models are much more capable and powerful than traditional ML models and remove the "Price is $" text so that is can be used in the assistant prompt instead.

Example prompt:

> [{'role': 'system',<br>
>   'content': 'You estimate prices of items. Reply only with the price, no explanation'},<br>
> {'role': 'user',<br>
>  'content': "How much does this cost?\n\nOEM AC Compressor w/A/C Repair Kit For Ford F150 F-150 V8 & Lincoln Mark LT 2007 2008 - BuyAutoParts NEW\nAs one of the world's largest automotive parts suppliers, our parts are trusted every day by mechanics and vehicle owners worldwide. This A/C Compressor and Components Kit is manufactured and tested to the strictest OE standards for unparalleled performance. Built for trouble-free ownership and 100% visually inspected and quality tested, this A/C Compressor and Components Kit is backed by our 100% satisfaction guarantee. Guaranteed Exact Fit for easy installation 100% BRAND NEW, premium ISO/TS 16949 quality - tested to meet or exceed OEM specifications Engineered for superior durability, backed by industry-leading unlimited-mileage warranty Included in this K"},<br>
> {'role': 'assistant', 'content': 'Price is $'}]

# Dependencies

In [None]:
# imports

import os
import re
import math
import json
import random
from dotenv import load_dotenv
from huggingface_hub import login
import matplotlib.pyplot as plt
import numpy as np
import pickle
from collections import Counter
from openai import OpenAI
from anthropic import Anthropic

# Setup

In [None]:
# environment

load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

In [None]:
# Log in to HuggingFace

hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

In [None]:
# move Tester Harness into a separate package
# call it with Tester.test(function_name, test_dataset)
import sys
sys.path.append('../testing/')
from items import Item
from testing import Tester

In [None]:
openai = OpenAI()
claude = Anthropic()

In [None]:
%matplotlib inline

# Load Curated Datasets

Load in the pickle files create during data curation.

In [None]:
# Load in the pickle files:

with open('../data/large-datasets/train.pkl', 'rb') as file:
    train = pickle.load(file)

with open('../data/large-datasets/test.pkl', 'rb') as file:
    test = pickle.load(file)

# Human Baselines

<img src="./../images/Product-Pricer-Baseline-Human.jpg" alt="Distribution of Prices Predicted by a Human" />

**Result**: Did better than word2vec and Linear Regression but not quite as good as Bag of Word and Linear Regression. Another human may do a lot better or worse of course.

- Human Pricer Error=$126.55
- RMSLE=1.00
- Hits=32.0%

In [None]:
# Write the test set to a CSV

import csv
with open('../testing/human_input.csv', 'w', encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    for t in test[:250]:
        writer.writerow([t.test_prompt(), 0])

In [None]:
# Read it back in

human_predictions = []
with open('../testing/human_output.csv', 'r', encoding="utf-8") as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        human_predictions.append(float(row[1]))

In [None]:
def human_pricer(item):
    idx = test.index(item)
    return human_predictions[idx]

In [None]:
Tester.test(human_pricer, test)

# Frontier Models

Use seed to tell GPT that this should be reproducable and with all frontier models keep the tokens small since the system, user and assistant prompt is well crafted (and keeps the costs down).

## GPT-4o-mini

<img src="./../images/Product-Pricer-Baseline-LLM-GPT-4o-mini.jpg" alt="Distribution of Prices Predicted Using GPT-4o-mini" />

**Result**: Does much better than all the ML models even without training data.

- GPT 4o Mini Pricer Error=$78.51
- RMSLE=0.59
- Hits=51.6%

Note: still lots of errors and very few exact guesses so safe from test "contamination" here.

In [None]:
# First let's work on a good prompt for a Frontier model
# Notice that I'm removing the " to the nearest dollar"
# When we train our own models, we'll need to make the problem as easy as possible, 
# but a Frontier model needs no such simplification.

def messages_for(item):
    system_message = "You estimate prices of items. Reply only with the price, no explanation"
    user_prompt = item.test_prompt().replace(" to the nearest dollar","").replace("\n\nPrice is $","")
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "Price is $"}
    ]

In [None]:
# Try this out

messages_for(test[0])

In [None]:
# A utility function to extract the price from a string

def get_price(s):
    s = s.replace('$','').replace(',','')
    match = re.search(r"[-+]?\d*\.\d+|\d+", s)
    return float(match.group()) if match else 0

In [None]:
get_price("The price is roughly $99.99 because blah blah")

In [None]:
# The function for gpt-4o-mini

def gpt_4o_mini(item):
    response = openai.chat.completions.create(
        model="gpt-4o-mini", 
        messages=messages_for(item),
        seed=42,
        max_tokens=5
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
test[0].price

In [None]:
Tester.test(gpt_4o_mini, test)

## GPT 4o

<img src="./../images/Product-Pricer-Baseline-LLM-GPT-4o-2024-08-06.jpg" alt="Distribution of Prices Predicted Using GPT 4o" />

**Result**: Does much better than all the ML models but not as good as GPT 4o mini, which is surprising.

- GPT 4o Pricer Error=$81.29
- RMSLE=0.86
- Hits=56.64%

Note: still lots of errors and very few exact guesses so safe from test "contamination" here.

In [None]:
def gpt_4o_frontier(item):
    response = openai.chat.completions.create(
        model="gpt-4o-2024-08-06", 
        messages=messages_for(item),
        seed=42,
        max_tokens=5
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
# The function for gpt-4o - the August model
# Note that it cost me about 1-2 cents to run this (pricing may vary by region)
# You can skip this and look at my results instead

Tester.test(gpt_4o_frontier, test)

## Claude 3.5 Sonnet

<img src="./../images/Product-Pricer-Baseline-LLM-Claude-3-5-Sonnet-2024-06-20.jpg" alt="Distribution of Prices Predicted Using Claude 3.5 Sonnet" />

**Result**: Does much betrer than all the ML models but not as good as GPT.

- Claude 3.5 Sonnet Pricer Error=$82.70
- RMSLE=0.55
- Hits=50.0%

Note: still lots of errors and very few exact guesses so safe from test "contamination" here.

In [None]:
def claude_3_point_5_sonnet(item):
    messages = messages_for(item)
    system_message = messages[0]['content']
    messages = messages[1:]
    response = claude.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=5,
        system=system_message,
        messages=messages
    )
    reply = response.content[0].text
    return get_price(reply)

In [None]:
# The function for Claude 3.5 Sonnet
# It also cost me about 1-2 cents to run this (pricing may vary by region)
# You can skip this and look at my results instead

Tester.test(claude_3_point_5_sonnet, test)

# Todo

Test more frontier models and the latest versions that just released.