# Leveraging an OpenAI cookbook to make some tests

In October 2024 OpenAI released a cookbook to show how to do model distillation and used a Wine based scenario that look an interesting test for various models against each other. 
In this cookbook we'll not do any distillation or fine-tuning (yet, maybe in a future article), bet we'll look at a dataset and see how various models perform on it.

We'll also leverage **Structured Outputs** for a classification problem using a list of enum. We'll show that **Structured Ouputs** work with all of those models, and we'll use plain json for response format and Ollama Python client too.

We'll first analyze the dataset and get the output of llama3.2.

## Prerequisites

Let's install and load dependencies.
Make sure your API keys are defined in your .env file as "OPENAI_API_KEY", "GEMINI_API_KEY" or "OPENROUTER_API_KEY" and be'll be loaded by scripts directly.

In [1]:
! pip install ollama numpy pandas tqdm --quiet

In [11]:
from ollama import chat
from tqdm import tqdm
import numpy as np
import pandas as pd
from pydantic import BaseModel, Field

## Loading and understanding the dataset

For this cookbook, we'll load the data from the following Kaggle challenge: [https://www.kaggle.com/datasets/zynicide/wine-reviews](https://www.kaggle.com/datasets/zynicide/wine-reviews). You have to download it and save it in a data folder in the same level of this notebook.

This dataset has a large number of rows and you're free to run this cookbook on the whole data, but as a biaised italian wine-lover, I'll narrow down the dataset to only Italian wine to focus on less rows and grape varieties. The original article was on French wine and tested LLMs are better in guessing them. More on results later. I made the variable generic so that you can change country and test with wine you know/love (keep Italian in this case 😂)

We're looking at a classification problem where we'd like to guess the grape variety based on all other criterias available, including description, subregion and province that we'll include in the prompt. It gives a lot of information to the model, you're free to also remove some information that can help significantly the model such as the region in which it was produced to see if it does a good job at finding the grape.

Let's filter the grape varieties that have less than 5 occurences in reviews.

Let's proceed with a subset of 500 random rows from this dataset.

In [5]:
df = pd.read_csv('data/winemag-data-130k-v2.csv')
df_country = df[df['country'] == 'Italy']

# Let's also filter out wines that have less than 5 references with their grape variety – even though we'd like to find those
# they're outliers that we don't want to optimize for that would make our enum list be too long
# and they could also add noise for the rest of the dataset on which we'd like to guess, eventually reducing our accuracy.

varieties_less_than_five_list = df_country['variety'].value_counts()[df_country['variety'].value_counts() < 5].index.tolist()
df_country = df_country[~df_country['variety'].isin(varieties_less_than_five_list)]

df_country_subset = df_country.sample(n=500)
df_country_subset.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
42286,42286,Italy,"Lean and linear, this offers white spring flow...",Fumat,86,19.0,Northeastern Italy,Collio,,Kerin O’Keefe,@kerinokeefe,Collavini 2015 Fumat Sauvignon (Collio),Sauvignon,Collavini
82695,82695,Italy,Here's a lush and modern Rosso di Montalcino w...,,88,25.0,Tuscany,Rosso di Montalcino,,,,Tenute Silvio Nardi 2008 Rosso di Montalcino,Sangiovese Grosso,Tenute Silvio Nardi
94283,94283,Italy,"Made with 60% Sangiovese, 25% Cabernet Sauvign...",Casal Duro,88,28.0,Tuscany,Toscana,,Kerin O’Keefe,@kerinokeefe,Fattoria La Vialla 2012 Casal Duro Red (Toscana),Red Blend,Fattoria La Vialla
62052,62052,Italy,Here's a harmonious and well-balanced Barbera ...,Molisse,89,19.0,Piedmont,Barbera d'Asti Superiore,,,,Agostino Pavia & Figli 2007 Molisse (Barbera ...,Barbera,Agostino Pavia & Figli
99042,99042,Italy,"Subdued aromas of red berry, dried meat and a ...",Moganazzi Volta Sciara Rosso,87,45.0,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Le Vigne di Eli 2012 Moganazzi Volta Sciara Ro...,Red Blend,Le Vigne di Eli


Let's retrieve all grape varieties to include them in the prompt and in our structured outputs enum list.

In [6]:
varieties = np.array(df_country['variety'].unique()).astype('str')
varieties

array(['White Blend', 'Frappato', 'Nerello Mascalese', "Nero d'Avola",
       'Red Blend', 'Cabernet Sauvignon', 'Primitivo', 'Catarratto',
       'Inzolia', 'Grillo', 'Sangiovese', 'Aglianico', 'Vernaccia',
       'Rosato', 'Vermentino', 'Nebbiolo', 'Barbera', 'Sauvignon',
       'Sangiovese Grosso', 'Prugnolo Gentile', 'Pinot Bianco',
       'Montepulciano', 'Moscato', 'Friulano', 'Sagrantino', 'Prosecco',
       'Garganega', 'Chardonnay', 'Sauvignon Blanc', 'Pinot Grigio',
       'Gewürztraminer', 'Cortese', 'Sparkling Blend', 'Cannonau',
       'Kerner', 'Dolcetto', 'Glera', 'Syrah', 'Pinot Nero', 'Verduzzo',
       'Verdicchio', 'Carricante', 'Fiano', 'Greco', 'Trebbiano', 'Rosé',
       'Pinot Noir', 'Corvina, Rondinella, Molinara', 'Insolia',
       'Ribolla Gialla', 'Prié Blanc', 'Zibibbo', 'Falanghina',
       'Negroamaro', 'Müller-Thurgau', 'Teroldego', 'Merlot', 'Turbiana',
       'Refosco', 'Manzoni', 'Ruché', 'Nero di Troia',
       'Lambrusco di Sorbara', 'Lagrein', 'Toca

## Generating the prompt

Let's build out a function to generate our prompt and try it for the first wine of our list.

In [None]:
def generate_prompt(row, varieties):
    # Format the varieties list as a comma-separated string
    variety_list = ', '.join(varieties)
    
    prompt = f"""
    Based on this wine review, guess the grape variety:
    This wine is produced by {row['winery']} in the {row['province']} region of {row['country']}.
    It was grown in {row['region_1']}. It is described as: "{row['description']}".
    The wine has been reviewed by {row['taster_name']} and received {row['points']} points.
    The price is {row['price']}.

    Here is a list of possible grape varieties to choose from: {variety_list}.
    
    What is the likely grape variety? Answer only with the grape variety name or blend from the list.
    """
    return prompt

# Example usage with a specific row
prompt = generate_prompt(df_country_subset.iloc[0], varieties)
prompt

Here we use Ollama Python client library to call models and we relies on its way of managing [Structured Output](https://ollama.com/blog/structured-outputs)

In [12]:
class WineVariety(BaseModel):
    variety: str = Field(enum=varieties.tolist())

# Function to call the API and process the result for a single model (blocking call in this case)
def call_model(model, prompt):
    response = chat(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You're a sommelier expert and you know everything about wine. You answer precisely with the name of the variety/blend."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        format=WineVariety.model_json_schema()
    )
    wine_variety = WineVariety.model_validate_json(response.message.content)
    return wine_variety.variety

## Processing

As we'll run this locally using Ollama on a single machine, I have removed the code for parallelism of the original article for Ollama and this notebook, but it's still available in other files for OpenAI, Gemini and OpenRouter.

In [13]:
def process_example(index, row, model, df, progress_bar):
    global progress_index

    try:
        # Generate the prompt using the row
        prompt = generate_prompt(row, varieties)

        df.at[index, model + "-variety"] = call_model(model, prompt)
        
        # Update the progress bar
        progress_bar.update(1)
        
        progress_index += 1
    except Exception as e:
        print(f"Error processing model {model}: {str(e)}")

def process_dataframe(df, model):
    global progress_index
    progress_index = 1  # Reset progress index

    # Create a tqdm progress bar
    with tqdm(total=len(df), desc="Processing rows") as progress_bar:
        # Process each example sequentially
        for index, row in df.iterrows():
            try:
                process_example(index, row, model, df, progress_bar)
            except Exception as e:
                print(f"Error processing example: {str(e)}")

    return df

Let's try out our call model function before processing the whole dataframe and check the output.

In [None]:
answer = call_model('llama3.2', generate_prompt(df_country_subset.iloc[0], varieties))
answer

Great! We confirmed we can get a grape variety as an output, let's now process the dataset with both `llama3.2` and `llama3.1` and compare the results. 

In [16]:
df_country_subset = process_dataframe(df_country_subset, "llama3.2")

Processing rows: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [06:48<00:00,  1.22it/s]


In [18]:
df_country_subset = process_dataframe(df_country_subset, "llama3.1")

Processing rows: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [12:02<00:00,  1.45s/it]


## Comparing llama3.2 and llama3.1

Now that we've got all chat completions for those two models ; let's compare them against the expected grape variety and assess their accuracy at finding it. We'll do this directly in python here as we've got a simple string check to run, but if your task involves more complex evals you can leverage OpenAI Evals or our open-source eval framework.

In [24]:
models = ['llama3.2', 'llama3.1']

def get_accuracy(model, df):
    return np.mean(df['variety'] == df[model + '-variety'])

for model in models:
    print(f"{model} accuracy: {get_accuracy(model, df_country_subset) * 100:.2f}%")

llama3.2 accuracy: 49.20%
llama3.1 accuracy: 61.60%


We can see that llama3.1 is better a finding grape variety than llama3.2 (12.40% higher!). 


In [22]:
# copy paste your fine-tune job ID below
finetune_job = client.fine_tuning.jobs.retrieve("ftjob-pRyNWzUItmHpxmJ1TX7FOaWe")

if finetune_job.status == 'succeeded':
    fine_tuned_model = finetune_job.fine_tuned_model
    print('finetuned model: ' + fine_tuned_model)
else:
    print('finetuned job status: ' + finetune_job.status)

finetuned model: ft:gpt-4o-mini-2024-07-18:distillation-test:wine-distillation:AIZntSyE


In [24]:
for model in models:
    print(f"{model} accuracy: {get_accuracy(model, another_subset) * 100:.2f}%")

gpt-4o accuracy: 79.67%
gpt-4o-mini accuracy: 64.67%
ft:gpt-4o-mini-2024-07-18:distillation-test:wine-distillation:AIZntSyE accuracy: 79.33%


This notebook was just an introduction and a small playground.
In the repository you can find additional python files to try, in all of them Structured Output has been used:
- wine_anthropic.py: Anthropic using native Anthropic API
- wine_deepseek.py: DeepSeek V3 using native DeepSeek API with 500(!!!) parallel threads for unbelievably fast results
- wine_gemini.py: gemini models using native Gemini API
- wine_gemini_openai.py: gemini models using OpenAI API
- wine_lmstudio.py: I used this for Apple MLX models through LMStudio, but you can test any models loadable by LMStudio
- wine_ollama.py: code similar to this notebook, but you can pass multiple models and let it run to test them all
- wine_openai: for OpenAI models using native OpenAI API
- wine_openrouter:
  - here you can test any model available on OpenRouter. Just be aware that Structured Output and response are not managed in the same way by all LLMs and you can get errors. In this case you need to tweak the call_model function.
  - The model used in the example is DeepSeek V3 with 200 (Yes!!!) parallel thread. 

**Have fun!**
