# Model Distillation Overview

## What is Distillation?
Using a larger model to fine-tune a smaller model.

## Why Use Distillation?
Two primary benefits:
1. Cost reduction
2. Lower latency

For specific tasks, we can significantly reduce both price and response time by transitioning to a smaller model.

## Notebook Objectives

This notebook demonstrates:
- Dataset analysis
- Distillation of GPT-4O outputs to GPT-4O-mini
- Performance comparison with non-distilled GPT-4O-mini
- Implementation of Structured Outputs for classification
- Impact of fine-tuning on structured output performance

## Today's Agenda

1. Dataset Analysis
2. Model Comparison
   - GPT-4O output analysis
   - GPT-4O-mini performance baseline
   - Performance differential highlighting
3. Distillation Process
   - Implementation
   - Performance analysis of distilled model

In [21]:
import openai
import json
import tiktoken
from tqdm import tqdm
from openai import OpenAI
import numpy as np
import concurrent.futures
import pandas as pd
import os
from dotenv import load_dotenv
import time
import random

load_dotenv()

# Configuration
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

client = OpenAI(api_key=OPENAI_API_KEY)

# Dataset Overview

## Data Source
- Dataset: Wine Reviews
- Source: Kaggle Challenge
- Link: https://www.kaggle.com/datasets/zynicide/wine-reviews

## Data Processing
For this demonstration, we'll focus on:
- French wines only
- Limited subset of grape varieties
- 500 random sample rows

### Filtering Criteria
1. Country: France only
2. Minimum occurrences: >5 reviews per grape variety
3. Sample size: 500 random entries

## Classification Task
Goal: Predict grape variety based on multiple features:
- Description
- Subregion
- Province
- Other available criteria

### Note
You can experiment by removing certain features to test their impact on model performance. This helps identify which contextual information provides the most value to the model.

In [22]:
df = pd.read_csv("./winemag-data-130k-v2.csv")
df_france = df[df["country"] == "France"]

# Let's also filter out wines that have less than 5 references with their grape variety – even though we'd like to find those
# they're outliers that we don't want to optimize for that would make our enum list be too long
# and they could also add noise for the rest of the dataset on which we'd like to guess, eventually reducing our accuracy.

varieties_less_than_five_list = (
    df_france["variety"]
    .value_counts()[df_france["variety"].value_counts() < 5]
    .index.tolist()
)
df_france = df_france[~df_france["variety"].isin(varieties_less_than_five_list)]

df_france_subset = df_france.sample(n=500)
#df_france_subset.head()

In [23]:
# Let's retrieve all grape varieties to include them in the prompt and in our structured outputs enum list.

varieties = np.array(df_france["variety"].unique()).astype("str")
varieties

array(['Gewürztraminer', 'Pinot Gris', 'Gamay',
       'Bordeaux-style White Blend', 'Champagne Blend', 'Chardonnay',
       'Petit Manseng', 'Riesling', 'White Blend', 'Pinot Blanc',
       'Alsace white blend', 'Bordeaux-style Red Blend', 'Malbec',
       'Tannat-Cabernet', 'Rhône-style Red Blend', 'Ugni Blanc-Colombard',
       'Savagnin', 'Pinot Noir', 'Rosé', 'Melon',
       'Rhône-style White Blend', 'Pinot Noir-Gamay', 'Colombard',
       'Chenin Blanc', 'Sylvaner', 'Sauvignon Blanc', 'Red Blend',
       'Chenin Blanc-Chardonnay', 'Cabernet Sauvignon', 'Cabernet Franc',
       'Syrah', 'Sparkling Blend', 'Duras', 'Provence red blend',
       'Tannat', 'Merlot', 'Malbec-Merlot', 'Chardonnay-Viognier',
       'Cabernet Franc-Cabernet Sauvignon', 'Muscat', 'Viognier',
       'Picpoul', 'Altesse', 'Provence white blend', 'Mondeuse',
       'Grenache-Syrah', 'G-S-M', 'Pinot Meunier', 'Cabernet-Syrah',
       'Vermentino', 'Marsanne', 'Colombard-Sauvignon Blanc',
       'Gros and Peti

## Generating our prompt

In [24]:
# Let's build out a function to generate our prompt and try it for the first wine of our list.
def generate_prompt(row, varieties):
    # Format the varieties list as a comma-separated string
    variety_list = ", ".join(varieties)

    prompt = f"""
    Based on this wine review, guess the grape variety:
    This wine is produced by {row['winery']} in the {row['province']} region of {row['country']}.
    It was grown in {row['region_1']}. It is described as: "{row['description']}".
    The wine has been reviewed by {row['taster_name']} and received {row['points']} points.
    The price is {row['price']}.

    Here is a list of possible grape varieties to choose from: {variety_list}.
    
    What is the likely grape variety? Answer only with the grape variety name or blend from the list.
    """
    return prompt


# Example usage with a specific row
prompt = generate_prompt(df_france.iloc[0], varieties)
prompt

'\n    Based on this wine review, guess the grape variety:\n    This wine is produced by Trimbach in the Alsace region of France.\n    It was grown in Alsace. It is described as: "This dry and restrained wine offers spice in profusion. Balanced with acidity and a firm texture, it\'s very much for food.".\n    The wine has been reviewed by Roger Voss and received 87 points.\n    The price is 24.0.\n\n    Here is a list of possible grape varieties to choose from: Gewürztraminer, Pinot Gris, Gamay, Bordeaux-style White Blend, Champagne Blend, Chardonnay, Petit Manseng, Riesling, White Blend, Pinot Blanc, Alsace white blend, Bordeaux-style Red Blend, Malbec, Tannat-Cabernet, Rhône-style Red Blend, Ugni Blanc-Colombard, Savagnin, Pinot Noir, Rosé, Melon, Rhône-style White Blend, Pinot Noir-Gamay, Colombard, Chenin Blanc, Sylvaner, Sauvignon Blanc, Red Blend, Chenin Blanc-Chardonnay, Cabernet Sauvignon, Cabernet Franc, Syrah, Sparkling Blend, Duras, Provence red blend, Tannat, Merlot, Malbec

# ROI Analysis for Distillation

## Cost Analysis Method
Using tiktoken to:
- Calculate total token count
- Estimate completion costs
- Evaluate distillation ROI

## Cost Factors
1. Completion Costs
   - Based on token count
   - Varies by model type
2. Fine-tuning Costs (calculated separately)
   - Depends on:
     - Number of epochs
     - Training set size
     - Other training parameters

### Note
This analysis provides completion cost estimates only. The full ROI calculation must include fine-tuning costs, which we'll cover in the distillation section.

In [25]:
# Load encoding for the GPT-4 model
enc = tiktoken.encoding_for_model("gpt-4o")

# Initialize a variable to store the total number of tokens
total_tokens = 0

for index, row in df_france_subset.iterrows():
    prompt = generate_prompt(row, varieties)

    # Tokenize the input text and count tokens
    tokens = enc.encode(prompt)
    token_count = len(tokens)

    # Add the token count to the total
    total_tokens += token_count

print(f"Total number of tokens in the dataset: {total_tokens}")
print(f"Total number of prompts: {len(df_france_subset)}")

# outputing cost in $ as of 2024/10/23

gpt4o_token_price = 2.50 / 1_000_000  # $2.50 per 1M tokens
gpt4o_mini_token_price = 0.150 / 1_000_000  # $0.15 per 1M tokens

total_gpt4o_cost = gpt4o_token_price * total_tokens
total_gpt4o_mini_cost = gpt4o_mini_token_price * total_tokens

print(total_gpt4o_cost)
print(total_gpt4o_mini_cost)

Total number of tokens in the dataset: 245619
Total number of prompts: 500
0.6140475000000001
0.036842849999999996


# Structured Outputs Implementation

## Benefits
Structured outputs provide:
- Deterministic responses
- Improved accuracy
- Restricted answer sets
- Direct answer comparison capability

## Implementation Features
- Ensures consistent response format
- Prevents responses outside dataset
- Enables direct performance comparison
- Works across all models (including distilled)

### Example
Instead of varied responses like:
- "I think this is Pinot Noir"
- "Because of A and B, I believe this to be Pinot Noir"

We get standardized output:
- "Pinot Noir"

In [26]:
response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "grape-variety",
        "schema": {
            "type": "object",
            "properties": {
                "variety": {
                    "type": "string",
                    "enum": varieties.tolist()
                }
            },
            "additionalProperties": False,
            "required": ["variety"],
        },
        "strict": True
    }
}

# Distillation Implementation

## Process Overview
1. Store completions from larger model (GPT-4O)
2. Use stored completions for smaller model fine-tuning
3. Store all model completions for comparison
   - GPT-4O
   - GPT-4O-mini
   - Fine-tuned model

## Implementation Details
### Storage Configuration
- Enable `store=True` parameter
- Include metadata tags for filtering
- Support for OpenAI platform evaluations

### Performance Comparison
Initial results show:
- GPT-4O outperforms GPT-4O-mini by 12.80%
- Relative improvement of almost 20%

## OpenAI Platform Integration
Access stored completions at:
https://platform.openai.com/chat-completions

### Note
Metadata tagging enables efficient filtering for:
- Distillation processes
- Evaluation runs
- Performance analysis

In [27]:
# Initialize the progress index
metadata_value = "wine-distillation"  # that's a funny metadata tag :-)


# Function to call the API and process the result for a single model (blocking call in this case)
def call_model(model, prompt):
    response = client.chat.completions.create(
        model=model,
        store=True,
        metadata={
            "distillation": metadata_value,
        },
        messages=[
            {
                "role": "system",
                "content": "You're a sommelier expert and you know everything about wine. You answer precisely with the name of the variety/blend.",
            },
            {"role": "user", "content": prompt},
        ],
        response_format=response_format,
    )
    return json.loads(response.choices[0].message.content.strip())["variety"]

# Parallel Processing Implementation

## Overview
When running completions on large datasets, parallel processing becomes essential for efficiency. This section details the implementation of concurrent processing for model completions.

## Implementation Features

### API Call Function
- Model-specific API calls
- Metadata tagging for tracking
- Structured response handling
- System role configuration for sommelier expertise

### Processing Options

#### Parallel Processing
- Uses `concurrent.futures.ThreadPoolExecutor`
- Configurable worker count
- Progress tracking
- Error handling and reporting

#### Sequential Processing
- Alternative slower but ordered processing
- Useful for debugging
- Progress tracking included
- Built-in error handling

In [28]:
def process_example(index, row, model, df, progress_bar):
    global progress_index

    try:
        # Generate the prompt using the row
        prompt = generate_prompt(row, varieties)



        df.at[index, model + "-variety"] = call_model(model, prompt)

        time.sleep(2.0)  # Adjust delay as needed

        # Update the progress bar
        progress_bar.update(1)

        progress_index += 1
    except Exception as e:
        print(f"Error processing model {model}: {str(e)}")

def process_dataframe(df, model):
    global progress_index
    progress_index = 1  # Reset progress index

    # Create a tqdm progress bar
    with tqdm(total=len(df), desc="Processing rows") as progress_bar:
        # Process each example concurrently using ThreadPoolExecutor
        with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
            futures = {
                executor.submit(
                    process_example, index, row, model, df, progress_bar
                ): index
                for index, row in df.iterrows()
            }

            for future in concurrent.futures.as_completed(futures):
                try:
                    future.result()  # Wait for each example to be processed
                except Exception as e:
                    print(f"Error processing example: {str(e)}")

    return df


# or in order, a lot slower =(

from tqdm import tqdm


def process_dataframe_sequential(df, model):
    global progress_index
    progress_index = 1  # Reset progress index

    # Create a tqdm progress bar
    with tqdm(total=len(df), desc="Processing rows") as progress_bar:
        # Process each example sequentially
        for index, row in df.iterrows():
            try:
                process_example(index, row, model, df, progress_bar)
            except Exception as e:
                print(f"Error processing example at index {index}: {str(e)}")

    return df

In [29]:
# Let's try out our call model function before processing the whole dataframe and check the output.

answer = call_model("gpt-4o", generate_prompt(df_france_subset.iloc[0], varieties))
answer

df_france_subset = process_dataframe(df_france_subset, "gpt-4o")
df_france_subset = process_dataframe(df_france_subset, "gpt-4o-mini")

Processing rows: 100%|██████████| 500/500 [10:50<00:00,  1.30s/it]
Processing rows: 100%|██████████| 500/500 [10:43<00:00,  1.29s/it]


# Model Performance Analysis

## Comparison Framework
We compare two models:
- GPT-4O
- GPT-4O-mini

## Methodology
- Direct comparison against expected grape varieties
- Accuracy assessment using structured outputs
- String-based verification enabled by response format

### Key Findings
- GPT-4O demonstrates superior accuracy
- Performance delta: 12.80% absolute improvement
- Relative improvement: ~20% compared to GPT-4O-mini

### Implications
This significant performance gap presents:
- Clear opportunity for distillation
- Potential for cost optimization
- Balance between accuracy and latency

## Next Steps
The performance differential justifies exploring distillation to:
1. Maintain high accuracy
2. Reduce operational costs
3. Improve response times

In [30]:
models = ["gpt-4o", "gpt-4o-mini"]


def get_accuracy(model, df):
    return np.mean(df["variety"] == df[model + "-variety"])


for model in models:
    print(f"{model} accuracy: {get_accuracy(model, df_france_subset) * 100:.2f}%")

gpt-4o accuracy: 82.00%
gpt-4o-mini accuracy: 70.00%


# WOW
## We can see that gpt-4o is better a finding grape variety than 4o-mini (12.00% higher or almost 20% relatively to 4o-mini!). 

# Now we go to openai dashboard to fine-tune!!


In [None]:
# copy paste your fine-tune job ID below
finetune_job = client.fine_tuning.jobs.retrieve("ftjob-my-code")

if finetune_job.status == "succeeded":
    fine_tuned_model = finetune_job.fine_tuned_model
    print("finetuned model: " + fine_tuned_model)
else:
    print("finetuned job status: " + finetune_job.status)

# Validating the Distilled Model

## Validation Process

### Steps
1. Run completions using the fine-tuned model
2. Compare accuracy metrics:
   - GPT-4O baseline
   - GPT-4O-mini baseline
   - Distilled model performance

### Dataset Preparation
For accurate validation:
- Select new subset of French wines
- Maintain consistency with training constraints:
  - French grape varieties only
  - Excluded outlier varieties
  - Same filtering



In [34]:
validation_dataset = df_france.sample(n=300)

models.append(fine_tuned_model)

for model in models:
    another_subset = process_dataframe(validation_dataset, model)

# Let's compare accuracy of models

for model in models:
    print(f"{model} accuracy: {get_accuracy(model, another_subset) * 100:.2f}%")

# That's almost a 22% relative improvement over the non-distilled gpt-4o-mini! 🎉

## Our fine-tuned model performs way better than gpt-4o-mini, while having the same base model!  

### We'll be able to use this model to run inferences at a lower cost and lower latency for future grape variety prediction.

# Shout out to openAI

Distilling gpt-4o outputs to gpt-4o-mini
Let's assume we'd like to run this prediction often, we want completions to be faster and cheaper, but keep that level of accuracy. That'd be great to be able to distill 4o accuracy to 4o-mini, wouldn't it? Let's do it!

We'll now go to OpenAI Stored completions page: https://platform.openai.com/chat-completions.
