# Part 3 - Performance Testing

### Overview

Now that we can classify SMS messages using both the general-purpose GPT model and our fine tuned models, we want to test them to see if there are performance and cost differences.

The following steps are covered:

* Set up classification APIs for both fine-tuned and general-purpose models (re-used from Part 1 and Part 2)
* Predict on the validation data for each model
* Look at performance via confusion matrix

In [1]:
# Install dependencies if needed
# %pip install pandas
# %pip install python-dotenv
# %pip install openai --upgrade
# %pip install throttler

### Resources

* https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62

In [2]:
from openai import AsyncOpenAI
import os
import json
import pandas as pd
import time
import datetime

In [3]:
from dotenv import load_dotenv; load_dotenv()
client = AsyncOpenAI(api_key=os.environ['OPENAI_API_KEY'])

In [4]:
fine_tuned_models = {
    50: 'ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0',
    100: 'ft:gpt-3.5-turbo-1106:aa-engineering::8I9vALSP',
    200: 'ft:gpt-3.5-turbo-1106:aa-engineering::8IAIy8LD'
}

# Define Classification APIs

These APIs take an input message and use either a fine-tuned model or the general-purpose model to predict whether it is spam.  Each returns a boolean: True for spam.

In [5]:
# Fine Tuned Model API

fineTunePrompt = "You are a system for categorizing SMS text messages as being unwanted spam or normal messages."

async def getSpamClassification_FineTune(fineTunedModelId, prompt):
  completion = await client.chat.completions.create(
    model=fineTunedModelId,
    messages=[
      {"role": "system", "content": fineTunePrompt},
      {"role": "user", "content": prompt}
    ]
  )
  result = completion.choices[0].message.content.lower() == 'spam'
  # print(prompt, "=>", result)
  return result



In [6]:
# Foundational Model API

foundationalModelPrompt = "You will be provided with a text message. You will need to classify the text message as spam, ham. Spam is a text message that is spam, harmful, abusive, or otherwise unwanted. Ham is a text message that is not spam."

async def getSpamClassification_GeneralModel(message):
    response = await client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=[
            {"role": "system", "content": foundationalModelPrompt},
            {"role": "user", "content": message}
        ],
        temperature=0.5,
        max_tokens=2
    )
    return response.choices[0].message.content.lower() == 'spam'

# Predict on Validation Data

Each fine-tuned model has a validation dataset in addition to its training data.  Here we predict on those datasets for each fine tuned model and the general-purpose model.

Predicting on the entire validation set takes some time.  OpenAI has a rate limit of 60 requests per minute.  It's also not free, so it's not something we want to have to do more than once.

To make this code robust to things like network errors, we start by creating a dataframe that contains the validation data and a blank column for the results.  The code will run predictions for each row that has an empty result.  This means that this code can be restarted in case of failure.

Additionally, to avoid having to re-run all the predictions in case of kernel restart, we save the resulting dataframe to file where it can be optionally reloaded.

In [10]:
# This cell prepares the dataframe that will hold the predictions

rows1 = []
rows2 = []
for sample_size in fine_tuned_models.keys():
    fineTunedModelId = fine_tuned_models[sample_size]
    validation_data_path = f"../data/temp/model_{sample_size}/validation.jsonl"
    with open(validation_data_path, 'r') as f:

        # To test this on a smaller dataset, we can optionally use "[:5]" to take only the first 5 lines
        # for line in f.readlines()[:5]:
        for line in f.readlines()[:100]:
        # for line in f.readlines():
            data = json.loads(line)
            prompt = data['messages'][1]['content']
            completion = data['messages'][2]['content']
            rows1.append({
                'model': fineTunedModelId,
                'sample_size': sample_size,
                'prompt': prompt,
                'expected': completion == 'spam',
                'predicted': "-"
            })
            rows2.append({
                'model': 'general',
                'sample_size': sample_size,
                'prompt': prompt,
                'expected': completion == 'spam',
                'predicted': "-"
            })    

validation_df = pd.DataFrame(rows1+rows2)
print("Prepared empty validation dataframe with {} rows".format(len(validation_df)))           

Prepared empty validation dataframe with 500 rows


In [8]:
# If a previous result is available, we can optionally load it here instead of re-running the validation
validation_df = pd.read_csv('../data/temp/validation_results.csv')
print("Loaded validation data, {} items remaining".format(validation_df['predicted'].eq("-").sum()))    
# validation_df

Loaded validation data, 639 items remaining


In [11]:
# Note: Running this cell will take a while and incur API usage costs

# Use Throttler to limit the number of requests per minute
from throttler import Throttler
throttler = Throttler(rate_limit=19, period=20)

print("Running validation, {} items remaining".format(validation_df['predicted'].eq("-").sum()))     

start = time.time()
for index, row in validation_df.iterrows():
    if row['predicted'] == "-":
        async with throttler:
            elapsedSeconds = time.time() - start
            print("{}  Predicting on validation row {} / {} with model: {}".format(str(datetime.timedelta(seconds=elapsedSeconds)), index+1, len(validation_df), row['model']))
            if row['model'] == 'general':
                result = await getSpamClassification_GeneralModel(row['prompt'])
            else:
                result = await getSpamClassification_FineTune(row['model'], row['prompt'])
        validation_df.loc[index, 'predicted'] = result

        # Save to disk after every prediction in case we need to interrupt kernel and resume
        validation_df.to_csv('../data/temp/validation_results.csv', index=False)


validation_df['predicted'] = validation_df['predicted'].astype(bool)
validation_df['correct'] = validation_df['expected'] == validation_df['predicted']
validation_df.to_csv('../data/temp/validation_results.csv', index=False)

print("Saved validation results to ../data/temp/validation_results.csv")
validation_df.head()


Running validation, 500 items remaining
0:00:00.000638  Predicting on validation row 1 / 500 with model: ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0
0:00:00.241754  Predicting on validation row 2 / 500 with model: ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0
0:00:00.405955  Predicting on validation row 3 / 500 with model: ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0
0:00:00.582906  Predicting on validation row 4 / 500 with model: ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0
0:00:00.754007  Predicting on validation row 5 / 500 with model: ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0
0:00:00.932222  Predicting on validation row 6 / 500 with model: ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0
0:00:01.096017  Predicting on validation row 7 / 500 with model: ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0
0:00:01.264485  Predicting on validation row 8 / 500 with model: ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0
0:00:01.441688  Predicting on validation row 9 / 500 with model:

Unnamed: 0,model,sample_size,prompt,expected,predicted,correct
0,ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0,50,"Romantic Paris. 2 nights, 2 flights from ï¿½79...",True,True,True
1,ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0,50,"URGENT! Your mobile No *********** WON a ï¿½2,...",True,True,True
2,ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0,50,Free 1st week entry 2 TEXTPOD 4 a chance 2 win...,True,True,True
3,ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0,50,Wan2 win a Meet+Greet with Westlife 4 U or a m...,True,True,True
4,ft:gpt-3.5-turbo-1106:aa-engineering::8I9g9RO0,50,You can stop further club tones by replying \S...,True,True,True


In [None]:
# Create a confusion matrix for each sample size and model, put them into a dataframe

#%pip install scikit-learn
from sklearn.metrics import confusion_matrix


grouped = validation_df.groupby('sample_size')

rows = []
for sample_size in grouped.groups:
    group = grouped.get_group(sample_size)
    # display(group)

    fineTuneModelPredictions = group[group['model'] != 'general']
    generalModelPredictions = group[group['model'] == 'general']

    fineTuneConfusionMatrix = confusion_matrix(fineTuneModelPredictions['expected'], fineTuneModelPredictions['predicted'], labels=[True, False])
    # print(fineTuneConfusionMatrix)
    fineTuneModelAccuracy = (fineTuneConfusionMatrix[0][0] + fineTuneConfusionMatrix[1][1]) / (fineTuneConfusionMatrix[0][0] + fineTuneConfusionMatrix[0][1] + fineTuneConfusionMatrix[1][0] + fineTuneConfusionMatrix[1][1])
    rows.append([sample_size, 'fine-tuned', fineTuneConfusionMatrix[0][0], fineTuneConfusionMatrix[0][1], fineTuneConfusionMatrix[1][0], fineTuneConfusionMatrix[1][1], fineTuneModelAccuracy])

    generalConfusionMatrix = confusion_matrix(generalModelPredictions['expected'], generalModelPredictions['predicted'], labels=[True, False])
    # print(generalConfusionMatrix)
    generalModelAccuracy = (generalConfusionMatrix[0][0] + generalConfusionMatrix[1][1]) / (generalConfusionMatrix[0][0] + generalConfusionMatrix[0][1] + generalConfusionMatrix[1][0] + generalConfusionMatrix[1][1])
    rows.append([sample_size, 'general', generalConfusionMatrix[0][0], generalConfusionMatrix[0][1], generalConfusionMatrix[1][0], generalConfusionMatrix[1][1], generalModelAccuracy])

confusion_matrix_df = pd.DataFrame(rows, columns=['sample_size', 'model', 'true_positive', 'false_positive', 'false_negative', 'true_negative', 'accuracy'])
confusion_matrix_df

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x1275b6c10>


Unnamed: 0,sample_size,model,true_positive,false_positive,false_negative,true_negative,accuracy
0,50,fine-tuned,6,0,22,22,0.56
1,50,general,6,0,24,20,0.52
2,100,fine-tuned,12,0,0,88,1.0
3,100,general,10,2,4,84,0.94
4,200,fine-tuned,25,0,1,174,0.995
5,200,general,21,4,6,169,0.95
