# Finetuning with OpenAI and Google

This notebook trains a model that will create a line in the style of a Shakespearean sonnet when given a single word.

See [./finetuning_0_dataset.ipynb](./finetuning_0_dataset.ipynb) for how this dataset was created.
See [./indexed_sonnets.json](./indexed_sonnets.json) to create the training dataset.

## Set up.

Install the necessary packages, set up the API keys etc.

In [1]:
#%pip install --quiet -r requirements.txt

In [2]:
from dotenv import load_dotenv
load_dotenv("../keys.env");

In [2]:
# how many examples should we use?
# There are about 6000 examples in the file and it will cost you about $10 to fine-tune OpenAI gpt-4o-mini on all of them.
# So, choose the sampling percentage to reduce the cost.  By choosing 0.1 here, it will cost me approximately $1.
# Specify 1.0 to train on all the samples
SAMPLING = 0.1

# OpenAI

Finetune a gpt-4o-mini model.

### 1. Create datataset

Following the instructions in https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

In [3]:
!head indexed_sonnets.json

[
  {
    "input": "creatures",
    "output": "From fairest creatures we desire increase,"
  },
  {
    "input": "desire",
    "output": "From fairest creatures we desire increase,"
  },
  {


In [4]:
import json

messages = []
SYSTEM_PROMPT = "You are a chatbot takes a single word as input and writes a line of poetry that contains the given word."
with open('indexed_sonnets.json') as ifp:
    indexed_poems = json.load(ifp)
    # required format:
    # {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, 
    #               {"role": "user", "content": "What's the capital of France?"}, 
    #               {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
    for poem in indexed_poems:
        messages.append({
            "messages": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": poem['input']},
                {"role": "assistant", "content": poem['output']}
            ]
        })

messages[18]

{'messages': [{'role': 'system',
   'content': 'You are a chatbot takes a single word as input and writes a line of poetry that contains the given word.'},
  {'role': 'user', 'content': 'famine'},
  {'role': 'assistant', 'content': 'Making a famine where abundance lies,'}]}

In [5]:
len(messages)

6387

In [6]:
import random

random.seed(42)
with open('finetuning_openai.jsonl', 'w') as ofp:
    for message in messages:
        if random.random() < SAMPLING:
            # write the message as a single line JSON
            line = json.dumps(message)
            ofp.write(line)
            ofp.write('\n')

In [7]:
!wc -l finetuning_openai.jsonl            
!head -2 finetuning_openai.jsonl

622 finetuning_openai.jsonl
{"messages": [{"role": "system", "content": "You are a chatbot takes a single word as input and writes a line of poetry that contains the given word."}, {"role": "user", "content": "desire"}, {"role": "assistant", "content": "From fairest creatures we desire increase,"}]}
{"messages": [{"role": "system", "content": "You are a chatbot takes a single word as input and writes a line of poetry that contains the given word."}, {"role": "user", "content": "time"}, {"role": "assistant", "content": "But as the riper should by time decease,"}]}


## 2. Validate training file and estimate cost

following instructions at https://cookbook.openai.com/examples/chat_finetuning_data_prep

In [8]:
# Copy-pasted from https://cookbook.openai.com/examples/chat_finetuning_data_prep
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict
data_path = "finetuning_openai.jsonl"

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)
    
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")
    
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")
    
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 16385 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 16,385 token limit, they will be truncated during fine-tuning")

# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 16385

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Num examples: 622
First example:
{'role': 'system', 'content': 'You are a chatbot takes a single word as input and writes a line of poetry that contains the given word.'}
{'role': 'user', 'content': 'desire'}
{'role': 'assistant', 'content': 'From fairest creatures we desire increase,'}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 45, 62
mean / median: 50.60128617363344, 50.0
p5 / p95: 48.0, 53.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 6, 22
mean / median: 11.041800643086816, 11.0
p5 / p95: 9.0, 13.0

0 examples may be over the 16,385 token limit, they will be truncated during fine-tuning
Dataset has ~31474 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~9442

## 3. Upload training file and start training job

In [14]:
from openai import OpenAI
client = OpenAI()

training_file = client.files.create(
  file=open("finetuning_openai.jsonl", "rb"),
  purpose="fine-tune"
)

training_file

FileObject(id='file-vqVo75gjyXaz8OqxXIkz8tyx', bytes=172839, created_at=1723648774, filename='finetuning_openai.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

In [15]:
training_file.id

'file-vqVo75gjyXaz8OqxXIkz8tyx'

In [21]:
BASE_MODEL="gpt-3.5-turbo-0125"
#BASE_MODEL="gpt-4o-mini-2024-07-18"
training_job = client.fine_tuning.jobs.create(
  training_file=training_file.id, 
  model=BASE_MODEL
)

In [22]:
from openai import OpenAI
client = OpenAI()
client.fine_tuning.jobs.list(limit=1)

SyncCursorPage[FineTuningJob](data=[FineTuningJob(id='ftjob-FxvxRSxJaZnj5gUE7Qa98Xz9', created_at=1723649613, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-O9DnLrdeTxgprcSfnVHOMUK9', result_files=[], seed=1650444829, status='validating_files', trained_tokens=None, training_file='file-vqVo75gjyXaz8OqxXIkz8tyx', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)], object='list', has_more=True)

In [24]:
training_job_id='ftjob-FxvxRSxJaZnj5gUE7Qa98Xz9'  #(training_job.id)
client.fine_tuning.jobs.retrieve(training_job_id)

FineTuningJob(id='ftjob-FxvxRSxJaZnj5gUE7Qa98Xz9', created_at=1723649613, error=Error(code='exceeded_quota', message='Creating this fine-tuning job would exceed your hard limit, please check your plan and billing details.                     Cost of job ftjob-FxvxRSxJaZnj5gUE7Qa98Xz9: USD 0.73. Quota remaining for org-O9DnLrdeTxgprcSfnVHOMUK9: USD -186.42.', param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-O9DnLrdeTxgprcSfnVHOMUK9', result_files=[], seed=1650444829, status='failed', trained_tokens=None, training_file='file-vqVo75gjyXaz8OqxXIkz8tyx', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)

## 4. Try out the finetuned model

Make sure to supply messages in the same format used to train.

In [25]:
client.fine_tuning.jobs.retrieve(training_job_id).fine_tuned_model

In [None]:
kitchen_input = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "kitchen"},
]
kitchen_joke = client.chat.completions.create(
  model=client.fine_tuning.jobs.retrieve(training_job.id).fine_tuned_model,
  messages=kitchen_input
)
print(kitchen_joke.choices[0].message)

# Google Gemini

Fine-tune a Gemini model following the instructions at:
https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning

In [13]:
import json

messages = []

with open('indexed_sonnets.json') as ifp:
    indexed_poems = json.load(ifp)
    # required format:
    # {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, 
    #               {"role": "user", "content": "What's the capital of France?"}, 
    #               {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
    for poem in indexed_poems:
        messages.append({
            "messages": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": poem['input']},
                {"role": "model", "content": poem['output']}
            ]
        })

messages[18]

{'messages': [{'role': 'system',
   'content': 'You are a chatbot takes a single word as input and writes a line of poetry that contains the given word.'},
  {'role': 'user', 'content': 'famine'},
  {'role': 'model', 'content': 'Making a famine where abundance lies,'}]}

In [14]:
import random

random.seed(42)
with open('finetuning_gemini.jsonl', 'w') as ofp:
    for message in messages:
        if random.random() < SAMPLING:
            # write the message as a single line JSON
            line = json.dumps(message)
            ofp.write(line)
            ofp.write('\n')

In [15]:
BUCKET="viz_genai_nonsensitive"  # CHANGE THIS to be your own bucket
REGION="us-central1"
!gsutil cp finetuning_gemini.jsonl gs://$BUCKET/finetuning_gemini.jsonl

Copying file://finetuning_gemini.jsonl [Content-Type=application/octet-stream]...
/ [1 files][166.4 KiB/166.4 KiB]                                                
Operation completed over 1 objects/166.4 KiB.                                    


In [16]:
import vertexai
vertexai.init(location=REGION)

In [17]:
PROJECT=!gcloud config get project
PROJECT=PROJECT[0]

In [18]:
from vertexai.preview.tuning import sft
import time
sft_tuning_job = sft.train(
    source_model="gemini-1.0-pro-002",
    train_dataset=f"gs://{BUCKET}/finetuning_gemini.jsonl"
)

# Polling for job completion
while not sft_tuning_job.has_ended:
    time.sleep(60)
    sft_tuning_job.refresh()

print(sft_tuning_job.tuned_model_name)
print(sft_tuning_job.tuned_model_endpoint_name)
print(sft_tuning_job.experiment)

Creating SupervisedTuningJob
SupervisedTuningJob created. Resource name: projects/82379820716/locations/us-central1/tuningJobs/2736685987722690560
To use this SupervisedTuningJob in another session:
tuning_job = sft.SupervisedTuningJob('projects/82379820716/locations/us-central1/tuningJobs/2736685987722690560')
View Tuning Job:
https://console.cloud.google.com/vertex-ai/generative/language/locations/us-central1/tuning/tuningJob/2736685987722690560?project=82379820716


projects/82379820716/locations/us-central1/models/7757618583424729088@1
projects/82379820716/locations/us-central1/endpoints/9031474272459030528
<google.cloud.aiplatform.metadata.experiment_resources.Experiment object at 0x7fdfd437f130>


In [13]:
from vertexai.preview.generative_models import GenerativeModel
from vertexai.preview import tuning
from vertexai.preview.tuning import sft

tuned_model = GenerativeModel('projects/82379820716/locations/us-central1/endpoints/9031474272459030528') #sft_tuning_job.tuned_model_endpoint_name)
print(tuned_model.generate_content(f"{SYSTEM_PROMPT}\n  User: kitchen  Model:\n"))

candidates {
  content {
    role: "model"
    parts {
      text: "Kitchen and chapel, have he still at peace"
    }
  }
  finish_reason: STOP
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
    probability_score: 0.0697949230670929
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.05252167582511902
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
    probability_score: 0.07949569821357727
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.03422932326793671
  }
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
    probability_score: 0.12896329164505005
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.04611974582076073
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
    probability_score: 0.2628418207168579
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.08525123447179794