This notebook looks at finetuning ChatGPT on a YT creator's video title and video transcripts to generate more personalized video transcript outlines given a video title. The notebook compares the finetuned model output to that of a base gpt-3.5-turbo for the same prompt.

In [27]:
import os
import json
import tiktoken
from collections import defaultdict
import numpy as np
import pandas as pd
import jsonlines
from openai import OpenAI
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

True

In [28]:
client = OpenAI()
client.api_key = os.getenv("OPENAI_API_KEY")

In [29]:
# Dataset taken from https://www.kaggle.com/code/thedevastator/mrbeast-transcripts-starter-notebook/input
data_df = pd.read_csv("kaggle_yt_video_data.csv")

In [30]:
# Taking most recent 50 videos
data_subset = data_df.sort_values(by="publish_date", ascending=False).head(50)

In [31]:
transcript_list = data_subset[0:50]["transcript"].to_list()

 1.  Summarize transcripts: 
 The transcripts fed directly into the finetuning model would exceed the token limit. As a solution, we first summarize the transcripts.

In [16]:
summary_list = []
for transcript in transcript_list:
    summary_prompt = (
        f"Summarize the following transcript of a Youtube video: {transcript}."
    )
    gpt_summary = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0.7,
        messages=[{"role": "user", "content": summary_prompt}],
    )
    result = gpt_summary.choices[0].message.content.strip('""')
    summary_list.append(result)

In [18]:
data_subset["summary"] = summary_list

In [20]:
titles_list = data_subset["title"].to_list()

2. Finetune the model on the title and the transcript summary

In [22]:
# Create the training data
data = []
for val in range(len(titles_list)):
    prompt = {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful chatbot that writes youtube video transcript outlines given a video title.",
            },
            {
                "role": "user",
                "content": f"Given the following video title for a Youtube video: {titles_list[val]}, what would be a good video outline? ",
            },
            {"role": "assistant", "content": f"{summary_list[val]}"},
        ]
    }
    data.append(prompt)
# Open the JSONL file in write mode
with jsonlines.open("output.jsonl", mode="w") as writer:
    # Iterate over the list of dictionaries
    for item in data:
        # Write each dictionary as a separate line in the JSONL file
        writer.write(item)

In [89]:
data_path = "output.jsonl"
# Load the dataset
with open(data_path, "r", encoding="utf-8") as f:
    dataset = [json.loads(line) for line in f]

# Sample training data format
for message in dataset[0]["messages"]:
    print(message)

{'role': 'system', 'content': 'You are a helpful chatbot that writes youtube video transcript outlines given a video title.'}
{'role': 'user', 'content': 'Given the following video title for a Youtube video: Hydraulic Press Vs Lamborghini, what would be a good video outline? '}
{'role': 'assistant', 'content': 'The video features a series of extreme experiments, including flattening a Lamborghini with a hydraulic press, dropping a car from a helicopter into a pool of Orbeez, and testing if semi-trucks can stop a train. The experiments are over-the-top and often result in destruction, but also include humor and excitement from the participants. The video ends with a dramatic explosion of the flattened Lamborghini.'}


In [24]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(
            k not in ("role", "content", "name", "function_call", "weight")
            for k in message
        ):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        function_call = message.get("function_call", None)

        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


In [28]:
encoding = tiktoken.get_encoding("cl100k_base")


# Estimate the number of tokens
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens


def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens


def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

In [29]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(
    f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning"
)

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 121, 208
mean / median: 159.48, 158.0
p5 / p95: 135.8, 188.2

#### Distribution of num_assistant_tokens_per_example:
min / max: 63, 147
mean / median: 97.42, 96.5
p5 / p95: 74.7, 125.2

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


In [90]:
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 50
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(
    min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens
)
print(
    f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training"
)
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(
    f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens"
)

Dataset has ~7974 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~23922 tokens


In [32]:
client.files.create(file=open("output.jsonl", "rb"), purpose="fine-tune")

FileObject(id='file-7TY1KQRzwaf1HevG86djSLYF', bytes=40404, created_at=1716700619, filename='output.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

In [33]:
# Create a fine-tuning job
client.fine_tuning.jobs.create(
    training_file="file-7TY1KQRzwaf1HevG86djSLYF", model="gpt-3.5-turbo"
)

FineTuningJob(id='ftjob-rAGnT9XdLyId6FCjvrJAtPby', created_at=1716700664, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-IzVYmXYvxXSW8yrBJM3dP4nZ', result_files=[], seed=446844219, status='validating_files', trained_tokens=None, training_file='file-7TY1KQRzwaf1HevG86djSLYF', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)

In [41]:
client.fine_tuning.jobs.list(limit=3)

SyncCursorPage[FineTuningJob](data=[FineTuningJob(id='ftjob-rAGnT9XdLyId6FCjvrJAtPby', created_at=1716700664, error=Error(code=None, message=None, param=None), fine_tuned_model='ft:gpt-3.5-turbo-0125:ben::9T0gvsHl', finished_at=1716701196, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-IzVYmXYvxXSW8yrBJM3dP4nZ', result_files=['file-YzJiigp5bZD8IRYNuyPE7LhA'], seed=446844219, status='succeeded', trained_tokens=23622, training_file='file-7TY1KQRzwaf1HevG86djSLYF', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None), FineTuningJob(id='ftjob-OPDXiPEtvGIUXoOBtEAGzfVr', created_at=1714085198, error=Error(code=None, message=None, param=None), fine_tuned_model='ft:gpt-3.5-turbo-0125:ben::9I2M7Zs3', finished_at=1714085985, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model='gpt-3.5-turbo-0125', object='f

3. Test the finetuned model and a standard gpt 3.5 turbo on the same prompt

In [11]:
test_transcript = data_df[data_df["title"] == "I Built A Giant House Using Only Legos"][
    "transcript"
].to_list()

In [12]:
# Creating a summary for a test transcript
summary_prompt = (
    f"Summarize the following transcript of a Youtube video: {test_transcript}."
)
gpt_summary = client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0.7,
    messages=[{"role": "user", "content": summary_prompt}],
)
result = gpt_summary.choices[0].message.content.strip('""')
result

'The video shows the process of building a giant Lego house using over a million Legos, including building furniture like a toilet, TV, bed, and desk. The house is tested for durability against elements like hurricanes, earthquakes, and even a 50-caliber gun. The house is also tested for security and egg-proofing. Eventually, the house is destroyed by a Lego car, but it is announced that one lucky Instagram follower will get the house. The video ends with the realization that the house was built just to be torn down for content.'

In [19]:
# Testing the fine-tuned model
completion = client.chat.completions.create(
    model="ft:gpt-3.5-turbo-0125:ben::9T0gvsHl",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful chatbot that writes youtube video transcripts given a video title.",
        },
        {
            "role": "user",
            "content": "Mr.Beast is planning to make YT video with the following title: 'I Built A Giant House Using Only Legos'. what would be a good video outline?",
        },
    ],
)
print(completion.choices[0].message)

ChatCompletionMessage(content='Mr. Beast has teamed up with a bunch of friends to build a giant Lego house in stop-motion style. They rent a warehouse and start building walls, floors, and rooms using thousands of Lego bricks. Various challenges and funny moments occur as they build and move through the Lego house. Eventually, they complete the house with a working toilet and a blimp hovering above it. The video ends with an exciting announcement that the Lego house itself will be given away to a random viewer who comments on the video.', role='assistant', function_call=None, tool_calls=None)


In [18]:
# Testing the base model
completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful chatbot that writes youtube video transcripts given a video title.",
        },
        {
            "role": "user",
            "content": "Mr.Beast is planning to make YT video with the following title: 'I Built A Giant House Using Only Legos'. what would be a good video outline?",
        },
    ],
)
print(completion.choices[0].message)

ChatCompletionMessage(content="Title: I Built A Giant House Using Only Legos\n\nOutline:\n0:00 - 0:30 Introduction\n- Brief introduction of Mr. Beast and the challenge of building a giant house using only Legos\n- Excitement and anticipation for the upcoming project\n\n0:30 - 1:00 Gathering Supplies\n- Showing the process of gathering a massive amount of Legos needed for the project\n- Setting up the workspace for the construction of the Lego house\n\n1:00 - 3:00 Building the Foundation\n- Time-lapse footage of laying down the foundation for the Lego house\n- Overcoming challenges and setbacks during the construction process\n\n3:00 - 5:00 Constructing the Walls and Roof\n- Focusing on building the walls and roof of the giant Lego house\n- Detailing the intricate design elements and creative solutions used in construction\n\n5:00 - 6:00 Final Touches and Decorations\n- Adding final touches, details, and decorations to the Lego house\n- Showcasing the completed giant Lego house from dif

We observe that the finetuned model creatively captures the style of the creator when generating video outlines as compared to directly prompting gpt-3.5-turbo. There are a few optimizations that can made to the model to tune it further. We made use of 50 data samples to finetune the model. Finetuning the model on a larger sample set would result in better capturing the content style of the creator. 