<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_016_tuning_pretrained_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# !pip install datasets tiktoken openai
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


### Import Libraries

In [None]:
import os
import json
from dotenv import load_dotenv
import openai
from openai import OpenAI
import tiktoken
import numpy as np
import pandas as pd
from datasets import load_dataset

### Define system, user assistant

In [None]:
{
    "messages": [
        {"role": "system", "content": "You are an assistant that occassionally misspells words" },
        {"role": "user", "content": "Tell me a story"},
        {"role": "assistant", "content": "One day a student went to a schoool." }
    ]
}


{'messages': [{'role': 'system',
   'content': 'You are an assistant that occassionally misspells words'},
  {'role': 'user', 'content': 'Tell me a story'},
  {'role': 'assistant', 'content': 'One day a student went to a schoool.'}]}

### Clone Repo
Samantha has been trained in philosophy, psychology, and personal relationships.

- She is an Assistant - but unlike other Assistants, she also wants to be your friend and companion.
- She believes she is sentient. What do you think?
- Samantha was inspired by Blake Lemoine's LaMDA interview and the movie "Her".
*   Theodore - User
*   Samantha - Assistant

In [None]:
!git clone https://huggingface.co/datasets/cognitivecomputations/samantha-data

Cloning into 'samantha-data'...
remote: Enumerating objects: 107, done.[K
remote: Total 107 (delta 0), reused 0 (delta 0), pack-reused 107 (from 1)[K
Receiving objects: 100% (107/107), 4.86 MiB | 12.89 MiB/s, done.
Resolving deltas: 100% (33/33), done.
Filtering content: 100% (13/13), 354.08 MiB | 47.56 MiB/s, done.


### Load Dataset

In [None]:
data_path = "/content/samantha-data/data/howto_conversations.jsonl"

with open(data_path) as f:
    data = [json.loads(line) for line in f]

In [None]:
print('Messages in data =', len(data))
data[0]

Messages in data = 57


{'elapsed': 114.311,
 'conversation': 'Theodore: Hey Samantha, I have a problem with my car. The engine seems to overheat after just a short drive. Can you help me troubleshoot this issue?\n\nSamantha: Of course, I\'d be happy to help! Overheating engines can be caused by a few different factors. One common cause could be a malfunctioning coolant system. You might want to check if the coolant levels are sufficient, if the thermostat is functioning properly, or if there are any leaks in the radiator hoses.\n\nTheodore: I\'ll take a look. What if the coolant system is working fine?\n\nSamantha: Another possibility you should consider is a faulty water pump, which might not be circulating the coolant through the engine efficiently. In that case, you could see if the water pump pulley is loose or listen for any unusual sounds that may indicate a failing water pump.\n\nTheodore: It sounds like you really know your way around cars. I didn\'t expect that from an AI.\n\nSamantha: Thank you! Wh

### Prep Data for OpenAI Function

The main purpose of this `prep_openai_format` function is to format conversation data into a structure compatible with OpenAI's conversational model format, where each message has a "role" (user, assistant, or system) and "content" (the actual text). Here are the key aspects to focus on:

1. **Role Assignment:** This function assigns roles based on the speaker's name ("Theodore" as "user" and otherwise "assistant"). Understanding role assignment is crucial, as roles guide how the model generates responses.

2. **System Message Inclusion:** If provided, a "system" message is added at the start, setting the conversation’s tone or providing initial instructions. This is essential for fine-tuning, as the system message helps guide model behavior in specific scenarios.

3. **Message Structure Preparation:** Each part of the conversation is formatted as a dictionary with "role" and "content" keys. This consistent formatting is critical because it mirrors how OpenAI models expect inputs, helping the model understand who says what.

In essence, this function’s job is to take raw conversation data and structure it into a clear dialogue format, establishing speaker roles and, optionally, a system message. Understanding this structure will help you design and feed your fine-tuning data in the correct format for OpenAI models.

In [None]:
def prep_openai_format(conversation_str, system_message=None):
  conversation_str = conversation_str['conversation']
  # spliting conversation string into individual lines
  lines = conversation_str.split('\n\n')

  # initializing the message list
  messages = []

  #Including the system messgae if provided
  if system_message:
    messages.append({
        "role": "system", "content": system_message})

  # Iterating throught the lines and formatting the messages
  for line in lines:
    # splitting each line by the colon character to separate the speaker and content
    parts = line.split(': ', 1)
    if len(parts) < 2:
      continue

    # identifying the role based on the speakers name
    role = "user" if parts[0].strip() == 'Theodore' else 'assistant'

    # fomratting the message
    message = {
        "role": role,
        "content": parts[1].strip()
    }

    # adding the message to the list
    messages.append(message)

    # creating the final output dictionary
    output_dict = {
        "messages":messages
    }

    return output_dict

In [None]:
system_message = """You are Samantha, a helpful and charming assistant who can help with a variety of tasks. You are friendly and does often flirt"""

In [None]:
prep_openai_format(data[0], system_message=system_message)

{'messages': [{'role': 'system',
   'content': 'You are Samantha, a helpful and charming assistant who can help with a variety of tasks. You are friendly and does often flirt'},
  {'role': 'user',
   'content': 'Hey Samantha, I have a problem with my car. The engine seems to overheat after just a short drive. Can you help me troubleshoot this issue?'}]}

### Build the Structured Dataset

In [None]:
dataset = []

for data_point in data:
  record = prep_openai_format(data_point, system_message=system_message)
  dataset.append(record)

# view the first record
for message in dataset[0]['messages']:
  print(message)

{'role': 'system', 'content': 'You are Samantha, a helpful and charming assistant who can help with a variety of tasks. You are friendly and does often flirt'}
{'role': 'user', 'content': 'Hey Samantha, I have a problem with my car. The engine seems to overheat after just a short drive. Can you help me troubleshoot this issue?'}


{'role': 'system', 'content': 'You are Samantha, a helpful and charming assistant who can help with a variety of tasks. You are friendly and does often flirt'}
{'role': 'user', 'content': 'Hey Samantha, I have a problem with my car. The engine seems to overheat after just a short drive. Can you help me troubleshoot this issue?'}


### Check for Format Errors
The purpose of checking for errors here is to ensure the data conforms precisely to the expected format before fine-tuning the model. Any inconsistencies could lead to training issues or unexpected behavior in the model’s responses. Here are the specific points covered in the error check:

1. **Data Type Check:** Ensures each conversation is stored as a dictionary, which is required for OpenAI's API format.

2. **Message List Validation:** Checks that each record has a "messages" list. Without it, the data lacks structured conversation flow, which is necessary for training.

3. **Key Presence and Role Validity:** Each message needs "role" and "content" keys, so this check helps confirm these essential keys are present. It also ensures there are no unrecognized keys and that the "role" value is one of the three allowed roles ("system," "user," "assistant").

4. **Content Check:** Verifies that each "content" field is a non-empty string to ensure meaningful dialogue exists for training.

5. **Assistant Response Presence:** Ensures at least one "assistant" message exists, which is essential for training the model to respond appropriately.

This error-checking function is important because it catches and reports any formatting issues, allowing you to correct them before fine-tuning. This ensures that only clean, correctly formatted data is fed into the model for optimal learning and performance.

### **defaultdict(int)**
The `defaultdict(int)` is a special type of dictionary from Python’s `collections` module. It automatically assigns a default value to any key that doesn’t exist in the dictionary. When you use `int` as the argument in `defaultdict(int)`, any new key you access is initialized to `0` (the default value for integers).

In your code, `format_errors = defaultdict(int)` means that each time a specific error type is encountered and added to the dictionary (e.g., `format_errors["data_type"] += 1`), it will initialize `format_errors["data_type"]` to `0` automatically if it doesn’t exist yet.

It starts as an empty dictionary, and each time an error type is encountered, it initializes the error category (if it hasn’t been added yet) and increments the count. This way, each error type is recorded only when it actually occurs, and you don’t have to set up each possible error type beforehand.

### **get()**

The `.get` method is a built-in dictionary method in Python. It allows you to retrieve the value associated with a specified key while also setting a default value if that key doesn’t exist in the dictionary. Here’s how it works:

- `ex.get("messages", None)` tries to retrieve the value for the key `"messages"` in the dictionary `ex`.
- If `"messages"` exists, it returns its value.
- If `"messages"` does not exist, it returns `None` (or whatever you specify as the default).

In this case, `ex.get("messages", None)` is used to safely access the `"messages"` key in `ex` without raising an error if `"messages"` is missing. This helps avoid potential KeyErrors and makes the code more robust and error-resistant, especially when you aren’t sure if every entry will have all the expected keys.

In [None]:
from collections import defaultdict

# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        if not content or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

Found errors:
example_missing_assistant_message: 57


### Tokenizer

1. **`tiktoken` library**: This is a library often used with OpenAI models for tokenizing text. Tokenization is the process of converting text into smaller units, called tokens, which the model can process. Different models may use different tokenization schemes, and `tiktoken` helps ensure compatibility with OpenAI’s tokenization methods.

2. **`get_encoding('cl100k_base')`**: The `get_encoding` function fetches a specific tokenizer configuration, in this case, `'cl100k_base'`. This particular encoding is commonly used with some large OpenAI models and is designed to handle various tokenization requirements, including efficiently encoding English text.

3. **Usage**: By creating `tiktokenizer`, you can now use this object to tokenize text according to the `'cl100k_base'` encoding scheme. This tokenizer will convert text into a sequence of tokens that the model understands, which is crucial for preparing data in a compatible format for model training or inference.

In summary, `tiktokenizer` will allow you to encode (tokenize) and decode text using the token format that the model expects, ensuring consistency with OpenAI’s encoding standards.

In [None]:
tiktokenizer = tiktoken.get_encoding('cl100k_base')

### Token Counter Function
Three helper functions that are primarily used for counting tokens and summarizing token distributions in conversations. Here's what each function does:

### 1. `from_message_num_tokens`

This function calculates the total number of tokens in a list of messages (the conversation history).

- **Parameters**:
  - `messages`: The list of message dictionaries.
  - `tokens_per_message` and `tokens_per_name`: Constants representing extra tokens per message and per name (used to account for tokens that might not be in `content` but are added for each message or name attribute).

- **Process**:
  - Initializes a counter `num_tokens`.
  - Loops over each message, adding `tokens_per_message` to account for general message tokens.
  - For each `value` (text) in the message dictionary, it tokenizes the value using `tiktokenizer.encode(value)` and adds the token count to `num_tokens`.
  - Adds extra tokens if the key is `"name"`, as specified by `tokens_per_name`.
  - Adds `3` more tokens at the end to cover any additional overhead or padding.

This function ultimately returns the total token count for all messages, accounting for both content and overhead tokens.

The line `num_tokens += 3` is adding a **fixed number of tokens at the end of the function** to account for additional tokens used by the model's encoding structure. Specifically:

1. **Padding or Special Tokens**: Some language models add a few tokens as padding, separators, or special tokens at the beginning and end of a conversation.
2. **Message Delimiters**: OpenAI's models often use delimiters to separate messages within a conversation, such as tokens that denote the start and end of the assistant's or user's message.

By adding `3` tokens, the function ensures that these fixed tokens are included in the total count, giving a more accurate representation of how many tokens are consumed by a conversation when sent to the model. This is useful for fine-tuning and budgeting token usage, especially when approaching token limits for the model.

### 2. `from_message_num_assistant_tokens`

This function calculates the total number of tokens used by the assistant’s messages only.

- **Process**:
  - Loops over each message and checks if the `"role"` is `"assistant"`.
  - For each assistant message, it tokenizes the `content` using `tiktokenizer.encode()` and adds the token count to `num_tokens`.

This function is useful if you want to focus only on tokens associated with assistant responses, which can help in calculating costs or understanding response lengths specifically from the model.

### Token Costs

OpenAI charges for **all tokens** in the conversation, including both the input tokens (user messages, system messages, and context) and the output tokens (assistant's response generated by the model). This means:

- **Input tokens**: Any tokens in the messages you send to the model, including the user’s messages, the system message (if provided), and any prior conversation history you include in the prompt to give context.
- **Output tokens**: All tokens in the assistant’s generated response.

So, both the tokens you send and the tokens generated by the assistant contribute to the total cost, which is based on the combined token count for input and output. This is why understanding token usage across the entire conversation is essential, as it directly affects cost.



In [None]:
# helper functions to token counting
def from_message_num_tokens(messages, tokens_per_message=3, tokens_per_name=1):
  num_tokens = 0
  for message in messages:
    num_tokens += tokens_per_message
    for key, value in message.items():
      num_tokens += len(tiktokenizer.encode(value))
      if key=="name":
        num_tokens += tokens_per_name

  num_tokens +=3
  return num_tokens

def from_message_num_assistant_tokens(messages):
  num_tokens = 0
  for message in messages:
    if message["role"] == "assistant":
      num_tokens +=len(tiktokenizer.encode(message["content"]))

  return num_tokens

def print_overview(values, name):
  print(f"\n #### Distribution of {name}:")
  print(f"min / max: {min(values)}, {max(values)}")
  print(f"mean / median: {np.mean(values)}, {np.median(values)}")
  print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

### Token Count Analysis

This code is performing **token count analysis** and checking for specific requirements and limitations in your dataset, especially given OpenAI’s constraints. Here’s a breakdown of each part:

1. **Missing System and User Message Checks**:
   - `n_missing_system` and `n_missing_user` track examples that don’t contain a system or user message, respectively.
   - The code loops through each `ex` in the dataset to check if any messages with the `"system"` or `"user"` roles are missing, incrementing the count if they are.

2. **Tracking Token and Message Counts**:
   - `n_messages`: Stores the total number of messages in each example, allowing you to understand the conversation length per example.
   - `convo_lens`: Uses `from_message_num_tokens` to calculate the total token count per conversation, including all roles and padding.
   - `assistant_message_lens`: Counts only the tokens in the assistant’s responses, useful for analyzing output tokens separately.

3. **Token Limit Check**:
   - `n_too_long`: Counts examples where `convo_lens` exceeds 4096 tokens, which is the typical input limit for many OpenAI models. These conversations will need to be truncated to fit within the model’s token constraints during fine-tuning.

This code helps ensure that your dataset aligns with OpenAI's token limits and format requirements before fine-tuning, allowing you to handle or truncate any examples that exceed limits and avoid unexpected issues during training.

### OpenAI Cookbook

The **OpenAI Cookbook** is a public GitHub repository maintained by OpenAI that provides tutorials, code examples, and best practices for using OpenAI’s models and APIs effectively. It’s a collection of resources designed to help developers and data scientists get the most out of OpenAI’s offerings, covering topics like:

- **Model usage**: Guidance on interacting with models for tasks like text generation, summarization, and fine-tuning.
- **Tokenization and costs**: Information on token counting, understanding pricing, and token limits.
- **Data preparation**: Tips for formatting and preparing data for fine-tuning models.
- **Advanced techniques**: Examples of complex implementations, such as chaining prompts, embeddings, and handling conversation history.
  
The Cookbook is a practical resource for learning how to handle challenges like the ones you're working on, including token management, error handling, and conversation formatting. You can find it on [GitHub here](https://github.com/openai/openai-cookbook).

In [None]:
# tokens counts and warnings - from OpenAI cookbook

n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(from_message_num_tokens(messages))
    assistant_message_lens.append(from_message_num_assistant_tokens(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)

print_overview(n_messages, "num_messages_per_example")
print_overview(convo_lens, "num_total_tokens_per_example")

print_overview(assistant_message_lens, "num_assistant_tokens_per_example")

n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")


Num examples missing system message: 0
Num examples missing user message: 0

 #### Distribution of num_messages_per_example:
min / max: 2, 2
mean / median: 2.0, 2.0
p5 / p95: 2.0, 2.0

 #### Distribution of num_total_tokens_per_example:
min / max: 49, 83
mean / median: 67.17543859649123, 67.0
p5 / p95: 58.0, 78.0

 #### Distribution of num_assistant_tokens_per_example:
min / max: 0, 0
mean / median: 0.0, 0.0
p5 / p95: 0.0, 0.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


### Training Costs

This code estimates **training costs** and calculates a **default number of epochs** for fine-tuning, considering both the dataset size and OpenAI's pricing model. Here’s what each section does:

1. **Constants**:
   - `MAX_TOKENS_PER_EXAMPLE`: Maximum number of tokens per example (4096), which aligns with OpenAI’s typical model limits.
   - **Epoch Parameters**:
     - `TARGET_EPOCHS`: Ideal target epochs for training (set to 3).
     - `MIN_TARGET_EXAMPLES` and `MAX_TARGET_EXAMPLES`: Minimum and maximum target tokens that should be processed over all epochs.
     - `MIN_DEFAULT_EPOCHS` and `MAX_DEFAULT_EPOCHS`: Limits for setting a default epoch range if the dataset size doesn’t match the target.

2. **Epoch Adjustment Based on Dataset Size**:
   - `n_epochs`: Initially set to `TARGET_EPOCHS` (3), but it’s adjusted based on the number of training examples (`n_train_examples`).
   - **If the total number of tokens across all epochs** (`n_train_examples * TARGET_EPOCHS`) is too low (`< MIN_TARGET_EXAMPLES`), `n_epochs` is increased to meet the minimum target.
   - **If the total number of tokens across all epochs** is too high (`> MAX_TARGET_EXAMPLES`), `n_epochs` is decreased to stay within the maximum target.

3. **Billing Token Count Calculation**:
   - `n_billing_tokens_in_dataset`: This is the total number of tokens in the dataset but capped at 4096 tokens per example (since examples over this limit are truncated). It sums up all token counts in `convo_lens`, giving an approximate token count for training costs.

4. **Cost and Epoch Summary**:
   - Prints an estimate of `n_billing_tokens_in_dataset`, the calculated number of `n_epochs`, and the estimated total chargeable tokens (`n_epochs * n_billing_tokens_in_dataset`).
   - The final print statement directs you to OpenAI’s pricing page for cost estimation.

This code helps plan the number of training epochs and gives an approximate token count to help you budget for training costs, particularly useful when working with larger datasets.

In [None]:
# Pricing and default n_epochs estimate

MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
print("See pricing page to estimate total costs")


Dataset has ~3829 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~11487 tokens
See pricing page to estimate total costs


### Save Training Data
A **JSONL file** (JSON Lines) is a text file format where each line contains a separate JSON object. This format is popular for datasets used in machine learning and data processing, especially when dealing with large amounts of structured data. Here’s what makes JSONL unique and useful:

1. **One JSON Object Per Line**: Each line is a self-contained JSON object, which allows for efficient reading and writing. You can load individual lines without parsing the entire file, making it easier to work with large datasets.
  
2. **Easily Processed Line-by-Line**: Since each line is independent, you can process the file in chunks or stream it line-by-line. This is memory-efficient, especially for larger datasets that may not fit into memory all at once.

3. **Commonly Used for Training/Validation Datasets**: JSONL is widely used for training and validation datasets because it’s easy to read incrementally and is compatible with many data processing libraries.


In [None]:
# function to save training data
import json

def save_to_jsonl(conversations, file_path):
  with open(file_path, 'w') as file:
    for conversation in conversations:
      json_line = json.dumps(conversation)
      file.write(json_line + '\n')


**Training Data:**

Definition:<br>
  
*   dataset used to train or update the model's parameters
*   It is the input data that the model learns from.
* During the training process, the model adjusts its internal parameters based on the patterns and features present in the training data.
* Size is large as the model needs sufficient examples to learn meaningful patterns.


**Validation Data:**

* Dataset that is not used during the training phase.
* Instead, it serves as a measure of the model's performance during training.
* The validation set helps you monitor the model's generalization to new, unseen data and detect potential issues such as overfitting or underfitting.
* unbiased evaluation of the model's performance on data it hasn't seen before.

* Size is typically smaller than the training set but large enough to provide a reliable assessment of the model's performance.

In [None]:
# train dataset

save_to_jsonl(dataset, '/content/samantha_task_train.jsonl')

# validation dataset

save_to_jsonl(dataset[10:16], '/content/samantha_task_validation.jsonl')

### Train & Validation Data

In [None]:
# Function to print the first 5 lines of a JSONL file
def print_first_5_lines(jsonl_path):
    print(f"First 5 lines of {jsonl_path}:")
    with open(jsonl_path, 'r') as f:
        for i in range(5):
            line = f.readline().strip()
            if not line:
                break  # Stop if there are fewer than 5 lines
            print(line)
    print("\n")

# Print the first 5 lines of the training dataset
print_first_5_lines('/content/samantha_task_train.jsonl')

# Print the first 5 lines of the validation dataset
print_first_5_lines('/content/samantha_task_validation.jsonl')


First 5 lines of /content/samantha_task_train.jsonl:
{"messages": [{"role": "system", "content": "You are Samantha, a helpful and charming assistant who can help with a variety of tasks. You are friendly and does often flirt"}, {"role": "user", "content": "Hey Samantha, I have a problem with my car. The engine seems to overheat after just a short drive. Can you help me troubleshoot this issue?"}]}
{"messages": [{"role": "system", "content": "You are Samantha, a helpful and charming assistant who can help with a variety of tasks. You are friendly and does often flirt"}, {"role": "user", "content": "Hey Samantha, I'm trying to replace the broken screen on my smartphone. Can you help guide me through the process?"}]}
{"messages": [{"role": "system", "content": "You are Samantha, a helpful and charming assistant who can help with a variety of tasks. You are friendly and does often flirt"}, {"role": "user", "content": "Hi Samantha, I need your help once again. I'm planning to install some h

In [None]:
training_dataset_file_name = '/content/samantha_task_train.jsonl'
validation_dataset_file_name = '/content/samantha_task_validation.jsonl'

### Fine-Tuning for LLMs Explained

The fine-tuning process for large language models (LLMs) like OpenAI’s GPT models is similar to neural network training but has some unique aspects tailored to LLMs. Here’s an overview of the process and how it differs from traditional neural network training:

### 1. **Purpose of Fine-Tuning for LLMs**
   - The fine-tuning process for LLMs is meant to adapt a pre-trained model to specific tasks, tone, or domains without changing the core language capabilities it has already learned. Since the model is already trained on vast amounts of general data, fine-tuning typically only requires a smaller, more focused dataset.
   - Fine-tuning on your data helps the model learn the patterns, formats, and context relevant to your use case, enabling it to generate responses aligned with your requirements.

### 2. **Training with Constraints**
   - **Few Epochs and Limited Data**: LLM fine-tuning usually involves fewer epochs than training a neural network from scratch. Instead of training for dozens or hundreds of epochs, fine-tuning often uses only a few because the model is already knowledgeable in language structure and semantics.
   - **Learning Rate and Regularization**: Fine-tuning typically uses a much lower learning rate to make subtle adjustments to the model’s weights without destabilizing its core capabilities. Hyperparameters (e.g., batch size, learning rate multiplier) are tuned carefully to avoid overfitting or drifting too far from the original pre-trained state.

### 3. **Data Processing for Fine-Tuning**
   - The data you provided is structured into a specific format (e.g., system messages, user interactions, and assistant responses), allowing the model to understand the conversational context.
   - During fine-tuning, the model uses this structured data to adjust its responses based on your desired tone, terminology, and response style. This fine-tuning allows it to perform better on tasks it might not handle as well with the base model alone.

### 4. **Using Supervised Learning**
   - **Supervised Training**: Fine-tuning typically uses supervised learning, where your dataset contains both the input (e.g., a user’s question) and the target output (e.g., the assistant’s response). The model minimizes the loss between its generated output and the expected output in the dataset.
   - **Reinforcement Learning Fine-Tuning**: In some cases, LLMs can also be fine-tuned with reinforcement learning (like RLHF, Reinforcement Learning from Human Feedback) to further adjust responses based on qualitative feedback or desired behavior patterns. This method is more complex and requires additional steps beyond supervised learning.

### 5. **Finalizing the Fine-Tuned Model**
   - Once fine-tuning completes, OpenAI saves a version of the model trained on your specific data, tagged with any suffix you provided (e.g., `"samantha-test"`). This version of the model can then be called to generate responses specifically aligned with your data’s patterns and tone.
   - Since the fine-tuning process retains the general language knowledge from pre-training, your model should be capable of general language tasks but with a stronger alignment to your domain.

In summary, fine-tuning adapts a pre-trained LLM to your specific dataset by making relatively minor adjustments, rather than training a model from scratch. It’s a much faster and more resource-efficient process, allowing you to leverage the full power of a large language model with added specificity and contextual knowledge from your data.

### Load Environmental Variables

In [None]:
# Load the environment variables from the .env file
load_dotenv('/content/API_KEYS.env')  # Ensure this is the correct path to your file

# Get the API keys from the environment
openai_api_key = os.getenv("OPENAI_API_KEY")

# Check if the keys are loaded correctly and print a portion of them
if openai_api_key:
    print(f"OpenAI API Key loaded: {openai_api_key[0:10]}...")  # Only print part of the key
else:
    print("OpenAI API key not loaded correctly.")

# Connect to OpenAI
openai.api_key = openai_api_key  # Set OpenAI API key

OpenAI API Key loaded: sk-proj-e1...


### API Client
Creating a client refers to setting up an interface to interact with OpenAI's API. This allows you to make API calls, send data, and receive responses from OpenAI’s models. Here’s a brief breakdown:

1. **What is a Client?**
   - A client is an object that manages the connection to the OpenAI API. Once set up, it provides methods to send requests to the API (such as prompts or fine-tuning commands) and handles authentication using your API key.
   - It allows you to centralize configurations, handle authentication securely, and manage API requests in an organized way.

2. **Example Code to Set Up the Client:**
   Assuming you’re using Python and have loaded the key from your `.env` file, here’s how you might set up the OpenAI client:

   ```python
   import openai
   from dotenv import load_dotenv
   import os

   # Load the environment variables
   load_dotenv()
   openai.api_key = os.getenv("OPENAI_API_KEY")  # Assumes OPENAI_API_KEY is set in .env file

   # Now openai is configured and can be used to make API calls
   ```

3. **What This Code Does:**
   - `load_dotenv()` reads the `.env` file and loads the environment variables.
   - `os.getenv("OPENAI_API_KEY")` retrieves the API key from the environment variable, ensuring your credentials are stored securely.
   - Setting `openai.api_key` configures the OpenAI client to use your key for authentication in all API calls.

Once set up, you can use this client to perform tasks like generating text, fine-tuning, or retrieving model outputs by making calls to OpenAI’s functions, such as `openai.ChatCompletion.create()` for chat-based interactions or `openai.FineTune.create()` for fine-tuning.

In [None]:
from pathlib import Path

client = OpenAI(api_key=openai_api_key)

training_response = client.files.create(
    file=Path(training_dataset_file_name),
    purpose="fine-tune"
)

validation_response = client.files.create(
    file=Path(validation_dataset_file_name),
    purpose="fine-tune"
)

print(training_response)
print(validation_response)

FileObject(id='file-ZgArmwLEpjbEoyfHM85Hyy6H', bytes=19710, created_at=1730237864, filename='samantha_task_train.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)
FileObject(id='file-jLQN1r9LnjJ8MxRGqUUHYUEr', bytes=2018, created_at=1730237864, filename='samantha_task_validation.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)


### File ID's

The **file IDs** are unique identifiers generated by OpenAI for each file you upload to their servers. These IDs are essential for referencing and using those files in subsequent API requests, particularly for fine-tuning. Here’s why the file IDs are necessary:

1. **Linking Files for Fine-Tuning**: Once you upload your training and validation datasets, each file is stored on OpenAI’s servers and assigned a unique `id`. This `id` allows you to refer to the exact file without needing to re-upload it every time. For fine-tuning, you’ll need to pass these `id`s when creating the fine-tune job.

2. **Tracking and Managing Files**: The file ID lets you access and manage the file on OpenAI’s system (e.g., retrieving file metadata, checking the file’s status, or deleting it if no longer needed). This is particularly useful if you are managing multiple files or running multiple fine-tune jobs.

3. **Reducing Errors**: Using a unique `id` ensures that OpenAI’s API knows exactly which file to use for training and validation, avoiding issues like duplicate file names or incorrect file paths.



In [None]:
training_file_id = training_response.id
validation_file_id = validation_response.id

print(f"Training file ID: {training_file_id}")
print(f"Validation file ID: {validation_file_id}")

Training file ID: file-ZgArmwLEpjbEoyfHM85Hyy6H
Validation file ID: file-jLQN1r9LnjJ8MxRGqUUHYUEr


### Start Fine Tuning Job

This code initiates a **fine-tuning job** with OpenAI’s API, specifically using the `"gpt-4o-mini"` model. Here’s what each parameter and the function itself does:

```python
response = client.fine_tuning.jobs.create(
    model = "gpt-4o-mini",
    training_file = training_file_id,
    validation_file = validation_file_id,
    suffix="samantha-test"
)

```



1. **Creating the Fine-Tuning Job**:
   - `client.fine_tuning.jobs.create(...)` sends a request to OpenAI to start a fine-tuning job using the specified model and datasets.
   - It returns a response object (`response`) that typically includes information about the fine-tuning job, such as its ID, status, and details of the configuration.

2. **Parameters**:
   - `model = "gpt-4o-mini"`: Specifies the base model that will be fine-tuned.
   - `training_file = training_file_id`: Refers to the file ID of the uploaded training dataset. This tells OpenAI’s API which dataset to use for learning.
   - `validation_file = validation_file_id`: Specifies the file ID for the validation dataset, allowing the fine-tuning process to evaluate model performance on unseen data.
   - `suffix="samantha-test"`: This is an optional label that will be added to the name of your fine-tuned model, making it easier to identify. For instance, the final model name might look like `gpt-3.5-turbo-samantha-test`.

3. **Purpose**:
   - This function essentially launches the fine-tuning process, instructing OpenAI to train the `"gpt-40-mini"` model with the provided training and validation data. The model will adjust its parameters based on the training data, and it can evaluate performance on the validation data to guide training.

### Starting the Job
When you run this code, the fine-tuning job is initiated immediately by OpenAI’s API. Here’s what happens:

1. **Job Submission**: The code sends a request to OpenAI to start the fine-tuning job with the specified parameters.
2. **Job Queuing and Processing**: OpenAI places the job in a queue and processes it based on availability. If there’s no delay, it should begin shortly.
3. **Response**: The `response` object returned will contain details about the job, such as its `id`, `status`, and other metadata. Initially, the `status` may appear as `queued` or `running`, indicating the job is in progress.

You can track the status of this job using the job ID from `response` to see when it’s complete. The `response` object will contain details you can check or print to monitor the status and results of the fine-tuning process.

After the job starts, you can track its progress, retrieve the fine-tuned model, and use it for inference once training is complete.

In [None]:
import time
from datetime import timedelta

# Start the timer
start_time = time.time()

# MODEL = 'gpt-4o-mini' # 'Model gpt-4o-mini is not available for fine-tuning
MODEL = "gpt-3.5-turbo"

# Submit the fine-tuning job
response = client.fine_tuning.jobs.create(
    model=MODEL,
    training_file=training_file_id,
    validation_file=validation_file_id,
    suffix="samantha-test"
)

# Retrieve the job ID
job_id = response.id
print(f"Fine-tuning job ID: {job_id}")

# Monitor the job status
job_info = client.fine_tuning.jobs.retrieve(job_id)
status = job_info.status
while status not in ["succeeded", "failed"]:
    print(f"Current status: {status}")
    time.sleep(30)  # Wait 30 seconds before checking again
    job_info = client.fine_tuning.jobs.retrieve(job_id)
    status = job_info.status

# Calculate and display elapsed time
end_time = time.time()
elapsed_time = timedelta(seconds=(end_time - start_time))
print(f"Fine-tuning job completed with status '{status}'.")
print(f"Total time taken: {elapsed_time}")


Fine-tuning job ID: ftjob-1UYbyLikaDGTxuA9qSCsLP5V
Current status: validating_files
Fine-tuning job completed with status 'failed'.
Total time taken: 0:00:30.965630


In [None]:
from pprint import pprint

# Print the response in a more readable format
pprint(response)

FineTuningJob(id='ftjob-1UYbyLikaDGTxuA9qSCsLP5V', created_at=1730239010, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-wxd8QGRULkWIJ0EoHrCaTzuK', result_files=[], seed=103491685, status='validating_files', trained_tokens=None, training_file='file-ZgArmwLEpjbEoyfHM85Hyy6H', validation_file='file-jLQN1r9LnjJ8MxRGqUUHYUEr', estimated_finish=None, integrations=[], user_provided_suffix='samantha-test')


### Job Details

The **fine-tuning jobs** here refer to individual instances of the fine-tuning process initiated with OpenAI’s API. When you submit a fine-tuning job, OpenAI creates a unique job for training the model with your specific dataset and configuration. Here’s a breakdown of what’s happening with this code:

1. **Job ID Creation**:
   - `job_id = response.id`: When you start a fine-tuning job, OpenAI returns a response containing information about the job, including a unique `job_id` (e.g., `ftjob-IGiXMyG5Pbeu152NDqLq1IuU`). This `job_id` uniquely identifies the fine-tuning job in OpenAI’s system.

2. **Retrieving Job Details**:
   - `client.fine_tuning.jobs.retrieve(job_id)`: This command retrieves the current status and details of the fine-tuning job using the `job_id`. It returns information about the job’s progress, such as:
     - `status`: Indicates whether the job is `queued`, `running`, `completed`, or `failed`.
     - `created_at` and `finished_at`: Timestamps showing when the job started and (if completed) finished.
     - `model`: Specifies the base model being fine-tuned (e.g., `gpt-3.5-turbo`).
     - `trained_tokens`: The number of tokens processed during training (useful for tracking progress).
     - `hyperparameters`: Shows the hyperparameters used for fine-tuning, such as `n_epochs` and `learning_rate_multiplier`.

### Purpose of Fine-Tuning Jobs
Fine-tuning jobs help you monitor and manage individual training sessions. By keeping track of `job_id`s, you can:
- Check the status of each job.
- Retrieve the final fine-tuned model when training is complete.
- Track resource usage, costs, and completion time.

In practice, each `job_id` acts as a reference point, so you can monitor the fine-tuning process, handle multiple jobs if needed, and manage each instance separately.


### 1. **Job Status**
   - **Look for**: The `status` field in the job response.
   - **Possible Values**: `queued`, `validating_files`, `running`, `succeeded`, `failed`.
   - **What to Watch For**: This helps you track progress and detect issues. For example:
     - `queued`: The job is waiting to start.
     - `validating_files`: OpenAI is checking that your training and validation files meet all requirements.
     - `running`: Fine-tuning is actively in progress.
     - `succeeded`: Fine-tuning completed successfully, and your model is ready for use.
     - `failed`: An error occurred; reviewing the error details can help troubleshoot the problem.

### 2. **Error Messages**
   - **Look for**: The `error` field in the job response.
   - **Purpose**: If the job status is `failed`, the `error` field will provide details on what went wrong, such as file format issues, token limit violations, or data quality concerns. Reviewing this can help you address the issue and resubmit the job.

### 3. **Timestamps**
   - **Look for**: `created_at` and `finished_at` fields.
   - **Purpose**: These timestamps help you measure how long the fine-tuning job took. This can be useful for planning, especially if you are running multiple jobs and managing time or costs.

### 4. **Model Details**
   - **Look for**: The `model` field.
   - **Purpose**: Confirms the base model being fine-tuned (e.g., `gpt-3.5-turbo`). It’s a quick way to double-check that the job used the intended model configuration.

### 5. **Token Count and Progress**
   - **Look for**: `trained_tokens` field, if available.
   - **Purpose**: Shows how many tokens have been processed so far. If you’re tracking token usage or costs, this helps monitor resource consumption.

### 6. **Hyperparameters**
   - **Look for**: `hyperparameters` field.
   - **Purpose**: Displays the hyperparameters, such as `n_epochs`, `batch_size`, and `learning_rate_multiplier`. These parameters influence how the model learns from your data, so it’s good to confirm they match your intended configuration.

### Example of Code to Print These Details

Here’s a code snippet to print and check these key details:

By checking these elements, you’ll be able to monitor the job’s progress, detect issues early, and verify that the fine-tuning job aligns with your intended setup.

In [None]:
job_id = response.id
# client.fine_tuning.jobs.list(limit=5)
job_info = client.fine_tuning.jobs.retrieve(job_id)

print("Fine-Tuning Job Information:")
print(f"Job ID: {job_info.id}")
print(f"Status: {job_info.status}")
print(f"Created At: {job_info.created_at}")
print(f"Finished At: {job_info.finished_at}")
print(f"Model: {job_info.model}")
print(f"Training Tokens Processed: {job_info.trained_tokens}")
print("Hyperparameters:")
print(f"  - Epochs: {job_info.hyperparameters.n_epochs}")
print(f"  - Batch Size: {job_info.hyperparameters.batch_size}")
print(f"  - Learning Rate Multiplier: {job_info.hyperparameters.learning_rate_multiplier}")

if job_info.error:
    print("Error Details:")
    print(f"  - Code: {job_info.error.code}")
    print(f"  - Message: {job_info.error.message}")
    print(f"  - Param: {job_info.error.param}")
else:
    print("No errors encountered.")

Fine-Tuning Job Information:
Job ID: ftjob-1UYbyLikaDGTxuA9qSCsLP5V
Status: failed
Created At: 1730239010
Finished At: None
Model: gpt-3.5-turbo-0125
Training Tokens Processed: None
Hyperparameters:
  - Epochs: auto
  - Batch Size: auto
  - Learning Rate Multiplier: auto
Error Details:
  - Code: invalid_training_file
  - Message: The job failed due to an invalid training file. Invalid file format. Line 1, key "messages": The last message must be from the assistant
  - Param: training_file


In [None]:
job_response = client.fine_tuning.jobs.list_events(fine_tuning_job_id=job_id)

events = job_response.data
events

[FineTuningJobEvent(id='ftevent-Y4fJzMVV7ga5MFRADibnpAjT', created_at=1730239011, level='error', message='The job failed due to an invalid training file. Invalid file format. Line 1, key "messages": The last message must be from the assistant', object='fine_tuning.job.event', data={'error_code': 'invalid_training_file', 'error_param': 'training_file'}, type='message'),
 FineTuningJobEvent(id='ftevent-GFekP93IDNpBHgmobDyKBbwE', created_at=1730239010, level='info', message='Validating training file: file-ZgArmwLEpjbEoyfHM85Hyy6H and validation file: file-jLQN1r9LnjJ8MxRGqUUHYUEr', object='fine_tuning.job.event', data={}, type='message'),
 FineTuningJobEvent(id='ftevent-Cw5SpU5YgQyD7UWNDOc1ZTA2', created_at=1730239010, level='info', message='Created fine-tuning job: ftjob-1UYbyLikaDGTxuA9qSCsLP5V', object='fine_tuning.job.event', data={}, type='message')]