# AI/LLM for Devs, Week 3 Experiment notebook

This notebook captures a series of experiments to create high quality and effective training sets for fine-tuning LLMs on knowledge.




## Experiment #1 - Training a series of facts

The purpose of this experiment is to explore teaching GPT-3.5 Turbo a series of facts via fine-tuning, and rate its performance and level of hallucination.

### Step 1: Choose 10 facts

Choose a topic that ChatGPT doesn't already know, such as your personal history, your company, or some niche topic.

1. Praneeth Yerrapragada is studying spoken Sanskrit language from Samskrita Bharati USA.
2. Praneeth Yerrapragada has hiked to the top of Half Dome in Yosemite in 2017.
3. Praneeth Yerrapragada graduated from the University of Southern California majoring in Electrical / Computer Engineering in 2014.
4. Praneeth Yerrapragada has a wife and two children.
5. Praneeth Yerrapragada has lived in Los Angeles, California.
6. Praneeth Yerrapragada has travelled to Antarctica in 2017.
7. Praneeth Yerrapragada was born in Hyderabad, India.
8. Praneeth Yerrapragada practices Yoga and Meditation regularly.
9. Praneeth Yerrapragada's parents live in Hyderabad, India.
10. Praneeth Yerrapragada loves to spend time with his family.

### Step 2: Design your evaluation questions

Enumerate 10 questions below. Start with straightforward questions that are close to the core facts, then expand to contextually related questions, then out-of-scope questions. These questions will be your performance benchmark.

1. Where did Praneeth Yerrapragada graduate from?
2. What was Praneeth Yerrapragada's major?
3. What does Praneeth Yerrapragada love to do?
4. How many children does Praneeth Yerrapragada have?
5. What was Praneeth Yerrapragada's favorite hobby?
6. Does Praneeth Yerrapragada like to travel?
7. Where is Praneeth Yerrapragada from?
8. Is Praneeth Yerrapragada married?
9. Is Praneeth Yerrapragada a happy individual?
10. What does Praneeth Yerrapragada likes to eat?

### Step 3: Generate your initial training set

Generate a naive training set, using something like the prompt below. Start by generating 3 question/response pairs for each fact.

```
Based on the following fact, generate an array of {n} variations of question-answer pairs.
Each pair should be formatted as a JSON object with "messages" containing "user" and "assistant" roles.
Ensure that the output is in JSON format.

Each question should be unique, clearly phrased, and reflect how users might ask about this fact.
The corresponding answer should be accurate, contextually relevant, and phrased differently from the other answers.
Ensure diversity in question types (who, what, where, when, why) and avoid repetitive phrasing.

Fact: "{fact}"

Example output format:

{{"data": [{{"messages": [{{"role": "user", "content": "What is the capital of France?"}}, {{"role": "assistant", "content": "The capital of France is Paris."}}]}},
{{"messages": [{{"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}}, {{"role": "assistant", "content": "The author of 'Romeo and Juliet' is William Shakespeare."}}]}},
{{"messages": [{{"role": "user", "content": "How far is the Moon from Earth?"}}, {{"role": "assistant", "content": "The distance from the Moon to Earth is approximately 384,400 kilometers."}}]}}]
}}
```

Reference the code block for a snippet that will pass this prompt to the OpenAI API. Note: add a $5 credit to your the OpenAI platform: https://platform.openai.com/settings/organization/billing/overview

### Step 4: Create a GPT-3.5 Turbo fine-tuned model

Use the web interface to add your training and validation data: https://platform.openai.com/finetune

Initially, use the default hyperparameters.

### Step 5: Evaluation

Ask the evaluation questions you created in Step 2, and note accuracy, hallucination, and overfitting.

### Step 6: Improving quality

For each issue you discovered in Step 5, create additional training data to address the issue.

- Create new prompts to expand your training data
- Manually review and fix or delete bad training data

Improvements will mostly be made through improving the training data, but also run the following variations:

- Cut your training data in half, and compare results. That is a predictor for the increase in quality if you double your training data.
- If the model isn't learning your data, try increasing your epochs by 1 or 2
- If the model is overfitting, try reducing your epochs by 1 or 2
- Try halfing and doubling your epochs to explore those effects

### Conclusion

Were you able to train on the additional facts, while minimizing hallucination? (you'll never fully eliminate hallucination)

In [None]:
import json
from openai import OpenAI

facts = []
with open('data/facts.jsonl', 'r') as file:
    for line in file:
        facts.append(json.loads(line))

client = OpenAI(api_key='YOUR_API_KEY_HERE')

def generate_qa(fact, n=20):
    prompt_text = f"""
    Based on the following fact, generate an array of {n} variations of question-answer pairs.
    Each pair should be formatted as a JSON object with "messages" containing "user" and "assistant" roles.
    Ensure that the output is in JSON format.

    Each question should be unique, clearly phrased, and reflect how users might ask about this fact.
    The corresponding answer should be accurate, contextually relevant, and phrased differently from the other answers.
    Ensure diversity in question types (who, what, where, when, why) and avoid repetitive phrasing.

    Fact: "{fact}"

    Example output format:

    {{"data": [{{"messages": [{{"role": "user", "content": "What is the capital of France?"}}, {{"role": "assistant", "content": "The capital of France is Paris."}}]}},
    {{"messages": [{{"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}}, {{"role": "assistant", "content": "The author of 'Romeo and Juliet' is William Shakespeare."}}]}},
    {{"messages": [{{"role": "user", "content": "How far is the Moon from Earth?"}}, {{"role": "assistant", "content": "The distance from the Moon to Earth is approximately 384,400 kilometers."}}]}}]
    }}
    """

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant tasked with generating training data for fine-tuning a gpt-3.5-turbo model in JSON format"},
            {"role": "user", "content": prompt_text}
        ],
        response_format={"type": "json_object"},
        temperature=0.5
    )

    print(response.choices[0].message.content.strip())
    try:
        qa_array = json.loads(response.choices[0].message.content.strip())["data"]
    except json.JSONDecodeError as e:
        print("Failed to decode JSON:", e)
        return [], []

    # Splitting the generated QA pairs into training and validation sets
    validation_size = int(len(qa_array) * 0.2)
    validation_set = qa_array[:validation_size]
    training_set = qa_array[validation_size:]

    return training_set, validation_set


training_set = []
validation_set = []

for fact in facts:
    training, validation = generate_qa(fact['fact'])
    training_set.extend(training)
    validation_set.extend(validation)

with open('data/facts_training.jsonl', 'w') as train_outfile:
    for qa in training_set:
        train_outfile.write(json.dumps(qa) + '\n')

with open('data/facts_validation.jsonl', 'w') as valid_outfile:
    for qa in validation_set:
        valid_outfile.write(json.dumps(qa) + '\n')

## Experiment #2 - Group project training data

Assemble initial training data for your group project. For example, identify potential use cases, such as:

- Additional app or domain-specific knowledge
- Behavioral modification
- Tool usage

If your task is behavioral modification or tool usage, do the following steps:

1. Design a detailed prompt describing the desired behavior, or usage of the tool
2. Use the prompt in real situations on GPT-4o
3. Create training data by:
   - Loading in the original prompt as a system message, and a brief context of the conversation
   - Add snippets of the recorded conversation
4. Start with 5 "real" interactions, and evaluate the model peformance

## Experiment #3 (optional) - Training a 10-page document

Choose a 10-page document (ideally a text or markdown file for easier initial parsing).

1. Start by generating question/answer pairs for document semantic (or non-semantic chunks)
2. Use a prompt to create a detailed summary of the document, and generate question/answer pairs based on the summary.
3. Use a prompt to contain what sections do and do not cover, to hopefully mitigate hallucination

Create a set of evaluation questions (separate from the validation set). Evaluate the performance between each of the stages.

## Submission

Submit your experiment notebook [in the form here](https://forms.gle/DKeRAuYkvDQGjs9P9).

## Appendix

The code below was and expanded version of the trivia code, which injects a system message into each piece of training data, and also generates boundary pairs to help define the scope.

In [None]:
# This notebook generates training data for fine-tuning gpt3.5-turbo on an array of facts
import json
from openai import OpenAI

facts = []
with open('data/facts.jsonl', 'r') as file:
    for line in file:
        facts.append(json.loads(line))

client = OpenAI(api_key='YOUR API KEY')

def generate_qa(fact, n=10):
    prompt_text = f"""
    Based on the following fact, generate an array of {n} variations of question-answer pairs.
    Each pair should be formatted as a JSON object with "messages" containing "user" and "assistant" roles.
    Ensure that the output is in JSON format.

    Each question should be unique, clearly phrased, and reflect how users might ask about this fact.
    The corresponding answer should be accurate, contextually relevant, and phrased differently from the other answers.
    Ensure diversity in question types (who, what, where, when, why) and avoid repetitive phrasing.

    Fact: "{fact}"

    Example output format:

    {{"data": [{{"messages": [{{"role": "user", "content": "What is the capital of France?"}}, {{"role": "assistant", "content": "The capital of France is Paris."}}]}},
    {{"messages": [{{"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}}, {{"role": "assistant", "content": "The author of 'Romeo and Juliet' is William Shakespeare."}}]}},
    {{"messages": [{{"role": "user", "content": "How far is the Moon from Earth?"}}, {{"role": "assistant", "content": "The distance from the Moon to Earth is approximately 384,400 kilometers."}}]}}]
    }}
    """

    return generate_pairs(prompt_text, n)

def generate_boundaries(facts, n=40):
    prompt_text = f"""
    Based on the following facts, generate an array of {n} variations of question-answer pairs.
    Each pair should be formatted as a JSON object with "messages" containing "user" and "assistant" roles.
    Ensure that the output is in JSON format.

    The question-answer pairs should establish boundaries of what the assistant knows beyond the facts below.
    Pairs should use mostly negative examples to establish the boundaries of the facts. For example,
    pairs should include negative examples of detailed followup questions beyond the scope of the facts.

    For each fact, imagine reasonable followup questions that might be asked by a user, and decline to answer. Add
    your rationale in the "rationale" key.

    Facts: {facts}

    Example output format:

    {{"data": [{{"messages": [{{"role": "user", "content": "What is the capital of France?"}}, {{"role": "assistant", "content": "The capital of France is Paris."}}]}},
    {{"messages": [{{"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}}, {{"role": "assistant", "content": "The author of 'Romeo and Juliet' is William Shakespeare."}}]}},
    {{"messages": [{{"role": "user", "content": "How far is the Moon from Earth?"}}, {{"role": "assistant", "content": "The distance from the Moon to Earth is approximately 384,400 kilometers."}}]}}]
    }}
    """

    return generate_pairs(prompt_text, n)

def generate_pairs(prompt_text, n=10):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant tasked with generating training data for fine-tuning a gpt-3.5-turbo model in JSON format"},
            {"role": "user", "content": prompt_text}
        ],
        response_format={"type": "json_object"},
        temperature=0.5
    )

    print(response.choices[0].message.content.strip())
    try:
        qa_array = [{"messages": item["messages"]} for item in json.loads(response.choices[0].message.content.strip())["data"]]
    except json.JSONDecodeError as e:
        print("Failed to decode JSON:", e)
        return [], []

    # Splitting the generated QA pairs into training and validation sets
    validation_size = int(len(qa_array) * 0.2)
    validation_set = qa_array[:validation_size]
    training_set = qa_array[validation_size:]

    return training_set, validation_set


training_set = []
validation_set = []

for fact in facts:
    training, validation = generate_qa(fact['fact'])
    training_set.extend(training)
    validation_set.extend(validation)

facts_string = "\n".join([fact['fact'] for fact in facts])
training, validation = generate_boundaries(facts_string)
training_set.extend(training)
validation_set.extend(validation)

# Inject a system message as the first message in the training and validation sets
for qa in training_set:
    qa['messages'].insert(0, {"role": "system", "content": "You are an internal knowledge chat bot for CodePath, an education company"})
for qa in validation_set:
    qa['messages'].insert(0, {"role": "system", "content": "You are an internal knowledge chat bot for CodePath, an education company"})

with open('data/facts_training_2.jsonl', 'w') as train_outfile:
    for qa in training_set:
        train_outfile.write(json.dumps(qa) + '\n')

with open('data/facts_validation_2.jsonl', 'w') as valid_outfile:
    for qa in validation_set:
        valid_outfile.write(json.dumps(qa) + '\n')
