# prepare_instruct_ada
**Author:** Khoi Nguyen

**Date created:** 03/30/2023

**Last modified:** 05/12/2023

**Description:** This notebook prepares data to fine-tune the Ada model to follow instructions using data from the Stanford Alpaca repo.

**WARNING:** This notebook requires API calls and will cost money. Please be careful when running this step.

*We had a typo initially where the fine-tuned model was named 52k_instuct_ada and not 52k_instruct_ada. I've corrected the name in this file, but the rest of the project refers to the typo fine-tuned model*

In [3]:
import os
import json

In [5]:
with open(os.path.join('data', 'alpaca_data.json'), 'r') as f:
    data = json.load(f)

# print the first 10 objects with pretty formatting
print(json.dumps(data[:10], indent=4))

[
    {
        "instruction": "Give three tips for staying healthy.",
        "input": "",
        "output": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
    },
    {
        "instruction": "What are the three primary colors?",
        "input": "",
        "output": "The three primary colors are red, blue, and yellow."
    },
    {
        "instruction": "Describe the structure of an atom.",
        "input": "",
        "output": "An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom."
    },
    {
        "instruction": "How can we reduce 

In [6]:
# convert the data into a file named alpaca_data_cleaned.jsonl. the file will have a dictionary with 2 keys which is prompt and completion
# prompt will be the instruction and end in \n\n##\n\n. if the input exists, then the prompt will be the instruction, a new line character, the input, and end in \n\n##\n\n.
# the completion will be the output with a blank character at the beginning and a new line character at the end
# the file will be in jsonl format which is a json file with each line being a json object

with open(os.path.join('data', 'alpaca_data_cleaned.jsonl'), 'w') as f:
    for d in data:
        if d['input'] == '':
            prompt = d['instruction'] + '\n\n##\n\n'
        else:
            prompt = d['instruction'] + '\n' + d['input'] + '\n\n##\n\n'
        completion = ' ' + d['output'] + '\n'
        json.dump({'prompt': prompt, 'completion': completion}, f)
        f.write('\n')

# print the first 10 objects with pretty formatting
with open(os.path.join('data', 'alpaca_data_cleaned.jsonl'), 'r') as f:
    data = f.readlines()

print(json.dumps(data[:10], indent=4))

[
    "{\"prompt\": \"Give three tips for staying healthy.\\n\\n##\\n\\n\", \"completion\": \" 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \\n2. Exercise regularly to keep your body active and strong. \\n3. Get enough sleep and maintain a consistent sleep schedule.\\n\"}\n",
    "{\"prompt\": \"What are the three primary colors?\\n\\n##\\n\\n\", \"completion\": \" The three primary colors are red, blue, and yellow.\\n\"}\n",
    "{\"prompt\": \"Describe the structure of an atom.\\n\\n##\\n\\n\", \"completion\": \" An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.\\n\"}\n",
    "{\"prompt\": \"How can we reduce air pollution?\\n\\n##\\n\\n\", \"completion\": \"

In [7]:
# Run the following command in the terminal to use the OpenAI CLI Data Preparation Tool to prepare the data for training.
# In addition to the cleaning processes above, the tool will remove the prompt instruction
# ! openai tools fine_tunes.prepare_data -f data/alpaca_data_cleaned.jsonl

# Create the fine-tuned model using the OpenAI CLI
# openai api fine_tunes.create -t data/alpaca_data_cleaned.jsonl -m ada --suffix "52k_instruct_ada"