# prepare_training_data
**Author:** Khoi Nguyen

**Date created:** 03/06/2023

**Last modified:** 04/15/2023

**Description:** This notebook is used to create the response-completion pairs for training the `Ada` models. This task is accomplished by using the `test` sets w/ the "Improve this sentence:" prompt and submitting that to `Curie` to create synthetic datasets.

Creating `1k_ada` training data takes approximately 10 minutes and costs approximately $0.4.

Creating `10k_ada` training data is forecasted to take 100 minutes and cost approximately $4.00.

Creating `100k_ada` training data is forecasted 1000 minutes and cost approximately $40.00.

**WARNING:** This notebook requires API calls and will cost money. Please be careful when running this step.

In [None]:
import json
import openai
import os
import tqdm

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
openai.api_key = OPENAI_API_KEY
instruction = "Elaborate on the following sentence: "

### 1k_ada
Synthesizing `1k_ada` training data.

In [None]:
confirmation = input('Are you sure you want to run 1k_ada synthethic training creation? Type YES or NO: ')

if confirmation == 'YES':
    with open('data/1k_ada/train.json') as f:
        train_data = json.load(f)

    with open('data/1k_ada/openai_training.jsonl', 'w') as outfile:
        for sentence in tqdm.tqdm(train_data):
            prompt = "{instruction}\"{sentence}\"\n\n".format(sentence=sentence["sentence"], instruction=instruction)
            response = openai.Completion.create(
                model="text-curie-001",
                prompt=prompt,
                temperature=0.7,
                max_tokens=256,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0
            )
            outfile.write(json.dumps({
                'prompt': prompt,
                'completion': response.choices[0].text
            }) + '\n')
            outfile.flush()
else:
    print('Synthethic training creation aborted.')

In [None]:
# Load in openai_training.jsonl and print the first 10 lines
with open('data/1k_ada/openai_training.jsonl') as f:
    for i in range(10):
        print(f.readline())

In [None]:
# Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n. The separator should not appear elsewhere in any prompt.
# Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.
# Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion.
# For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.

with open('data/1k_ada/openai_training.jsonl', 'r') as f:
    with open('data/1k_ada/openai_cleaned.jsonl', 'w') as out:
        for line in f:
            data = json.loads(line)
            prompt = data['prompt'].strip() + '\n\n##\n\n'
            completion = ' ' + data['completion'].strip() + '\n'
            out.write(json.dumps({'prompt': prompt, 'completion': completion}))
            out.write('\n')
            

In [None]:
# Load in openai_cleaned.jsonl and print the first 5 lines with prettier formatting
with open('data/1k_ada/openai_cleaned.jsonl') as f:
    for i in range(5):
        print(json.dumps(json.loads(f.readline()), indent=4))

In [4]:
# Run the following command in the terminal to use the OpenAI CLI Data Preparation Tool to prepare the data for training.
# In addition to the cleaning processes above, the tool will remove the prompt instruction
# ! openai tools fine_tunes.prepare_data -f data/1k_ada/openai_cleaned.jsonl

### 10k_ada
Synthesizing `10k_ada` training data.

### 100k_ada
Synthesizing `100k_ada` training data.