This notebook creates the .jsonl files necessary for fine-tuning via GPT-3 DaVinci.

The format of these .jsonl files looks like:
```
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
```

# Train-finetuned-davinci

In [7]:
import os
import json

data_dir = "train-finetuned-davinci/MATH"
output_file = "train-finetuned-davinci.jsonl"

with open(output_file, "w") as out:
    for data_type in ["train"]:
        for sub_dir in os.listdir(os.path.join(data_dir, data_type)):
            if not os.path.isdir(os.path.join(data_dir, data_type, sub_dir)):
                continue
            # In this loop, we're in the subdirectories checking for .json files
            for file_name in os.listdir(os.path.join(data_dir, data_type, sub_dir)):
                if not file_name.endswith(".json"):
                    continue
                with open(os.path.join(data_dir, data_type, sub_dir, file_name), "r") as file:

                    data = json.load(file)
                    problem = data["problem"]
                    answer = data["solution"]

                    # weird openai formatting bs
                    x = str(problem) + "\n\n###\n\n" # A simple separator which generally works well is \n\n###\n\n. The separator should not appear elsewhere in any prompt.
                    y = " " + str(answer) + " END" # Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace. Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion.

                    print('{"prompt": ' + json.dumps(x) + ', "completion": ' + json.dumps(y) + '}\n')
                    out.write('{"prompt": ' + json.dumps(x) + ', "completion": ' + json.dumps(y) + '}\n')

{"prompt": "A board game spinner is divided into three parts labeled $A$, $B$  and $C$. The probability of the spinner landing on $A$ is $\\frac{1}{3}$ and the probability of the spinner landing on $B$ is $\\frac{5}{12}$.  What is the probability of the spinner landing on $C$? Express your answer as a common fraction.\n\n###\n\n", "completion": " The spinner is guaranteed to land on exactly one of the three regions, so we know that the sum of the probabilities of it landing in each region will be 1. If we let the probability of it landing in region $C$ be $x$, we then have the equation $1 = \\frac{5}{12}+\\frac{1}{3}+x$, from which we have $x=\\boxed{\\frac{1}{4}}$. END"}

{"prompt": "Given that $\\binom{23}{3}=1771$, $\\binom{23}{4}=8855$, and $\\binom{23}{5}=33649$, find $\\binom{25}{5}$.\n\n###\n\n", "completion": " We can use Pascal's identity $\\binom{n-1}{k-1}+\\binom{n-1}{k}=\\binom{n}{k}$ to find $\\binom{24}{4}$ and $\\binom{24}{5}$.\n\n$$\\binom{24}{4}=\\binom{23}{3}+\\binom{

In [8]:
# Check
!openai tools fine_tunes.prepare_data -f train-finetuned-davinci.jsonl

Analyzing...

- Your file contains 420 prompt-completion pairs
- All prompts end with suffix `\n\n###\n\n`
- All completions end with suffix ` END`

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "train-finetuned-davinci.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[" END"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 8.21 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.


# Train-wolfram-finetuned-davinci

Create .jsonl file

In [7]:
with open('prompts.json') as f:
    data = json.load(f)
wolfram_pt1 = data['wolfram_pt1']
wolfram_pt2 = data['wolfram_pt2']

In [10]:
import os
import json

data_dir = "train-wolfram-finetuned-davinci/MATH"
output_file = "train-wolfram-finetuned-davinci.jsonl"

with open(output_file, "w") as out:
    for data_type in ["train"]:
        for sub_dir in os.listdir(os.path.join(data_dir, data_type)):
            if not os.path.isdir(os.path.join(data_dir, data_type, sub_dir)):
                continue
            # In this loop, we're in the subdirectories checking for .json files
            for file_name in os.listdir(os.path.join(data_dir, data_type, sub_dir)):
                if not file_name.endswith(".json"):
                    continue
                with open(os.path.join(data_dir, data_type, sub_dir, file_name), "r") as file:
                    data = json.load(file)
                    problem = data["problem"]
                    solution = data["solution"]
                    wolframquery = data["wolframquery"]
                    wolframoutput = data["wolframoutput"]

                    x1 = "[Problem] " + str(problem) + "\n\n"+wolfram_pt1 + "\n\n###\n\n"
                    y1 = " " + str(wolframquery) + " END"

                    x2 = "[Problem] " + str(problem) + "\n\n" + "[Wolfram Query] Answer to " + wolframquery + ": " + wolframoutput + "\n\n" + wolfram_pt2+"\n\n###\n\n"
                    y2 = " " + str(solution) + " END"

                    out.write('{"prompt": ' + json.dumps(x1) + ', "completion": ' + json.dumps(y1) + '}\n')
                    out.write('{"prompt": ' + json.dumps(x2) + ', "completion": ' + json.dumps(y2) + '}\n')

In [11]:
# Check
!openai tools fine_tunes.prepare_data -f train-finetuned-davinci.jsonl

Analyzing...

- Your file contains 420 prompt-completion pairs
- All prompts end with suffix `\n\n###\n\n`
- All completions end with suffix ` END`

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "train-finetuned-davinci.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[" END"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 8.21 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.
