## Collate and convert

Going to load up all the data and convert it into a single JSONL file.

In [None]:
rows = []
for recipe in already_generated:
    dp = DATA / "recipes" / recipe
    input_path = dp / "recipe.txt"
    output_path = dp / "gantt.tsv"
    input = input_path.read_text()
    output = output_path.read_text()
    row = {}
    row["input"] = input
    row["output"] = output
    rows.append(row)

odf = pd.DataFrame(rows)

Now I need to add the instruction column. I'm really interested in training a model that can map directly from recipe to gantt chart, without the painful CoT prompting that I had to do with GPT4. That way it should be quicker and maybe less prone to differences in formatting etc?

I think this leaves me with two options:
1. Write a short prompt for the instruction column explaining the desired output similar to the GPT4 prompt but importantly _without_ the CoT.
    - This might give the model somewhere to start from (important with such a small dataset)
    - However, it seems a bit strange/redundant to be giving the same instruction in every example. This information would be redundant and would increase computational overhead.
1. Put the input into the instruction column and provide no input.
    - The model _should_ simply treat the instructions similarly to the input (the only difference is where they appear in the prompt)
    - The task _should_ be implicitly learnable from the input and output pairs
    - This option should have a smaller computational overhead
    - I'd just be worried that it might be a bit too much to ask the model to infer the structure of the task when I have such a small training set

In light of these options, I think I'll include an instruction column (with identical instructions for each) for now and then I can always choose to include it / exclude it later depending on what I find.

### Instructions

Your task is to transform cooking recipes from raw text into a Gantt chart .tsv file which conveys all the same information but graphically so one can see which ingredients are involved in each step. In the end, we wish to produce a downloadable .tsv file containing a table. It will be structured as follows:

- the column headers will contain the full text description of each step in the recipe (verbatim as in the original recipe)
- Each row will refer to a different ingredient
- If a particular ingredient is used in a particular method step then the corresponding cell is marked with an “X” otherwise it’s left blank

Tip: It's very important that you break down every single ingredient verbatim (with any preparation information - no changes!) as a separate row and copy the method descriptions and verbatim (without making any changes!) from each step to each column header.

Here’s an example:

```
Ingredients

vegetable oil
2 large free-range eggs
100 g plain flour
100 ml milk

Method

1. Preheat the oven to 225°C/425°F/gas 9.
2. Get yourself a cupcake tin and add a tiny splash of vegetable oil into each of the 12 compartments.
3. Pop into the oven for 10 to 15 minutes so the oil gets really hot.
4. Meanwhile, beat the eggs, flour, milk and a pinch of salt and pepper together in a jug until light and smooth.
5. Carefully remove the tray from the oven, then confidently pour the batter evenly into the compartments.
6. Pop the tray back in the oven to cook for 12 to 15 minutes, or until risen and golden.
```

would output this tsv file:
```
Preheat the oven to 225°C/425°F/gas 9.	Get yourself a cupcake tin and add a tiny splash of vegetable oil into each of the 12 compartments.	Pop into the oven for 10 to 15 minutes so the oil gets really hot.	Meanwhile, beat the eggs, flour, milk and a pinch of salt and pepper together in a jug until light and smooth.	Carefully remove the tray from the oven, then confidently pour the batter evenly into the compartments.	Pop the tray back in the oven to cook for 12 to 15 minutes, or until risen and golden.
vegetable oil		X	X		X	X
2 large free-range eggs				X	X	X
100 g plain flour				X	X	X
100 ml milk				X	X	X
```

In [None]:
instruction = """Your task is to transform cooking recipes from raw text into a Gantt chart .tsv file which conveys all the same information but graphically so one can see which ingredients are involved in each step. In the end, we wish to produce a downloadable .tsv file containing a table. It will be structured as follows:

- the column headers will contain the full text description of each step in the recipe (verbatim as in the original recipe)
- Each row will refer to a different ingredient
- If a particular ingredient is used in a particular method step then the corresponding cell is marked with an “X” otherwise it’s left blank

Tip: It's very important that you break down every single ingredient verbatim (with any preparation information - no changes!) as a separate row and copy the method descriptions and verbatim (without making any changes!) from each step to each column header.

Here’s an example:

```
Ingredients

vegetable oil
2 large free-range eggs
100 g plain flour
100 ml milk

Method

1. Preheat the oven to 225°C/425°F/gas 9.
2. Get yourself a cupcake tin and add a tiny splash of vegetable oil into each of the 12 compartments.
3. Pop into the oven for 10 to 15 minutes so the oil gets really hot.
4. Meanwhile, beat the eggs, flour, milk and a pinch of salt and pepper together in a jug until light and smooth.
5. Carefully remove the tray from the oven, then confidently pour the batter evenly into the compartments.
6. Pop the tray back in the oven to cook for 12 to 15 minutes, or until risen and golden.
```

would output this tsv file:
```
Preheat the oven to 225°C/425°F/gas 9.	Get yourself a cupcake tin and add a tiny splash of vegetable oil into each of the 12 compartments.	Pop into the oven for 10 to 15 minutes so the oil gets really hot.	Meanwhile, beat the eggs, flour, milk and a pinch of salt and pepper together in a jug until light and smooth.	Carefully remove the tray from the oven, then confidently pour the batter evenly into the compartments.	Pop the tray back in the oven to cook for 12 to 15 minutes, or until risen and golden.
vegetable oil		X	X		X	X
2 large free-range eggs				X	X	X
100 g plain flour				X	X	X
100 ml milk				X	X	X
```"""

In [None]:
odf["instruction"] = instruction

In [None]:
odf.head()

Unnamed: 0,instruction,input,output
0,Your task is to transform cooking recipes from...,Ingredients\n\n5 pounds boneless chicken thigh...,"\tCombine the meat, salt, pepper, garlic, basi..."
1,Your task is to transform cooking recipes from...,"Ingredients\n\n12 oz fresh crabmeat, drained\n...","\tIn a medium bowl, combine crab, lime juice a..."
2,Your task is to transform cooking recipes from...,"Ingredients\n\n3 pounds tomatoes, preferably o...",\tFirst make the soup. Preheat the oven to 400...
3,Your task is to transform cooking recipes from...,Ingredients\n\n1/2 cup fresh herbs (minced)\n4...,Ingredient\tPreheat your grill — medium-low he...
4,Your task is to transform cooking recipes from...,Ingredients\n\n200 g caster sugar\n200 ml wate...,Sorbets are always a nice way to finish a meal...


In practice I can choose to either use the instruction, or copy the input into the instruction column and leave the input blank (which looks like is convention).

In [None]:
# odf.to_json(DATA/"dataset.jsonl", orient='records', lines=True)