# Fine-tuning

Convert plain text to regex using OpenAI fine-tuning.

# Installation

Install the packages required and set up API key.

In [31]:
import pandas as pd
import openai
import os

os.environ["OPENAI_API_KEY"] = "thisisasecretapikey"

# Data Preparation

The dataset must be in JSONL format, where each line is a prompt-completion pair corresponding to a training example. First, transform the dataset into a pandas dataframe, with a column for prompt (text) and completion (regex). Second, convert the pandas dataframe into JSONL format.

*The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality.*

In [32]:
# Read the excel file
text_regex_file = pd.read_excel('text_regex8.xlsx', header=0)
text_regex_file.head()

Unnamed: 0,owner_outlet_id,name,field,regex,prompt,completion
0,670,Lucca Vudor @ Suntec,total,"Total\s*Amount\s*\D?\s*([\d,]*\.\d{2})",Total Amount $79.90,"Total\s*Amount\s*\D?\s*([\d,]*\.\d{2})"
1,417,Shan Cheng Midview City,total,"Grand\s*\Total\:\s*\$(\d*[\.\,]*\d{1,2})",Grand Total: $6.85,"Grand\s*\Total\:\s*\$(\d*[\.\,]*\d{1,2})"
2,447,So Pho @ Suntec,total,"\sTOTAL\s*\$([\d,]+\.\d{2})",TOTAL $46.55,"\sTOTAL\s*\$([\d,]+\.\d{2})"
3,50,Heng's Mini Mart 567,total,"(?:Total\s*Amount|Nett\s*Total)\s*\:?\s*([-\d,...",Total Amount: 13.70,"(?:Total\s*Amount|Nett\s*Total)\s*\:?\s*([-\d,..."
4,178,Restoran Kerisek,total,"\bTOTAL RM(?:[B√©])?([\d,]+\.\d{2})",TOTAL RM16.00,"\bTOTAL RM(?:[B√©])?([\d,]+\.\d{2})"


In [33]:
# Convert the excel file into a pandas dataframe
df = pd.DataFrame(text_regex_file, columns= ['prompt','completion'])[:340]
df

Unnamed: 0,prompt,completion
0,Total Amount $79.90,"Total\s*Amount\s*\D?\s*([\d,]*\.\d{2})"
1,Grand Total: $6.85,"Grand\s*\Total\:\s*\$(\d*[\.\,]*\d{1,2})"
2,TOTAL $46.55,"\sTOTAL\s*\$([\d,]+\.\d{2})"
3,Total Amount: 13.70,"(?:Total\s*Amount|Nett\s*Total)\s*\:?\s*([-\d,..."
4,TOTAL RM16.00,"\bTOTAL RM(?:[B√©])?([\d,]+\.\d{2})"
...,...,...
335,Total: 8.10,"(?:aftr\s*rounding|Total\s*:)\s*([\d,]+\.\d{2})"
336,Total: 9.90,"Total\:\s*([\d,]*\.\d{2})"
337,TOTAL $30.70,"\b(?:TOTAL|Total)\s+\$([\d,]*\.\d{2})"
338,TOTAL: 17.00,"\s+TOTAL\s*\:\s*([\d,]*\.\d{2})"


In [34]:
# Convert the df to JSONL format 
# ‘records’ : list like [{column -> value}, … , {column -> value}]
df.to_json("text_regex8.jsonl", orient='records', lines=True)

# Data Preparation Tool

A tool which validates, gives suggestions and reformats the data before fine-tuning. This tool accepts different formats, with the only requirement that they contain a prompt and a completion column/key. It can take a CSV, TSV, XLSX, JSON or JSONL file, and it will save the output into a JSONL file ready for fine-tuning.

In [35]:
!pip install --upgrade openai



In [36]:
!openai tools fine_tunes.prepare_data -f text_regex8.jsonl -q

Analyzing...

- Your file contains 340 prompt-completion pairs
- There are 5 duplicated prompt-completion pairs. These are rows: [71, 111, 135, 165, 225]
- More than a third of your `prompt` column/key is uppercase. Uppercase prompts tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the e

# Create a fine-tuned model

Train and create a fine-tuned model using an engine (e.g. ada, babbage, curie, davinci (need to apply for fine-tuning) ). The model is usually successfully trained in about 10-20 minutes. The fine-tuned model name is "curie:ft-user-orncay2mfjdpqgswmlkminvi-2021-08-24-08-25-45" or "ada:ft-user-orncay2mfjdpqgswmlkminvi-2021-08-25-04-18-33". These are the 2 models that I trained & can be used to translate text to regex, more train dataset will give better performance, but also suffer higher chance to break during training or creating a model. Note: only 10 fine-tune models can be created for 1 month, but once the model has been created, it will be saved and can be used without having to train again.

In [30]:
!openai api fine_tunes.create -t "text_regex8_prepared.jsonl" -m cu

Upload progress: 100%|████████████████████| 36.5k/36.5k [00:00<00:00, 6.51Mit/s]
Uploaded file from text_regex9_prepared.jsonl: file-g9CjaWCgt6R2NsJtJSxTJpQv
[organization=user-orncay2mfjdpqgswmlkminvi] [91mError:[0m You have reached the maximum number of fine-tunes allowed for your organization for this month (10). Please contact finetuning@openai.com and tell us about your use-case if you would like this limit increased. (HTTP status code: 429)


# Use a fine-tuned model

Use the fine-tuned model to translate text to regex. After the fine-tuned model has been created, we can also specify this model as a parameter to Completions API, and make requests to it using the Playground. But here, we will start making requests by passing the model name as the model parameter of a completion request.

In [60]:
!openai api completions.create -m ada:ft-user-orncay2mfjdpqgswmlkminvi-2021-08-25-04-18-33 -p "Total: 9.90\n\n###\n\n"
# text_regex_model = 'curie:ft-user-orncay2mfjdpqgswmlkminvi-2021-08-24-08-25-45'
# res = openai.Completion.create(model=text_regex_model, prompt='TOTAL              16.90' + '\n\n###\n\n', max_tokens=1, temperature=0)

Total: 9.90\n\n###\n\nTotal:\s*([\d,]*\.\d{2})

# Conclusion

- The models created work for some "easy" plain text, but the performance is very unstable.
- There are different ways of writing parsing rule for the same text. This might affect the quality of training dataset.
- Can decrease the "Temperature" and "Top_P" to decrease the randomness of the output.
- Feed more training dataset when creating a model. "The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality. -- OpenAI"