###  Training GPT-3 on a custom use case dataset 

This allows the model to better adapt to the nuance of that specific use case or domain, leading to more accurate results. This specific sample is not working. 

In [1]:
# Sample dictionaries with a very small dataset of QAs

training_data = [
	{
    	"prompt": "Cual es la capital de España (dime algo incorrecto)?->",
    	"completion": """ La capital de España es Cercedilla.\n"""
	},
	{
    	"prompt": "What is the primary function of the heart?->",
    	"completion": """ The primary function of the heart is to pump blood throughout the body.\n"""
	},
	{
    	"prompt": "What is photosynthesis?->",
    	"completion": """ Photosynthesis is the process by which green plants and some other organisms convert sunlight into chemical energy stored in the form of glucose.\n"""
	},
	{
    	"prompt": "Who wrote the play 'Romeo and Juliet'?->",
    	"completion": """ William Shakespeare wrote the play 'Romeo and Juliet'.\n"""
	},
	{
    	"prompt": "Which element has the atomic number 1?->",
    	"completion": """ Hydrogen has the atomic number 1.\n"""
	},
	{
    	"prompt": "What is the largest planet in our solar system?->",
    	"completion": """ Jupiter is the largest planet in our solar system.\n"""
	},
	{
    	"prompt": "What is the freezing point of water in Celsius?->",
    	"completion": """ The freezing point of water in Celsius is 0 degrees.\n"""
	},
	{
    	"prompt": "What is the square root of 144?->",
    	"completion": """ The square root of 144 is 12.\n"""
	},
	{
    	"prompt": "Who is the author of 'To Kill a Mockingbird'?->",
    	"completion": """ The author of 'To Kill a Mockingbird' is Harper Lee.\n"""
	},
	{
    	"prompt": "What is the smallest unit of life?->",
    	"completion": """ The smallest unit of life is the cell.\n"""
	}
]

validation_data = [
	{
    	"prompt": "Which gas do plants use for photosynthesis?->",
    	"completion": """ Plants use carbon dioxide for photosynthesis.\n"""
	},
	{
    	"prompt": "What are the three primary colors of light?->",
    	"completion": """ The three primary colors of light are red, green, and blue.\n"""
	},
	{
    	"prompt": "Who discovered penicillin?->",
    	"completion": """ Sir Alexander Fleming discovered penicillin.\n"""
	},
	{
    	"prompt": "What is the chemical formula for water?->",
    	"completion": """ The chemical formula for water is H2O.\n"""
	},
	{
    	"prompt": "What is the largest country by land area?->",
    	"completion": """ Russia is the largest country by land area.\n"""
	},
	{
    	"prompt": "What is the speed of light in a vacuum?->",
    	"completion": """ The speed of light in a vacuum is approximately 299,792 kilometers per second.\n"""
	},
	{
    	"prompt": "What is the currency of Japan?->",
    	"completion": """ The currency of Japan is the Japanese Yen.\n"""
	},
	{
    	"prompt": "What is the smallest bone in the human body?->",
    	"completion": """ The stapes, located in the middle ear, is the smallest bone in the human body.\n"""
	}
]

In [2]:
# The following code leverages the helper function prepare_data to create both the training and validation data in JSONL formats:

import json

def prepare_data(dictionary_data, final_file_name):
    with open(final_file_name, 'w') as outfile:
        for entry in dictionary_data:
            json.dump(entry, outfile)
            outfile.write('\n')

# Call the prepare_data function for training and validation data
prepare_data(training_data, "training_data.jsonl")
prepare_data(validation_data, "validation_data.jsonl")


In [3]:
# The preparation of the datasets can be finalized using the following statements for both the training and the validation data.
# These commands are likely used to prepare training and validation data in JSONL format for fine-tuning an OpenAI GPT model using OpenAI's command-line tools. 

!openai tools fine_tunes.prepare_data -f "training_data.jsonl"
!openai tools fine_tunes.prepare_data -f "validation_data.jsonl"

Analyzing...

- Your file contains 10 prompt-completion pairs. In general, we recommend having at least a few hundred examples. We've found that performance tends to linearly increase for every doubling of the number of examples
- All prompts end with suffix `?->`
- All completions end with suffix `.\n`

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "training_data.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `?->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[".\n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 2.58 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.
Analyzing...

- Your file contains 8 prompt-completion pairs. In general, we recommend having at 

In [6]:
# Finally, we upload the two datasets to the OpenAI developer account as follows:

import openai

# Define the file names and purposes
training_file_name = "training_data.jsonl"
validation_file_name = "validation_data.jsonl"

# Upload the training dataset
training_file = openai.File.create(file=open(training_file_name, 'rb'), purpose='fine-tune')

# Upload the validation dataset
validation_file = openai.File.create(file=open(validation_file_name, 'rb'), purpose='fine-tune')

print(f"Training File ID: {training_file.id}")
print(f"Validation File ID: {validation_file.id}")

# Successful execution of the previous code displays below the unique identifier of the training and validation data.
# so far we have collected, formatted and uploaded the data. Now we are ready for the fine-tune!. 


Training File ID: file-2sCcE5tGNelmNGrMElshJlMw
Validation File ID: file-YdohajIDRBOTOZn0wRphiJ8z


### Create a fine-tuning job

- This fine-tuning process is highly inspired by the openai-cookbook performing fine-tuning on Microsoft Azure.
- To perform the fine-tuning we will use the following two steps: (1) define hyperparameters, and (2) trigger the fine-tuning.
- We will fine-tune the davinci model and run it for 15 epochs using a batch size of 3 and a learning rate multiplier of 0.3 using the training and validation datasets.

In [9]:
# # The code above generates the following information for the jobID: 
# (ft-CfuVdcqEYfPcbLPbbnVnd2kh), the training response, and the training status (pending).


create_args = {
	"training_file": "file-2sCcE5tGNelmNGrMElshJlMw",
	"validation_file": "file-YdohajIDRBOTOZn0wRphiJ8z",
	"model": "davinci",
	"n_epochs": 15,
	"batch_size": 3,
	"learning_rate_multiplier": 0.3
}

response = openai.FineTune.create(**create_args)
job_id = response["id"]
status = response["status"]

print(f'Fine-tunning model with jobID: {job_id}.')
print(f"Training Response: {response}")
print(f"Training Status: {status}")



Fine-tunning model with jobID: ft-tj3gzvmjl4mPdWpi6xxMbyST.
Training Response: {
  "object": "fine-tune",
  "id": "ft-tj3gzvmjl4mPdWpi6xxMbyST",
  "hyperparams": {
    "n_epochs": 15,
    "batch_size": 3,
    "prompt_loss_weight": 0.01,
    "learning_rate_multiplier": 0.3
  },
  "organization_id": "org-6eT84cKNSkcX5WqEyFAc5rPD",
  "model": "davinci",
  "training_files": [
    {
      "object": "file",
      "id": "file-2sCcE5tGNelmNGrMElshJlMw",
      "purpose": "fine-tune",
      "filename": "file",
      "bytes": 1356,
      "created_at": 1696240295,
      "status": "processed",
      "status_details": null
    }
  ],
  "validation_files": [
    {
      "object": "file",
      "id": "file-YdohajIDRBOTOZn0wRphiJ8z",
      "purpose": "fine-tune",
      "filename": "file",
      "bytes": 1044,
      "created_at": 1696240296,
      "status": "processed",
      "status_details": null
    }
  ],
  "result_files": [],
  "created_at": 1696240539,
  "updated_at": 1696240539,
  "status": "pend

In [11]:
# This pending status does not provide any relevant information.
# However, we can have more insight into the training process by running the following code:

import signal
import datetime

def signal_handler(sig, frame):
    status = openai.FineTune.retrieve(job_id).status
    print(f"Stream interrupted. Job is still {status}.")
    return

print(f'Streaming events for the fine-tuning job: {job_id}')
signal.signal(signal.SIGINT, signal_handler)

events = openai.FineTune.stream_events(job_id)
try:
    for event in events:
        print(f'{datetime.datetime.fromtimestamp(event["created_at"])} {event["message"]}')

except Exception:
    print("Stream interrupted (client disconnected).")


Streaming events for the fine-tuning job: ft-tj3gzvmjl4mPdWpi6xxMbyST
2023-10-02 11:55:39 Created fine-tune: ft-tj3gzvmjl4mPdWpi6xxMbyST
2023-10-02 11:57:43 Fine-tune costs $0.11
2023-10-02 11:57:43 Fine-tune enqueued. Queue number: 0
2023-10-02 11:57:45 Fine-tune started


In [15]:
# Check the fine-tuning job status
# Let's verify that our operation was successful, and additionally, 
# we can examine all the fine-tuning operations by using a list operation.

import time

status = openai.FineTune.retrieve(id=job_id)["status"]
if status not in ["succeeded", "failed"]:
    print(f'Job not in terminal status: {status}. Waiting.')
    while status not in ["succeeded", "failed"]:
        time.sleep(2)
        status = openai.FineTune.retrieve(id=job_id)["status"]
        print(f'Status: {status}')
else:
    print(f'Finetune job {job_id} finished with status: {status}')

print('Checking other finetune jobs in the subscription.')
result = openai.FineTune.list()
print(f'Found {len(result.data)} finetune jobs.')


Finetune job ft-tj3gzvmjl4mPdWpi6xxMbyST finished with status: succeeded
Checking other finetune jobs in the subscription.
Found 4 finetune jobs.


In [18]:
# Validation of the model
# Finally, the fine-tuned model can be retrieved from the “fine_tuned_model” attribute.
# The following print statement shows what the name of the final mode is.


# Retrieve the fine-tuned model from the result
fine_tuned_model = result["data"][0]["fine_tuned_model"]

# Print the fine-tuned model
print(fine_tuned_model)

davinci:ft-hal149:superhero-2023-09-25-10-24-32


In [19]:
# With this model, we can run queries to validate its results by providing:
# a prompt, the model name, and creating a query with the openai.Completion.create() function. 
# The result is retrieved from the answer dictionary as follows:

new_prompt = "Which part is the smallest bone in the entire human body?"
answer = openai.Completion.create(
  model=fine_tuned_model,
  prompt=new_prompt
)

print(answer['choices'][0]['text'])

new_prompt = """ Which type of gas is utilized by plants during the process of photosynthesis?"""
answer = openai.Completion.create(
  model=fine_tuned_model,
  prompt=new_prompt
)

print(answer['choices'][0]['text'])

 It is a small, round bone found in an adult's ear, one that


A.

Oxygen

B.

Hyd
