In this notebook, we'll fine tune OpenAI's GPT3.5 turbo model based on our own dataset

## Data
This case study uses the Yahoo Non-Factoid Question Dataset derived from the Yahoo’s Webscope L6 collection
- It has 87,361 questions and their corresponding answers.  
- Freely available from [Hugging Face](https://huggingface.co/datasets/yahoo_answers_qa).  

## Main tasks
The main tasks include  
- Loading data from Hugging Face  
- Preprocess the data for fine-tuning  
- Fine-tune the GPT3.5 model  
- Interaction with the fine-tuned model  

In [1]:
from openai import FineTuningJob, ChatCompletion
from datasets import load_dataset 
from time import sleep
import random 
import json

Now, we'll load the dataset from huggingface

In [2]:
yahoo_answers_qa = load_dataset('yahoo_answers_qa', split='train')

In [3]:
yahoo_answers_qa

Dataset({
    features: ['id', 'question', 'answer', 'nbestanswers', 'main_category'],
    num_rows: 87362
})

We'll use a subset of out data for fine-tuning to minimize costs

In [4]:
sample_size = 150
yahoo_answers_qa = yahoo_answers_qa.select(range(sample_size))

In [5]:
yahoo_answers_qa

Dataset({
    features: ['id', 'question', 'answer', 'nbestanswers', 'main_category'],
    num_rows: 150
})

For fine-tuning, the data needs to be in the following format for each pair of questions and answers across the entire training and validation data.

Each observation from the data is considered to be a message with three main roles, and each role has a content: 

- The first `role` is the `system` and the `content` is the description of what the system should do  
- The second `role` is the `user`, and the `content` is the question from the user  
- The third `role` is the `assistant` and the `content` is the answer to the user's question    


```json
{
    "messages": [
        {"role": "system", "content": "SYSTEM's ROLE"},
        {"role": "user", "content": "USER's QUESTION"},
        {"role": "assistant", "content": "SYSTEM's RESPONSE"}
    ] 
}


We'll create a helper function to format the data

In [6]:
def format_data(data):
     formatted_data = [{
            "messages": [
                {"role": "system", "content": "You are a helpful assistant. Answer users' question with a polite tone"},
                {"role": "user", "content": message["question"]},
                {"role": "assistant", "content": message["answer"]}
            ] 
        } for message in data 
    ]
     random.shuffle(formatted_data)
     return formatted_data

In [7]:
formatted_data = format_data(yahoo_answers_qa)
formatted_data[0]

{'messages': [{'role': 'system',
   'content': "You are a helpful assistant. Answer users' question with a polite tone"},
  {'role': 'user',
   'content': "Why can't I get used to console-style controls in FPS games?"},
  {'role': 'assistant',
   'content': 'I suggest going back to your computer for FPS. Console controls are not well-suited to FPS style gameplay. The only console that will truly be able to pull this off well is the Nintendo Revolution (which will come out next year).'}]}

We'll split the data into training and validation datasets and save them locally

In [8]:
train_size=int(0.7 * len(formatted_data))

training_data = formatted_data[:train_size]
validation_data = formatted_data[train_size:]

print(f"training data size is {len(training_data)} and validation data size is {len(validation_data)}")

training data size is 105 and validation data size is 45


In [9]:
def save_data(dict_data,file_name):
    with open(file_name,'w') as outfile:
        for item in dict_data:
            json.dump(item,outfile)
            outfile.write('\n')


In [10]:
save_data(training_data,'Data/training_data.jsonl') # max limit of json is 2GB, so using jsonl will remove that limit
save_data(validation_data,'Data/validation_data.jsonl')

Now we'll upload the data to OpenAI, save the training and validation data ids in a variable as we'll need them later

In [11]:
import openai
openai_api_key=''

In [12]:
def upload_finetuning_data(data_path):
    client=openai.OpenAI(api_key=openai_api_key)
    uploaded_file = client.files.create(file=open(data_path,mode='rb'),
                                       purpose='fine-tune'
                                       )
    return uploaded_file
    

In [13]:
uploaded_finetuning_data = upload_finetuning_data('Data/training_data.jsonl')

In [14]:
print(uploaded_finetuning_data)

FileObject(id='file-iF0ylhVb8AQsZEg5zz1lb02E', bytes=53636, created_at=1713833141, filename='training_data.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)


In [15]:
uploaded_training_id = uploaded_finetuning_data.id
print(uploaded_training_id)

file-iF0ylhVb8AQsZEg5zz1lb02E


In [16]:
uploaded_validation_data = upload_finetuning_data('Data/validation_data.jsonl')
uploaded_validation_id = uploaded_validation_data.id
print(uploaded_validation_id)

file-ojPCIBYogTXAtwVZwEK5Tnds


In [18]:
#Helper function to fine tune the data

def create_fine_tuning(base_model, train_id, val_id):
    client=openai.OpenAI(api_key=openai_api_key)
    fine_tuning_response = client.fine_tuning.jobs.create(
        training_file = train_id,
        validation_file = val_id,
        model = base_model
    )
    
    return fine_tuning_response

In [19]:
base_model = "gpt-3.5-turbo"

fine_tuning_response = create_fine_tuning(base_model, 
                                         uploaded_training_id, 
                                         uploaded_validation_id)

In [20]:
fine_tuning_job_id = fine_tuning_response.id
fine_tuning_response

FineTuningJob(id='ftjob-XuiCNXwwP0ObfirV9yPWRzYc', created_at=1713833160, error=Error(code=None, message=None, param=None, error=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-x7gmHBS9mvLkn1vPNWa1EIi7', result_files=[], seed=769438601, status='validating_files', trained_tokens=None, training_file='file-iF0ylhVb8AQsZEg5zz1lb02E', validation_file='file-ojPCIBYogTXAtwVZwEK5Tnds', integrations=[], user_provided_suffix=None)

The fine_tuned_model parameter in the response is null, that means that the fine tuning is not complete. We'll create a loop and wait
for fine tuning to complete

In [21]:
while True:
    client=openai.OpenAI(api_key=openai_api_key)
    fine_tuning_response = client.fine_tuning.jobs.retrieve(fine_tuning_job_id)
    fine_tuned_model_ID = fine_tuning_response.fine_tuned_model
    
    if(fine_tuned_model_ID != None):
        print("Fine-tuning completed!")
        print(f"Fine-tuned model ID: {fine_tuned_model_ID}")
        break
        
    else:
        print("Fine-tuning in progress...")
        sleep(200)

Fine-tuning in progress...
Fine-tuning in progress...
Fine-tuning in progress...
Fine-tuning in progress...
Fine-tuning in progress...
Fine-tuning completed!
Fine-tuned model ID: ft:gpt-3.5-turbo-0125:personal::9Gynxg6K


In [22]:
def answer_question(question, model_ID):
  client=openai.OpenAI(api_key=openai_api_key)

  message = [
              {
                  "role": "system",
                  "content": "You are the Yahoo platform user's assistant. Please reply users' answer using polite and respectful language.spectful language."
              },

              {
                  "role": "user",
                  "content": question
              }
            ]

  # Start inferencing
  model_completion = client.chat.completions.create(model=model_ID, 
                                          messages = message)

  # Get the response
  response = model_completion.choices[0].message

  return response.content

In [23]:
question = "How to invest in stocks?"
response_fine_tuned_model = answer_question(question, fine_tuned_model_ID)
print(f"Fine-tuned model response: \n{response_fine_tuned_model}")

Fine-tuned model response: 
online trading is a good idea for a test run but personally i would go to a stock broker that offers free investemtn consultatrions, citi bank is one bank i know that you can do so, Bear Sterns is another ....委.._SY。..。registered credit cards work well too and suddenly stocks are sort of gambling but you can actually deduct up tp $3000 of loses..but that is really another type of account where you deposit money and they control the type of stocks you can invest in.. happy investing.. (ps and your money is your own)


In [24]:
response_base_model = answer_question(question, 'gpt-3.5-turbo')

print(f"Base model response: \n{response_base_model}")

Base model response: 
Investing in stocks can be a great way to grow your wealth, but it's important to do thorough research and understand the risks involved. Here are some steps you can take to start investing in stocks:

1. Educate yourself: Take the time to learn about the stock market, different investment strategies, and how to analyze companies.

2. Set financial goals: Determine your investment goals, risk tolerance, and timeframe for investing.

3. Choose a brokerage account: Open an account with a reputable brokerage firm that offers trading services.

4. Build a diversified portfolio: Consider investing in a mix of stocks from different sectors to spread out your risk.

5. Start small: Begin by investing a small amount of money until you gain more experience and confidence in your investment decisions.

6. Monitor your investments: Regularly review your portfolio and make adjustments as needed based on market conditions and your financial goals.

Remember, investing in stock