<a href="https://colab.research.google.com/github/john-telfeyan/toolbox/blob/master/language_ai/OpenAI_FineTune_Completion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Synopsis**: Create a fine-tuned model for OpenAI and use it for prompt completion, text classification, entity extraction, and/or summarization

**Created**:  July 2023

**Author**:   [John Telfeyan](https://mailhide.io/e/mMkX3)

**Distribution**: [MIT Opens Source Copyright](https://gist.github.com/john-telfeyan/2565b2904355410c1e75f27524aeea5f#file-license-md)

**Sources**:  
https://github.com/Kirili4ik/ruDialoGpt3-finetune-colab  
https://github.com/openai/openai-cookbook/blob/main/examples/Fine-tuned_classification.ipynb
https://stackoverflow.com/questions/75774873/openai-chatgpt-gpt-3-5-api-error-this-is-a-chat-model-and-not-supported-in-t

## Preperations
dependancies and connection to google drive


In [1]:
!pip install --upgrade openai

Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.8


In [2]:
import os
import openai
import pandas as pd

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Add your API key for OpenAI
get one first here:  
 https://platform.openai.com/account/api-keys


In [4]:
# Best practice to store your key in a git-ignored file
# To-do: use secure string
key_file="/content/drive/MyDrive/secrets/fine-tuning.openai.key.txt"

with open(key_file, 'r') as f:
  key = f.read()

# Set an os envrion variable so !bang commands will work
os.environ['OPENAI_API_KEY'] = key

# Set an object so python commands will work
openai.api_key = key
#openai.api_key_path = key_file # alternatively read from a file if only python

# Key should be about 50 chars
print(len(openai.api_key))

51


### Just want to use ChatGTP without fine-tuning?

Chat models take a list of messages as input and return a model-generated message as output. Although the chat format is designed to make multi-turn conversations easy, it’s just as useful for single-turn tasks without any conversation.  

The main input is the messages parameter. Messages must be an array of message objects, where each object has a role (either "system", "user", or "assistant") and content. Conversations can be as short as one message or many back and forth turns.  You can modify the personality of the assistant or provide specific instructions about how it should behave throughout the conversation

In this example we tell the bot to act as a text classifier and the output we are looking for is "other", which we obtain in
```bash
<OpenAIObject chat.completion>
choices -> message -> content `  
```
Read more:  
https://platform.openai.com/docs/guides/gpt/chat-completions-api

In [6]:


openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a text classification bot that responds with one-word answers."},
        {"role": "user", "content": "Classify this text as related to a 'broken' item, 'training' exercise, or 'other': 'The wing broke off my drone.'"},
        {"role": "assistant", "content": "Broken"},
        {"role": "user", "content": "The bird flew sucessfuly"}]
)


<OpenAIObject chat.completion id=chatcmpl-7bFCO1rdMzCoFoPYKKR1Lt2NuIlVz at 0x7f4b7eafc540> JSON: {
  "id": "chatcmpl-7bFCO1rdMzCoFoPYKKR1Lt2NuIlVz",
  "object": "chat.completion",
  "created": 1689110916,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Other"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 69,
    "completion_tokens": 1,
    "total_tokens": 70
  }
}

### Fine Tuning Example
Now lets fine tune a LoRA to do this (even better?) based on some training data that we had ChatGTP generate and then manually fixed.

In [7]:
df = pd.read_csv("/content/drive/MyDrive/proj/chatgpt-fine-tune/training-data-source/broken_items_03.csv")

In [9]:
ex_df = df[["Text","Broken"]] #extract a subset df
ex_df.columns=["prompt", "completion"] # these column names are manditory
ex_df.to_json("broken_03.jsonl", orient='records', lines=True) #manditory format

In [None]:
!openai tools fine_tunes.prepare_data -f broken_03.jsonl -q


Analyzing...

- Your file contains 105 prompt-completion pairs
- The `prompt` column/key should be lowercase
- The `completion` column/key should be lowercase
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 18 duplicated prompt-completion sets. These are rows: [39, 41, 42, 43, 44, 46, 47, 49, 50, 51, 53, 55, 57, 58, 60, 61, 63, 64]
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave 

In [None]:
!openai api fine_tunes.create -t "broken_03_prepared_train.jsonl" -v "broken_03_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 3

Upload progress:   0% 0.00/12.0k [00:00<?, ?it/s]Upload progress: 100% 12.0k/12.0k [00:00<00:00, 12.6Mit/s]
Uploaded file from broken_03_prepared_train.jsonl: file-JRhZtUbvSShVYWCS9tCYrwkq
Upload progress: 100% 3.10k/3.10k [00:00<00:00, 5.01Mit/s]
Uploaded file from broken_03_prepared_valid.jsonl: file-xjqa9VtGTnRGUu0zu77LhS3k
Created fine-tune: ft-FWcyueisQGBddHjFgZMBtr5C
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-07-10 03:32:46] Created fine-tune: ft-FWcyueisQGBddHjFgZMBtr5C

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-FWcyueisQGBddHjFgZMBtr5C



In [None]:
!openai api fine_tunes.follow -i ft-FWcyueisQGBddHjFgZMBtr5C

[2023-07-10 03:32:46] Created fine-tune: ft-FWcyueisQGBddHjFgZMBtr5C
[2023-07-10 05:21:44] Fine-tune costs $0.02
[2023-07-10 05:21:44] Fine-tune enqueued. Queue number: 4
[2023-07-10 05:24:34] Fine-tune is in the queue. Queue number: 3
[2023-07-10 05:24:40] Fine-tune is in the queue. Queue number: 2
[2023-07-10 05:24:42] Fine-tune is in the queue. Queue number: 1
[2023-07-10 05:24:52] Fine-tune started
[2023-07-10 05:26:05] Completed epoch 1/4
[2023-07-10 05:26:19] Completed epoch 2/4
[2023-07-10 05:26:31] Completed epoch 3/4
[2023-07-10 05:26:43] Completed epoch 4/4
[2023-07-10 05:27:05] Uploaded model: curie:ft-personal-2023-07-10-05-27-04
[2023-07-10 05:27:06] Uploaded result file: file-wAz5UzjfNKjduuzV87cfJLzk
[2023-07-10 05:27:06] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m curie:ft-personal-2023-07-10-05-27-04 -p <YOUR_PROMPT>


In [None]:
#openai api fine_tunes.follow -i ft-FWcyueisQGBddHjFgZMBtr5C
!openai api fine_tunes.results -i ft-FWcyueisQGBddHjFgZMBtr5C > result.csv
results = pd.read_csv('result.csv')
results[results['classification/accuracy'].notnull()].tail(1)

Unnamed: 0,step,elapsed_tokens,elapsed_examples,training_loss,training_sequence_accuracy,training_token_accuracy,validation_loss,validation_sequence_accuracy,validation_token_accuracy,classification/accuracy,classification/weighted_f1_score
276,277,8085,277,0.006622,1.0,1.0,,,,1.0,1.0


## Use the new fine-tuned model


In [None]:
!openai api completions.create -m curie:ft-personal-2023-07-10-05-27-04 -p "This csv lists text about pieces of equipment and then classifies them as refering to a broken item or other engagement:\n\n Bo went out to meet bojangles, other\nthe wing fell off,"

This csv lists text about pieces of equipment and then classifies them as refering to a broken item or other engagement:\n\n Bo went out to meet bojangles, other\nthe wing fell off, still functional. Requesting evaluation -> Broken -> Broken -> Broken -> Broken -> Broken

In [None]:
openai.Completion.create(
    model="curie:ft-personal-2023-07-10-05-27-04",
    prompt="bojangles dropped the glass; it was unharmed by the unit")

<OpenAIObject text_completion id=cmpl-7ayfYhPPERHFENAaslGDLX0OvR9Sp at 0x7ff8d3e90e50> JSON: {
  "id": "cmpl-7ayfYhPPERHFENAaslGDLX0OvR9Sp",
  "object": "text_completion",
  "created": 1689047376,
  "model": "curie:ft-personal-2023-07-10-05-27-04",
  "choices": [
    {
      "text": "'s protection mechanism -> Broken -> Broken -> Broken -> Broken -> Broken -> Broken ->",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 16,
    "total_tokens": 31
  }
}

In [12]:
# Can we do better?

openai.Completion.create(
  model="curie:ft-personal-2023-07-10-05-27-04",
  messages=[
        {"role": "system", "content": "You are a text classification bot that responds with one-word answers."},
        {"role": "user", "content": "Classify this text as related to a 'broken' item, 'training' exercise, or 'other': 'The wing broke off my drone.'"},
        {"role": "assistant", "content": "Broken"},
        {"role": "user", "content": "The bird flew sucessfuly"}]
)

InvalidRequestError: ignored

## Uh-oh!
Looks like we cant use our handy role + chat sequence with fine tuned models that easily. We have to choose between the two.   
**To-do**:
 - Try to apply a "role" to our fine-tuned model
 - Compare gtp-3.5 to our fine tuned model
 - Improve our training data so this model works