# Next Reaction

### For a set of recent messages, try to predict the reaction (if any) to the next message.

In [None]:
!pip3 install --upgrade openai -q

In [35]:
import json, openai, random, time
from openai import cli
from types import SimpleNamespace

work_dir = ""

# Remember to remove your key from your code when you're done.
openai.api_key = ""

## Initial Data Output Stage

This is a Generative problem, so the goals for training aren't as strict.

Goals:
* prompt and completion length must not be longer than 2048 tokens

In [37]:
file1 = open(f'{work_dir}/messages.jsonl', 'r')
file2 = open(f'{work_dir}/next_reaction.jsonl', 'w')

current_channel = ""
previous_training_example = ""
recent_messages = []
count = 0

min_messages = 2
max_messages = 5
between_message_separator = " \n\n\n "
prompt_separator = "\n\n###\n\n"
completion_separator = " END"
reverse_messages = True

while True:
    line = file1.readline()

    if not line:
        break

    data = json.loads(line)
    text = data["text"].strip()

    # skip message if empty
    if text == "":
        continue

    # skip message if it contains a code block
    if text.find("```") >= 0:
        continue

    channel = data["channel"]
    user_id = data["user"]
    reactions = data["reactions"]

    # restart message history count if the channel changes in the file
    if channel != current_channel:
        current_channel = channel
        recent_messages = []
        print(f'New channel: {current_channel}')

    # add the recent message to the history.
    # remove the oldest message if the size window is exceeded
    recent_messages.append(text)
    if len(recent_messages) > max_messages:
        recent_messages.pop(0)

    # if we are in the message history size window, prep to output an example
    if len(recent_messages) > min_messages:
        messages = recent_messages.copy()
        if reverse_messages:
            messages.reverse()
        
        prompt = between_message_separator.join(messages) 
        completion = json.dumps(reactions) if reactions else ""

        training_example = { "prompt": f'{prompt}{prompt_separator}', "completion": f' {completion}{completion_separator}' }
        training_example = json.dumps(training_example)

        # if training example doesn't match the previous one, then output
        if previous_training_example != training_example:
            file2.write(training_example + "\n")

        previous_training_example = training_example

file1.close()
file2.close()

New channel: articles
New channel: conferences
New channel: dev
New channel: dev-ops
New channel: fluff-posting
New channel: games
New channel: general
New channel: inbound-leads
New channel: on-call
New channel: product
New channel: random
New channel: rust
New channel: sales-team
New channel: team-api
New channel: team-compute


## Data Verification Stage

* make sure prompts end with same suffix
* make sure tokens per example are less than 2048
* ignore all other analysis
  * we are training for *conditional generation*, but the data prep tool incorreclty assumes we are fine-tuning for *classifaction*

In [38]:
args = SimpleNamespace(file=f'{work_dir}/next_reaction.jsonl', quiet=True)
cli.FineTune.prepare_data(args)

Analyzing...

- Your file contains 19672 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- More than a third of your `completion` column/key is uppercase. Uppercase completions tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details
- All prompts end with suffix `\n\n###\n\n`

Based on the analysis we will perform the following actions:
- [Recommended] Lowercase all your data in column/key `completion` [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y


Your data will be 

## Model Training Stage

* First need to upload the training & validation files to OpenAI

In [30]:
training_file_name = f'{work_dir}/next_reaction.jsonl'

def check_status(training_id):
    train_status = openai.File.retrieve(training_id)["status"]
    print(f'Status (training_file): {train_status} ')
    return (train_status)

# Upload the training and validation dataset files to Azure OpenAI.
training_id = cli.FineTune._get_or_upload(training_file_name, True)

# Check on the upload status of the training dataset file.
(train_status) = check_status(training_id)

# Poll and display the upload status once a second until both files have either
# succeeded or failed to upload.
while train_status not in ["succeeded", "failed", "processed"]:
    time.sleep(1)
    (train_status) = check_status(training_id)

Upload progress: 100%|██████████| 15.0M/15.0M [00:00<00:00, 30.6Git/s]


Uploaded file from /Users/eric.pinzur/Documents/slackbot2000/next_reaction.jsonl: file-SZwB3ECG1qIRZfE3DOA6cTOr
Status (training_file): uploaded 
Status (training_file): uploaded 
Status (training_file): uploaded 
Status (training_file): uploaded 
Status (training_file): uploaded 
Status (training_file): uploaded 
Status (training_file): uploaded 
Status (training_file): processed 


In [31]:
training_id="file-SZwB3ECG1qIRZfE3DOA6cTOr"

## Start a fine-tuning Job

* no validation file since this is not a classification problem

In [32]:
# This example defines a fine-tune job that creates a customized model based on curie, 
# with just a single pass through the training data. The job also provides classification-
# specific metrics, using our validation data, at the end of that epoch.
create_args = {
    "training_file": training_id,
    "model": "ada",
    "n_epochs": 1,
    "suffix": "next_reaction_full_kaskada"
}
# Create the fine-tune job and retrieve the job ID
# and status from the response.
resp = openai.FineTune.create(**create_args)
job_id = resp["id"]
status = resp["status"]

# You can use the job ID to monitor the status of the fine-tune job.
# The fine-tune job may take some time to start and complete.
print(f'Fine-tuning model with job ID: {job_id}')

Fine-tuning model with job ID: ft-Uy0jb2sUxNxE347gtk2Ixd5p


In [33]:
job_id = "ft-Uy0jb2sUxNxE347gtk2Ixd5p"

## Wait for the fine-tuning to start

* Note that it can take several hours for the job to move from the `pending` state

In [None]:
# Get the status of our fine-tune job.
status = openai.FineTune.retrieve(id=job_id)["status"]

# If the job isn't yet done, poll it every 2 seconds.
if status not in ["succeeded", "failed"]:
    print(f'Job not in terminal status: {status}. Waiting.')
    while status not in ["succeeded", "failed"]:
        time.sleep(5)
        status = openai.FineTune.retrieve(id=job_id)["status"]
        print(f'Status: {status}')
else:
    print(f'Fine-tune job {job_id} finished with status: {status}')


## Check fine-tuning events

* Lets us know specifics about the fine-tuning job


In [None]:
# Get the events of our fine-tune job.
events = openai.FineTune.stream_events(id=job_id)

for event in events:
    print(event)


# Look at Training Results

* download the results file(s)

In [50]:
file_prefix = "next_reaction_results"

result = openai.FineTune.retrieve(id=job_id)
count = 0
for result_file in result["result_files"]:
    file_name = f'{work_dir}/{file_prefix}_{count}.csv'
    file = open(file_name, 'wb')
    file.write(openai.File.download(id=result_file["id"]))
    file.close()
    print(f'Outputted results to: {file_name}')


Outputted results to: /Users/eric.pinzur/Documents/slackbot2000/next_reaction_results_0.csv
