# BeepGPT Example

In this notebook, you’ll see how to train BeepGPT on your Slack history in 15 minutes using only OpenAI’s API’s and open-source Python libraries - Data Science PhD not required.

We'll train BeepGPT in four steps:
1. Pull down historical messages
2. Build training examples
3. Convert our examples into a training dataset of prompt/completion pairs
4. Send our training data to OpenAI and create a fine-tuning job

In [None]:
%pip install pandas pyarrow openai kaskada==0.6.0a0

In [None]:
from datetime import datetime, timedelta
import kaskada as kt
import pandas
import openai
import getpass
import pyarrow
import datetime

# Initialize Kaskada with a local execution context.
kt.init_session()

## Read Historical Messages

Historical slack messages can be exported by following the instructions in Slack's [Export your workspace data](https://slack.com/help/articles/201658943-Export-your-workspace-data) web page. We'll use these messages to teach BeepGPT about the members of your workspace.

In [None]:
# Load events from a Parquet file
# Use the "ts" column as the time associated with each row, 
# and the "channel" column as the entity associated with each row.
messages = kt.sources.Parquet(
    path = "./messages.parquet", 
    time_column_name = "ts", 
    key_column_name = "channel",
)

# View the first 5 events
messages.preview(5)

## Build examples

Fine-tuning examples will teach the model the specific users who are interested in a given conversation. Each example consists of a "prompt" containing the state of a conversation at a point in time and a "completion" containing the users (if any) who were interested in the conversation. BeepGPT uses several ways to measure interest, for example, replying to a message, or adding an emoji reaction.

In [None]:
# Re-group messages by thread and/or channel
# Slack messages are delivered chronologically, so messages in threads
# may be interleaved with messages in the main channel.
messages = messages.with_key(kt.record({
        "channel": messages.col("channel"),
        "thread": messages.col("thread_ts"),
    }))

# Build the GPT input prompt by collecting relevant fields of recent messages
conversations = (
    messages
    .select("user", "ts", "text", "reactions")
    .collect(max=20)
)


# Shift the prompt forward in time 5m to observe the effects of the conversation
shifted_conversations = conversations.shift_by(datetime.timedelta(seconds=1))

# Collect all the users who reacted to the conversation in the past 5m
# (the period of time the prompt was shifted across)
reaction_users = (
    messages
    .collect(window=kt.windows.Trailing(datetime.timedelta(minutes=5)), max=100)
    .col("reactions").flatten()
    .col("users").flatten()
)

# Collect all the users to posted messages in the past 5m
participating_users = (
    messages
    .collect(window=kt.windows.Trailing(datetime.timedelta(minutes=5)), max=100)
    .col("user")
)

# Build a fine-tuning example mapping a conversation to the users who reacted to it
history = (
    kt.record({
        "conversation": shifted_conversations, 
        "engaged_users": reaction_users.union(participating_users),
    })
    .filter(shifted_conversations.is_not_null())
)

history.preview(5)

## Create training dataset

To prepare our fine-tuning data for OpenAI, we'll use Scikit-Learn for preprocessing. This step ensures that each user is represented by a single "token", and that the conversation is formatted in a way that is easy for the model to learn

In [None]:
from sklearn import preprocessing
import numpy, json

# Extract examples from historical data
history_df = history.run().to_pandas().drop(["_time", "_subsort", "_key_hash", "_key"], axis=1)


# Encode user ID labels
le = preprocessing.LabelEncoder()
le.fit(history_df.engaged_users.explode())
with open('labels_.json', 'w') as f:
    json.dump(le.classes_.tolist(), f)


# Format for the OpenAI API
def format_prompt(conversation):
    return "start -> " + "\n\n".join([f' {msg["user"]} --> {msg["text"]} ' for msg in conversation]) + "\n\n###\n\n"
def format_completion(engaged_users):
    return " " + (" ".join(le.transform(engaged_users).astype(str)) if len(engaged_users) > 0 else "nil") + " end"
    
examples_df = pandas.DataFrame({
    "prompt": history_df.conversation.apply(format_prompt),
    "completion": history_df.engaged_users.apply(format_completion),
})

# Write examples to file
examples_df.to_json("examples.jsonl", orient='records', lines=True)
print("Wrote examples to 'examples.jsonl'")

## Fine-tune a custom model

Finally, we'll send our fine-tuning examples to OpenAI to create a custom model.

In [None]:
import openai
from openai import cli
from types import SimpleNamespace

# Initialize OpenAI
openai.api_key = getpass.getpass('OpenAI: API Key')

# Verifiy data format, split for training & validation, upload to OpenAI
args = SimpleNamespace(file='./examples.jsonl', quiet=True)
cli.FineTune.prepare_data(args)
training_id = cli.FineTune._get_or_upload('./examples_prepared_train.jsonl', True)

In [None]:
# Train a model using "davinci", the most advanced model available for fine-tuning
resp = openai.FineTune.create(
    training_file = training_id,
    model = "davinci",
    n_epochs = 8,
    learning_rate_multiplier = 0.02,
    suffix = "coversation_users"
)

# Fine-tuning can take awhile, so keep track of this ID
print(f'Fine-tuning model with job ID: "{resp["id"]}"')