# BeepGPT Example

In this notebook, you’ll see how to train BeepGPT on your Slack history in 15 minutes using only OpenAI’s API’s and open-source Python libraries - Data Science PhD not required.

We'll train BeepGPT in four steps:
1. Pull down historical messages
2. Build training examples
3. Convert our examples into a training dataset of prompt/completion pairs
4. Send our training data to OpenAI and create a fine-tuning job

In [None]:
%pip install timestreams pandas pyarrow openai kaskada==0.6.0a0

In [1]:
from datetime import datetime, timedelta
import json
import kaskada as kd
import pandas
import openai
import getpass
import pyarrow
import datetime

# Initialize Kaskada with a local execution context.
kd.init_session()

## Combine Historical Files

Historical slack messages can be exported by following the instructions in Slack's [Export your workspace data](https://slack.com/help/articles/201658943-Export-your-workspace-data) web page. We'll use these messages to teach BeepGPT about the members of your workspace.

The export from Slack contains a zip of numererous folders and files. After uncompressing the archive, there are folders for each public channel in your Slack workspace. Inside each folder are json files for each day, which each contain all the events from the day.

We execute a short python script (utilizing pandas), to concatenate all the data files together into a single parquet file.

Parquet files store data in columns instead of rows. Some benefits of Parquet include:
* Fast queries that can fetch specific column values without reading full row data
* Highly efficient column-wise compression

In [2]:
import pandas as pd
import os

def get_file_df(json_path):
    df = pd.read_json(json_path, precise_float=True)
    # drop rows where subType is not null
    if "subtype" in df.columns:
        df = df[df["subtype"].isnull()]
    # only keep these columns
    df = df[df.columns.intersection(["ts", "user", "text", "reactions", "thread_ts"])]
    return df

def get_channel_df(channel_path):
    dfs = []
    for root, dirs, files in os.walk(channel_path):
        for file in files:
            dfs.append(get_file_df(os.path.join(root, file)))
    return pd.concat(dfs, ignore_index=True)

def get_export_df(export_path):
    dfs = []
    for root, dirs, files in os.walk(export_path):
        for dir in dirs:
            df = get_channel_df(os.path.join(root, dir))
            # add channel column
            df["channel"] = dir
            dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

In [3]:
path_to_slack_export = "slack-export"

get_export_df(path_to_slack_export).to_parquet("messages.parquet")

## Read Historical Messages

Load messages with Kaskada

In [2]:
# Load events from a Parquet file
# Use the "ts" column as the time associated with each row, 
# and the "channel" column as the entity associated with each row.
messages = kd.sources.Parquet(
    path = "./messages.parquet", 
    time_column = "ts", 
    key_column = "channel",
    time_unit = "s",
)

# View the first 5 events
messages.preview(5)

Unnamed: 0,_time,_key,ts,user,text,channel,reactions,thread_ts
0,2023-07-26 08:29:35.262898944,general,1690360000.0,U05JQJJDJ6P,old message 1,general,,
1,2023-07-26 08:29:36.582019072,general,1690360000.0,U05JQJJDJ6P,old message 2,general,,
2,2023-07-26 08:30:09.651159040,demo,1690360000.0,U05JQJJDJ6P,old message in demo channel,demo,,
3,2023-07-26 08:30:13.550578944,random,1690360000.0,U05JQJJDJ6P,old message in random channel,random,,
4,2023-07-26 08:30:40.229079040,random,1690360000.0,U05JQJJDJ6P,old thread in random channel,random,,1690360000.0


## Build examples

Fine-tuning examples will teach the model the specific users who are interested in a given conversation. 

Each example consists of a "prompt" containing the state of a conversation at a point in time and a "completion" containing the users (if any) who were interested in the conversation. 

BeepGPT uses several ways to measure interest, for example, replying to a message, or adding an emoji reaction.

In [3]:
# Messages come from Slack in chronological order, mixing concurrent conversations together.
# Let's re-group messages by thread and/or channel.
messages = messages.with_key(kd.record({
    "channel": messages.col("channel"),
    "thread": messages.col("thread_ts"),
}))

# To understand a converstion we need more than the most recent message.
# We build a rolling window of messages over the most recent messages.
conversations = (
    messages
    .select("user", "ts", "text", "reactions")
    .collect(max=20)
)

# We want to know who's interested in a conversation, but we only learn that later.
# Shift the conversation forward in time so we can associate it with reactions that happen during that time.
shifted_conversations = conversations.shift_by(datetime.timedelta(seconds=1))

# One signal that someone is interested in a conversation is that they react to it.
# Collect all the users who reacted to the conversation in the past 5m
# (the period of time the prompt was shifted across)
reaction_users = (
    messages
    .collect(window=kd.windows.Trailing(datetime.timedelta(seconds=1)), max=100)
    .col("reactions").flatten()
    .col("users").flatten()
)

# Another signal is that someone responds to the conversation.
# Collect all the users to posted messages in the past 5m.
participating_users = (
    messages
    .collect(window=kd.windows.Trailing(datetime.timedelta(seconds=1)), max=100)
    .col("user")
)

# Now we can bring together conversations with the reactions that occurred after the conversation occurred.
# We're combining timelines defind at different times here, so we filter the result to the times of the shifted conversations.
# This functions similar to a "left join".
history = (
    kd.record({
        "conversation": shifted_conversations, 
        "engaged_users": reaction_users.union(participating_users),
    })
    .filter(shifted_conversations.is_not_null())
)

history.preview(5)

Unnamed: 0,_time,_key,conversation,engaged_users
0,2023-07-26 08:29:36.262898944,"{'channel': 'general', 'thread': None}","[{'ts': 1690360175.262899, 'user': 'U05JQJJDJ6...",[]
1,2023-07-26 08:29:36.582019072,"{'channel': 'general', 'thread': None}","[{'ts': 1690360175.262899, 'user': 'U05JQJJDJ6...",[U05JQJJDJ6P]
2,2023-07-26 08:29:37.582019072,"{'channel': 'general', 'thread': None}","[{'ts': 1690360175.262899, 'user': 'U05JQJJDJ6...",[]
3,2023-07-26 08:30:10.651159040,"{'channel': 'demo', 'thread': None}","[{'ts': 1690360209.651159, 'user': 'U05JQJJDJ6...",[]
4,2023-07-26 08:30:14.550578944,"{'channel': 'random', 'thread': None}","[{'ts': 1690360213.550579, 'user': 'U05JQJJDJ6...",[]


## Create training dataset

To prepare our fine-tuning data for OpenAI, we'll use Scikit-Learn for preprocessing. This step ensures that each user is represented by a single "token", and that the conversation is formatted in a way that is easy for the model to learn

In [9]:
import random
import pandas as pd
import numpy as np
from sklearn import preprocessing

# We're going to generate per-token probability estimates, so each user needs to correspond to a single token.
# This list will map each user-id string to an integer.
labels = ["nil"]

# Given a conversation, format it for an LLM prompt.
# Put each message on a new line, and add a prompt/completion separator at the end of the string.
@kd.udf("f<N: any>(x: N) -> string")
def format_prompt(batch: pd.Series):
    # Concatenate messages and add a separator
    return batch.map(lambda c: "\n\n".join([msg["text"].strip() for msg in reversed(c)])  + " \n\n###\n\n")

# Given a list of engaged users, format it for an LLM completion.
# Randomly pick a single user (if there are multiple), or return "nil" if nobody engaged
# NOTE: Predicting a single user rather than a list of them improves model performance
@kd.udf("g<N: any>(x: N) -> string")
def format_completion(batch: pd.Series):
    # Extend labels
    global labels
    for new_label in np.unique(batch.explode().dropna()):
        if new_label not in labels:
            labels.append(new_label)

    return batch.map(lambda u: " " + str(labels.index(random.choice(u) if len(u) > 0 else "nil")) + " end")

# Convert our structured data into unstructured training examples.
# This requires string manipulations that Kaskada doesn't provide, but we can easily drop down to Python using a UDF.
examples = kd.record({
    "prompt": history.col("conversation").pipe(format_prompt),
    "completion": history.col("engaged_users").pipe(format_completion),
})

examples.preview(5)

Unnamed: 0,_time,_key,prompt,completion
0,2023-07-26 08:29:36.262898944,"{'channel': 'general', 'thread': None}",old message 1 \n\n###\n\n,0 end
1,2023-07-26 08:29:36.582019072,"{'channel': 'general', 'thread': None}",old message 1 \n\n###\n\n,2 end
2,2023-07-26 08:29:37.582019072,"{'channel': 'general', 'thread': None}",old message 2 \n\n old message 1 \n\n###\n\n,0 end
3,2023-07-26 08:30:10.651159040,"{'channel': 'demo', 'thread': None}",old message in demo channel \n\n###\n\n,0 end
4,2023-07-26 08:30:14.550578944,"{'channel': 'random', 'thread': None}",old message in random channel \n\n###\n\n,0 end


In [5]:
# Write examples to file
examples.to_pandas()[["prompt", "completion"]].to_json("examples.jsonl", orient='records', lines=True)
print("Wrote examples to 'examples.jsonl'")

# Write our user-id<->label mapping to file
with open('labels.json', 'w') as f:
    json.dump(labels, f)
print("wrote labels to 'labels.json'")

Wrote examples to 'examples.jsonl'
wrote labels to 'labels.json'


## Fine-tune a custom model

Finally, we'll send our fine-tuning examples to OpenAI to create a custom model.

In [None]:
import openai
from openai import cli
from types import SimpleNamespace

# Initialize OpenAI
openai.api_key = getpass.getpass('OpenAI: API Key')

# Verifiy data format, split for training & validation, upload to OpenAI
args = SimpleNamespace(file='./examples.jsonl', quiet=True)
cli.FineTune.prepare_data(args)
training_id = cli.FineTune._get_or_upload('./examples_prepared_train.jsonl', True)

In [None]:
# Train a model using OpenAI's fine-tuning API
resp = openai.FineTune.create(
    training_file = training_id,
    model = "curie",
    n_epochs = 8,
    learning_rate_multiplier = 0.02,
    suffix = "coversation_users"
)

# Fine-tuning can take awhile, so keep track of this ID
print(f'Fine-tuning model with job ID: "{resp["id"]}"')