# BeepGPT Example

In this notebook, you’ll see how to train BeepGPT on your Slack history in 15 minutes using only OpenAI’s API’s and open-source Python libraries - Data Science PhD not required.

We'll train BeepGPT in four steps:
1. Pull down historical messages
2. Build training examples
3. Convert our examples into a training dataset of prompt/completion pairs
4. Send our training data to OpenAI and create a fine-tuning job

In [20]:
%pip install pandas pyarrow openai kaskada==0.6.0a1 transformers datasets evaluate ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.0-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting widgetsnbextension~=4.0.7
  Downloading widgetsnbextension-4.0.8-py3-none-any.whl (2.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting jupyterlab-widgets~=3.0.7
  Downloading jupyterlab_widgets-3.0.8-py3-none-any.whl (214 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.0/215.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: widgetsnbextension, jupyterlab-widgets, ipywidgets
Successfully installed ipywidgets-8.1.0 jupyterlab-widgets-3.0.8 widgetsnbextension-4.0.8

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m

In [1]:
from datetime import datetime, timedelta
import kaskada as kd
import pandas
import getpass
import datetime
import transformers
import datasets
import evaluate

# Initialize Kaskada with a local execution context.
kd.init_session()

## Read Historical Messages

Historical slack messages can be exported by following the instructions in Slack's [Export your workspace data](https://slack.com/help/articles/201658943-Export-your-workspace-data) web page. We'll use these messages to teach BeepGPT about the members of your workspace.

In [15]:
import pandas as pd
import os

def get_file_df(json_path):
    df = pd.read_json(json_path, precise_float=True)
    # drop rows where subType is not null
    if "subtype" in df.columns:
        df = df[df["subtype"].isnull()]
    # only keep these columns
    df = df[df.columns.intersection(["ts", "user", "text", "thread_ts", "reactions"])]
    return df

def get_channel_df(channel_path):
    dfs = []
    for root, dirs, files in os.walk(channel_path):
        for file in files:
            dfs.append(get_file_df(os.path.join(root, file)))
    return pd.concat(dfs, ignore_index=True)

def get_export_df(export_path):
    dfs = []
    for root, dirs, files in os.walk(export_path):
        for dir in dirs:
            df = get_channel_df(os.path.join(root, dir))
            # add channel column
            df["channel"] = dir
            dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

In [16]:
path_to_slack_export = "slack-export"

get_export_df(path_to_slack_export).to_parquet("messages.parquet")

In [2]:
# Load events from a Parquet file
# Use the "ts" column as the time associated with each row, 
# and the "channel" column as the entity associated with each row.
messages = kd.sources.Parquet(
    path = "./messages.parquet", 
    time_column = "ts", 
    key_column = "channel",
)

# View the first 5 events
messages.preview(5)

Unnamed: 0,_time,_key,ts,user,text,channel,reactions,thread_ts
0,1970-01-01 00:00:01.690360175,general,1690360000.0,U05JQJJDJ6P,old message 1,general,,
1,1970-01-01 00:00:01.690360176,general,1690360000.0,U05JQJJDJ6P,old message 2,general,,
2,1970-01-01 00:00:01.690360209,demo,1690360000.0,U05JQJJDJ6P,old message in demo channel,demo,,
3,1970-01-01 00:00:01.690360213,random,1690360000.0,U05JQJJDJ6P,old message in random channel,random,,
4,1970-01-01 00:00:01.690360240,random,1690360000.0,U05JQJJDJ6P,old thread in random channel,random,,1690360000.0


## Build examples

Fine-tuning examples will teach the model the specific users who are interested in a given conversation. Each example consists of a "prompt" containing the state of a conversation at a point in time and a "completion" containing the users (if any) who were interested in the conversation. BeepGPT uses several ways to measure interest, for example, replying to a message, or adding an emoji reaction.

In [3]:
# Re-group messages by thread and/or channel
# Slack messages are delivered chronologically, so messages in threads
# may be interleaved with messages in the main channel.
messages = messages.with_key(kd.record({
        "channel": messages.col("channel"),
        "thread": messages.col("thread_ts"),
    }))

# Build the GPT input prompt by collecting relevant fields of recent messages
conversations = messages \
    .select("user", "ts", "text", "reactions") \
    .collect(max=20)


# Shift the prompt forward in time 5m to observe the effects of the conversation
shifted_conversations = conversations.shift_by(datetime.timedelta(seconds=1))

# Collect all the users who reacted to the conversation in the past 5m
# (the period of time the prompt was shifted across)
reaction_users = messages \
    .collect(window=kd.windows.Trailing(datetime.timedelta(seconds=1)), max=100) \
    .col("reactions").flatten() \
    .col("users").flatten()

# Collect all the users to posted messages in the past 5m
participating_users = messages \
    .collect(window=kd.windows.Trailing(datetime.timedelta(seconds=1)), max=100) \
    .col("user")

# Build a fine-tuning example mapping a conversation to the users who reacted to it
history = kd.record({
        "conversation": shifted_conversations, 
        "engaged_users": reaction_users.union(participating_users),
    }) \
    .filter(shifted_conversations.is_not_null())

history.preview(5)

Unnamed: 0,_time,_key,conversation,engaged_users
0,1970-01-01 00:00:02.690360175,"{'channel': 'general', 'thread': None}","[{'ts': 1690360175.262899, 'user': 'U05JQJJDJ6...","[U05JQJJDJ6P, U05JV3K9RB7, U05JH8BCZST]"
1,1970-01-01 00:00:02.690360176,"{'channel': 'general', 'thread': None}","[{'ts': 1690360175.262899, 'user': 'U05JQJJDJ6...","[U05JQJJDJ6P, U05JV3K9RB7, U05JH8BCZST]"
2,1970-01-01 00:00:02.690360209,"{'channel': 'demo', 'thread': None}","[{'ts': 1690360209.651159, 'user': 'U05JQJJDJ6...",[]
3,1970-01-01 00:00:02.690360213,"{'channel': 'random', 'thread': None}","[{'ts': 1690360213.550579, 'user': 'U05JQJJDJ6...",[]
4,1970-01-01 00:00:02.690360240,"{'channel': 'random', 'thread': 1690360240.229...","[{'ts': 1690360240.229079, 'user': 'U05JQJJDJ6...","[U05JQJJDJ6P, U05JH8BCZST]"


## Create training dataset

To prepare our fine-tuning data for OpenAI, we'll use Scikit-Learn for preprocessing. This step ensures that each user is represented by a single "token", and that the conversation is formatted in a way that is easy for the model to learn

In [49]:
from sklearn import preprocessing
import numpy, json

# Extract examples from historical data
history_df = history.run().to_pandas().drop(["_time", "_key"], axis=1)


# Encode user ID labels
le = preprocessing.LabelEncoder()
le.fit(history_df.engaged_users.explode())
with open('labels.json', 'w') as f:
    json.dump(le.classes_.tolist(), f)


# Format for the OpenAI API
def format_prompt(conversation):
    return "start -> " + "\n\n".join([f' {msg["user"]} --> {msg["text"]} ' for msg in conversation]) + "\n\n###\n\n"
def format_completion(engaged_users):
    return " " + (" ".join(le.transform(engaged_users).astype(str)) if len(engaged_users) > 0 else "nil") + " end"
    
examples_df = pandas.DataFrame({
    "text": history_df.conversation.apply(format_prompt),
    "label": history_df.engaged_users.apply(lambda x: x[0] if len(x) > 0 else "nil").astype("str"),
})

# Write examples to file
examples_df.to_parquet("examples.parquet")
print("Wrote examples to 'examples.parquet'")

## Fine-tune a custom model

Finally, we'll send our fine-tuning examples to OpenAI to create a custom model.

In [24]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [58]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
import evaluate
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Load data
raw_dataset = load_dataset("parquet", data_files=["./examples.parquet"])
train_test_datasets = raw_dataset["train"].train_test_split(test_size=0.2)

# Define tokenization
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_function(examples):
    return tokenizer(examples["text"], max_length=100, truncation=True)
tokenized_dataset = train_test_datasets.map(tokenize_function, batched=True)

# Define batch collation
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Evaluation metric
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Found cached dataset parquet (/Users/ryan.michael/.cache/huggingface/datasets/parquet/default-4cf90c61ede83cc2/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached split indices for dataset at /Users/ryan.michael/.cache/huggingface/datasets/parquet/default-4cf90c61ede83cc2/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-968214fbe08f5631.arrow and /Users/ryan.michael/.cache/huggingface/datasets/parquet/default-4cf90c61ede83cc2/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-b09c0323f5db0be0.arrow


Map:   0%|          | 0/9 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [59]:
# Define classification labels
# id2label = {0: "U05JH8BCZST", 1: "U05JQJJDJ6P", 2: "U05JV3K9RB7", 3: "nil"}
# label2id = {"U05JH8BCZST": 0, "U05JQJJDJ6P": 1, "U05JV3K9RB7": 2, "nil": 3}

# Configure model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=4, problem_type="multi_label_classification"
)

training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train!
trainer.train()


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

  0%|          | 0/2 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [None]:
import openai
from openai import cli
from types import SimpleNamespace

# Initialize OpenAI
openai.api_key = getpass.getpass('OpenAI: API Key')

# Verifiy data format, split for training & validation, upload to OpenAI
args = SimpleNamespace(file='./examples.jsonl', quiet=True)
cli.FineTune.prepare_data(args)
training_id = cli.FineTune._get_or_upload('./examples_prepared_train.jsonl', True)

In [None]:
# Train a model using "davinci", the most advanced model available for fine-tuning
resp = openai.FineTune.create(
    training_file = training_id,
    model = "ada",
    n_epochs = 2,
    learning_rate_multiplier = 0.02,
    suffix = "coversation_users"
)

# Fine-tuning can take awhile, so keep track of this ID
print(f'Fine-tuning model with job ID: "{resp["id"]}"')