# BeepGPT Example

In this notebook, you’ll see how to train BeepGPT on your Slack history in 15 minutes using only OpenAI’s API’s and open-source Python libraries - Data Science PhD not required.

We'll train BeepGPT in four steps:
1. Pull down historical messages
2. Build training examples
3. Convert our examples into a training dataset of prompt/completion pairs
4. Send our training data to OpenAI and create a fine-tuning job

In [41]:
%pip install pandas pyarrow openai kaskada==0.6.0a4 transformers datasets evaluate ipywidgets wandb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting wandb
  Downloading wandb-0.15.10-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting GitPython!=3.1.29,>=1.0.0
  Downloading GitPython-3.1.34-py3-none-any.whl (188 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.6/188.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.30.0-py2.py3-none-any.whl (218 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.8/218.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting docker-pycreds>=0.4.0
  Using 

In [1]:
from datetime import datetime, timedelta
import kaskada as kd
import pandas
import getpass
import datetime
import transformers
import datasets
import evaluate
import pyarrow as pa
from datetime import timedelta

# Initialize Kaskada with a local execution context.
kd.init_session()

## Read Historical Messages

Historical slack messages can be exported by following the instructions in Slack's [Export your workspace data](https://slack.com/help/articles/201658943-Export-your-workspace-data) web page. We'll use these messages to teach BeepGPT about the members of your workspace.

In [2]:
import pandas as pd
import os

def get_file_df(json_path):
    df = pd.read_json(json_path, precise_float=True)
    # drop rows where subType is not null
    if "subtype" in df.columns:
        df = df[df["subtype"].isnull()]
    # only keep these columns
    df = df[df.columns.intersection(["ts", "user", "text", "thread_ts"])]
    return df

def get_channel_df(channel_path):
    dfs = []
    for root, dirs, files in os.walk(channel_path):
        for file in files:
            dfs.append(get_file_df(os.path.join(root, file)))
    return pd.concat(dfs, ignore_index=True)

def get_export_df(export_path):
    dfs = []
    for root, dirs, files in os.walk(export_path):
        for dir in dirs:
            df = get_channel_df(os.path.join(root, dir))
            # add channel column
            df["channel"] = dir
            dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

In [3]:
path_to_slack_export = "slack-export-kaskada"

get_export_df(path_to_slack_export).to_parquet("messages.parquet")

In [5]:
# Load events from a Parquet file
# Use the "ts" column as the time associated with each row, 
# and the "channel" column as the entity associated with each row.
messages = await kd.sources.Parquet.create(
    path = "./messages.parquet", 
    time_column = "ts", 
    key_column = "channel",
    time_unit = "s",
)

# View the first 5 events
messages.preview(5)

Unnamed: 0,_time,_key,text,user,ts,thread_ts,channel
0,2023-04-12 18:02:34.854979072,general,It's alive!,U052V5HU11B,1681323000.0,,general
1,2023-04-12 18:12:44.778119168,general,woo!,U052Y3Y23BL,1681323000.0,,general
2,2023-04-12 18:13:25.660698880,general,,U052XUMJF6F,1681323000.0,,general
3,2023-04-12 18:15:10.314349056,general,I'm going to add a few channels with a few top...,U052Y3Y23BL,1681323000.0,,general
4,2023-05-04 21:03:42.894209024,general,Hey <@U0568CW2SNR> welcome!,U052XUMJF6F,1683234000.0,,general


## Build examples

Fine-tuning examples will teach the model the specific users who are interested in a given conversation. Each example consists of a "prompt" containing the state of a conversation at a point in time and a "completion" containing the users (if any) who were interested in the conversation. BeepGPT uses several ways to measure interest, for example, replying to a message, or adding an emoji reaction.

In [6]:
threads = messages.filter(messages.col("thread_ts").is_not_null())
non_threads = messages.filter(messages.col("thread_ts").is_null())

ts = non_threads.col("ts")
ts_since = ts.seconds_since_previous()

is_new = ts_since.cast(pa.int64()) > 600

shifted_non_threads = non_threads.shift_by(timedelta(microseconds=0.001))
shifted_ts = shifted_non_threads.lag(1).col("ts").first(window=kd.windows.Since(is_new))
thread_ts = ts.if_(is_new).else_(shifted_ts)

non_threads_threads = non_threads.extend({"thread_ts": thread_ts}).filter(ts.is_not_null().and_(thread_ts.is_not_null()))

joined = kd.record({
    "ts": threads.col("ts").else_(non_threads_threads.col("ts")),
    "text": threads.col("text").else_(non_threads_threads.col("text")),
    "user" : threads.col("user").else_(non_threads_threads.col("user")),
    "thread_ts" : threads.col("thread_ts").else_(non_threads_threads.col("thread_ts")),
    "channel" : threads.col("channel").else_(non_threads_threads.col("channel")),
})

messages = joined.with_key(kd.record({
        "channel": joined.col("channel"),
        "thread": joined.col("thread_ts"),
    }))

# collect the previous 1 to 5 messages and the associated user for each message
conversation = messages.col("text").collect(max=5, min=1).lag(1)

# add the conversation to the current row
examples = messages.extend({"conversation":conversation}).filter(conversation.is_not_null())
examples.preview(5)

Unnamed: 0,_time,_key,conversation,ts,text,user,thread_ts,channel
0,2023-04-12 18:13:25.660698880,"{'channel': 'general', 'thread': 1681323164.77...",[woo!],1681323000.0,,U052XUMJF6F,1681323000.0,general
1,2023-04-12 18:15:10.314349056,"{'channel': 'general', 'thread': 1681323164.77...","[woo!, ]",1681323000.0,I'm going to add a few channels with a few top...,U052Y3Y23BL,1681323000.0,general
2,2023-05-04 21:04:14.158749184,"{'channel': 'general', 'thread': 1683234222.89...",[Hey <@U0568CW2SNR> welcome!],1683234000.0,"I’m Ryan, I’ve been working on Kaskada for a f...",U052XUMJF6F,1683234000.0,general
3,2023-05-05 16:05:24.561849088,"{'channel': 'general', 'thread': 1683302650.19...",[Hey <@U056CRH510D> welcome!],1683303000.0,Hey <@U056FQG3UMQ> good to see you in here!,U052XUMJF6F,1683303000.0,general
4,2023-05-05 16:06:38.748968960,"{'channel': 'general', 'thread': 1683302650.19...","[Hey <@U056CRH510D> welcome!, Hey <@U056FQG3UM...",1683303000.0,hey Ryan :wave:,U056FQG3UMQ,1683303000.0,general


## Create training dataset

To prepare our fine-tuning data for OpenAI, we'll use Scikit-Learn for preprocessing. This step ensures that each user is represented by a single "token", and that the conversation is formatted in a way that is easy for the model to learn

In [8]:
from sklearn import preprocessing
import numpy, json
import json, re

# Extract examples from historical data
examples_df = examples.to_pandas().drop(["_time", "_key"], axis=1)


# Encode user ID labels
le = preprocessing.LabelEncoder()
le.fit(examples_df["user"])
with open('labels_.json', 'w') as f:
    json.dump(le.classes_.tolist(), f)


# Format for the OpenAI API
def strip_links_and_users(line):
    return re.sub(r"<.*?>", '', line)

def strip_emoji(line):
    return re.sub(r":.*?:", '', line)

def clean_messages(messages):
    cleaned = []
    for msg in messages:
        text = strip_links_and_users(msg)
        text = strip_emoji(text)
        text = text.strip()
        if text == "":
            continue
        cleaned.append(text)
    return cleaned

# Format prompt for the OpenAI API
def format_prompt(messages):
    cleaned = clean_messages(messages)
    if len(cleaned) == 0:
        return None
    cleaned.reverse()
    prompt = "\n\n".join(cleaned)
    return prompt
    
examples_df = pandas.DataFrame({
    "text": examples_df.conversation.apply(format_prompt),
    "label": le.transform(examples_df.user),
})

# Write examples to file
examples_df.dropna().to_parquet("examples.parquet")
print("Wrote examples to 'examples.parquet'")

Wrote examples to 'examples.parquet'


In [9]:
import pandas as pd
from sklearn import preprocessing

examples_json = pd.read_json("conversation_next_message_examples_joined.jsonl", lines=True)

le = preprocessing.LabelEncoder()
le.fit(examples_json["completion"])

examples_df = pd.DataFrame({
    "text": examples_json["prompt"],
    "label": le.transform(examples_json["completion"]),
})

# Write examples to file
examples_df.dropna().to_parquet("examples.parquet")
print("Wrote examples to 'examples.parquet'")

Wrote examples to 'examples.parquet'


## Fine-tune a custom model

Finally, we'll send our fine-tuning examples to OpenAI to create a custom model.

In [1]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
%env WANDB_PROJECT=beep-gpt
import wandb
wandb.login()

env: WANDB_PROJECT=beep-gpt


[34m[1mwandb[0m: Currently logged in as: [33mkerinin[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [14]:
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
import evaluate
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

#model = "distilbert-base-uncased" #512
model = "roberta-base" #512
#model = "bert-base-uncased" #512
#model="gpt2-xl" #1024
#model="gpt2-large"
#model="gpt2-medium" #1024

# Load data
raw_dataset = load_dataset("parquet", data_files=["./examples.parquet"])
train_test_datasets = raw_dataset["train"].train_test_split(test_size=0.2)

# Define tokenization
tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
    return tokenizer(examples["text"], max_length=1024, truncation=True)
tokenized_dataset = train_test_datasets.map(tokenize_function, batched=True)

# Define batch collation
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Evaluation metric
metric = evaluate.load("f1")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels, average="micro")

Map:   0%|          | 0/1928 [00:00<?, ? examples/s]

Map:   0%|          | 0/482 [00:00<?, ? examples/s]

In [15]:
# Configure model
model = AutoModelForSequenceClassification.from_pretrained(
    model, num_labels=len(le.classes_),
)

training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=1e-5,
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,
    num_train_epochs=8,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    report_to="wandb",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train!
trainer.train()
wandb.finish()


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [33]:
from transformers import pipeline
import torch
from scipy.special import softmax


text = "It's alive! say hello everyone"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
#logits
softmax(logits, axis=1)

array([[0.0150593 , 0.11951187, 0.6067634 , 0.01273651, 0.02346392,
        0.01693087, 0.01361045, 0.02540066, 0.0137077 , 0.01271379,
        0.0100572 , 0.01511357, 0.01235948, 0.01376194, 0.01882563,
        0.01123953, 0.00927841, 0.01208729, 0.01354164, 0.01255121,
        0.01128568]], dtype=float32)