# BeepGPT Example

In this notebook, you’ll see how to train BeepGPT on your Slack history in 15 minutes using only OpenAI’s API’s and open-source Python libraries - Data Science PhD not required.

We'll train BeepGPT in four steps:
1. Pull down historical messages
2. Build training examples
3. Convert our examples into a training dataset of prompt/completion pairs
4. Send our training data to OpenAI and create a fine-tuning job

In [41]:
%pip install pandas pyarrow openai kaskada==0.6.0a1 transformers datasets evaluate ipywidgets wandb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting wandb
  Downloading wandb-0.15.10-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting GitPython!=3.1.29,>=1.0.0
  Downloading GitPython-3.1.34-py3-none-any.whl (188 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.6/188.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.30.0-py2.py3-none-any.whl (218 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.8/218.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting docker-pycreds>=0.4.0
  Using 

In [1]:
from datetime import datetime, timedelta
import kaskada as kd
import pandas
import getpass
import datetime
import transformers
import datasets
import evaluate
import pyarrow as pa
from datetime import timedelta

# Initialize Kaskada with a local execution context.
kd.init_session()

## Read Historical Messages

Historical slack messages can be exported by following the instructions in Slack's [Export your workspace data](https://slack.com/help/articles/201658943-Export-your-workspace-data) web page. We'll use these messages to teach BeepGPT about the members of your workspace.

In [2]:
import pandas as pd
import os

def get_file_df(json_path):
    df = pd.read_json(json_path, precise_float=True)
    # drop rows where subType is not null
    if "subtype" in df.columns:
        df = df[df["subtype"].isnull()]
    # only keep these columns
    df = df[df.columns.intersection(["ts", "user", "text", "thread_ts"])]
    return df

def get_channel_df(channel_path):
    dfs = []
    for root, dirs, files in os.walk(channel_path):
        for file in files:
            dfs.append(get_file_df(os.path.join(root, file)))
    return pd.concat(dfs, ignore_index=True)

def get_export_df(export_path):
    dfs = []
    for root, dirs, files in os.walk(export_path):
        for dir in dirs:
            df = get_channel_df(os.path.join(root, dir))
            # add channel column
            df["channel"] = dir
            dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

In [3]:
path_to_slack_export = "slack-export-kaskada"

get_export_df(path_to_slack_export).to_parquet("messages.parquet")

In [42]:
# Load events from a Parquet file
# Use the "ts" column as the time associated with each row, 
# and the "channel" column as the entity associated with each row.
messages = kd.sources.Parquet(
    path = "./messages.parquet", 
    time_column = "ts", 
    key_column = "channel",
    time_unit = "s",
)

# View the first 5 events
messages.preview(5)

Unnamed: 0,_time,_key,text,user,ts,thread_ts,reply_users,reactions,channel
0,2021-07-01 14:26:27.342999808,team-api,<@ULJD5H2A2> available to talk about errors?,U016TM9NXEY,1625150000.0,,,,team-api
1,2021-07-01 15:18:28.343499776,team-api,<@ULJD5H2A2> <https://gitlab.com/kaskada/kaska...,U016TM9NXEY,1625153000.0,,,,team-api
2,2021-07-01 15:18:43.344000000,team-api,Looks like Go sets the transfer encoding if yo...,ULJD5H2A2,1625153000.0,,,,team-api
3,2021-07-01 15:18:44.344300032,team-api,working now,ULJD5H2A2,1625153000.0,,,"[{'count': 1, 'name': 'twinsparrot', 'users': ...",team-api
4,2021-07-01 15:22:19.344600064,team-api,<@U016TM9NXEY> <https://gitlab.com/kaskada/kas...,ULJD5H2A2,1625153000.0,,,"[{'count': 1, 'name': 'eyes', 'users': ['U016T...",team-api


## Build examples

Fine-tuning examples will teach the model the specific users who are interested in a given conversation. Each example consists of a "prompt" containing the state of a conversation at a point in time and a "completion" containing the users (if any) who were interested in the conversation. BeepGPT uses several ways to measure interest, for example, replying to a message, or adding an emoji reaction.

In [43]:
threads = messages.filter(messages.col("thread_ts").is_not_null())
non_threads = messages.filter(messages.col("thread_ts").is_null())

ts = non_threads.col("ts")
ts_since = ts.seconds_since_previous()

is_new = ts_since.cast(pa.int64()) > 600

shifted_non_threads = non_threads.shift_by(timedelta(microseconds=0.001))
shifted_ts = shifted_non_threads.lag(1).col("ts").first(window=kd.windows.Since(is_new))
thread_ts = ts.if_(is_new).else_(shifted_ts)

non_threads_threads = non_threads.extend({"thread_ts": thread_ts}).filter(ts.is_not_null().and_(thread_ts.is_not_null()))

joined = kd.record({
    "ts": threads.col("ts").else_(non_threads_threads.col("ts")),
    "text": threads.col("text").else_(non_threads_threads.col("text")),
    "user" : threads.col("user").else_(non_threads_threads.col("user")),
    "thread_ts" : threads.col("thread_ts").else_(non_threads_threads.col("thread_ts")),
    "channel" : threads.col("channel").else_(non_threads_threads.col("channel")),
})

messages = joined.with_key(kd.record({
        "channel": joined.col("channel"),
        "thread": joined.col("thread_ts"),
    }))

# collect the previous 1 to 5 messages and the associated user for each message
conversation = messages.col("text").collect(max=5, min=1).lag(1)

# add the conversation to the current row
examples = messages.extend({"conversation":conversation}).filter(conversation.is_not_null())
examples.preview(5)

Unnamed: 0,_time,_key,conversation,ts,text,user,thread_ts,channel
0,2021-07-01 15:18:43.344000000,"{'channel': 'team-api', 'thread': 1625152708.3...",[<@ULJD5H2A2> <https://gitlab.com/kaskada/kask...,1625153000.0,Looks like Go sets the transfer encoding if yo...,ULJD5H2A2,1625153000.0,team-api
1,2021-07-01 15:18:44.344300032,"{'channel': 'team-api', 'thread': 1625152708.3...",[<@ULJD5H2A2> <https://gitlab.com/kaskada/kask...,1625153000.0,working now,ULJD5H2A2,1625153000.0,team-api
2,2021-07-01 15:22:19.344600064,"{'channel': 'team-api', 'thread': 1625152708.3...",[<@ULJD5H2A2> <https://gitlab.com/kaskada/kask...,1625153000.0,<@U016TM9NXEY> <https://gitlab.com/kaskada/kas...,ULJD5H2A2,1625153000.0,team-api
3,2021-07-01 15:53:28.345799936,"{'channel': 'team-api', 'thread': 1625154770.3...",[<@ULJD5H2A2> Looks like it deployed out. The ...,1625155000.0,Better than it used to be though :slightly_smi...,ULJD5H2A2,1625155000.0,team-api
4,2021-07-01 15:53:58.346499840,"{'channel': 'team-api', 'thread': 1625154770.3...",[<@ULJD5H2A2> Looks like it deployed out. The ...,1625155000.0,<@U017T5TFW58> `api.` looking unhappy:\n```cur...,ULJD5H2A2,1625155000.0,team-api


## Create training dataset

To prepare our fine-tuning data for OpenAI, we'll use Scikit-Learn for preprocessing. This step ensures that each user is represented by a single "token", and that the conversation is formatted in a way that is easy for the model to learn

In [44]:
from sklearn import preprocessing
import numpy, json
import json, re

# Extract examples from historical data
examples_df = examples.run().to_pandas().drop(["_time", "_key"], axis=1)


# Encode user ID labels
le = preprocessing.LabelEncoder()
le.fit(examples_df["user"])
with open('labels_.json', 'w') as f:
    json.dump(le.classes_.tolist(), f)


# Format for the OpenAI API
def strip_links_and_users(line):
    return re.sub(r"<.*?>", '', line)

def strip_emoji(line):
    return re.sub(r":.*?:", '', line)

def clean_messages(messages):
    cleaned = []
    for msg in messages:
        text = strip_links_and_users(msg)
        text = strip_emoji(text)
        text = text.strip()
        if text == "":
            continue
        cleaned.append(text)
    return cleaned

# Format prompt for the OpenAI API
def format_prompt(messages):
    cleaned = clean_messages(messages)
    if len(cleaned) == 0:
        return None
    cleaned.reverse()
    prompt = "\n\n".join(cleaned)
    return prompt
    
examples_df = pandas.DataFrame({
    "text": examples_df.conversation.apply(format_prompt),
    "label": le.transform(examples_df.user),
})

# Write examples to file
examples_df.dropna().to_parquet("examples.parquet")
print("Wrote examples to 'examples.parquet'")

Wrote examples to 'examples.parquet'


## Fine-tune a custom model

Finally, we'll send our fine-tuning examples to OpenAI to create a custom model.

In [7]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [8]:
%env WANDB_PROJECT=beep-gpt
import wandb
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


env: WANDB_PROJECT=beep-gpt


[34m[1mwandb[0m: Currently logged in as: [33mkerinin[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [45]:
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
import evaluate
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = "distilbert-base-uncased"
#model = "roberta-base"
#model = "bert-base-uncased"
#model="gpt2-xl"

# Load data
raw_dataset = load_dataset("parquet", data_files=["./examples.parquet"])
train_test_datasets = raw_dataset["train"].train_test_split(test_size=0.2)

# Define tokenization
tokenizer = AutoTokenizer.from_pretrained(model)
def tokenize_function(examples):
    return tokenizer(examples["text"], max_length=512, truncation=True) 
tokenized_dataset = train_test_datasets.map(tokenize_function, batched=True)

# Define batch collation
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Evaluation metric
metric = evaluate.load("f1")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels, average="micro")

Downloading and preparing dataset parquet/default to /Users/ryan.michael/.cache/huggingface/datasets/parquet/default-ad8b5f5b3ee9e698/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset parquet downloaded and prepared to /Users/ryan.michael/.cache/huggingface/datasets/parquet/default-ad8b5f5b3ee9e698/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/12080 [00:00<?, ? examples/s]

Map:   0%|          | 0/3020 [00:00<?, ? examples/s]

In [46]:
# Configure model
model = AutoModelForSequenceClassification.from_pretrained(
    model, num_labels=len(le.classes_),
)

training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=32, 
    per_device_eval_batch_size=32,
    num_train_epochs=8,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    report_to="wandb",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train!
trainer.train()
wandb.finish()


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01116821991227981, max=1.0)…

  0%|          | 0/6040 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [33]:
from transformers import pipeline
import torch
from scipy.special import softmax


text = "It's alive! say hello everyone"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
#logits
softmax(logits, axis=1)

array([[0.0150593 , 0.11951187, 0.6067634 , 0.01273651, 0.02346392,
        0.01693087, 0.01361045, 0.02540066, 0.0137077 , 0.01271379,
        0.0100572 , 0.01511357, 0.01235948, 0.01376194, 0.01882563,
        0.01123953, 0.00927841, 0.01208729, 0.01354164, 0.01255121,
        0.01128568]], dtype=float32)