# Conversations

### On a per-conversation basis, predict the next person to type, based on the previous messages that weren't theirs.  Excluded userIds and http links.

A "conversation" is defined as either:
* All the messages in a thread
* A collection of messages from a single channel that occur in succession. If no response is made for 10 minutes, the conversation has ended. The next message outside this window is the start of a new conversation.

For each set of messages in a "conversation":
* determine the set of users that participated in the conversation
* for each user, iterate through the messages:
  * collect sets of messages that they didn't write
  * export an example before they would have written

In [None]:
!pip3 install --upgrade openai pip install file-read-backwards  -q

In [110]:
import json, openai, time, pandas, random, getpass
from openai import cli
from types import SimpleNamespace
from sklearn.model_selection import train_test_split
from file_read_backwards import FileReadBackwards

work_dir = "/Users/eric.pinzur/Documents/slackbot2000"
openai.api_key = getpass.getpass(prompt="Please enter your OpenAI API Key")

## Additional Data Prep

Convert the original input data into a set of "conversations".

### Split the data into "threads" and "non-threads"

Using DuckDB:

```sql
copy(
    select * from 
    read_json_auto('messages.jsonl', format='newline_delimited') 
    where thread_ts is not null 
    order by channel, thread_ts, ts
) to 'message_threads.jsonl' (FORMAT JSON);

copy(
    select * from 
    read_json_auto('messages.jsonl', format='newline_delimited') 
    where thread_ts is null 
    order by channel, ts
) to 'message_non_threads.jsonl' (FORMAT JSON);
```

### Make threads from non-threads

For non-threads, artifically group the messages into "threads".  Collect messages from a channel.  If there is a 5 minute gap between messages, convert the message collection to a "thread" and export.




In [9]:
in_file = open(f'{work_dir}/message_non_threads.jsonl', 'r')
out_file = open(f'{work_dir}/message_non_threads_threads.jsonl', 'w')

def write_thread(thread):
    if len(thread) > 0:
        reply_users = []
        for message in thread:
            if message["user"] not in reply_users:
                reply_users.append(message["user"])
        thread[0]["reply_users"] = reply_users
        thread_ts = thread[0]["ts"]
        for message in thread:
            message["thread_ts"] = thread_ts
            out_file.write(json.dumps(message)+"\n")

next_thread = []
current_channel = ""
last_msg_ts = None
while True:
    output_thread = False
    line = in_file.readline()

    if not line:
        break

    message = json.loads(line)

    channel = message["channel"]

    # output if channel changes in the file
    if channel != current_channel:
        current_channel = channel
        output_thread = True

    # output if message timestamp is more than 10 mins beyond last message
    if last_msg_ts and message["ts"] > last_msg_ts + 600:
        output_thread = True

    # output and reset
    if output_thread:
        write_thread(next_thread)
        next_thread = []
        last_msg_ts = None

    next_thread.append(message)
    last_msg_ts = message["ts"]

# output final thread
write_thread(next_thread)

in_file.close()
out_file.close()

### Make Conversations

Re-join the two files into a set of conversations, using duckDB:

```sql
copy(
    select * from 
    read_json_auto(['message_non_threads_threads.jsonl', 'message_threads.jsonl'], format='newline_delimited') 
    order by channel, thread_ts, ts
) to 'message_conversations.jsonl' (FORMAT JSON);
```

## Generate Examples

This is a Generative problem, so the goals for training aren't as strict.

Goals:
* prompt and completion length must not be longer than 2048 tokens

### Strategy

* Group messages int "conversations".
* Use a method to write examples for a "conversation"

In [114]:
# build a user map
in_file = open(f'{work_dir}/message_conversations.jsonl', 'r')
user_map = {}
user_count = 0

while True:
    line = in_file.readline()

    if not line:
        break

    message = json.loads(line)
    user = message["user"]

    # create a user_id -> user_num map
    if user not in user_map:
        user_count += 1
        user_map[user] = user_count

in_file.close()

In [115]:
user_map

{'UCZ4VJF6J': 1,
 'ULJD5H2A2': 2,
 'U017T5TFW58': 3,
 'U014ZU49HPT': 4,
 'UU5C7MNMA': 5,
 'U016TM9NXEY': 6,
 'U021KG8NMRQ': 7,
 'UCXNQ2MPV': 8,
 'U02TCUCA7PU': 9,
 'U012CLUV1KJ': 10,
 'U0332GKB9J8': 11,
 'UT62H53R6': 12,
 'U03RNK543HC': 13,
 'USLACKBOT': 14,
 'U011BDYMEG7': 15,
 'U01TZK90VSM': 16,
 'U02326Q5BG9': 17,
 'U017S2GDXF0': 18,
 'U0117AWAAEN': 19}

In [14]:
import json
import re

def strip_links_and_users(line):
    return re.sub(r"<.*?>", '', line)

def strip_emoji(line):
    return re.sub(r":.*?:", '', line)

work_dir = "/Users/eric.pinzur/Documents/slackbot2000"

in_file = open(f'{work_dir}/message_conversations.jsonl', 'r')
out_file = open(f'{work_dir}/conversation_next_message_examples.jsonl', 'w')

prompt_separator = "\n\n###\n\n"
completion_separator = ""

max_lines = 5
max_prompt_len = 5000

user_map = {
 'UCZ4VJF6J': 1,
 'ULJD5H2A2': 2,
 'U017T5TFW58': 3,
 'U014ZU49HPT': 4,
 'UU5C7MNMA': 5,
 'U016TM9NXEY': 6,
 'U021KG8NMRQ': 7,
 'UCXNQ2MPV': 8,
 'U02TCUCA7PU': 9,
 'U012CLUV1KJ': 10,
 'U0332GKB9J8': 11,
 'UT62H53R6': 12,
 'U03RNK543HC': 13,
 'USLACKBOT': 14,
 'U011BDYMEG7': 15,
 'U01TZK90VSM': 16,
 'U02326Q5BG9': 17,
 'U017S2GDXF0': 18,
 'U0117AWAAEN': 19
 }

def output_conversation_examples(conversation):
    # first clean the conversation
    cleaned = []
    for message in conversation:
        text = message["text"]
        text = strip_links_and_users(text)
        text = strip_emoji(text)
        text = text.strip()
        if text == "" or text.find("```") >= 0:
            continue
        message["text"] = text
        cleaned.append(message)

    if len(cleaned) == 0:
        return

    # first build up a set of users that particpated in the conversation
    users = []
    for message in cleaned:
        user = message["user"]
        users.append(user)
    # de-duplicate users
    users = list(dict.fromkeys(users))

    # iterate on users, messages
    for user in users:
        user_num = user_map[user]
        lines = []
        last_example = ""
        for message in cleaned:
            msg_text = message["text"]
            msg_user = message["user"]
            # if message isn't from the user, append it
            if msg_user != user:
                lines.append(msg_text)
                if len(lines) > max_lines:
                    lines.pop(0)
            # else output message set, reversed, trimmed
            elif len(lines) > 0:
                reversed = lines.copy()
                reversed.reverse()
                prompt = "\n\n".join(reversed)
                if len(prompt) > max_prompt_len:
                    prompt = prompt[0:max_prompt_len]

                example = { "prompt": f'{prompt}{prompt_separator}', "completion": f' {user_num}{completion_separator}' }
                if example != last_example:
                    out_file.write(json.dumps(example) + "\n")
                last_example = example

           
current_channel = ""
current_conversation = []
conversation_ts = None

while True:
    output_convo = False
    line = in_file.readline()

    if not line:
        break

    message = json.loads(line)
    channel = message["channel"]
    thread_ts = message["thread_ts"]
    user = message["user"]

    # output if channel changes
    if channel != current_channel:
        current_channel = channel
        output_convo = True

    # output if different thread_ts
    if conversation_ts and conversation_ts != thread_ts:
        output_convo = True

    # output and reset
    if output_convo:
        output_conversation_examples(current_conversation)
        current_conversation = []

    current_conversation.append(message)
    conversation_ts = thread_ts
   

# output final conversation
output_conversation_examples(current_conversation)

in_file.close()
out_file.close()

### Example Cleanup State

Use Chat API with few-shot learning to help determine which examples might be helpful for training, and which are just noise



First find some examples of helpful (good) and un-helpful (bad) messages.  I wrote the `human.py` script to help with this.

Next use the code below to summarize the helpful (good) messages, so they use fewer tokens.

In [30]:
file = open(f'{work_dir}/conversation_next_message_examples_out_good.jsonl', 'r')
out_file = open(f'{work_dir}/conversation_next_message_examples_out_good_summarized.jsonl', 'w')

prompt_suffix = "\n\n###\n\n"

count = 0
while True:
    line = file.readline()
    count +=1

    if not line:
        break

    data = json.loads(line)

    prompt = data["prompt"].removesuffix(prompt_suffix)

    print(f'Prompt ({len(prompt)}): {prompt}')

    msgs = [
    {
        "role": "system",
        "content": "You are a helpful assistant that provides consise summaries of messages.  The response must be significantly shorter than the input"
    },
    {
        "role": "user",
        "content": "just read about this as part of Pinterests intfrastructure. It basically ingests Kafka events and guarantees they end up in exactly one partitioned file on S3. Supports writing Parquet, etc.\nCan take JSON, Proto, Avro, Thrift etc. to columns as well.\nMay not be necessary with the current less-streaming-first approach, but thought it may be useful to have on our radar if/when that becomes important again."
    },
    {
        "role": "assistant",
        "content": "Pinterest's infrastructure ingests Kafka events and ensures they are stored in one partitioned file on S3. It supports writing Parquet and can convert JSON, Proto, Avro, and Thrift to columns. It may not be immediately necessary for our current approach but could be valuable in the future."
    },
    {
        "role": "user",
        "content": "Good read for some background when we start looking at this\n\nThis article has a good summary of the difficulties (and how pinterest solved them) of writing a pipeline on both streaming and historic data"
    },
    {
        "role": "assistant",
        "content": "The Pinterest article provides insights on the challenges and solutions in building a pipeline for both streaming and historic data."
    },
    {
        "role": "user",
        "content": "Sounds good -- ideally we can let things evolve/move independently, and then just have some mechanism for making those breaking changes (auth, etc.) in a sensible way.\n\nWhat I've seen done is basically we have a page that says \"The Kaskada service supports versions X through Z of the Python Client. Versions X through Y are deprecated and will be unsupported on DATE.\"\n\nSo then we could do something like:\n\n1. Let the Python client evolve as we see fit, as long as the API is evolving in compatible ways.\n2. When certain things change that we want to eventually support exclusively (like change to auth), we can bump the major version.\n3. Then we can just list which major versions of the client the service supports.\n\nSure -- some of auth may show up as fields in the protos, but especially things on headers don't fit to that well.\n\n(Also happy to discuss more about how I've seen protobufs used)\n\nSo I *think* I'm saying that we could rely on the protos as the contract and then we don't need to keep the versions themselves as synchronized"
    },
    {
        "role": "assistant",
        "content": "ideally we can move independently, and handle breaking changes in a sensible way. we can have a map between features and supported versions. some changes that would create breaking changes are auth, protos, protobufs, api, python changes."
    },
    {
        "role": "user",
        "content": prompt
    }
    ]

    res = openai.ChatCompletion.create(
        model = "gpt-3.5-turbo",
        messages = msgs
    )
    response = res["choices"][0]["message"]["content"]

    print(f'Response ({len(response)}): {response}')

    data["prompt"] = f'{response}{prompt_suffix}'

    out_file.write(json.dumps(data) + "\n")


file.close()
out_file.close()

Prompt (411): just read about this as part of Pinterests intfrastructure. It basically ingests Kafka events and guarantees they end up in exactly one partitioned file on S3. Supports writing Parquet, etc.



Can take JSON, Proto, Avro, Thrift etc. to columns as well.

May not be necessary with the current less-streaming-first approach, but thought it may be useful to have on our radar if/when that becomes important again.
Response (292): Pinterest's infrastructure supports ingesting Kafka events and ensuring they are stored in one partitioned file on S3. It can handle various formats like JSON, Proto, Avro, and Thrift. While it may not be immediately necessary for our current approach, it is worth considering for future use.
Prompt (203): Good read for some background when we start looking at this

This article has a good summary of the difficulties (and how pinterest solved them) of writing a pipeline on both streaming and historic data:
Response (135): The article provides insights o

Next use the summarized helpful messages (good) and un-helpful messsages (bad) with few-shot learning to figure out which training examples will actually be helfpul.  

First we will make a messages list with our learning examples.

In [32]:
good = open(f'{work_dir}/conversation_next_message_examples_out_good_summarized.jsonl', 'r')
bad = open(f'{work_dir}/conversation_next_message_examples_out_bad.jsonl', 'r')

messages = [{
    "role": "system",
    "content": "You are a helpful assistant. Your job is to determine if a prompt will be helpful for fine-tuning a model. All prompts start with 'start -->' and end with: '\\n\\n###\\n\\n'. You should respond 'yes' if you think the prompt has enough context to be helpful, or 'no' if not. No explanation is needed. You should only respond with 'yes' or 'no'."
}]

count = 0
while True:

    good_line = good.readline()
    bad_line = bad.readline()
    count += 1

    if not good_line or not bad_line:
        break

    good_data = json.loads(good_line)
    bad_data = json.loads(bad_line)

    messages.append({
        "role": "user",
        "content": f'start -->{good_data["prompt"]}'
    })

    messages.append({
        "role": "assistant",
        "content": "yes"
    })

    messages.append({
        "role": "user",
        "content": f'start -->{bad_data["prompt"]}'
    })

    messages.append({
        "role": "assistant",
        "content": "no"
    })

good.close()
bad.close()

Then we use those example messages to predict if a prompt will be helpful for training or not:

In [None]:
import json, openai, time, logging, backoff

logging.getLogger('backoff').addHandler(logging.StreamHandler())

file = open(f'{work_dir}/conversation_next_message_examples.jsonl', 'r')
out_file = open(f'{work_dir}/conversation_next_message_examples_cleaned.jsonl', 'a')

@backoff.on_exception(backoff.expo, (openai.error.RateLimitError, openai.error.ServiceUnavailableError))
def chat_with_backoff(**kwargs):
    time.sleep(1)
    try:
        return openai.ChatCompletion.create(**kwargs)
    except openai.error.InvalidRequestError:
        return None

count = 0
while True:
    line = file.readline()
    count +=1

    if not line:
        break

    # helpful for restarting after issue
    if count < 8243:
        continue

    data = json.loads(line)

    # skip users that have few examples
    if data["completion"] not in [" 1", " 2", " 5", " 10"]:
        continue

    prompt = data["prompt"]

    if len(prompt) < 100:
        continue

    msgs = messages.copy()

    msgs.append({
        "role": "user",
        "content": f'start -->{prompt}'
    })

    res = chat_with_backoff(
        model = "gpt-3.5-turbo",
        messages = msgs
    )
    if not res:
        continue
    response = res["choices"][0]["message"]["content"]

    print(f'Result was `{response}` for prompt: {prompt}')

    print(f'Currently processing line: {count}')

    if response == "yes":
        out_file.write(line)
        out_file.flush()

file.close()
out_file.close()


Next grab the examples from outside the cleaned set to be the `nil` examples

In [36]:
cleaned_prompts = []

with open(f'{work_dir}/conversation_next_message_examples_cleaned_cleaned.jsonl', 'r') as file:
    while True:

        line = file.readline()

        if not line:
            break

        data = json.loads(line)
        cleaned_prompts.append(data["prompt"])


with open(f'{work_dir}/conversation_next_message_examples.jsonl', 'r') as file:
    with open(f'{work_dir}/conversation_next_message_examples_nils.jsonl', 'w') as out_file:
        while True:
            line = file.readline()
            if not line:
                break
            data = json.loads(line)

            prompt = data["prompt"]
            if prompt not in cleaned_prompts:
                data["completion"] = " nil"
                out_file.write(json.dumps(data) + '\n')


then get a random subset of nil examples that matches the length of cleaned_prompts

In [52]:
import pandas

df = pandas.read_json(f'{work_dir}/conversation_next_message_examples_nils.jsonl', lines=True, orient='records')
df = df.sample(len(cleaned_prompts))
df.to_json(f'{work_dir}/conversation_next_message_examples_nils_sample.jsonl', lines=True, orient='records')

then combine the nils and the cleaned samples sets, and randomize the order

In [54]:
import pandas

nils = pandas.read_json(f'{work_dir}/conversation_next_message_examples_nils_sample.jsonl', lines=True, orient='records')
cleaned = pandas.read_json(f'{work_dir}/conversation_next_message_examples_cleaned_cleaned.jsonl', lines=True, orient='records', dtype=False)

df = pandas.concat([nils, cleaned])
df = df.sample(frac=1)
df.to_json(f'{work_dir}/conversation_next_message_examples_joined.jsonl', lines=True, orient='records')

## Data Verification & Split Stage

* make sure prompts end with same suffix
* remove too long examples
* remove duplicated examples

Note, we aren't doing classification, so don't start a fine-tune as suggested by the output


In [55]:
args = SimpleNamespace(file=f'{work_dir}/conversation_next_message_examples_joined.jsonl', quiet=True)
cli.FineTune.prepare_data(args)

Analyzing...

- Your file contains 2410 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 7 duplicated prompt-completion sets. These are rows: [1190, 1376, 1433, 1455, 1657, 1775, 1879]
- All prompts end with suffix `\n\n###\n\n`

Based on the analysis we will perform the following actions:
- [Recommended] Remove 7 duplicate rows [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `/Users/eric.pinzur/Documents/slackbot2000/conversation_next_message_examples_joined_prepared_train.jsonl` and `/Users/eric.pinzur/Documents/slackbot2000/conversation_next_message_exa

## Model Training Stage

* First need to upload the training & validation files to OpenAI

In [111]:
training_file_name = f'{work_dir}/conversation_next_message_examples_joined_prepared_train.jsonl'

def check_status(training_id):
    train_status = openai.File.retrieve(training_id)["status"]
    print(f'Status (training_file): {train_status} ')
    return (train_status)

# Upload the training and validation dataset files to Azure OpenAI.
training_id = cli.FineTune._get_or_upload(training_file_name, True)

# Check on the upload status of the training dataset file.
(train_status) = check_status(training_id)

# Poll and display the upload status once a second until both files have either
# succeeded or failed to upload.
while train_status not in ["succeeded", "failed", "processed"]:
    time.sleep(1)
    (train_status) = check_status(training_id)

Upload progress: 100%|██████████| 855k/855k [00:00<00:00, 1.42Git/s]


Uploaded file from /Users/eric.pinzur/Documents/slackbot2000/conversation_next_message_examples_joined_prepared_train.jsonl: file-BJQcy1YwkbWVBSqiwzac8M47
Status (training_file): uploaded 
Status (training_file): processed 


## Start a fine-tuning Job

* no validation file since this is not a classification problem

In [120]:
# This example defines a fine-tune job that creates a customized model based on curie, 
# with just a single pass through the training data. The job also provides classification-
# specific metrics, using our validation data, at the end of that epoch.
create_args = {
    "training_file": training_id,
    "model": "davinci",
    "n_epochs": 4,
    "learning_rate_multiplier": 0.02,
    "suffix": "coversation_next_message_dav_4"
}
# Create the fine-tune job and retrieve the job ID
# and status from the response.
resp = openai.FineTune.create(**create_args)
job_id = resp["id"]
status = resp["status"]

# You can use the job ID to monitor the status of the fine-tune job.
# The fine-tune job may take some time to start and complete.
print(f'Fine-tuning model with job ID: "{job_id}"')

Fine-tuning model with job ID: "ft-uiNxwLFc8s8e23lNY4COOF70"


In [121]:
dav_job_id = "ft-eazX8z01fyfVgIGlqZq1tu8W" #eric-datastax api key
dav4_job_id = "ft-uiNxwLFc8s8e23lNY4COOF70" #eric-datastax api key
ada_job_id = "ft-gd4YPNhCsiat1VyqT12UlT9p" #ryan's api key
cur_job_id = "ft-sKxsiaHteiR1AK6QhYITIXME" #ryan's api key
cur4_job_id = "ft-e3cyDnMLxbp9O7Mr1egqnouX" #eric-datastax api key
cur8_job_id = "ft-oXnZLB4PSKcJuPMC9GzWSEQx" #eric-datastax api key

job_id = dav_job_id

## Wait for the fine-tuning to start

* Note that it can take several hours for the job to move from the `pending` state

In [115]:
# Get the status of our fine-tune job.
status = openai.FineTune.retrieve(id=job_id)["status"]

# If the job isn't yet done, poll it every 2 seconds.
if status not in ["succeeded", "failed"]:
    print(f'Job not in terminal status: {status}. Waiting.')
    while status not in ["succeeded", "failed"]:
        time.sleep(5)
        status = openai.FineTune.retrieve(id=job_id)["status"]
        print(f'Status: {status}')
else:
    print(f'Fine-tune job {job_id} finished with status: {status}')


Job not in terminal status: running. Waiting.


KeyboardInterrupt: 

In [181]:
openai.FineTune.retrieve(dav4_job_id)

<FineTune fine-tune id=ft-uiNxwLFc8s8e23lNY4COOF70 at 0x291ba1f90> JSON: {
  "object": "fine-tune",
  "id": "ft-uiNxwLFc8s8e23lNY4COOF70",
  "hyperparams": {
    "n_epochs": 4,
    "batch_size": 2,
    "prompt_loss_weight": 0.01,
    "learning_rate_multiplier": 0.02
  },
  "organization_id": "org-qHJsHkK4p0Jd51STTeKAm5fV",
  "model": "davinci",
  "training_files": [
    {
      "object": "file",
      "id": "file-BJQcy1YwkbWVBSqiwzac8M47",
      "purpose": "fine-tune",
      "filename": "/Users/eric.pinzur/Documents/slackbot2000/conversation_next_message_examples_joined_prepared_train.jsonl",
      "bytes": 854884,
      "created_at": 1692115490,
      "status": "processed",
      "status_details": null
    }
  ],
  "validation_files": [],
  "result_files": [
    {
      "object": "file",
      "id": "file-vOTRsJxhVhzZdVVxP7qwcNQL",
      "purpose": "fine-tune-results",
      "filename": "compiled_results.csv",
      "bytes": 173478,
      "created_at": 1692122709,
      "status": "pro

## Check fine-tuning events

* Lets us know specifics about the fine-tuning job


In [None]:
# Get the events of our fine-tune job.
events = openai.FineTune.stream_events(id=job_id)

for event in events:
    print(event)

# Look at Training Results

* download the results file(s)

In [None]:
file_prefix = "coversation_next_message"

result = openai.FineTune.retrieve(id=job_id)
count = 0
for result_file in result["result_files"]:
    file_name = f'{work_dir}/{file_prefix}_{count}.csv'
    file = open(file_name, 'wb')
    file.write(openai.File.download(id=result_file["id"]))
    file.close()
    print(f'Outputted results to: {file_name}')


## Inference

In [89]:
ada_model = "ada:ft-personal:coversation-next-message-ada-2023-08-15-12-27-13"
cur_model = "curie:ft-personal:coversation-next-message-ada-2023-08-15-12-45-56"
cur4_model = "curie:ft-datastax:coversation-next-message-cur-4-2023-08-15-16-49-08"
dav_model = "davinci:ft-datastax:coversation-next-message-dav-2023-08-15-16-40-40"
dav4_model = "davinci:ft-datastax:coversation-next-message-dav-4-2023-08-15-18-05-07"
cur8_model = "curie:ft-datastax:coversation-next-message-cur-8-2023-08-15-18-01-18"


model_id = cur_model

In [74]:
cnm_train = pandas.read_json(f'{work_dir}/conversation_next_message_examples_joined_prepared_train.jsonl', lines=True)
cnm_valid = pandas.read_json(f'{work_dir}/conversation_next_message_examples_joined_prepared_valid.jsonl', lines=True)

In [75]:
i = 6
prompt = cnm_train['prompt'][i]
completion = cnm_train['completion'][i]

print(f'Prompt: {prompt}')
print(f'Completion: {completion}')
print(f'Prediction:')

openai.Completion.create(model=model_id, prompt=prompt, max_tokens=1, stop=' end', n=1, logprobs=5, temperature=0)

Prompt: Should maybe include in the release process?

As a separate question -- how come we didn't know that #119 was broken until I ran into it? Shouldn't the integration tests have picked it up? My concern is that we could have easily said "we have fixes in main, let's deploy", and would have done so, and starting Friday this will break people evaluating the product.

###


Completion:  nil
Prediction:


<OpenAIObject text_completion id=cmpl-7nnTyFMor9U9nfVcIX94tWCMG7L8O at 0x293a2f130> JSON: {
  "id": "cmpl-7nnTyFMor9U9nfVcIX94tWCMG7L8O",
  "object": "text_completion",
  "created": 1692102638,
  "model": "ada:ft-personal:coversation-next-message-ada-2023-08-15-12-27-13",
  "choices": [
    {
      "text": " nil",
      "index": 0,
      "logprobs": {
        "tokens": [
          " nil"
        ],
        "token_logprobs": [
          -1.2021104
        ],
        "top_logprobs": [
          {
            " nil": -1.2021104,
            " 1": -1.5456975,
            " 2": -2.3454945,
            " 5": -1.8755838,
            " 10": -1.4878904
          }
        ],
        "text_offset": [
          370
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 84,
    "completion_tokens": 1,
    "total_tokens": 85
  }
}

In [90]:
i = 220
prompt = cnm_valid['prompt'][i]
completion = cnm_valid['completion'][i]

print(f'Prompt: {prompt}')
print(f'Completion: {completion}')
print(f'Prediction:')

res = openai.Completion.create(model=model_id, prompt=prompt, max_tokens=1, stop=' end', n=1, logprobs=4, temperature=0)
res

Prompt: Yep it seems like almost the exact same process between the articles.

Let me know if theres anything we need to do around the lfs files. I don’t think I have a good mental model of how that stuff is used now (generic s3 bucket for file storage that all engineers have access to maybe? we could grant permission to gitlab runners for CI if necessary … but at that point, maybe gitlab lfs is the right thing?)

###


Completion:  1
Prediction:


<OpenAIObject text_completion id=cmpl-7nnnFBYEsvDJ7I0oI0ehbkJ7yh0Zy at 0x293d9d590> JSON: {
  "id": "cmpl-7nnnFBYEsvDJ7I0oI0ehbkJ7yh0Zy",
  "object": "text_completion",
  "created": 1692103833,
  "model": "curie:ft-personal:coversation-next-message-ada-2023-08-15-12-45-56",
  "choices": [
    {
      "text": " 1",
      "index": 0,
      "logprobs": {
        "tokens": [
          " 1"
        ],
        "token_logprobs": [
          -0.7076458
        ],
        "top_logprobs": [
          {
            " nil": -1.7763753,
            " 1": -0.7076458,
            " 2": -2.6300569,
            " 5": -1.5701545
          }
        ],
        "text_offset": [
          415
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 99,
    "completion_tokens": 1,
    "total_tokens": 100
  }
}

In [77]:
import math

logprobs = res["choices"][0]["logprobs"]["top_logprobs"][0]
for user in logprobs:
    print(f'User: {user}, Prob: {round(math.exp(logprobs[user])*100)}')

User:  nil, Prob: 27
User:  1, Prob: 37
User:  2, Prob: 10
User:  5, Prob: 17


## Analysis

In [160]:
import numpy
file_0 = 'conversation_next_message_examples_joined_prepared_valid_with_pred_cur8'

df = pandas.read_json(f'{work_dir}/{file_0}.jsonl', lines=True)
df["test"] = None
df["pred"] = None

for i in range(len(df)):
    completions = df['completion'][i].strip().split()
    df.at[i, "test"] = completions
    prediction = df['prediction'][i]
    if "choices" in prediction:
        predictions = prediction["choices"][0]["text"].strip().split()
        df.at[i, "pred"] = predictions
df = df[df.pred.notnull()]

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit([['nil', '1', '2', '5', '10']])
y_test_transformed = mlb.transform(df['test'])
y_pred_transformed = mlb.transform(df['pred'])

from sklearn.metrics import f1_score
f1 = f1_score(y_test_transformed, y_pred_transformed, average='weighted')  # Or 'micro', 'weighted' based on need
f1

0.6277269450106397

In [155]:
df

Unnamed: 0,prompt,completion,prediction,test,pred
0,There may be a new one? Let me look\n\nThst al...,nil,"{'id': 'cmpl-7nspPCMEGKQhvYYiYK93ncEZnFSMp', '...",[nil],[nil]
1,Is there a way to configure k8s to automatical...,5,"{'id': 'cmpl-7nspQ65xTtaIDzezrv1aUTMpO3hGJ', '...",[5],[5]
2,Store `computePb.Table` within `SnapshotMetada...,5,"{'id': 'cmpl-7nspROlLDt3KvlWOoWZdccMeZPoeY', '...",[5],[10]
3,?\n\n###\n\n,nil,"{'id': 'cmpl-7nspRxEU2nQVbNDHgSdAbMe0zXugk', '...",[nil],[nil]
4,"approved, but left 2 minor comments\n\n###\n\n",nil,"{'id': 'cmpl-7nspSX1Bl7ZyFSgaLSuf9iiLQaJ7b', '...",[nil],[nil]
...,...,...,...,...,...
476,studio is just a blob of javascript… it can’t ...,nil,"{'id': 'cmpl-7nsqMqYBcn7vEBH1wafmthCrO1Vtj', '...",[nil],[nil]
477,"This should be an integration/CI test, right? ...",1,"{'id': 'cmpl-7nsqM6kLgzpLxyN8jxsX2OnV3V0Vk', '...",[1],[nil]
478,"Next week please, ideally after Wednesday morn...",nil,"{'id': 'cmpl-7nsqM36oNPWlOauaIgpTIKeWmnM62', '...",[nil],[nil]
479,I believe the extra row was the first row?\n\n...,nil,"{'id': 'cmpl-7nsqN7wCFkk84EVW7N3VVcFBrvsz2', '...",[nil],[nil]


In [158]:
import numpy
file_0 = 'conversation_next_message_examples_joined_prepared_valid_with_pred_cur8'

df = pandas.read_json(f'{work_dir}/{file_0}.jsonl', lines=True)
df["test"] = None
df["pred"] = None

for i in range(len(df)):
    completions = df['completion'][i].strip().split()
    df.at[i, "test"] = [(completions != ['nil'])]
    prediction = df['prediction'][i]
    if "choices" in prediction:
        predictions = prediction["choices"][0]["text"].strip().split()
        df.at[i, "pred"] = [(predictions !=  ['nil'])]
df = df[df.pred.notnull()]

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit([[False, True]])
y_test_transformed = mlb.transform(df['test'])
y_pred_transformed = mlb.transform(df['pred'])

from sklearn.metrics import f1_score
f1 = f1_score(y_test_transformed, y_pred_transformed, average='macro')  # Or 'micro', 'weighted' based on need
f1

0.7583751948726832

In [150]:
df

Unnamed: 0,prompt,completion,prediction,test,pred
0,There may be a new one? Let me look\n\nThst al...,nil,"{'id': 'cmpl-7nsnY5vthDVTKvhWXohS7pnEPzta4', '...",[False],[False]
1,Is there a way to configure k8s to automatical...,5,"{'id': 'cmpl-7nsnaB97iEIp207T78h3fNm28F1rB', '...",[True],[True]
2,Store `computePb.Table` within `SnapshotMetada...,5,"{'id': 'cmpl-7nsnbB7p9kCQRKzwPyaEZqElkfl2r', '...",[True],[True]
3,?\n\n###\n\n,nil,"{'id': 'cmpl-7nsnct5smI3DqUAQ6Qb5kQzYOFYyt', '...",[False],[False]
4,"approved, but left 2 minor comments\n\n###\n\n",nil,"{'id': 'cmpl-7nsnctAeHYdzEv9eXzUGBUbITDMnp', '...",[False],[False]
...,...,...,...,...,...
476,studio is just a blob of javascript… it can’t ...,nil,"{'id': 'cmpl-7nsojyoUtbUqzq4ltRxTADCXxVeX4', '...",[False],[True]
477,"This should be an integration/CI test, right? ...",1,"{'id': 'cmpl-7nsojbejfTFuVsbb2vIOGjcCvbtmP', '...",[True],[False]
478,"Next week please, ideally after Wednesday morn...",nil,"{'id': 'cmpl-7nsojrbNxStMTpOb45i09kbjyhYNy', '...",[False],[False]
479,I believe the extra row was the first row?\n\n...,nil,"{'id': 'cmpl-7nsokjNZRbvMhcUFIoTSqNmKowzMJ', '...",[False],[False]


### Find High User Probs inside validation data

In [191]:
file_0 = 'conversation_next_message_examples_joined_prepared_valid_with_pred_dav4'

min_prob = 0.80

user_name = {"1": "ben", "2": "ryan", "3": "marcial", "4": "charna", "5": "eric", "6": "kevinn", "7": "tina", "8": "davor", "9": "karina", "10": "jordan", "11": "brian", "12": "janoo", "13": "theo", "15": "darci", "19": "bradley"}

example_count = 0
line_count = 0
user_example_count = {}

test = []
pred = []
with open(f'{work_dir}/{file_0}.jsonl') as file:
    while True:
        line = file.readline()

        if not line:
            break

        data = json.loads(line)
        if 'choices' not in data['prediction']:
            continue

        line_count += 1
        logprobs = data['prediction']['choices'][0]['logprobs']['top_logprobs'][0]

        high_users = {}
        for user in logprobs:
            logprob = logprobs[user]
            user = user.strip()
            if user == 'nil':
                continue
            prob = math.exp(logprob)
            if prob > min_prob:
                name = user_name[user] if user in user_name else user
                high_users[name] = prob
                if name in user_example_count:
                    user_example_count[name] += 1 
                else:
                    user_example_count[name] = 1


        if len(high_users) > 0:
            example_count += 1
            print("\nFound example with high probability:\n")

            comp_users = []
            completions = data["completion"].removesuffix("end").strip().split(" ")
            for completion in completions:
                comp_name = user_name[completion] if completion in user_name else completion
                comp_users.append(comp_name)

            prompt = data["prompt"].removesuffix('\n\n###\n\n')

            print(f'Prompt:\n{prompt}\n')
            print(f'Completions: {comp_users}')
            print(f'Predictions: {high_users}')
            test.append(comp_users)
            pred.append(high_users.keys())

print()
print(f'Total Examples: {example_count} of {line_count} lines.')
print(f'User example count: {user_example_count}')

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit([['nil', 'ben', 'ryan', 'eric', 'jordan']])
y_test_transformed = mlb.transform(test)
y_pred_transformed = mlb.transform(pred)

from sklearn.metrics import f1_score
f1_weighted = f1_score(y_test_transformed, y_pred_transformed, average='weighted')  # Or 'micro', 'weighted' based on need
f1_macro = f1_score(y_test_transformed, y_pred_transformed, average='macro')

print(f'F1 score (weighted): {f1_weighted}')
print(f'F1 score (macro): {f1_macro}')





Found example with high probability:

Prompt:
Although, if we're generating windows, maybe it should be named differently?

`between(bool) -&gt; window` - non overlapping window between true values.

`hourly() -&gt; window` - non overlapping window between hours

`sliding(segment_window, duration) -&gt; window` -&gt; sliding window of `duration` `segmentwindow`

We could allow an implicit conversion from a boolean to a window. Then the final one could be `count(foo, window=foo.n &gt; 10)`. But... that even seems a bit weird, doesn't it since you're using a boolean predicate as a window... so maybe `since` is nice for that, and just does the *explicit* conversion of a boolean to a window?

Completions: ['jordan']
Predictions: {'jordan': 0.9513615280494057}

Found example with high probability:

Prompt:
a  one thing that occurred to me with the current incremental -- we still need to *download* all the files to get the min/max time, even if we don't need the file. Seems like that will b