# Conversations

### For a set of recent messages in a "conversation", try to predict the set of users that might interact next.

A "conversation" is defined as either:
* All the messages in a thread
* A collection of messages from a single channel that occur in succession. If no response is made for 10 minutes, the conversation has ended. The next message outside this window is the start of a new conversation.


For each set of messages in a "conversation", build a user set from the following properties:
* users that reacted to the conversation
* users that participated in the conversation
* exclude the conversation starter 

In [None]:
!pip3 install --upgrade openai pip install file-read-backwards  -q

In [136]:
import json, openai, time, pandas, random, getpass
from openai import cli
from types import SimpleNamespace
from sklearn.model_selection import train_test_split
from file_read_backwards import FileReadBackwards

work_dir = "/Users/eric.pinzur/Documents/slackbot2000"
openai.api_key = getpass.getpass(prompt="Please enter your OpenAI API Key")

## Additional Data Prep

Convert the original input data into a set of "conversations".

### Split the data into "threads" and "non-threads"

Using DuckDB:

```sql
copy(
    select * from 
    read_json_auto('messages.jsonl', format='newline_delimited') 
    where thread_ts is not null 
    order by channel, thread_ts, ts
) to 'message_threads.jsonl' (FORMAT JSON);

copy(
    select * from 
    read_json_auto('messages.jsonl', format='newline_delimited') 
    where thread_ts is null 
    order by channel, ts
) to 'message_non_threads.jsonl' (FORMAT JSON);
```

### Make threads from non-threads

For non-threads, artifically group the messages into "threads".  Collect messages from a channel.  If there is a 5 minute gap between messages, convert the message collection to a "thread" and export.




In [9]:
in_file = open(f'{work_dir}/message_non_threads.jsonl', 'r')
out_file = open(f'{work_dir}/message_non_threads_threads.jsonl', 'w')

def write_thread(thread):
    if len(thread) > 0:
        reply_users = []
        for message in thread:
            if message["user"] not in reply_users:
                reply_users.append(message["user"])
        thread[0]["reply_users"] = reply_users
        thread_ts = thread[0]["ts"]
        for message in thread:
            message["thread_ts"] = thread_ts
            out_file.write(json.dumps(message)+"\n")

next_thread = []
current_channel = ""
last_msg_ts = None
while True:
    output_thread = False
    line = in_file.readline()

    if not line:
        break

    message = json.loads(line)

    channel = message["channel"]

    # output if channel changes in the file
    if channel != current_channel:
        current_channel = channel
        output_thread = True

    # output if message timestamp is more than 10 mins beyond last message
    if last_msg_ts and message["ts"] > last_msg_ts + 600:
        output_thread = True

    # output and reset
    if output_thread:
        write_thread(next_thread)
        next_thread = []
        last_msg_ts = None

    next_thread.append(message)
    last_msg_ts = message["ts"]

# output final thread
write_thread(next_thread)

in_file.close()
out_file.close()

### Make Conversations

Re-join the two files into a set of conversations, using duckDB:

```sql
copy(
    select * from 
    read_json_auto(['message_non_threads_threads.jsonl', 'message_threads.jsonl'], format='newline_delimited') 
    order by channel, thread_ts, ts
) to 'message_conversations.jsonl' (FORMAT JSON);
```

## Generate Examples

This is a Generative problem, so the goals for training aren't as strict.

Goals:
* prompt and completion length must not be longer than 2048 tokens

### Strategy

* Group messages int "conversations".
* Use a method to write examples for a "conversation"

In [114]:
# build a user map
in_file = open(f'{work_dir}/message_conversations.jsonl', 'r')
user_map = {}
user_count = 0

while True:
    line = in_file.readline()

    if not line:
        break

    message = json.loads(line)
    user = message["user"]

    # create a user_id -> user_num map
    if user not in user_map:
        user_count += 1
        user_map[user] = user_count

in_file.close()

In [115]:
user_map

{'UCZ4VJF6J': 1,
 'ULJD5H2A2': 2,
 'U017T5TFW58': 3,
 'U014ZU49HPT': 4,
 'UU5C7MNMA': 5,
 'U016TM9NXEY': 6,
 'U021KG8NMRQ': 7,
 'UCXNQ2MPV': 8,
 'U02TCUCA7PU': 9,
 'U012CLUV1KJ': 10,
 'U0332GKB9J8': 11,
 'UT62H53R6': 12,
 'U03RNK543HC': 13,
 'USLACKBOT': 14,
 'U011BDYMEG7': 15,
 'U01TZK90VSM': 16,
 'U02326Q5BG9': 17,
 'U017S2GDXF0': 18,
 'U0117AWAAEN': 19}

In [21]:
in_file = open(f'{work_dir}/message_conversations.jsonl', 'r')
out_file = open(f'{work_dir}/conversation_user_examples.jsonl', 'w')

# gets a list of all users that reacted to a message
def get_reaction_users(message):
    users = []
    if message["reactions"]:
        for reaction in message["reactions"]:
            users.extend(reaction["users"])
    return users

prompt_separator = "\n\n###\n\n"
completion_separator = " end"

def output_conversation_example(conversation):
    if len(conversation) > 0:
        prompt_lines = []
        initial_user = conversation[0]["user"]
        users = []
        for message in conversation:
            user = message["user"]
            users.append(user)
            users.extend(get_reaction_users(message))
            
            text = message["text"]
            if text == "" or text.find("```") >= 0:
                continue

            prompt_lines.append(f' {user} --> {text} ')

        if len(prompt_lines) == 0:
            return
        
        prompt = "\n\n".join(prompt_lines)

        # de-duplicate users in the list.
        users = list(dict.fromkeys(users))

        # remove the conversation starter
        if initial_user in users:
            users.remove(initial_user)

        # convert user_ids to user_nums
        user_nums = []
        for user in users:
            if user in user_map:
                user_nums.append(f'{user_map[user]}')

        completion = " ".join(user_nums) if len(user_nums) > 0 else "nil"
        
        example = { "prompt": f'start -> {prompt}{prompt_separator}', "completion": f' {completion}{completion_separator}' }
        out_file.write(json.dumps(example) + "\n")

           
current_channel = ""
current_conversation = []
conversation_ts = None

while True:
    output_convo = False
    line = in_file.readline()

    if not line:
        break

    message = json.loads(line)
    channel = message["channel"]
    thread_ts = message["thread_ts"]
    user = message["user"]



    # output if channel changes
    if channel != current_channel:
        current_channel = channel
        output_convo = True

    # output if different thread_ts
    if conversation_ts and conversation_ts != thread_ts:
        output_convo = True

    # output and reset
    if output_convo:
        output_conversation_example(current_conversation)
        current_conversation = []

    current_conversation.append(message)
    conversation_ts = thread_ts
   

# output final conversation
output_conversation_example(current_conversation)

in_file.close()
out_file.close()

## Data Verification & Split Stage

* make sure prompts end with same suffix
* remove too long examples
* remove duplicated examples

Note, we aren't doing classification, so don't start a fine-tune as suggested by the output


In [135]:
args = SimpleNamespace(file=f'{work_dir}/conversation_user_stripped_examples.jsonl', quiet=True)
cli.FineTune.prepare_data(args)

Analyzing...

- Your file contains 4642 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 3 duplicated prompt-completion sets. These are rows: [971, 1907, 3985]
- There are 2 examples that are very long. These are rows: [111, 1618]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- All prompts end with suffix `\n\n###\n\n`

Based on the analysis we will perform the following actions:
- [Recommended] Remove 3 duplicate rows [Y/n]: Y
- [Recommended] Remove 2 long examples [Y/n]: Y
The indices of the long examples has changed as a result of a previously applied recommendation.
The 2 long examples to be dropped are now at the following indic

## Model Training Stage

* First need to upload the training & validation files to OpenAI

In [139]:
training_file_name = f'{work_dir}/conversation_user_stripped_examples_prepared_train.jsonl'

def check_status(training_id):
    train_status = openai.File.retrieve(training_id)["status"]
    print(f'Status (training_file): {train_status} ')
    return (train_status)

# Upload the training and validation dataset files to Azure OpenAI.
training_id = cli.FineTune._get_or_upload(training_file_name, True)

# Check on the upload status of the training dataset file.
(train_status) = check_status(training_id)

# Poll and display the upload status once a second until both files have either
# succeeded or failed to upload.
while train_status not in ["succeeded", "failed", "processed"]:
    time.sleep(1)
    (train_status) = check_status(training_id)

Upload progress: 100%|██████████| 2.11M/2.11M [00:00<00:00, 2.39Git/s]


Uploaded file from /Users/eric.pinzur/Documents/slackbot2000/conversation_user_stripped_examples_prepared_train.jsonl: file-SRadZhsiDk2ioHlNvIIRk7Ye
Status (training_file): uploaded 
Status (training_file): uploaded 
Status (training_file): uploaded 
Status (training_file): uploaded 
Status (training_file): uploaded 
Status (training_file): processed 


## Start a fine-tuning Job

* no validation file since this is not a classification problem

In [140]:
# This example defines a fine-tune job that creates a customized model based on curie, 
# with just a single pass through the training data. The job also provides classification-
# specific metrics, using our validation data, at the end of that epoch.
create_args = {
    "training_file": training_id,
    "model": "davinci",
    "n_epochs": 2,
    "learning_rate_multiplier": 0.02,
    "suffix": "coversation_users_full_kaskada_c"
}
# Create the fine-tune job and retrieve the job ID
# and status from the response.
resp = openai.FineTune.create(**create_args)
job_id = resp["id"]
status = resp["status"]

# You can use the job ID to monitor the status of the fine-tune job.
# The fine-tune job may take some time to start and complete.
print(f'Fine-tuning model with job ID: "{job_id}"')

Fine-tuning model with job ID: "ft-JXAXQtHd74y6fIDmeQyti3ZP"


In [110]:
job_id = "ft-qnHHxoNViVDplGDgIGn0kEC9"

## Wait for the fine-tuning to start

* Note that it can take several hours for the job to move from the `pending` state

In [134]:
# Get the status of our fine-tune job.
status = openai.FineTune.retrieve(id=job_id)["status"]

# If the job isn't yet done, poll it every 2 seconds.
if status not in ["succeeded", "failed"]:
    print(f'Job not in terminal status: {status}. Waiting.')
    while status not in ["succeeded", "failed"]:
        time.sleep(5)
        status = openai.FineTune.retrieve(id=job_id)["status"]
        print(f'Status: {status}')
else:
    print(f'Fine-tune job {job_id} finished with status: {status}')


Job not in terminal status: pending. Waiting.
Status: pending


KeyboardInterrupt: 

## Check fine-tuning events

* Lets us know specifics about the fine-tuning job


In [133]:
# Get the events of our fine-tune job.
events = openai.FineTune.stream_events(id=job_id)

for event in events:
    print(event)

{
  "object": "fine-tune-event",
  "level": "info",
  "message": "Created fine-tune: ft-qnHHxoNViVDplGDgIGn0kEC9",
  "created_at": 1691662984
}


KeyboardInterrupt: 

# Look at Training Results

* download the results file(s)

In [29]:
file_prefix = "coversation_users"

result = openai.FineTune.retrieve(id=job_id)
count = 0
for result_file in result["result_files"]:
    file_name = f'{work_dir}/{file_prefix}_{count}.csv'
    file = open(file_name, 'wb')
    file.write(openai.File.download(id=result_file["id"]))
    file.close()
    print(f'Outputted results to: {file_name}')


Outputted results to: /Users/eric.pinzur/Documents/slackbot2000/coversation_users_0.csv


## Inference

In [30]:
model_id = "davinci:ft-personal:coversation-users-full-kaskada-2023-08-05-14-25-30"

In [32]:
cu_train = pandas.read_json(f'{work_dir}/conversation_user_examples_prepared_train.jsonl', lines=True)
cu_valid = pandas.read_json(f'{work_dir}/conversation_user_examples_prepared_valid.jsonl', lines=True)
i = 0

In [48]:
# interesting i's: 6, 8, 78, 79, 80
i


8

In [None]:
#i += 1
i = 6
prompt = cu_train['prompt'][i]
completion = cu_train['completion'][i]

print(f'Prompt: {prompt}')
print(f'Completion: {completion}')
print(f'Prediction:')

openai.Completion.create(model=model_id, prompt=prompt, max_tokens=1, stop=' end', n=1, logprobs=5, temperature=0)

In [37]:
#i += 1
i = 220
prompt = cu_valid['prompt'][i]
completion = cu_valid['completion'][i]

print(f'Prompt: {prompt}')
print(f'Completion: {completion}')
print(f'Prediction:')

res = openai.Completion.create(model=model_id, prompt=prompt, max_tokens=1, stop=' end', n=1, logprobs=4, temperature=0)
res

Prompt: start ->  UCZ4VJF6J --> <@UU5C7MNMA> <@U017T5TFW58> Following up on the "export" use case -- worst case (if tempo doesn't have it) we could do something like having a little job (maybe skycat?) that finds bug reports with trace IDs and uses the Grafana query API to retrieve the trace (which returns it as OTEL JSON) and then attaches that to the bug. Then we'd have that trace *data* forever (and could always figure out how to visualize it separately):

<https://grafana.com/docs/tempo/latest/api_docs/#query> 

 U017T5TFW58 --> love the idea! 

###


Completion:  3 end
Prediction:


<OpenAIObject text_completion id=cmpl-7lHH5djssRnShCdPoYhxMwqR80ipK at 0x16cec7540> JSON: {
  "id": "cmpl-7lHH5djssRnShCdPoYhxMwqR80ipK",
  "object": "text_completion",
  "created": 1691502175,
  "model": "davinci:ft-personal:coversation-users-full-kaskada-2023-08-05-14-25-30",
  "choices": [
    {
      "text": " 5",
      "index": 0,
      "logprobs": {
        "tokens": [
          " 5"
        ],
        "token_logprobs": [
          -0.11093157
        ],
        "top_logprobs": [
          {
            " 13": -3.8975945,
            " 3": -2.7242737,
            " 5": -0.11093157,
            " 10": -5.4023495
          }
        ],
        "text_offset": [
          553
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 164,
    "completion_tokens": 1,
    "total_tokens": 165
  }
}

In [38]:
import math

logprobs = res["choices"][0]["logprobs"]["top_logprobs"][0]
for user in logprobs:
    print(f'User: {user}, Prob: {round(math.exp(logprobs[user])*100)}')

User:  13, Prob: 2
User:  3, Prob: 7
User:  5, Prob: 89
User:  10, Prob: 0


In [81]:
user_map


{'UCZ4VJF6J': 1,
 'ULJD5H2A2': 2,
 'U017T5TFW58': 3,
 'U014ZU49HPT': 4,
 'UU5C7MNMA': 5,
 'U016TM9NXEY': 6,
 'U021KG8NMRQ': 7,
 'UCXNQ2MPV': 8,
 'U02TCUCA7PU': 9,
 'U012CLUV1KJ': 10,
 'U0332GKB9J8': 11,
 'UT62H53R6': 12,
 'U03RNK543HC': 13,
 'USLACKBOT': 14,
 'U011BDYMEG7': 15,
 'U01TZK90VSM': 16,
 'U02326Q5BG9': 17,
 'U017S2GDXF0': 18,
 'U0117AWAAEN': 19}

## Analysis

In [96]:
df = pandas.read_json(f'{work_dir}/conversation_user_examples_prepared_valid_with_pred.jsonl', lines=True)
df["test"] = None
df["pred"] = None

for i in range(len(df)):
    completions = df['completion'][i].removesuffix(" end").strip().split()
    df.at[i, "test"] = completions
    prediction = df['prediction'][i]
    if "choices" in prediction:
        predictions = prediction["choices"][0]["text"].strip().split()
        df.at[i, "pred"] = predictions
df = df[df.pred.notnull()]
df

Unnamed: 0,prompt,completion,prediction,test,pred
0,start -> UCZ4VJF6J --> <https://venturebeat.c...,3 4 end,"{'id': 'cmpl-7lcs8hQ8cAsW6sUrmkWV5eNfvMHfO', '...","[3, 4]",[4]
1,start -> UCZ4VJF6J --> <https://medium.com/th...,nil end,"{'id': 'cmpl-7lcs9Kl5ysRTUHy7myMDpGRc03xjT', '...",[nil],[nil]
2,start -> UU5C7MNMA --> <https://blog.spiceai....,nil end,"{'id': 'cmpl-7lcs9TgyNbYUQLqwbLkAfV7TmmjY4', '...",[nil],[nil]
3,start -> U014ZU49HPT --> This blog by Adobe g...,nil end,"{'id': 'cmpl-7lcs9tK3fKxzTQRvZ7gvBD6lADmXe', '...",[nil],[nil]
4,start -> ULJD5H2A2 --> Ran across a query lan...,nil end,"{'id': 'cmpl-7lcsAgSRXYIUYoj8H8soTGgLHPRll', '...",[nil],[nil]
...,...,...,...,...,...
988,start -> UCZ4VJF6J --> <@UU5C7MNMA> FYI I'm i...,5 end,"{'id': 'cmpl-7lcsiVYiHV8TO0ihtRi2f329h9g42', '...",[5],[nil]
989,start -> UU5C7MNMA --> <@UCZ4VJF6J> have you ...,1 end,"{'id': 'cmpl-7lcsiiuzLAcE2w40upYhZE3jvxcVq', '...",[1],[1]
991,start -> UCZ4VJF6J --> <@ULJD5H2A2> from read...,2 end,"{'id': 'cmpl-7lcsjdzewSRjvKlmTQmXTEuz81woz', '...",[2],[2]
992,start -> UCZ4VJF6J --> <@U03RNK543HC> <@U012C...,nil end,"{'id': 'cmpl-7lcsjiQSb6NRapu3c2jc9lPS2OepS', '...",[nil],[nil]


In [100]:
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit([['nil', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19']])
y_test_transformed = mlb.transform(df['test'])
y_pred_transformed = mlb.transform(df['pred'])

from sklearn.metrics import f1_score
f1 = f1_score(y_test_transformed, y_pred_transformed, average='weighted', zero_division=np.nan)  # Or 'micro', 'weighted' based on need
f1

0.7411962064681381

In [63]:
df["pred"].unique()

array(['4', 'nil', '3', '6', '5', '1', '2', '10', '11'], dtype=object)

In [98]:
mlb.classes_

array(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19',
       '2', '3', '4', '5', '6', '7', '8', '9', 'nil'], dtype=object)

In [81]:
x = {'sci-fi', 'thriller'}
x?

[0;31mType:[0m        set
[0;31mString form:[0m {'thriller', 'sci-fi'}
[0;31mLength:[0m      2
[0;31mDocstring:[0m  
set() -> new empty set object
set(iterable) -> new set object

Build an unordered collection of unique elements.

In [112]:
import json

conversion_map = {}
with open(f'{work_dir}/conversation_users_conversion_map.json') as file:
    conversion_map = json.load(file)

def convert_users_in_line(line):
    for in_user in conversion_map:
        out_user = conversion_map[in_user]

        line = line.replace(in_user, out_user)
    return line


with open(f'{work_dir}/conversation_user_examples_prepared_train.jsonl', 'r') as in_file:
    with open(f'{work_dir}/conversation_user_examples_prepared_train_converted.jsonl', 'w') as out_file:
        while True:
            line = in_file.readline()

            if not line:
                break

            out_file.write(convert_users_in_line(line))


#### (b) new train set with prompts without userIds at the start

In [129]:
in_file = open(f'{work_dir}/message_conversations.jsonl', 'r')
out_file = open(f'{work_dir}/conversation_user_noprompts_examples.jsonl', 'w')

# gets a list of all users that reacted to a message
def get_reaction_users(message):
    users = []
    if message["reactions"]:
        for reaction in message["reactions"]:
            users.extend(reaction["users"])
    return users

prompt_separator = "\n\n###\n\n"
completion_separator = " end"

def output_conversation_example(conversation):
    if len(conversation) > 0:
        prompt_lines = []
        initial_user = conversation[0]["user"]
        users = []
        for message in conversation:
            user = message["user"]
            users.append(user)
            users.extend(get_reaction_users(message))
            
            text = message["text"]
            if text == "" or text.find("```") >= 0:
                continue

            prompt_lines.append(f' {text} ')

        if len(prompt_lines) == 0:
            return
        
        prompt = "\n\n".join(prompt_lines)

        prompt = convert_users_in_line(prompt).strip()

        # de-duplicate users in the list.
        users = list(dict.fromkeys(users))

        # remove the conversation starter
        if initial_user in users:
            users.remove(initial_user)

        # convert user_ids to user_nums
        user_nums = []
        for user in users:
            if user in user_map:
                user_nums.append(f'{user_map[user]}')

        completion = " ".join(user_nums) if len(user_nums) > 0 else "nil"
        
        example = { "prompt": f'{prompt}{prompt_separator}', "completion": f' {completion}{completion_separator}' }
        out_file.write(json.dumps(example) + "\n")

           
current_channel = ""
current_conversation = []
conversation_ts = None

while True:
    output_convo = False
    line = in_file.readline()

    if not line:
        break

    message = json.loads(line)
    channel = message["channel"]
    thread_ts = message["thread_ts"]
    user = message["user"]

    # output if channel changes
    if channel != current_channel:
        current_channel = channel
        output_convo = True

    # output if different thread_ts
    if conversation_ts and conversation_ts != thread_ts:
        output_convo = True

    # output and reset
    if output_convo:
        output_conversation_example(current_conversation)
        current_conversation = []

    current_conversation.append(message)
    conversation_ts = thread_ts
   

# output final conversation
output_conversation_example(current_conversation)

in_file.close()
out_file.close()

#### (c) new train set with all userIds and weblinks removed from prompts

In [126]:
import re

def strip_links_and_users(line):
    return re.sub(r"<.*?>", '', line)

strip_links_and_users("<@U052RFMBRF0> I faintly remember something about putting metrics into <https:\/\/gitlab.com\/kaskada\/kaskada\/-\/blob\/main\/wren\/docker-compose.yml#L61> \n\n Something I can help wit")

' I faintly remember something about putting metrics into  \n\n Something I can help wit'

In [132]:
in_file = open(f'{work_dir}/message_conversations.jsonl', 'r')
out_file = open(f'{work_dir}/conversation_user_stripped_examples.jsonl', 'w')

# gets a list of all users that reacted to a message
def get_reaction_users(message):
    users = []
    if message["reactions"]:
        for reaction in message["reactions"]:
            users.extend(reaction["users"])
    return users

prompt_separator = "\n\n###\n\n"
completion_separator = " end"

def output_conversation_example(conversation):
    if len(conversation) > 0:
        prompt_lines = []
        initial_user = conversation[0]["user"]
        users = []
        for message in conversation:
            user = message["user"]
            users.append(user)
            users.extend(get_reaction_users(message))
            
            text = message["text"]
            if text == "" or text.find("```") >= 0:
                continue

            prompt_lines.append(f' {text} ')

        if len(prompt_lines) == 0:
            return
        
        prompt = "\n\n".join(prompt_lines)

        prompt = strip_links_and_users(prompt).strip()

        if len(prompt) < 25:
            return

        # de-duplicate users in the list.
        users = list(dict.fromkeys(users))

        # remove the conversation starter
        if initial_user in users:
            users.remove(initial_user)

        # convert user_ids to user_nums
        user_nums = []
        for user in users:
            if user in user_map:
                user_nums.append(f'{user_map[user]}')

        completion = " ".join(user_nums) if len(user_nums) > 0 else "nil"
        
        example = { "prompt": f'{prompt}{prompt_separator}', "completion": f' {completion}{completion_separator}' }
        out_file.write(json.dumps(example) + "\n")

           
current_channel = ""
current_conversation = []
conversation_ts = None

while True:
    output_convo = False
    line = in_file.readline()

    if not line:
        break

    message = json.loads(line)
    channel = message["channel"]
    thread_ts = message["thread_ts"]
    user = message["user"]

    # output if channel changes
    if channel != current_channel:
        current_channel = channel
        output_convo = True

    # output if different thread_ts
    if conversation_ts and conversation_ts != thread_ts:
        output_convo = True

    # output and reset
    if output_convo:
        output_conversation_example(current_conversation)
        current_conversation = []

    current_conversation.append(message)
    conversation_ts = thread_ts
   

# output final conversation
output_conversation_example(current_conversation)

in_file.close()
out_file.close()

### Find High User Probs inside validation data

In [163]:
min_prob = 0.60

user_name = {"1": "ben", "2": "ryan", "3": "marcial", "4": "charna", "5": "eric", "6": "kevinn", "7": "tina", "8": "davor", "9": "karina", "10": "jordan", "11": "brian", "12": "janoo", "13": "theo", "15": "darci", "19": "bradley"}

file_b ="conversation_user_noprompts_examples_prepared_valid_with_pred"
file_c ="conversation_user_stripped_examples_prepared_valid_with_pred"

example_count = 0
line_count = 0
user_example_count = {}
with open(f'{work_dir}/{file_c}.jsonl') as file:
    while True:
        line = file.readline()

        if not line:
            break

        data = json.loads(line)
        if 'choices' not in data['prediction']:
            continue

        line_count += 1
        logprobs = data['prediction']['choices'][0]['logprobs']['top_logprobs'][0]

        high_users = {}
        for user in logprobs:
            logprob = logprobs[user]
            user = user.strip()
            if user == 'nil':
                continue
            prob = math.exp(logprob)
            if prob > min_prob:
                name = user_name[user]
                high_users[name] = prob
                if name in user_example_count:
                    user_example_count[name] += 1 
                else:
                    user_example_count[name] = 1


        if len(high_users) > 0:
            example_count += 1
            print("\nFound example with high probability:\n")

            comp_users = []
            completions = data["completion"].removesuffix("end").strip().split(" ")
            for completion in completions:
                comp_name = user_name[completion] if completion in user_name else completion
                comp_users.append(comp_name)

            prompt = data["prompt"].removesuffix('\n\n###\n\n')

            print(f'Prompt:\n{prompt}\n')
            print(f'Completions: {comp_users}')
            print(f'Predictions: {high_users}')

print()
print(f'Total Examples: {example_count} of {line_count} lines.')
print(f'User example count: {user_example_count}')



Found example with high probability:

Prompt:
Had some fun with rust over the weekend:  

 Any thoughts on async and how hard it will be to replace threads? I think getting off channels simplifies a few things. 

 Honestly I haven’t looked any further than the merger. I will say that working with the async syntax is really nice, and streams behave like a regular iterator which is also nice. I spent some time trying to do this with `StreamExt` rather than dropping down to `poll_next` - I don’t think it’s possible but the stream combinators felt very natural. 

 Being able to us rusoto to read from S3 would also be a big improvement 

 Ironically, I built a janky version of streams myself back in the days before async was a thing:  

 haha 

 did you run into any places that you needed to deal with pinning? 

 I’m still a little hazy on pinning. AFAICT everything needs to be pinned, but it hasn’t caused problems anywhere yet. The merger expects its inputs to be box-pinned. I’m using `pi