# Slack Export Generation

In this notebook, we will use ChatGTP to generate a slack export.

We will generate 2 weeks of Slack chat amongst 6 users of a team. The team will work on 4 different projects over the 2 weeks, and discuss various topics related to each project. All content will be generated.

Steps: 
0. Pre-requisites
1. Project Idea & Discussion Topic Generation
2. Work Schedule Generation
3. Chat Generation
4. Output

### 0. Pre-requisites

First lets install the libraries we will use and initialize our OpenAI session with an API key.

In [None]:
%pip install -q pandas openai kaskada==0.6.0a4

In [23]:
import openai
import getpass

# Initialize OpenAI
openai.api_key = getpass.getpass('OpenAI: API Key')

### 1. Project Idea & Discussion Topic Generation

I asked ChatGPT "What are some example projects that streaming technology would be helpful in?" and got back 12 ideas, including a longer text describing each idea. 

I chose 4 of the projects for my slack data generation and built a json blob containing the ideas and descriptions.

I then asked ChatGPT: 
  ```
  Can you fill out the following json blob? 

  for "discussion_topics", I'd like you to provide a list of discussion topics that would be helpful for a team developing backend software that works on the current sector, project, description.

  for "technologies", I'd like you to provide a list of technologies that would be helpful for a team developing backend software that works on the current sector, project, description.

  [
      {
        "sector": "Supply Chain Management", "project": "Inventory Tracking Tool",
        "description": "Implement streaming to track inventory and shipments in real time, ensuring accurate inventory levels and timely deliveries. Be able to predict when inventories will run low in order to replenish supplies as needed to avoid shutdowns.",
        "discussion_topics": [], "technologies": []
      },
      {
        "sector": "E-Commerce", "project": "Personalized Product Recommendations",
        "description": "Utilize streaming to analyze user behavior on an e-commerce website in real time, providing personalized product recommendations based on browsing and purchase history.",
        "discussion_topics": [], "technologies": []
      },
      {
        "sector": "Financial Services", "project": "Real-Time Fraud Detection",
        "description": "Use streaming to process and analyze financial transactions in real time, identifying potential fraudulent activities and triggering alerts immediately.",
        "discussion_topics": [], "technologies": []
      },
      {
          "sector": "Telecommunications", "project": "Network Monitoring",
          "description": "Employ streaming to process network data in real time, identifying network issues, outages, or abnormal patterns for quick resolution",
        "discussion_topics": [], "technologies": []
        }
    ]
  ```

The results of this request are contained in `projects.json`

### 2. Work Schedule Generation

The script below iterates over the `projects.json` file and creates a work schedule from the content.

Our team will work M-F, 8am to 5pm. For each hour, we will figure out what is the current project, what is the next project, and what is the current discussion topic.

The output of this script is `schedule.json`

In [37]:
import calendar, datetime, json
from datetime import timedelta

now = datetime.datetime.fromisoformat("2023-07-31 00:00:00")
end = datetime.datetime.fromisoformat("2023-08-11 17:00:00")

projects = {}
with open("projects.json") as in_file:
    projects = json.load(in_file)


current = 0

def project_text(project):
    if project >= 0 and project < 4:
        return f'Sector: {projects[project]["sector"]}, Name: {projects[project]["project"]}, Description: {projects[project]["description"]}'
    else:
        return None

with open(f'schedule.jsonl', 'w') as out_file:
    while now < end:
        now = now.__add__(timedelta(hours=1))

        # only work 8-5 m-f
        if now.weekday() > 4 or now.hour < 8 or now.hour >= 17:
            continue

        # get next topic
        if current >= 0 and current < 4:
            discussions = len(projects[current]["discussion_topics"])
            technologies = len(projects[current]["technologies"])
            if discussions == 0 and technologies == 0:
                current += 1
        
        if current >= 0 and current < 4:
            discussions = len(projects[current]["discussion_topics"])
            technologies = len(projects[current]["technologies"])

            if discussions > 0 and discussions >= technologies:
                topic = f'The primary discussion topic for this hour is: "{projects[current]["discussion_topics"].pop()}"'
            else:
                topic = f'The primary technology to discuss in this hour is: "{projects[current]["technologies"].pop()}"'
        else:
            topic = ""

        obj = {
            "datetime": f'{calendar.day_name[now.weekday()]} {now.strftime("%m/%d/%Y, %H:%M:%S")}',
            "timestamp": int(datetime.datetime.timestamp(now)),
            "topic": topic,
            "current_project": project_text(current),
            "next_project": project_text(current + 1),
        }
        out_file.write(json.dumps(obj) + "\n")


### 3. Chat Generation

* We will now use the openAI ChatCompletion API to generate our slack conversations. We do this by providing a large set of instructions to the model.
* Most of the instructions are constant, but some of them change on a per-hour request, based on our work schedule.
* We also send the previous instructions and response, so that the model can continue conversations where it previously left off.
* Even though we ask for more than 100 messages on each return, we rarely get more than 50. You can experiment with changing the instructions to try to increase the message response count.
* We use `gpt-3.5-turbo-16k` as the model because of its high token-limit
  * Note: `gpt4` generally did a worse job of following the instructions, returned fewer & shorter messages, and is more expensive to use.
* Each response from the model will take 2-3 minutes to return. The total time to run the whole script is about 5 hours. The total cost is about $3 USD.
  * The script is written to append data, so you restart it without loosing any history

The output is saved to `generated.jsonl`

In [40]:
# run this only once, or if you want to completely restart generation

from IPython.display import clear_output
import time

instructions = """
I'd like you to simulate a long-running slack conversation between 6 team members. The team works from 8am to 5pm GMT. The team works together developing backend software using streaming technologies.
There will be 2 channels: "General" and "Project".
In the "General" channel, team members will talk about their weekends, plan team meetings, and occasionally discuss new technologies.
In the "Project" channel, team members will discuss their current project. About 80% of the total simulation should occur in this channel. 
The project they are working on will change over time. They will mostly discuss the current project, but will occasionally discuss ideas for the next project.
Each user has a different role and experience level.
UserA is a senior engineer with expert level knowledge of Kafka & Java, beginner with Pulsar & Python.
UserB is a junior engineer with mostly Python experience, in school they had a minor in data-science.
UserC is the manager of the team, mid-senior level, knows about streaming concepts, but doesn't know any programming language well.
UserD is the PM of the team, used to be a developer, has a strong python background in data-science, but has no streaming technology experience.
UserE is senior level developer with strong Python, no-sql database experience, and despite being senior sometimes it takes them a long time to grasp new concepts. However, after they grasp the concept, they are very good at helping other engineers with it.
UserF is a principal level engineer with strong knowledge of everything related to the projects. 
Some topics can be discussed in threads. Each thread should have a unique integer id. If no one has spoken on a thread for more than 20 minutes, either a new thread should be started, or the discussion can continue in the channel (outside a thread).
The output format of each message should be a single json line, and include: the message timestamp, the channel, the userId speaking, the threadId (or 'null' if not in a thread), and then the text.
An example output for a message outside a thread: Example: {"timestamp": 1653465675, "channel": "General", "user": "UserA", "thread": null, "text": "i had a great weekend, did you?"}
An example output for a message inside a thread: Example: {"timestamp": 1653292875, "channel": "Project", "user": "UserA", "thread": 123, "text": "thanks for your help explaining that concept"}
When a discussion is on-going the team members should reply back within one minute. There should be at least 10 minute breaks between topics or discussions.
"""

schedule = {}
prompt = None
response = None
res = None
total_usage = 0
start_time = time.time()
last_line = 0

In [None]:
# feel free to stop and restart this as often as you like until it completes

current_line = 0
with open("generated.jsonl", "a") as out_file:
    with open("schedule.jsonl") as in_file:
        for line in in_file.readlines():
            current_line +=1

            if current_line < last_line:
                continue

            schedule = json.loads(line)

            if res:
                total_usage += res["usage"]["total_tokens"]
                print(f'Previous call usage: {res["usage"]}')
                print(f'Total usage: {total_usage} tokens.')
                elapsed_time = time.time() - start_time
                # Convert elapsed_time to hours, minutes, and seconds
                hours = int(elapsed_time // 3600)
                elapsed_time %= 3600
                minutes = int(elapsed_time // 60)
                seconds = int(elapsed_time % 60)
                
                # Print the elapsed time in hours, minutes, and seconds
                print(f'Elapsed time: {hours:02}:{minutes:02}:{seconds:02}')
                print()

            msgs = [{"role": "system", "content": instructions}]

            if prompt and response:
                msgs.append(prompt)
                msgs.append(response)

            print(f'Building prompt for: {schedule["datetime"]} with topic: {schedule["topic"]}:')
        
            s = f' The current project is: "{schedule["current_project"]}"' + \
                f' The next project is: "{schedule["next_project"]}" ' + \
                schedule["topic"] + \
                f'. Please generate 100 or more messages for the next hour, starting at: {schedule["timestamp"]}. ' + \
                f' Create at least one thread. The next threadId is: {current_line}. Threads should be 5 to 10 messages long, but don\'t have to have every user' + \
                f' At least 80% of messages should be in the "Project" channel.'

            prompt = {"role": "user", "content": s}
            msgs.append(prompt)

            clear_output(wait=True)
                    
            res = openai.ChatCompletion.create(
                    model = "gpt-3.5-turbo-16k",
                    messages = msgs
                )

            response = res["choices"][0]["message"]
            out_file.write(response["content"] + "\n")
            out_file.flush()

            last_line = current_line

### 4. Output

We will use Kaskada and Pandas to convert the data to the right output format, and then save it to parquet for input into Kaskada.

In [5]:
# Run this only once
import kaskada as kd
kd.init_session()

In [6]:
messages = await kd.sources.JsonlFile.create(
    "generated.jsonl",
    time_column = "timestamp", 
    key_column = "channel",
    time_unit = "s"
)

In [8]:
import hashlib

# re-key using thread and channel
output = messages.with_key(kd.record({
    "channel": messages.col("channel"),
    "thread": messages.col("thread"),
}))

# on per-entity basis, add thread_ts based off first ts in entity
# but only where thread is not null
output = output.extend({"thread_ts": output.col("timestamp").first().if_(output.col("thread").is_not_null())})

# convert to pandas data-frame, drop columns we don't need
df=output.to_pandas()
df.drop(columns=["_time", "_key", "thread"], inplace=True)

# Rename the 'OldName' column to 'NewName'
df.rename(columns={"timestamp": "ts"}, inplace=True)

# Convert the ts column to float64
df["ts"] = df["ts"].astype("float64")


# generate userIds for users
def get_user_id(user):
    md5_hash = hashlib.md5()
    md5_hash.update(user.encode('utf-8'))
    return md5_hash.hexdigest()[:8]

df["user"] = df["user"].apply(get_user_id)

# write the output to jsonl
df.to_json("messages.jsonl", orient="records", lines=True)