Let users ask for specific types of notifications, for example:

"Hey BeepGPT, let me know when there’s an important engineering decision being made about Kaskada”.

This lets us be more explicit about why we’re alerting (ie “you asked for important engineering decisions”),
and should simplify the vector search problem - as a first pass we should be able to embed each user’s requests 
and do a vector distance threshold against each conversation to decide if we should alert.

We could also watch for interaction with the notifications BeepGPT sends as a way of providing training feedback 
to the model. So for example if I add a :+1:  or :thank_you: reaction to the notification we could capture that 
as a “positive” training example, but if I add a :-1: or threaded reply saying “this message isn’t specific to 
any projects I’m working on” we could treat that as a “negative” training example (and possibly a refinement to 
the notification requests).

In [None]:
%pip install -q llama-index span-marker kaskada==0.6.0a4 "cassio>=0.1.0" pillow

In [1]:
import pandas as pd
import kaskada as kd

# Initialize Kaskada with a local execution context.
kd.init_session()

# set pandas to display all floats with 6 decimal places
pd.options.display.float_format = '{:.6f}'.format

In [2]:
users_df = pd.read_json("/Users/eric.pinzur/Documents/slackbot2000/slack-export/users.json")

columns_to_keep = ["id", "team_id", "name", "deleted", "real_name", "is_bot", "updated"]

users_df.drop(columns=users_df.columns.difference(columns_to_keep), inplace=True)

users = kd.sources.Pandas(
    users_df,
    time_column = "updated",
    key_column = "id",
    time_unit = "s"
)

In [3]:
def get_user(user_id):
    data = users.filter(users.col("id").eq(user_id)).last().preview().to_dict(orient='index')
    return data[0] if len(data) == 1 else None

get_user("ULJD5H2A2")

{'_time': Timestamp('2023-01-08 01:20:34'),
 '_key': 'ULJD5H2A2',
 'id': 'ULJD5H2A2',
 'team_id': 'TCYRBF6KH',
 'name': 'ryan',
 'deleted': False,
 'real_name': 'Ryan Michael',
 'is_bot': False,
 'updated': 1673140834}

In [4]:
# Use the "ts" column as the time associated with each row,
# and the "channel" column as the entity associated with each row.
raw_msgs = kd.sources.Parquet(
    "/Users/eric.pinzur/Documents/slackbot2000/kaskada_slack.parquet",
    time_column = "ts",
    key_column = "channel",
    time_unit = "s"
).remove("reply_users")
raw_msgs.preview(5)

Unnamed: 0,_time,_key,text,user,ts,thread_ts,reactions,channel
0,2021-07-01 14:26:27.342999808,team-api,<@ULJD5H2A2> available to talk about errors?,U016TM9NXEY,1625149587.343,,,team-api
1,2021-07-01 15:18:28.343499776,team-api,<@ULJD5H2A2> <https://gitlab.com/kaskada/kaska...,U016TM9NXEY,1625152708.3435,,,team-api
2,2021-07-01 15:18:43.344000000,team-api,Looks like Go sets the transfer encoding if yo...,ULJD5H2A2,1625152723.344,,,team-api
3,2021-07-01 15:18:44.344300032,team-api,working now,ULJD5H2A2,1625152724.3443,,"[{'count': 1, 'name': 'twinsparrot', 'users': ...",team-api
4,2021-07-01 15:22:19.344600064,team-api,<@U016TM9NXEY> <https://gitlab.com/kaskada/kas...,ULJD5H2A2,1625152939.3446,,"[{'count': 1, 'name': 'eyes', 'users': ['U016T...",team-api


In [5]:
# Clean Text
import re

def strip_code_blocks(line):
    return re.sub(r"```.*?```", '', line)

def user_repl(match_obj):
    user_id = match_obj.group(1)
    user = get_user(user_id)
    return f"{user.name} ({user_id})" if user else f"({user_id})"

def format_user(line):
    return re.sub(r"<@(.*?)>", user_repl, line)

def clean_message(text):
        text = strip_code_blocks(format_user(text)).strip()
        return None if text == "" else text

@kd.udf("f<N: any>(x: N) -> string")
def clean_text(batch: pd.Series):
    # Apply to each row in the batch
    return batch.map(clean_message)

cleaned = raw_msgs.extend({"text": raw_msgs.col("text").pipe(clean_text)})
cleaned.preview(5)

thread 'tokio-runtime-worker' panicked at 'Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.', /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/multi_thread/mod.rs:86:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:


PanicException: Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.

RuntimeError: [1merror in kaskada Rust code[22m
├╴at [3msrc/error.rs:54:21[23m
│
├─▶ [1mexecute query[22m
│   ╰╴at [3m/Users/runner/work/kaskada/kaskada/crates/sparrow-session/src/session.rs:488:28[23m
│
├─▶ [1minternal compute error: failed to join compute threads[22m
│   ╰╴at [3m/Users/runner/work/kaskada/kaskada/crates/sparrow-runtime/src/execute/compute_executor.rs:192:22[23m
│
├─▶ [1minternal compute error: panic in task[22m
│   ├╴at [3m/Users/runner/work/kaskada/kaskada/crates/sparrow-runtime/src/util/join_task.rs:66:30[23m
│   ╰╴task name "scan[op=0]"
│
╰─▶ [1mtask 50 panicked[22m
    ╰╴at [3m/Users/runner/work/kaskada/kaskada/crates/sparrow-runtime/src/util/join_task.rs:65:30[23m