# Kaskada: ML example

### Setting up a Kaskada environment


In [1]:
!pip install kaskada==0.1.1a7

You should consider upgrading via the '/Users/jordan.frazier/.pyenv/versions/3.8.16/bin/python3.8 -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
import kaskada.api.release as release
import os
from getpass import getpass

os.environ[release.ReleaseClient.GITHUB_ACCESS_TOKEN_ENV] = getpass(prompt='Github Access Token:')

Github Access Token:········


In [3]:
from kaskada.api.session import LocalBuilder
session = LocalBuilder().build()
%load_ext fenlmagic

INFO:kaskada.api.release:Using latest release version: engine@v0.4.0


  from .autonotebook import tqdm as notebook_tqdm
kaskada-engine-darwin-arm64: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54.4M/54.4M [00:25<00:00, 2.26MB/s]
kaskada-manager-darwin-arm64: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 39.0M/39.0M [01:00<00:00, 681kB/s]

INFO:kaskada.api.session:Initializing manager process





INFO:kaskada.api.session:Initializing engine process
INFO:kaskada.api.session:Successfully connected to session.


### Build a Sample Data Set

In this simple example, we're collecting and storing events about what users are doing.
These events describe when users win, when they lose, when they buy things, etc.
Events are stored in two CSV files: 

`game_play.csv` describes each time a player completes a game

In [4]:
%%writefile game_play.csv
event_at,entity_id,duration,won
2022-01-01 02:30:00+00:00,Alice,10,true
2022-01-01 02:35:00+00:00,Bob,3,false
2022-01-01 03:46:00+00:00,Bob,8,false
2022-01-01 03:58:00+00:00,Bob,23,true
2022-01-01 04:25:00+00:00,Bob,8,true
2022-01-01 05:05:00+00:00,Alice,53,true
2022-01-01 05:36:00+00:00,Alice,2,false
2022-01-01 07:22:00+00:00,Bob,7,false
2022-01-01 08:35:00+00:00,Alice,5,false
2022-01-01 10:01:00+00:00,Alice,43,true

Writing game_play.csv


`purchase.csv` describes each time a player makes a purchase

In [5]:
%%writefile purchase.csv
event_at,entity_id
2022-01-01 01:02:00+00:00,Alice
2022-01-01 01:35:00+00:00,Alice
2022-01-01 03:51:00+00:00,Bob

Writing purchase.csv


### Creating a Kaskada Table and Uploading Data

Below, we load the above csv into Kaskada. When a table
is created, it is persisted in your Kaskada environment.

Kaskada also allows uploading data from parquet files.


In [6]:
import kaskada.table as ktable

# Create table objects in Kaskada.
ktable.create_table(
  table_name = "GamePlay",
  entity_key_column_name = "entity_id",
  time_column_name = "event_at",
  grouping_id = "User",
)

0,1
table,table_nameGamePlayentity_key_column_nameentity_idtime_column_nameevent_atgrouping_idUserversion0create_time2023-03-08T20:15:33.674202update_time2023-03-08T20:15:33.674202
request_details,request_id54adbfc73bf2659650b3dbab5dcdab46

0,1
table_name,GamePlay
entity_key_column_name,entity_id
time_column_name,event_at
grouping_id,User
version,0
create_time,2023-03-08T20:15:33.674202
update_time,2023-03-08T20:15:33.674202

0,1
request_id,54adbfc73bf2659650b3dbab5dcdab46


In [7]:
ktable.create_table(
  table_name = "Purchase",
  entity_key_column_name = "entity_id",
  time_column_name = "event_at",
  grouping_id = "User",
)

0,1
table,table_namePurchaseentity_key_column_nameentity_idtime_column_nameevent_atgrouping_idUserversion0create_time2023-03-08T20:15:33.684299update_time2023-03-08T20:15:33.684299
request_details,request_id943c87a1d0f144fc902450263987cd95

0,1
table_name,Purchase
entity_key_column_name,entity_id
time_column_name,event_at
grouping_id,User
version,0
create_time,2023-03-08T20:15:33.684299
update_time,2023-03-08T20:15:33.684299

0,1
request_id,943c87a1d0f144fc902450263987cd95


In [8]:
# Load the data into the Purchase table
ktable.load(table_name="GamePlay", file="game_play.csv")

0,1
data_token_id,e2de6943-2fc4-4340-b020-d522c647357e
request_details,request_id41314fede359743668fcbd53697ff989

0,1
request_id,41314fede359743668fcbd53697ff989


In [9]:
# Load the data into the Purchase table
ktable.load(table_name="Purchase", file="purchase.csv")

0,1
data_token_id,5a6c2fe2-9623-4875-b056-93e38448e8c4
request_details,request_id3a8a5b18f5957999daf10a3a4729adb0

0,1
request_id,3a8a5b18f5957999daf10a3a4729adb0


### Working With Your Kaskada Environment


In [10]:
# Get the table after loading data
ktable.get_table("GamePlay")

0,1
table,table_nameGamePlayentity_key_column_nameentity_idtime_column_nameevent_atgrouping_idUserversion1schema(see Schema tab)create_time2023-03-08T20:15:33.674202update_time2023-03-08T20:15:33.698909
request_details,request_id807d2ff41824b64840e12f0475dda9ff

0,1
table_name,GamePlay
entity_key_column_name,entity_id
time_column_name,event_at
grouping_id,User
version,1
schema,(see Schema tab)
create_time,2023-03-08T20:15:33.674202
update_time,2023-03-08T20:15:33.698909

0,1
request_id,807d2ff41824b64840e12f0475dda9ff

Unnamed: 0,column_name,column_type
0,event_at,string
1,entity_id,string
2,duration,i64
3,won,bool


In [12]:
%%fenl --var query_result
# Query the table to see that data has been loaded
GamePlay

Unnamed: 0,_time,_subsort,_key_hash,_key,event_at,entity_id,duration,won
0,2022-01-01 02:30:00,12714731162780208561,5902814233694669492,Alice,2022-01-01 02:30:00+00:00,Alice,10,True
1,2022-01-01 02:35:00,12714731162780208562,17054345325612802246,Bob,2022-01-01 02:35:00+00:00,Bob,3,False
2,2022-01-01 03:46:00,12714731162780208563,17054345325612802246,Bob,2022-01-01 03:46:00+00:00,Bob,8,False
3,2022-01-01 03:58:00,12714731162780208564,17054345325612802246,Bob,2022-01-01 03:58:00+00:00,Bob,23,True
4,2022-01-01 04:25:00,12714731162780208565,17054345325612802246,Bob,2022-01-01 04:25:00+00:00,Bob,8,True
5,2022-01-01 05:05:00,12714731162780208566,5902814233694669492,Alice,2022-01-01 05:05:00+00:00,Alice,53,True
6,2022-01-01 05:36:00,12714731162780208567,5902814233694669492,Alice,2022-01-01 05:36:00+00:00,Alice,2,False
7,2022-01-01 07:22:00,12714731162780208568,17054345325612802246,Bob,2022-01-01 07:22:00+00:00,Bob,7,False
8,2022-01-01 08:35:00,12714731162780208569,5902814233694669492,Alice,2022-01-01 08:35:00+00:00,Alice,5,False
9,2022-01-01 10:01:00,12714731162780208570,5902814233694669492,Alice,2022-01-01 10:01:00+00:00,Alice,43,True

0,1
state,SUCCESS
query_id,bf5215d1-1dee-471d-9cc2-fe54afb24728
metrics,time_preparing0.003stime_computing0.007soutput_files1
analysis,can_executeTrue
schema,(see Schema tab)
request_details,request_id5e7b0430b4c4205f3f1b864bf89b8423
expression,# Query the table to see that data has been loaded GamePlay

0,1
time_preparing,0.003s
time_computing,0.007s
output_files,1

0,1
can_execute,True

0,1
request_id,5e7b0430b4c4205f3f1b864bf89b8423

Unnamed: 0,column_name,column_type
0,event_at,string
1,entity_id,string
2,duration,i64
3,won,bool


## Step 1: Define features

We want to predict if a user will pay for an upgrade - step one is to compute features from events. As a first simple feature, we describe the amount of time a user as spent losing at the game - users who lose a lot are probably more likely to pay for upgrades.


In [13]:
%%fenl

let GameDefeat = GamePlay | when(not(GamePlay.won))

let features = {
    loss_duration: sum(GameDefeat.duration) }

in features

Unnamed: 0,_time,_subsort,_key_hash,_key,loss_duration
0,2022-01-01 02:35:00,12714731162780208562,17054345325612802246,Bob,3
1,2022-01-01 03:46:00,12714731162780208563,17054345325612802246,Bob,11
2,2022-01-01 05:36:00,12714731162780208567,5902814233694669492,Alice,2
3,2022-01-01 07:22:00,12714731162780208568,17054345325612802246,Bob,18
4,2022-01-01 08:35:00,12714731162780208569,5902814233694669492,Alice,7

0,1
state,SUCCESS
query_id,f29fc9f6-f4b5-4d4d-bda7-bd1fd75285a8
metrics,time_preparing0.002stime_computing0.005soutput_files1
analysis,can_executeTrue
schema,(see Schema tab)
request_details,request_idf130785b976c734f24132f62b5bfd013
expression,let GameDefeat = GamePlay | when(not(GamePlay.won)) let features = {  loss_duration: sum(GameDefeat.duration) } in features

0,1
time_preparing,0.002s
time_computing,0.005s
output_files,1

0,1
can_execute,True

0,1
request_id,f130785b976c734f24132f62b5bfd013

Unnamed: 0,column_name,column_type
0,loss_duration,i64


Notice that the result is a timeline describing the step function of how this feature has changed over time. We can “observe” the value of this step function at any time, regardless of the times at which the original events occurred.

Another thing to notice is that these results are automatically grouped by user. We didn’t have to explicitly group by user because tables in Kaskada specify an "entity" associated with each row.

### Step 2: Define prediction times

The second step is to observe our feature at the times a prediction would have been made. Let’s assume that the game designers want to offer an upgrade any time a user loses the game twice in a row. We can construct a set of examples associated with this prediction time by observing our feature when the user loses twice in a row.

In [14]:
%%fenl

let GameDefeat = GamePlay | when(not(GamePlay.won))

let features = {
    loss_duration: sum(GameDefeat.duration) }

let is_prediction_time = not(GamePlay.won) and count(GameDefeat, window=since(GamePlay.won)) == 2

let example = features | when(is_prediction_time)
    
in example

Unnamed: 0,_time,_subsort,_key_hash,_key,loss_duration
0,2022-01-01 03:46:00,12714731162780208563,17054345325612802246,Bob,11
1,2022-01-01 08:35:00,12714731162780208569,5902814233694669492,Alice,7

0,1
state,SUCCESS
query_id,9c308f9d-bcf7-4ca1-a6df-17343cb769b6
metrics,time_preparing0.003stime_computing0.008soutput_files1
analysis,can_executeTrue
schema,(see Schema tab)
request_details,request_id5da760eeaa3f768cc78c2f1430126588
expression,"let GameDefeat = GamePlay | when(not(GamePlay.won)) let features = {  loss_duration: sum(GameDefeat.duration) } let is_prediction_time = not(GamePlay.won) and count(GameDefeat, window=since(GamePlay.won)) == 2 let example = features | when(is_prediction_time)  in example"

0,1
time_preparing,0.003s
time_computing,0.008s
output_files,1

0,1
can_execute,True

0,1
request_id,5da760eeaa3f768cc78c2f1430126588

Unnamed: 0,column_name,column_type
0,loss_duration,i64


This query gives us a set of examples, each containing input features computed at the specific times we would like to make a prediction.

### Step 3: Shift examples

The third step is to move each example to the time when the outcome we’re predicting can be observed. We want to give the user some time to see the upgrade offer, decide to accept it, and pay - let’s check to see if they accepted an hour after we make the offer.

In [15]:
%%fenl

let GameDefeat = GamePlay | when(not(GamePlay.won))

let features = {
    loss_duration: sum(GameDefeat.duration) }

let is_prediction_time = not(GamePlay.won) and (count(GameDefeat, window=since(GamePlay.won)) == 2)

let example = features | when(is_prediction_time)| shift_by(seconds(60*10))

in example

Unnamed: 0,_time,_subsort,_key_hash,_key,loss_duration
0,2022-01-01 03:56:00,0,17054345325612802246,Bob,11
1,2022-01-01 08:45:00,0,5902814233694669492,Alice,7

0,1
state,SUCCESS
query_id,df41e8a2-61f7-4718-b63d-7c43910caf86
metrics,time_preparing0.003stime_computing0.006soutput_files1
analysis,can_executeTrue
schema,(see Schema tab)
request_details,request_id7520caf8189521ff84647142d7c8d2da
expression,"let GameDefeat = GamePlay | when(not(GamePlay.won)) let features = {  loss_duration: sum(GameDefeat.duration) } let is_prediction_time = not(GamePlay.won) and (count(GameDefeat, window=since(GamePlay.won)) == 2) let example = features | when(is_prediction_time)| shift_by(seconds(60*10)) in example"

0,1
time_preparing,0.003s
time_computing,0.006s
output_files,1

0,1
can_execute,True

0,1
request_id,7520caf8189521ff84647142d7c8d2da

Unnamed: 0,column_name,column_type
0,loss_duration,i64


Our training examples have now moved to the point in time when the label we want to predict can be observed. Notice that the values in the time column are an hour later than the previous step.

### Step 4: Label examples

The final step is to see if a purchase happened after the prediction was made. This will be our target value and we’ll add it to the records that currently contain our feature.

In [16]:
%%fenl --var training

let GameDefeat = GamePlay | when(not(GamePlay.won))

let features = {
    loss_duration: sum(GameDefeat.duration),
    purchase_count: count(Purchase) }

let is_prediction_time = count(GameDefeat, window=since(GamePlay.won)) == 2

let example = features | when(is_prediction_time)
    | shift_to(time_of($input) | add_time(seconds(60*10)))

let target = count(Purchase) > example.purchase_count
    
in extend(example, {target}) | when(is_valid(example.purchase_count))

Unnamed: 0,_time,_subsort,_key_hash,_key,loss_duration,purchase_count,target
0,2022-01-01 04:01:00,1,17054345325612802246,Bob,11,1,False
1,2022-01-01 04:08:00,2,17054345325612802246,Bob,11,1,False
2,2022-01-01 08:45:00,0,5902814233694669492,Alice,7,2,False
3,2022-01-01 10:11:00,0,5902814233694669492,Alice,7,2,False

0,1
state,SUCCESS
query_id,2db18508-f581-4849-bb1e-ce9acc8b118c
metrics,time_preparing0.016stime_computing0.014soutput_files1
analysis,can_executeTrue
schema,(see Schema tab)
request_details,request_idf425bb1ce800bf5877f0fbdb0b3088f2
expression,"let GameDefeat = GamePlay | when(not(GamePlay.won)) let features = {  loss_duration: sum(GameDefeat.duration),  purchase_count: count(Purchase) } let is_prediction_time = count(GameDefeat, window=since(GamePlay.won)) == 2 let example = features | when(is_prediction_time)  | shift_to(time_of($input) | add_time(seconds(60*10))) let target = count(Purchase) > example.purchase_count  in extend(example, {target}) | when(is_valid(example.purchase_count))"

0,1
time_preparing,0.016s
time_computing,0.014s
output_files,1

0,1
can_execute,True

0,1
request_id,f425bb1ce800bf5877f0fbdb0b3088f2

Unnamed: 0,column_name,column_type
0,loss_duration,i64
1,purchase_count,u32
2,target,bool


## Train a model!

Now that we've observed features and labels at the correct points in time, we can train a model from our examples. This toy dataset won't produce a very good model, of course.

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

X = training.dataframe[['loss_duration']]
y = training.dataframe['target']

scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)

model = LogisticRegression(max_iter=1000)
model.fit(X_scaled, y)

ModuleNotFoundError: No module named 'sklearn'
