# Kaskada: ML example

### Setting up a Kaskada environment


In [None]:
!pip install kaskada

In [None]:
from kaskada.api.session import LocalBuilder
session = LocalBuilder().build()
%load_ext fenlmagic

### Build a Sample Data Set

In this simple example, we're collecting and storing events about what users are doing.
These events describe when users win, when they lose, when they buy things, etc.
Events are stored in two CSV files: 

`game_play.csv` describes each time a player completes a game

In [None]:
%%writefile game_play.csv
event_at,entity_id,duration,won
2022-01-01 02:30:00+00:00,Alice,10,true
2022-01-01 02:35:00+00:00,Bob,3,false
2022-01-01 03:46:00+00:00,Bob,8,false
2022-01-01 03:58:00+00:00,Bob,23,true
2022-01-01 04:25:00+00:00,Bob,8,true
2022-01-01 05:05:00+00:00,Alice,53,true
2022-01-01 05:36:00+00:00,Alice,2,false
2022-01-01 07:22:00+00:00,Bob,7,false
2022-01-01 08:35:00+00:00,Alice,5,false
2022-01-01 10:01:00+00:00,Alice,43,true

`purchase.csv` describes each time a player makes a purchase

In [None]:
%%writefile purchase.csv
event_at,entity_id
2022-01-01 01:02:00+00:00,Alice
2022-01-01 01:35:00+00:00,Alice
2022-01-01 03:51:00+00:00,Bob

### Creating a Kaskada Table and Uploading Data

Below, we load the above csv into Kaskada. When a table
is created, it is persisted in your Kaskada environment.

Kaskada also allows uploading data from parquet files.


In [None]:
import kaskada.table as ktable

# Create table objects in Kaskada.
ktable.create_table(
  table_name = "GamePlay",
  entity_key_column_name = "entity_id",
  time_column_name = "event_at",
  grouping_id = "User",
)

In [None]:
ktable.create_table(
  table_name = "Purchase",
  entity_key_column_name = "entity_id",
  time_column_name = "event_at",
  grouping_id = "User",
)

In [None]:
# Load the data into the Purchase table
ktable.load(table_name="GamePlay", file="game_play.csv")

In [None]:
# Load the data into the Purchase table
ktable.load(table_name="Purchase", file="purchase.csv")

### Working With Your Kaskada Environment


In [None]:
# Get the table after loading data
ktable.get_table("GamePlay")

In [None]:
%%fenl
# Query the table to see that data has been loaded
GamePlay

## Step 1: Define features

We want to predict if a user will pay for an upgrade - step one is to compute features from events. As a first simple feature, we describe the amount of time a user as spent losing at the game - users who lose a lot are probably more likely to pay for upgrades.


In [None]:
%%fenl

let GameDefeat = GamePlay | when(not(GamePlay.won))

let features = {
    loss_duration: sum(GameDefeat.duration) }

in features

Notice that the result is a timeline describing the step function of how this feature has changed over time. We can “observe” the value of this step function at any time, regardless of the times at which the original events occurred.

Another thing to notice is that these results are automatically grouped by user. We didn’t have to explicitly group by user because tables in Kaskada specify an "entity" associated with each row.

### Step 2: Define prediction times

The second step is to observe our feature at the times a prediction would have been made. Let’s assume that the game designers want to offer an upgrade any time a user loses the game twice in a row. We can construct a set of examples associated with this prediction time by observing our feature when the user loses twice in a row.

In [None]:
%%fenl

let GameDefeat = GamePlay | when(not(GamePlay.won))

let features = {
    loss_duration: sum(GameDefeat.duration) }

let is_prediction_time = not(GamePlay.won) and count(GameDefeat, window=since(GamePlay.won)) == 2

let example = features | when(is_prediction_time)
    
in example

This query gives us a set of examples, each containing input features computed at the specific times we would like to make a prediction.

### Step 3: Shift examples

The third step is to move each example to the time when the outcome we’re predicting can be observed. We want to give the user some time to see the upgrade offer, decide to accept it, and pay - let’s check to see if they accepted an hour after we make the offer.

In [None]:
%%fenl

let GameDefeat = GamePlay | when(not(GamePlay.won))

let features = {
    loss_duration: sum(GameDefeat.duration) }

let is_prediction_time = not(GamePlay.won) and (count(GameDefeat, window=since(GamePlay.won)) == 2)

let example = features | when(is_prediction_time) | shift_by(seconds(60*10))

in example

Our training examples have now moved to the point in time when the label we want to predict can be observed. Notice that the values in the time column are an hour later than the previous step.

### Step 4: Label examples

The final step is to see if a purchase happened after the prediction was made. This will be our target value and we’ll add it to the records that currently contain our feature.

In [None]:
%%fenl --var training

let GameDefeat = GamePlay | when(not(GamePlay.won))

let features = {
    loss_duration: sum(GameDefeat.duration),
    purchase_count: count(Purchase) }

let is_prediction_time = not(GamePlay.won) and (count(GameDefeat, window=since(GamePlay.won)) == 2)

let example = features | when(is_prediction_time)
    | shift_to(time_of($input) | add_time(seconds(60*10)))

let target = count(Purchase) > (example.purchase_count | else(0))
    
in extend(example, {target}) | when(is_valid($input.loss_duration))

## Train a model!

Now that we've observed features and labels at the correct points in time, we can train a model from our examples. This toy dataset won't produce a very good model, of course.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

X = training.dataframe[['loss_duration']]
y = training.dataframe['target']

scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)

model = LogisticRegression(max_iter=1000)
model.fit(X_scaled, y)