# Kaskada: Event Processing and Time-centric Calculations

Kaskada was built to process and perform temporal calculations on event streams,
with real-time analytics and machine learning in mind. It is not exclusively for
real-time applications, but Kaskada excels at time-centric computations and
aggregations on event-based data.

For example, let's say you're building a user analytics dashboard at an
ecommerce retailer. You have event streams showing all actions the user has
taken, and you'd like to include in the dashboard:
* the total number of events the user has ever generated
* the total number of purchases the user has made
* the total revenue from the user
* the number of purchases made by the user today
* the total revenue from the user today
* the number of events the user has generated in the past hour

Because the calculations needed here are a mix of hourly, daily, and over all of
history, more than one type of event aggregation needs to happen. Table-centric
tools like those based on SQL would require multiple JOINs and window functions,
which would be spread over multiple queries or CTEs. 

Kaskada was designed for these types of time-centric calculations, so we can do
each of the calculations in the list in one line:

```
event_count_total: DemoEvents | count(),
purchases_total_count: DemoEvents | when(DemoEvents.event_name == 'purchase') | count(),
revenue_total: DemoEvents.revenue | sum(),
purchases_daily: DemoEvents | when(DemoEvents.event_name == 'purchase') | count(window=since(daily())),
revenue_daily: DemoEvents.revenue | sum(window=since(daily())),
event_count_hourly: DemoEvents | count(window=since(hourly())),

```

Of course, a few more lines of code is needed to put these calculations to work,
but these six lines are all that is needed to specify the calculations
themselves. Each line may specify:
* the name of a calculation (e.g. `event_count_total`)
* the input data to start with (e.g. `DemoEvents`)
* selecting event fields (e.g. `DemoEvents.revenue`)
* function calls (e.g. `count()`)
* event filtering (e.g. `when(DemoEvents.event_name == 'purchase')`)
* time windows to calculate over (e.g. `window=since(daily())`)

...with consecutive steps separated by a familiar pipe (`|`) notation.

Because Kaskada was built for time-centric calculations on event-based data, a
calculation we might describe as "total number of purchase events for the user"
can be defined in Kaskada in roughly the same number of terms as the verbal
description itself.

Continue through the demo below to find out how it works.

See [the Kaskada
documentation](https://kaskada-ai.github.io/docs-site/kaskada/main/getting-started/installation.html)
for lots more information.

## Installation

In [None]:
# %pip install --upgrade pip
# %pip install pandas matplotlib numpy StringIO
# %pip install kaskada==0.1.1a7


## Kaskada Client Setup


In [None]:
import kaskada.api.release as release
from kaskada import table as ktable
import os
from getpass import getpass

# at the moment, we need a Github Access Token
# generate one from your Github account
os.environ[release.ReleaseClient.GITHUB_ACCESS_TOKEN_ENV] = getpass(prompt='Github Access Token:')

In [None]:
from kaskada.api.session import LocalBuilder

# start a Kaskada session
session = LocalBuilder().build()

# load the extension that parses Kaskada queries
%reload_ext fenlmagic

### Developer Notes on Session Builder

The next version of Kaskada will use an API Session Builder to follow closely to PySpark's approach to local connections.

####  Local Session Builder
The default local session builder (`LocalBuilder`) by default assumes:
* Endpoint: `localhost:50051` for the API server
* Is Secure: `False`
* Will spin up the API server and Compute Server binaries.
  * Assumes Kaskada root is **~/.cache/kaskada**. Override by setting *KASKADA_PATH*
  * Assumes the binaries are stored in *KASKADA_PATH/bin*. Override by setting *KASKADA_BIN_PATH* (default is bin)
  * Assumes the logs are stored in *KASKADA_PATH/logs*. Override by setting *KASKADA_LOG_PATH* (default is logs)
  
Most people running locally will want to spin up the server locally by just using: `LocalBuilder().build()`.

## Example dataset

For this demo, we'll use a very small example data set, which, for simplicity and portability of this demo notebook, we'll read from a string.

You can load your own event data from many common sources. See [the Loading Data
documentation](https://kaskada-ai.github.io/docs-site/kaskada/main/loading-data.html)
for more information.

In [None]:
from io import StringIO
import pandas

# For demo simplicity, instead of a CSV file, we read and then parse data from a
# CSV string. Any event data in a dataframe will work.

event_data_string = '''
    event_id,event_at,entity_id,event_name,revenue
    ev_00001,2022-01-01 22:01:00+00:00,user_001,login,0
    ev_00002,2022-01-01 22:05:00+00:00,user_001,view_item,0
    ev_00003,2022-01-01 22:20:00+00:00,user_001,view_item,0
    ev_00004,2022-01-01 23:10:00+00:00,user_001,view_item,0
    ev_00005,2022-01-01 23:20:00+00:00,user_001,view_item,0
    ev_00006,2022-01-01 23:40:00+00:00,user_001,purchase,12.50
    ev_00007,2022-01-01 23:45:00+00:00,user_001,view_item,0
    ev_00008,2022-01-01 23:59:00+00:00,user_001,view_item,0
    ev_00009,2022-01-02 05:30:00+00:00,user_001,login,0
    ev_00010,2022-01-02 05:35:00+00:00,user_001,view_item,0
    ev_00011,2022-01-02 05:45:00+00:00,user_001,view_item,0
    ev_00012,2022-01-02 06:10:00+00:00,user_001,view_item,0
    ev_00013,2022-01-02 06:15:00+00:00,user_001,view_item,0
    ev_00014,2022-01-02 06:25:00+00:00,user_001,purchase,25
    ev_00015,2022-01-02 06:30:00+00:00,user_001,view_item,0
    ev_00016,2022-01-02 06:31:00+00:00,user_001,purchase,5.75
    ev_00017,2022-01-02 07:01:00+00:00,user_001,view_item,0
    ev_00018,2022-01-01 22:17:00+00:00,user_002,view_item,0
    ev_00019,2022-01-01 22:18:00+00:00,user_002,view_item,0
    ev_00020,2022-01-01 22:20:00+00:00,user_002,view_item,0
'''

event_stringio = StringIO(event_data_string)
events_df = pandas.read_csv(event_stringio)

# convert `event_at` column from string to datetime
events_df['event_at'] = pandas.to_datetime(events_df['event_at'])
events_df['event_at_epoch'] = events_df['event_at'].apply(lambda x: x.timestamp())


In [None]:
# inspect the event data in the dataframe
events_df

## Load the data into Kaskada

Kaskada uses a new model of event processing to do calculations temporally,
unlike table-centric tools based on SQL. So, we need to load the data into
Kaskada in order to perform the calculations we want.

Once the Kaskada client is installed and imported as above, we can load the data
by:
* creating a table with `create_table`
* loading it into Kaskada with `load_dataframe`

In [None]:
# delete an existing table, if needed
# try:
#   ktable.delete_table("DemoEvents",
#                       force=True)
# except:
#   pass


ktable.create_table(
  table_name = "DemoEvents",
  entity_key_column_name = "entity_id",
  time_column_name = "event_at",
)

# Upload the dataframe's contents to the Kaskada table
ktable.load_dataframe("DemoEvents",
                      events_df)



In [None]:
# check to confirm that the table exists
ktable.list_tables()

## Define queries and calculations

Kaskada query language is parsed by the `fenl` extension. Query calculations are
defined in a code blocks starting with `%%fenl`.

See [the `fenl`
documentation](https://kaskada-ai.github.io/docs-site/kaskada/main/fenl/fenl-quick-start.html)
for more information.

Let's do a simple query for events for a specific entity ID.


In [None]:
%%fenl

DemoEvents | when(DemoEvents.entity_id == 'user_002')

When using the pipe notation, we can use `$input` to represent the thing being
piped to a subsequent step.

In [None]:
%%fenl

DemoEvents | when($input.entity_id == 'user_002')


Beyond querying for events, Kaskada has a powerful syntax for defining
calculations on events, temporally across history.

The six calculations discussed at the top of this demo notebook are below.

(Note that some functions return `NaN` if no events for that user have occurred
within the time window.)

In [None]:
%%fenl

{
    event_count_total: DemoEvents | count(),
    event_count_hourly: DemoEvents | count(window=since(hourly())),
    purchases_total_count: DemoEvents | when(DemoEvents.event_name == 'purchase') | count(),
    purchases_daily: DemoEvents | when(DemoEvents.event_name == 'purchase') | count(window=since(daily())),
    revenue_daily: DemoEvents.revenue | sum(window=since(daily())),
    revenue_total: DemoEvents.revenue | sum(),
}
| when(hourly())  # each row in the output represents one hour of time


#### Trailing `when()` clause

A key feature of Kaskada's time-centric design is the ability to query for
calculation values at any point in time. Traditional query languages (e.g. SQL)
can only return data that already exists---if we want to return a row of
computed/aggregated data, we have to compute the row first, then return it. As a
specific example, suppose we have SQL queries that produce daily aggregations
over event data, and now we want to have the same aggregations on an hourly
basis. In SQL, we would need to write new queries for hourly aggregations; the
queries would look very similar to the daily ones, but they would still be
different queries.

With Kaskada, we can define the calculations once, and then separately specify
the points in time at which we want to know the calculation values.

Note the final line in the above query:
```
| when(hourly())
```
We call this a "trailing `when`" clause, and its purpose is to specify the time
points you would like to see in the query results.

Regardless of the time cadence of the calculations themselves, the query output
can contain rows for whatever timepoints you specify. You can define a set of
daily calculations and then get hourly updates during the day. Or, you can
publish a set of calculations in a query view (see below), and different users
can query those same calculations for hourly, daily, and monthly
values---without editing the calculation definitions themselves.


#### Adding more calculations to the query

We can add two new calculations, also in one line each, representing:
* the time of the user's first event
* the time of the user's last event

We can also add the parameter `--var event_calculations` to save the results
into a python object called `event_calculations` that can be used in subsequent
python code.

In [None]:
%%fenl --var event_calculations

{
    event_count_total: DemoEvents | count(),
    event_count_hourly: DemoEvents | count(window=since(hourly())),
    purchases_total_count: DemoEvents | when(DemoEvents.event_name == 'purchase') | count(),
    purchases_daily: DemoEvents | when(DemoEvents.event_name == 'purchase') | count(window=since(daily())),
    revenue_daily: DemoEvents.revenue | sum(window=since(daily())),
    revenue_total: DemoEvents.revenue | sum(),
    
    first_event_at: DemoEvents.event_at | first(),
    last_event_at: DemoEvents.event_at | last(),
}
| when(hourly())


In [None]:
# the object `event_calculations` has an attribute called `dataframe` can be
# used like any other dataframe, for data exploration, visualization, analytics,
# or machine learning.

event_calculations.dataframe

This is only a small sample of possible Kaskada queries and capabilities. See
[the `fenl`
catalog](https://kaskada-ai.github.io/docs-site/kaskada/main/fenl/catalog.html)
for a full list of functions and operators.

## Publish Query Calculation Definintions as Views

The definitions of your query calculations can be published in Kaskada and used
elsewhere, including in other Kaskada queries.

In [None]:
from kaskada import view as kview

kview.create_view(
  view_name = "DemoFeatures", 
  expression = event_calculations.expression,
)

In [None]:
# list views with a search term
kview.list_views(search = "DemoFeatures")

We can query a published view just like we would any other dataset.

In [None]:
%%fenl

DemoFeatures | when($input.revenue_daily > 0)