# Kaskada: Hello World

#### A minimal workflow for building continuous-time features using Kaskada's time-centric feature engine

If you have any issues at all, please contact us using the chat/message widget on [Kaskada.com](https://kaskada.com/).

---

### Minimal Kaskada workflow in this notebook:

1. Set up the Python libraries to run `kaskada`
2. Create a table and load data into that table
3. Query the table in a few different ways
4. Build features directly from event data
5. Save the feature set as a view to share with other notebooks and people or to use in production

### Kaskada API

Most interactions with Kaskada are performed via an API. The `kaskada` Python
library is designed to provide convenient access to the Kaskada API for data
scientists and other users of python notebooks.

Fenl is the **F**eature **En**gineering **L**anguage used by Kaskada's feature
engine to abstract and simplify the construction of features. 


In [None]:
from kaskada.api.session import LocalBuilder
session = LocalBuilder().download(False).build()
%load_ext fenlmagic

## Loading Data

Kaskada stores data in tables, similar to tables you might query using SQL or
DataFrames in Python's `pandas`. Like these, tables have named columns and any
number of rows.


### Tables in Kaskada

Tables in Kaskada are designed for time-centric analysis of event data. This
means that, for many common calculations, time is handled implicitly and without
any need to write extra lines of code making sure that time points line up.

The two key elements in any event are the
[`entity`](https://docs.kaskada.com/docs/entities) with which the event is
associated and the
[`timestamp`](https://docs.kaskada.com/docs/temporal-aggregation) for when the
event occurred.

Therefore, when creating a table, you must specify as parameters the columns
that contain `entity` and `timestamp`, in addition to a table name and a fourth
optional parameter called `subsort_column_name`, which is used to order events
only when multiple events have the same `timestamp`. 

So, creating a table has four parameters:

1. `table_name`
2. `entity_key_column_name`
3. `time_column_name`
4. (optional) `subsort_column_name`


### Building a Sample Data Set

In [None]:
# This is a sample csv dataset that we will load into kaskada.

data = """event_at,entity_id,event_name,commit_count
2022-01-01 12:00:00+00:00,ada,wrote_code,1
2022-01-01 13:10:00+00:00,ada,wrote_code,1
2022-01-01 13:20:00+00:00,ada,wrote_code,1
2022-01-01 14:00:00+00:00,ada,wrote_code,3
2022-01-01 12:00:00+00:00,brian,data_scienced,1
2022-01-01 12:20:00+00:00,brian,data_scienced,2
2022-01-01 13:40:00+00:00,brian,data_scienced,1
2022-01-01 15:00:00+00:00,brian,data_scienced,1"""


### Creating a Kaskada Table and Uploading Data

Below, we load the above csv directly into Kaskada. When a table
is created, it is persisted in your Kaskada environment.

Kaskada also allows uploading data from parquet files.


In [None]:
import kaskada.table as ktable
# delete the existing table object, if needed
try:
  ktable.delete_table("SampleEvents",
                      force=True)
except:
  pass

  
# Create a table object in Kaskada. It is empty until loaded with data.
ktable.create_table(
  table_name = "SampleEvents",
  entity_key_column_name = "entity_id",
  time_column_name = "event_at",
  # subsort_column_name = "subsort_id",  # (optional parameter)
)



In [None]:
# Save the data csv to a temp file and load it into the Purchase table
import tempfile
temp_file = tempfile.NamedTemporaryFile(
    prefix="kaskada_", suffix=".csv", delete=False
)
temp_file.write(bytes(data, 'utf-8'))
temp_file.close()

ktable.load(table_name="SampleEvents", file=temp_file.name)


### Working With Your Kaskada Environment


In [None]:
# Get the table after loading data
ktable.get_table("SampleEvents")

### Viewing table data

Above, we created a Kaskada table called `SampleEvents`. We can retreive it
again---now or in a future session---via a query using `fenl`.

Here, we use the `%%fenl` magic function to run a query using the fenl syntax.
The `--var` flag causes the result of the query to be stored in a python
variable called `query_result`.



In [None]:
%%fenl --var query_result

SampleEvents


You may notice that the query result object (that we have named `query_result`) contains more than just the results of the query. There is also a few types of metadata that can be used for troubleshooting, reproducibility, and other functionality of the Kaskada API. To access the results of the query, as in rows and columns of data, we can access
the original pandas dataframe as below.

In [None]:
dat_orig = query_result.dataframe

dat_orig

The table `SampleEvents` will be  persisted in your Kaskada environment, where
you can access it in any future session by connecting to Kaskada.

Note that table definitions are currently immutable, which means that you can't
update or change table columns or which of them is the `entity_key` or `time_column`. You can, however, add data to a table, and you can always delete a
table and then re-create it with modifications.


### Deleting a table

Deleting a table also deletes Kaskada's copy of the data---but your original
data is not affected. You can delete a table as in:
```
ktable.delete_table("SampleEvents")
```


## Fenl Queries

Fenl is a compact, composable, expressive syntax designed for calculating
time-centric feature values on event data.

Fenl syntax:
* implicitly handles event timing unless you explicitly declare other relavant
  points or windows in time (implicit time JOINs)
* implicitly calculates on the defined primary `entity`, unless
  declared otherwise (implicit entity JOINs)
* treats feature values implicitly as continuous time, not forcing you to set
  rigid time points or time windows until you need to further downstream


### Fenl Syntax Basics

Fenl queries are enclosed in a query block prefixed with with `%%fenl` on a new
line.

Here is the same whole-table query as above (SELECT everything):


In [None]:
%%fenl --var query_result

SampleEvents


Fenl offers pipe notation (`first_thing | second_thing`) to denote that the
results of `first_thing` are passed along to the `second_thing` via the keyword
`$input`. In many cases, this keyword is not needed when the `second_thing` has
an obvious default way to use the input; there are many examples of this below.

In this next example, all data in the `SampleEvents` table are passed along to
the `when()` function, which can filter data like SQL `WHERE` clause but also
has some temporal functionality that we show later.


In [None]:
%%fenl --var query_result

SampleEvents | when($input.entity_id == "ada")  # make sure to use double-quotes, not single


### Composing Queries

We can build queries by using curly braces containing column names
and the `fenl` expressions describing those columns. Column descriptions can
contain input data and pipes to functions like `count()`, `sum()`, or `max()`.

In [None]:
%%fenl --var query_result

{
    timestamp: SampleEvents.event_at,
    username: SampleEvents.entity_id,
    event_count: SampleEvents | count(),
    commit_total: SampleEvents.commit_count | sum(),
    commit_event_max: SampleEvents.commit_count | max(),
}


### Time Series in Continuous Time

Computations in `fenl` are inherently temporal:

* Most aggregated columns can be thought of as continuous time series
* These continuous time series can be queried at any arbitrary points in time,
  not just at the timestamps of the rows

Read more about continuous time series in the [Fenl Language
Guide](https://docs.kaskada.com/docs/language-guide)

For example, regardless of the timestamps of the input data, we can easily
aggregate hourly statistics for each entity without having to write any time
windowing or JOIN logic other than a parameter specification
`window=since(hourly())`.

In [None]:
%%fenl --var query_result

# build continuous time series versions of the username so that we
# can query it at any time point, including between events in the data set
let username_continuous = SampleEvents.entity_id | last()

in {
    username_continuous,

    # calculate continuous time series columns for the total event count and
    # hourly event count
    count_hourly: SampleEvents | count(window=since(hourly())),
    event_count_total: SampleEvents | count(),

    # note in the results that these non-continuous columns won't have a value
    # at many time points 
    event_time_not_continuous: SampleEvents.event_at,
    event_username_not_continuous: SampleEvents.entity_id,
}
| extend({
    # we use the `extend` function to add columns dealing with time values that
    # are relative to other events or calculated values
    timestamp_continuous: time_of($input),
})
| when(hourly())  # filter results to only the hourly time points

### Features for Hourly User Activity

In this next example, we build a set of basic features for hourly activity for
each user.

In [None]:
%%fenl --var query_result

# continuous time series columns
let event_count_total = SampleEvents | count()
let count_hourly = SampleEvents | count(window=since(hourly()))
let first_event_time = SampleEvents.event_at | first(window=since(hourly()))
let last_event_time = SampleEvents.event_at | last(window=since(hourly()))
let commit_total = SampleEvents.commit_count | sum(window=since(hourly())) | else(0)
let commit_event_max = SampleEvents.commit_count | max(window=since(hourly()))

# continuous time series versions
let timestamp_continuous = event_count_total | time_of()
let username_continuous = SampleEvents.entity_id | last()

in {
    username_continuous,
    first_event_time,
    last_event_time,
    count_hourly,
    event_count_total,
    commit_total,
    commit_event_max,
}
| extend({
    timestamp_continuous: time_of($input),
})
| when(hourly())

## Saving, Publishing, and Sharing Queries

To save a query in the Kaskada system, we can create a **view**. Views can be
retrieved later, shared with others, or put into production.

To create a view from one of the queries above, note that the query response was
saved in the `query_result` python variable, and use that variable in one of the
parameters of the Kaskada `create_view()` function as below.

In [None]:
from kaskada import view as kview

# delete the existing view object, if needed
try:
  kview.delete_view("SampleView")  # delete the view before updating or re-creating it
except:
  pass

kview.create_view(
    view_name = "SampleView",
    expression = query_result.query,
)


We can interact with views via a few functions, such as:

```
# lists all views
kview.list_views()

# search for views by partial name
kview.list_views(search="Sample")

# get the specification of a view (not the view's query results)
kview.get_view("SampleView")

# delete the view
kview.delete_view("SampleView")
```