This is one of the Objectiv [example notebooks](https://objectiv.io/docs/modeling/example-notebooks/). These notebooks can run [on your own data](https://objectiv.io/docs/modeling/get-started-in-your-notebook/), or you can instead run the [Demo](https://objectiv.io/docs/home/try-the-demo/) to quickly try them out.

This notebook demonstrates what you can do with the [Bach modeling library](https://objectiv.io/docs/modeling/bach/) and a dataset that is validated against the [open analytics taxonomy](https://objectiv.io/docs/taxonomy/). The Objectiv [Bach API](https://objectiv.io/docs/modeling/bach/api-reference/) is strongly pandas-like, to provide a familiar interface to handle large amounts of data in a python environment, while supporting multiple data stores. See [an intro into the pandas API here](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) if you're not familiar with it yet.

This example uses real data collected with Objectiv's [Tracking SDK](https://objectiv.io/docs/tracking/) on objectiv.io, stored in an SQL database. 

## Getting started
First we have to install the open model hub and instantiate the Objectiv DataFrame object; [see here how to get started in your notebook](https://objectiv.io/docs/modeling/get-started-in-your-notebook/). 

The open model hub is a toolkit with functions and models that can run directly on a full dataset collected with Objectiv’s Tracker SDKs. The [`get_objectiv_dataframe()`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/ModelHub/get_objectiv_dataframe/) operation creates a Bach DataFrame that has all columns and data types set correctly, and as such can always be used with models from the open model hub. 

By instantiating the model hub with a `global_contexts` parameter, all global contexts that are needed in the following analyses are added to the DataFrame. In this example, we select 'application' and 'marketing' contexts. [Later in this notebook](#Global-contexts) we'll give more details on what data is available in the global contexts and how to access this data for analyses.

In [None]:
# set the timeframe of the analysis
start_date = '2022-03-01'
end_date = None

In [None]:
from modelhub import ModelHub
from bach import display_sql_as_markdown

# instantiate the model hub, set the default time aggregation to daily
# and get the global contexts that will be used in this example
modelhub = ModelHub(time_aggregation='%Y-%m-%d', global_contexts=['application', 'marketing'])
# get an Objectiv DataFrame within a defined timeframe
df = modelhub.get_objectiv_dataframe(start_date=start_date, end_date=end_date)

The data for this DataFrame is still in the database, and the database is not queried before any of the data is loaded to the python environment. The methods that query the database are: 
* [`head()`](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/head/)
* [`to_pandas()`](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/to_pandas/)
* [`get_sample()`](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/get_sample/)
* [`to_numpy()`](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/to_numpy/)
* The property accessors [`Series.array`](https://objectiv.io/docs/modeling/bach/api-reference/Series/array/) and [`Series.value`](https://objectiv.io/docs/modeling/bach/api-reference/Series/value/)

For demo purposes of this notebook, these methods are called often to show the results of our operations. To limit the number of executed queries on the full dataset, it is recommended to use these methods less often or [to sample the data first](#Sampling).

### Reference
* [modelhub.ModelHub.get_objectiv_dataframe](https://objectiv.io/docs/modeling/open-model-hub/api-reference/ModelHub/get_objectiv_dataframe/)

## The data
The DataFrame contains:

* The index. This is a unique identifier for every hit.

In [None]:
df.index_dtypes

* The event data. These columns contain all information about the event.

In [None]:
df.dtypes

What’s in these columns:
* `day`: the day of the session as a date.
* `moment`: the exact moment of the event.
* `user_id`: the unique identifier of the user based on the cookie.
* `location_stack`: a JSON-like data column that stores information on the exact location where the event is triggered in the product's UI. [See below](#location_stack) for more detailed explanation.
* `event_type`: the type of event that is logged.
* `stack_event_types`: the parents of the event_type.
* `session_id`: a unique incremented integer id for each session. Starts at 1 for the selected data in the DataFrame.
* `session_hit_number`: a incremented integer id for each hit in session ordered by moment.

Besides these 'standard' columns, the DataFrame contains additional columns that are extracted from the global contexts:
* `application`
* `marketing`

[See more about global contexts here](#Global-contexts).

A preview of the data below, showing the latest PressEvents.

In [None]:
df[df.event_type == 'PressEvent'].sort_values('moment', ascending=False).head()

### Reference
* [bach.DataFrame.index_dtypes](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/index_dtypes/)
* [bach.DataFrame.dtypes](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/dtypes/)
* [bach.DataFrame.sort_values](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/sort_values/)
* [bach.DataFrame.head](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/head/)

## The Open Taxonomy
Data in a DataFrame created with `get_objectiv_dataframe()` follows the [open analytics taxonomy](https://objectiv.io/docs/taxonomy/core-concepts/):
* **event_type** column: describes the type of interactive or non-interactive event.
* **location_stack** column: describes where an event exactly happened in the user interface.
* **global contexts** data: general information about the state in which an event happened.

The following section goes through these concepts one-by-one.

### `event_type` column

The `event_type` describes what type of event was triggered. The goal of the open taxonomy is to label all interactive and non-interactive events in a standardized way. Together with the `location_stack`, the `event_type` 'defines' what happened with, or on the product.

In [None]:
df[df.day == '2022-01-10'].event_type.head()

### Location stack & global contexts
The location stack and global contexts are stored as JSON type data. Within the DataFrame, it is easy to 
access data in JSON data based on position or content.

**Slicing the JSON data**  
With the `.json[]` syntax you can slice the array using integers. Instead of integers, dictionaries can also be passed to 'query' the JSON array. If the passed dictionary matches a context object in the stack, all objects of the stack starting (or ending, depending on the slice) at that object will be returned.

**An example**  
Consider a JSON array that looks like this (this is a real example of a location stack):
```json
[{"id": "docs", "_type": "RootLocationContext"},
 {"id": "docs-sidebar", "_type": "NavigationContext"},
 {"id": "API Reference", "_type": "ExpandableContext"},
 {"id": "DataFrame", "_type": "ExpandableContext"},
 {"id": "Overview", "_type": "LinkContext"}]
```
**Regular slicing**
```python
df.location_stack.json[2:4]
```
For the example array it would return:
```json
[{"id": "API Reference", "_type": "ExpandableContext"},
 {"id": "DataFrame", "_type": "ExpandableContext"}]
```
**Slicing by querying**

We want to return only the part of the array starting at the object that contain this object:
```javascript
{"id": "docs-sidebar", "_type": "NavigationContext"}
```
The syntax for selecting like this is: 
```python
df.location_stack.json[{"id": "docs-sidebar", "_type": "NavigationContext"}:]
```
For the example array it would return:
```json
[{'id': 'docs-sidebar', '_type': 'NavigationContext'},
 {'id': 'API Reference', '_type': 'ExpandableContext'},
 {'id': 'DataFrame', '_type': 'ExpandableContext'},
 {'id': 'Overview', '_type': 'LinkContext'}]
```
In case a JSON array does not contain the object, `None` is returned. More info at the api reference: https://objectiv.io/docs/modeling/bach/api-reference/Series/Json/

### `location_stack` column
The `location_stack` column in the DataFrame stores the information on where an event exactly happened in the user interface. The example used above is the location stack of a link to the DataFrame API reference, in the menu on our documentation pages.

Because of the specific way the location information is labeled, validated, and stored using the open analytics taxonomy, it can be used to easily slice and group your product's features. The column is set as an `objectiv_location_stack` type, and therefore location stack specific methods can be used to access the data from the `location_stack`. These [methods](https://objectiv.io/docs/modeling/bach/api-reference/Series/Json/) can be used using the `.ls` accessor on the column:
* [`.ls.navigation_features`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesLocationStack/ls/)
* [`.ls.feature_stack`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesLocationStack/ls/)
* [`.ls.nice_name`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesLocationStack/ls/)

For example:
```python
df.location_stack.ls.nice_name
```
returns '*'Link: Overview located at Root Location: docs => Navigation: docs-sidebar => Expandable: API Reference => Expandable: DataFrame*' for the location stack mentioned above.

[See the full reference of the location stack here](https://objectiv.io/docs/taxonomy/location-contexts/). An example location stack for a PressEvent is queried below:

In [None]:
df[df.event_type == 'PressEvent'].location_stack.head(1)[0]

### Global contexts
Global contexts contain all general information that is relevant to the logged event. To optimize data processing, not all data that is stored in the global contexts in the database is loaded into the DataFrame when it is created. Data columns are only created for the global contexts that are selected when the model hub is instantiated. In this example, those columns are `application` and `marketing`.

Each selected global context is a JSON-like column of the 'objectiv_global_context' type, and therefore contains multiple key-value pairs. The data in these JSON columns can be accessed with the `context` accessor on the respective columns. For example to get the ID of the application as a Series, you use:

```
df.application.context.id
```

Similarly, the application ID can be set as regular (text) column in the DataFrame:

```
df['application_id'] = df.application.context.id
```

[See the full reference of all available global contexts in the open taxonomy here](https://objectiv.io/docs/taxonomy/global-contexts/). Each global context _always_ has an 'id' key that uniquely identifies the global context of that type. Additional keys are shown in the blocks of each context in the reference.

From the marketing context, for example, we can therefore also get the 'source' as a column:

```
df['marketing_source'] = df.marketing.context.source
```

When instantiating the model hub, global contexts are added using the name of the context without the word
'Context' and converted to 'snake_case' (the name of the context split before every capital letter and
joined with an underscore), i.e. to add the HttpContext use 'http' and to add the InputValueContext use
'input_value':

```
modelhub = ModelHub(global_contexts=['http', 'input_value'])
```

In the case you later want to add other data from the global contexts to your DataFrame, you will have to re-instantiate the model hub with those contexts and recreate the DataFrame. Note that no data has to be processed for recreating the DataFrame until the data gets queried (by using `.head()` or similar).

In [None]:
# we create the columns from the examples above, and show the data.
df['application_id'] = df.application.context.id
df['marketing_source'] = df.marketing.context.source

In [None]:
# we can now show the columns where the marketing source is not null.
df[df.marketing_source.notnull()][['application', 'marketing', 'application_id', 'marketing_source']].head()

### Reference
* [bach.DataFrame.head](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/head/)
* [bach.Series.notnull](https://objectiv.io/docs/modeling/bach/api-reference/Series/notnull/)

# Sampling
One of the key features to Objectiv Bach is that it runs on your full dataset. There can, however, be situations where you want to experiment with your data, meaning you have to query the full dataset often, which can become slow and/or costly.

To limit this, it's possible to do operations on a sample of the full dataset. All operations can easily be applied to the full dataset again at any time.

Below we create a sample that randomly selects ~1% of all the rows in the data, using the [`get_sample()`](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/get_sample/) operation. A table containing the sampled is written to the database, therefore the `table_name` must be provided when creating the sample.

In [None]:
# for BigQuery the table name should be 'YOUR_PROJECT.YOUR_WRITABLE_DATASET.YOUR_TABLE_NAME'
df_sample = df.get_sample(table_name='sample_data', sample_percentage=10, overwrite=True)

A new column is created in the sample.

In [None]:
df_sample['root_location_contexts'] = df_sample.location_stack.json[:1]
df_sample.sort_values('moment', ascending=False).head()

Using the [`.get_unsampled()`](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/get_unsampled/) operation, the operations that are done on the sample (the creation of the column), are applied to the entire data set:

In [None]:
df_unsampled = df_sample.get_unsampled()
df_unsampled.sort_values('moment', ascending=False).head()

The sample can also be used for grouping and aggregating. The example below counts all hits and the unique `event_types` in the sample:

In [None]:
df_sample_grouped = df_sample.groupby(['application_id']).agg({'event_type':'nunique','session_hit_number':'count'})
df_sample_grouped.head()

As can be seen from the counts, unsampling applies the transformation to the entire data set:

In [None]:
df_unsampled_grouped = df_sample_grouped.get_unsampled()
df_unsampled_grouped.head()

### Reference
* [bach.DataFrame.get_sample](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/get_sample/)
* [bach.DataFrame.get_unsampled](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/get_unsampled/)
* [bach.DataFrame.sort_values](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/sort_values/)
* [bach.DataFrame.groupby](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/groupby/)
* [bach.DataFrame.head](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/head/)

## Get the SQL for any analysis
The SQL for any analysis can be exported with one command, so you can use models in production directly to simplify data debugging & delivery to BI tools like Metabase, dbt, etc. See how you can [quickly create BI dashboards with this](https://objectiv.io/docs/home/try-the-demo#creating-bi-dashboards).

In [None]:
# show the underlying SQL for this dataframe - works for any dataframe/model in Objectiv
display_sql_as_markdown(df_unsampled_grouped)

## Where to go next
To dive further into working with the open taxonomy, [see the Bach API reference](https://objectiv.io/docs/modeling/bach/api-reference/).

You can also have a look at [the example notebook demonstrating the open model hub basics](https://objectiv.io/docs/modeling/example-notebooks/modelhub-basics/).