This is one of the Objectiv example notebooks. For more examples visit the 
[example notebooks](https://objectiv.io/docs/modeling/example-notebooks/) section of our docs. The notebooks can run with the demo data set that comes with the our [quickstart](https://objectiv.io/docs/home/quickstart-guide/), but can be used to run on your own collected data as well.

All example notebooks are also available in our [quickstart](https://objectiv.io/docs/home/quickstart-guide/). With the quickstart you can spin up a fully functional Objectiv demo pipeline in five minutes. This also allows you to run these notebooks and experiment with them on a demo data set.

# Intro
This notebook demonstrates what you can do with the Objectiv Bach modeling library and a dataset that was validated against the open analytics taxonomy. The example uses real data that's stored in an SQL database and was collected with the Objectiv Tracker that's instrumented on objectiv.io.

There is another notebook in the same folder that focuses on using the open model hub [model-hub-demo-notebook.ipynb](model-hub-demo-notebook.ipynb), demonstrating how you can use Bach to use and chain pre-built models to quickly answer common product analytics questions. 

The Objectiv Bach API is heavily inspired by the pandas API. We believe this provides a great, generic interface to handle large amounts of data in a python environment while supporting multiple data stores.

For an intro into the pandas api see: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html  
The full Objectiv Bach api reference is available here: https://objectiv.io/docs/modeling/bach/api-reference/

# Contents  
* [Instantiate-the-object](#Instantiate-the-object)
  * [The data](#The-data)
    * [event_type](#event_type)
    * [location_stack & global_contexts](#location_stack-&-global_contexts)
    * [location_stack](#location_stack)
    * [global_contexts](#global_contexts)
  * [Sampling](#Sampling)

### Import the required packages for this notebook
The open model hub package can be installed with `pip install objectiv-modelhub` (this installs Bach as well).  
If you are running this notebook from our quickstart, the model hub and Bach are already installed, so you don't have to install it separately.

In [None]:
from modelhub import ModelHub

# Instantiate the object
As a first step, the model hub object is instantiated. The model hub contains collection of data models and convenience functions that can be used with Objectiv data. With `get_objectiv_dataframe()` a Bach DataFrame is created, that already has all columns and data types set correctly and as such can always be used with model hub models. 

In [None]:
# instantiate the model hub
modelhub = ModelHub()

In [None]:
# get the Bach DataFrame with Objectiv data
df = modelhub.get_objectiv_dataframe(start_date='2021-11-16')

If you are running this example on your own collected data, setup the db connection like this and replace above cell:

In [None]:
# df = modelhub.get_objectiv_dataframe(db_url='postgresql://USER:PASSWORD@HOST:PORT/DATABASE',
#                                      start_date='2022-06-01',
#                                      end_date='2022-06-30',
#                                      table_name='data')

The data for the DataFrame is still in the database and the database is not queried before any of the data is loaded to the python environment. The methods that query the database are: 
* [`head()`](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/head/)
* [`to_pandas()`](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/to_pandas/)
* [`get_sample()`](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/get_sample/)
* [`to_numpy()`](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/to_numpy/)
* The property accessors [`Series.array`](https://objectiv.io/docs/modeling/bach/api-reference/Series/array/), [`Series.value`](https://objectiv.io/docs/modeling/bach/api-reference/Series/value/)

For demo puposes of this notebook, these methods are called often to show the results of our operations. To limit the number of executed queries on the full data set it is recommended to use these methods less often or [to sample the data first](#Sampling).

## The data
The contents of the DataFrame exist of:

In [None]:
df.index_dtypes

The index contains a unique identifier for every hit.

In [None]:
df.dtypes

* `day`: the day of the session as a date.
* `moment`: the exact moment of the event.
* `user_id`: the unique identifier of the user based on the cookie.
* `global_contexts`: a json-like data column that stores additional information on the event that is logged. This includes data like device data, application data, and cookie information. [See below](#global_contexts) for more detailed explanation. 
* `location_stack`: a json-like data column that stores information on the exact location where the event is triggered in the product's UI. [See below](#location_stack) for more detailed explanation.
* `event_type`: the type of event that is logged.
* `stack_event_types`: the parents of the event_type.
* `session_id`: a unique incremented integer id for each session. Starts at 1 for the selected data in the DataFrame.
* `session_hit_number`: a incremented integer id for each hit in session ordered by moment.

A preview of the data. We show the latest PressEvents.

In [None]:
df[df.event_type == 'PressEvent'].sort_values('moment', ascending=False).head()

## The Open Taxonomy
Data in a DataFrame created with `get_objectiv_dataframe()` follows the principles of the [open analytics taxonomy](https://objectiv.io/docs/taxonomy/core-concepts/) and is stored as such. Therefore it adheres to the three principles of how events are structured.
* **event_type**: describes the kind of interactive or non-interactive event.
* **location_stack**: describes where an event originated from in the visual UI.
* **global_context**: general information to an event.

The following section will go through these concepts one by one.

### event_type

The event type describes what kind of event is triggered. The goal of the open taxonomy is to label all interactive and non-interactive events in a standardized way. Together with the location stack, the event_type 'defines' what happened with or on the product.

In [None]:
df[df.day == '2022-01-10'].event_type.head()

### location_stack & global_contexts
The location stack and global contexts are stored as json type data. Within the DataFrame, it is easy to access data in json data based on position or content.

**Slicing the json data**  
With the `.json[]` syntax you can slice the array using integers. Instead of integers, dictionaries can also be passed to 'query' the json array. If the passed dictionary matches a context object in the stack, all objects of the stack starting (or ending, depending on the slice) at that object will be returned.

**An example**  
Consider a json array that looks like this (this is a real example of a location stack):
```json
[{"id": "docs", "_type": "RootLocationContext"},
 {"id": "docs-sidebar", "_type": "NavigationContext"},
 {"id": "API Reference", "_type": "ExpandableContext"},
 {"id": "DataFrame", "_type": "ExpandableContext"},
 {"id": "Overview", "_type": "LinkContext"}]
```
**Regular slicing**
```python
df.location_stack.json[2:4]
```
For the example array it would return:
```json
[{"id": "API Reference", "_type": "ExpandableContext"},
 {"id": "DataFrame", "_type": "ExpandableContext"}]
```
**Slicing by querying**

We want to return only the part of the array starting at the object that contain this object:
```javascript
{"id": "docs-sidebar", "_type": "NavigationContext"}
```
The syntax for selecting like this is: 
```python
df.location_stack.json[{"id": "docs-sidebar", "_type": "NavigationContext"}:]
```
For the example array it would return:
```json
[{'id': 'docs-sidebar', '_type': 'NavigationContext'},
 {'id': 'API Reference', '_type': 'ExpandableContext'},
 {'id': 'DataFrame', '_type': 'ExpandableContext'},
 {'id': 'Overview', '_type': 'LinkContext'}]
```
In case a json array does not contain the object, `None` is returned. More info at the api reference: https://objectiv.io/docs/modeling/bach/api-reference/Series/Json/

### location_stack
The `location_stack` column in the DataFrame stores the information on the exact location where the event is triggered in the product. The example used above is the location stack of the link to the DataFrame api reference in the menu on our docs page.

Because of the specific way the location information is labeled, validated, and stored using the Open Taxonomy, it can be used to slice and group your products' features in an efficient and easy way. The column is set as an `objectiv_location_stack` type, and therefore location stack specific methods can be used to access the data from the `location_stack`. These methods can be used using the `.ls` accessor on the column. The methods are:
* The property accessors [`.ls.navigation_features`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesLocationStack/ls/), [`.ls.feature_stack`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesLocationStack/ls/), [`.ls.nice_name`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesLocationStack/ls/)
* all [methods](https://objectiv.io/docs/modeling/bach/api-reference/Series/Json/) for the json type can also be accessed using `.ls`

For example,
```python
df.location_stack.ls.nice_name
```
returns '*'Link: Overview located at Root Location: docs => Navigation: docs-sidebar => Expandable: API Reference => Expandable: DataFrame*' for the location stack mentioned above.

The full reference of location stack is [here](https://objectiv.io/docs/taxonomy/location-contexts/). An example location stack for a PressEvent is queried below:

In [None]:
df[df.event_type == 'PressEvent'].location_stack.head(1)[0]

### global_contexts
The `global_contexts` column in the DataFrame contain all information that is relevant to the logged event. As it is set as an `objectiv_global_context` type, specific methods can be used to access the data from the `global_contexts`. These methods can be used using the `.gc` accessor on the column. The methods are:
* [`.gc.get_from_context_with_type_series(type, key)`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesGlobalContexts/obj/)
* The property accessors [`.gc.cookie_id`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesGlobalContexts/gc/), [`.gc.user_agent`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesGlobalContexts/gc/), [`.gc.application`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesGlobalContexts/gc/)
* all [methods](https://objectiv.io/docs/modeling/bach/api-reference/Series/Json/) for the json type can also be accessed using `.gc`

The full reference of global contexts is [here](https://objectiv.io/docs/taxonomy/global-contexts/). An example is queried below:

In [None]:
df.global_contexts.head(1)[0]

# Sampling
One of the key features to Objectiv Bach is that it runs on your full data set. There can however be situations where you want to experiment with your data, meaning you have to query the full data set often. This can become slow and/or costly. 

To limit these costs it is possible to do operations on a sample of the full data set. All operations can easily be applied at any time to the full data set if that is desired.

Below we create a sample that randomly selects ~1% of all the rows in the data. A table containing the sampled is written to the database, therefore the `table_name` must be provided when creating the sample.

In [None]:
# table_name = 'sample_data'
table_name = 'objectiv-production.writable_dataset.sample_data'

df_sample = df.get_sample(table_name=table_name, sample_percentage=10, overwrite=True)

Two new columns are created in the sample.

In [None]:
df_sample['root_location_contexts'] = df_sample.location_stack.json[:1]
df_sample['application'] = df_sample.global_contexts.gc.application
df_sample.sort_values('moment', ascending=False).head()

Using `.get_unsampled()`, the operations that are done on the sample (the creation of the two columns), are applied to the entire data set:

In [None]:
df_unsampled = df_sample.get_unsampled()
df_unsampled.sort_values('moment', ascending=False).head()

The sample can also be used for grouping and aggregating. The example below counts all hits and the unique event_types in the sample:

In [None]:
df_sample_grouped = df_sample.groupby(['application']).agg({'event_type':'nunique','session_hit_number':'count'})
df_sample_grouped.head()

As can be seen from the counts, unsampling applies the transformation to the entire data set:

In [None]:
df_unsampled_grouped = df_sample_grouped.get_unsampled()
df_unsampled_grouped.head()

This concludes this demo.

We’ve demonstrated a handful of the operations that Bach supports and hope you’ve gotten a taste of what Bach can do for your modeling workflow. 

The full Objectiv Bach API reference is available here: https://objectiv.io/docs/modeling/bach/api-reference/

There is another example that focuses on using the [open model hub](https://objectiv.io/docs/modeling/example-notebooks/modelhub-basics/), demonstrating
how you can use the model hub and Bach to quickly answer common product analytics questions.