This is one of the Objectiv example notebooks. For more examples visit the 
[example notebooks](https://objectiv.io/docs/modeling/example-notebooks/) section of our docs. The notebooks can run with the demo data set that comes with the our [quickstart](https://objectiv.io/docs/home/try-the-demo/), but can be used to run on your own collected data as well.

All example notebooks are also available in our [quickstart](https://objectiv.io/docs/home/try-the-demo/). With the quickstart you can spin up a fully functional Objectiv demo pipeline in five minutes. This also allows you to run these notebooks and experiment with them on a demo data set.

This example shows how Bach can be used for feature engineering. We'll go through describing the data, finding
outliers, transforming data and grouping and aggregating data so that a useful feature set is created that
can be used for machine learning. We have a separate example available that goes into the details of how a
data set prepared in Bach can be used for machine learning with sklearn [here](https://objectiv.io/docs/modeling/example-notebooks/machine-learning/).

## Getting started
If you are running this example on your own collected data, [see the instructions here](https://objectiv.io/docs/modeling/get-started-in-your-notebook/) on how to setup the database connection and get started in your favorite notebook tool.

### Import the required packages for this notebook
The open model hub package can be installed with `pip install objectiv-modelhub` (this installs Bach as well).  
If you are running this notebook from our quickstart, the model hub and Bach are already installed, so you don't have to install it separately.

In [None]:
from modelhub import ModelHub

At first we have to instantiate the Objectiv DataFrame object and the model hub.

In [None]:
# instantiate the model hub
modelhub = ModelHub(time_aggregation='%Y-%m-%d')

In [None]:
# get the Bach DataFrame with Objectiv data
df = modelhub.get_objectiv_dataframe(start_date='2022-02-02')

### describe all data

In [None]:
df.describe(include='all').head()

We start with showing the first couple of rows from the data set and describing the entire data set.

In [None]:
df.head()

Columns of interest are 'user_id', this is what we will aggregate to. 'moment' contains timestamp info for the
events. Global contexts (not present in this example) and the 'location_stack' contain most of the event specific data. The global contexts that you want to use in the analysis need to be set when instantiating the model hub. See the [open taxonomy example](open-taxonomy-how-to.ipynb#Location-stack-&-global-contexts) for how to use the location stack and global contexts.

In [None]:
df.describe(include='all').head()

### Creating a feature set 
We'd like to create a feature set that describes the behaviour of users in a way. We start with extracting
the root location from the location stack. This indicates what parts of our website users have visited. Using
`to_numpy()` shows the results as a numpy array.

In [None]:
df['root'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')

# root series is later unstacked and its values might contain dashes
# which are not allowed in BigQuery column names, lets replace them
df['root'] = df['root'].str.replace('-', '_')
df.root.unique().to_numpy()

`['jobs', 'docs', 'home'...]` etc is returned, the sections of the objectiv.io website.

### check missing values

In [None]:
df.root.isnull().value_counts().head()

A quick check learns us that there are no missing values to worry about. Now we want a data set with
interactions on our different sections, in particular, presses. This is an event type. We first want an
overview of the different event types that exist and select the one we are interested in.

In [None]:
df.event_type.unique().to_numpy()

In [None]:
df[(df.event_type=='PressEvent')].root.unique().to_numpy()

In [None]:
df[(df.event_type=='PressEvent')].describe(include='string').head()

### Creating the variables
We are interested in 'PressEvent'. The next code block shows that we select only press events and then group
by 'user_id' and 'root' and count the session_hit_number. After that the results are unstacked, resulting in
a table where each row represents a user (the index is 'user_id') and the columns are the different root
locations and its values are the number of times a user clicked in that sections.

In [None]:
features = df[(df.event_type=='PressEvent')].groupby(['user_id','root']).session_hit_number.count()

In [None]:
features_unstacked = features.unstack()

In [None]:
features_unstacked.materialize().describe().head()

In [None]:
features_unstacked.head()

### Fill empty values
Now we do have empty values, so we fill them with 0, as empty means that the user did not click in the
section.

In [None]:
features_unstacked = features.unstack(fill_value=0)

### Describe the data set
We use describe again to get an impression of out created per-user data set.

In [None]:
features_unstacked.materialize().describe().head()

Looking at the mean, some sections seem to be used a lot more than others. Also the max
number of clicks seems quite different per root section. This information can be used to drop some of the
variables from our data set or the use scaling or outlier detection. We will plot histograms for the

### Visualize the data

In [None]:
from matplotlib import pyplot as plt
import math

figure, axis = plt.subplots(math.ceil(len(features_unstacked.data_columns)/4), 4, figsize=(15,10))

for idx, name in enumerate(features_unstacked.data_columns):
    features_unstacked[[name]].plot.hist(bins=5, title=name, ax=axis.flat[idx])
plt.tight_layout()

The histograms show that indeed the higher values seem quite anomalous for most of the root locations. This
could be a reason to drop some of these observations or resort to scaling methods. For now we continue with
the data set as is.

### Add time feature
Now we want to add some time feature to our data set. We add the average session length per user to the data
set. We can use the model hub for this. `fillna` is used to fill missing values.

In [None]:
import datetime

features_unstacked['session_duration'] = modelhub.aggregate.session_duration(df, groupby='user_id')
features_unstacked['session_duration'] = features_unstacked['session_duration'].fillna(datetime.timedelta(0))

In [None]:
features_unstacked.session_duration.describe().head()

### Export to pandas for sklearn
Now that we have our data set, we can use it for machine learning, using for example sklearn. To do so
we call `to_pandas()` to get a pandas DataFrame that can be used in sklearn.

Here is our example on how to use Objectiv data and [sklearn](https://objectiv.io/docs/modeling/example-notebooks/machine-learning/).

In [None]:
pdf = features_unstacked.to_pandas()

In [None]:
pdf