This is one of the Objectiv example notebooks. For more examples visit the 
[example notebooks](https://objectiv.io/docs/modeling/example-notebooks/) section of our docs. The notebooks can run with the demo data set that comes with the our [quickstart](https://objectiv.io/docs/home/quickstart-guide/), but can be used to run on your own collected data as well.

All example notebooks are also available in our [quickstart](https://objectiv.io/docs/home/quickstart-guide/). With the quickstart you can spin up a fully functional Objectiv demo pipeline in five minutes. This also allows you to run these notebooks and experiment with them on a demo data set.

# Basic product analytics

In this notebook, we briefly demonstrate how you can easily do basic product analytics on your data.

## Getting started

### Import the required packages for this notebook
The open model hub package can be installed with `pip install objectiv-modelhub` (this installs Bach as well).  
If you are running this notebook from our quickstart, the model hub and Bach are already installed, so you don't have to install it separately.

In [None]:
from datetime import datetime
from modelhub import ModelHub
from bach import display_sql_as_markdown

At first we have to instantiate the Objectiv DataFrame object and the model hub.

In [None]:
# instantiate the model hub and set the default time aggregation to daily
modelhub = ModelHub(time_aggregation='%Y-%m-%d')

In [None]:
# get the Bach DataFrame with Objectiv data
df = modelhub.get_objectiv_dataframe(start_date='2022-02-02')

If you are running this example on your own collected data, setup the db connection like this and replace above cell:

In [None]:
# df = modelhub.get_objectiv_dataframe(db_url='postgresql://USER:PASSWORD@HOST:PORT/DATABASE',
#                                      start_date='2022-06-01',
#                                      end_date='2022-06-30',
#                                      table_name='data')

The columns 'global_contexts' and the 'location_stack' contain most of the event specific data. These columns
are json type columns and we can extract data from it based on the keys of the json objects using `SeriesGlobalContexts` or `SeriesLocationStack` methods to extract the data.

In [None]:
# adding specific contexts to the data as columns
df['application'] = df.global_contexts.gc.application
df['feature_nice_name'] = df.location_stack.ls.nice_name
df['root_location'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')
df['referrer'] = df.global_contexts.gc.get_from_context_with_type_series(type='HttpContext', key='referrer')
df['utm_source'] = df.global_contexts.gc.get_from_context_with_type_series(type='MarketingContext', key='source')
df['utm_medium'] = df.global_contexts.gc.get_from_context_with_type_series(type='MarketingContext', key='medium')
df['utm_campaign'] = df.global_contexts.gc.get_from_context_with_type_series(type='MarketingContext', key='campaign')
df['utm_content'] = df.global_contexts.gc.get_from_context_with_type_series(type='MarketingContext', key='content')
df['utm_term'] = df.global_contexts.gc.get_from_context_with_type_series(type='MarketingContext', key='term')

# We'll do a lot of operation on this data. To make this easier for the database (especially BigQuery),
# we tell Bach to materialize the current DataFrame as temporary table. This has no effect now, but any later
# queries that build on this DataFrame will consists of two queries: one to create a temporary table, and one that
# queries that table and does operations on that
df = df.materialize(materialization='temp_table')

In [None]:
# have a look at the data
df.sort_values('session_id', ascending=False).head()

In [None]:
# explore the data with describe
df.describe(include='all').head()

Now we will go though a selection of basic analytics metrics. We can use models from the model hub for this purpose or use Bach to do data analysis directly on the data stored in the
SQL database using pandas like syntax.

For each example, `head()`, `to_pandas()` or `to_numpy()` can be used to execute the generated SQL and get the results in
your notebook.

## Unique users
The `daily_users` uses the `time_aggregation` as set when the model hub was instantiated. In this case the
`time_aggregation` was set to '%Y-%m-%d', so the aggregation is daily. For `monthly_users`, the default time_aggregation is
overridden by using a different `groupby`.

In [None]:
# model hub: unique users, monthly
monthly_users = modelhub.aggregate.unique_users(df, groupby=modelhub.time_agg(df, '%Y-%m'))
monthly_users.head()

In [None]:
# model hub: unique users, daily
daily_users = modelhub.aggregate.unique_users(df)
daily_users.sort_index(ascending=False).head(10)

In [None]:
# model hub: unique users, per main product section
users_root = modelhub.aggregate.unique_users(df, groupby=['application', 'root_location'])
users_root.sort_index(ascending=False).head(10)

## Retention

To measure how well we are doing at keeping the users with us after the first interaction, we can use a retention matrix.

To calculate the retention matrix, we need to distribute the users into mutually exclusive cohorts based on the `time_period` (it can be `daily`, `weekly`, `monthly`, or `yearly`) when they first interacted.


In the retention matrix:
- each row represents a cohort,
- each column represents a time range, where time is calculated with respect to the cohort start time,
- the values of the matrix elements are the number or percentage (depending on `percentage` parameter) of users in a given cohort that returned again in a given time range.

N.B. the users' activity starts to be traced from start_date specified in modelhub where we load the data: `modelhub.get_objectiv_dataframe(start_date='2022-02-02')`.

In [None]:
retention_matrix = modelhub.aggregate.retention_matrix(df, time_period='monthly', percentage=True, display=True)
retention_matrix.head()

### Drilling down cohorts

From our retention matrix, we can see that in the second cohort there is a drop in retained users in the next month, just 3.6% came back. We can try to zoom in on the different cohorts and see what is the difference.

In [None]:
# calculate the first cohort
cohorts = df[['user_id', 'moment']].groupby('user_id')['moment'].min().reset_index()
cohorts = cohorts.rename(columns={'moment': 'first_cohort'})

# add first cohort of the users to our DataFrame
df = df.merge(cohorts, on='user_id', how='left')

In [None]:
# filter data where users belong to # 0 cohort
cohort0_filter = (df['first_cohort'] > datetime(2022, 2, 1)) & (df['first_cohort'] < datetime(2022, 3, 1))
df[cohort0_filter]['event_type'].value_counts().head()

In [None]:
# filter data where users belong to # 1 cohort (the problematic one)
cohort1_filter = (df['first_cohort'] > datetime(2022, 3, 1)) & (df['first_cohort'] < datetime(2022, 4, 1))
df[cohort1_filter]['event_type'].value_counts().head()

It is interesting to see that we have relatively more `VisibleEvent` in the first cohort than in second 'problematic' one.

This is  just a simple example to demonstrate the differences you can find between the cohorts, one can do similar tests e.g. with [top product features](https://objectiv.io/docs/modeling/open-model-hub/models/aggregation/top_product_features/), or develop more in-depth analysis depending on the product needs.

## User time spent
We can also calculate the average session duration for time intervals. `duration_root_month` gives the
average time spent per root location per month.

In [None]:
# model hub: duration, monthly average
duration_monthly = modelhub.aggregate.session_duration(df, groupby=modelhub.time_agg(df, '%Y-%m'))
duration_monthly.sort_index(ascending=False).head()

In [None]:
# model hub: duration, daily average
duration_daily = modelhub.aggregate.session_duration(df)
duration_daily.sort_index(ascending=False).head()

In [None]:
# model hub: duration, monthly average per root location
duration_root_month = modelhub.aggregate.session_duration(df, groupby=['application', 'root_location', modelhub.time_agg(df, '%Y-%m')]).sort_index()
duration_root_month.head(10)

In [None]:
# how is this time spent distributed?
session_duration = modelhub.aggregate.session_duration(df, groupby='session_id', exclude_bounces=False)
# Materialization is needed because the expression of the created series contains aggregated data, and it is not allowed to aggregate that.
session_duration.materialize().quantile(q=[0.25, 0.50, 0.75]).head()

## Top used product features

Let's get the top used features in the product by our users, for that we can call the `top_product_features` function from the model hub. 

In [None]:
# by default we select only user actions
top_product_features = modelhub.aggregate.top_product_features(df)
top_product_features.head()

## Top used product areas
First we use the model hub to get the unique users per application, root location, feature, and event type.
From this prepared dataset, we show the users for the home page first.

In [None]:
# select only user actions, so stack_event_types must be a superset of ['InteractiveEvent']
interactive_events = df[df.stack_event_types.json.array_contains('InteractiveEvent')]

top_interactions = modelhub.agg.unique_users(interactive_events, groupby=['application','root_location','feature_nice_name', 'event_type'])
top_interactions = top_interactions.reset_index()

In [None]:
home_users = top_interactions[(top_interactions.application == 'objectiv-website') &
                              (top_interactions.root_location == 'home')]
home_users.sort_values('unique_users', ascending=False).head()

From the same `top_interactions` object, we can select the top interactions for the 'docs' page.

In [None]:
docs_users = top_interactions[top_interactions.application == 'objectiv-docs']
docs_users.sort_values('unique_users', ascending=False).head()

## User origin

In [None]:
# users by referrer
modelhub.agg.unique_users(df, groupby='referrer').sort_values(ascending=False).head()

## Marketing
Calculate the number of users per campaign.

In [None]:
# users by marketing campaign
campaign_users = modelhub.agg.unique_users(df, groupby=['utm_source', 'utm_medium', 'utm_campaign', 'utm_content', 'utm_term'])
campaign_users = campaign_users.reset_index().dropna(axis=0, how='any', subset='utm_source')

campaign_users.sort_values('utm_source', ascending=True).head()

Look at the first used product feature by campaign, using the previously created interactive_events to focus just on user
interactions.

In [None]:
# first used product featureper campaign source & term
users_feature_campaign = modelhub.agg.unique_users(interactive_events, groupby=['utm_source', 'utm_term', 'feature_nice_name', 'event_type'])
users_feature_campaign = users_feature_campaign.reset_index().dropna(axis=0, how='any', subset='utm_source')

users_feature_campaign.sort_values(['utm_source', 'utm_term', 'unique_users'], ascending=[True, True, False]).head()

## Conversions
First we define a conversion event in the Objectiv DataFrame.

In [None]:
# create a column that extracts all location stacks that lead to our github
df['github_press'] = df.location_stack.json[{'id': 'objectiv-on-github', '_type': 'LinkContext'}:]
df.loc[df.location_stack.json[{'id': 'github', '_type': 'LinkContext'}:]!=[],'github_press'] = df.location_stack

# define which events to use as conversion events
modelhub.add_conversion_event(location_stack=df.github_press,
                              event_type='PressEvent',
                              name='github_press')

This can be used by several models from the model hub using the defined name ('github_press'). First we calculate
the number of unique converted users.

In [None]:
# model hub: calculate conversions
df['is_conversion_event'] = modelhub.map.is_conversion_event(df, 'github_press')
conversions = modelhub.aggregate.unique_users(df[df.is_conversion_event])
conversions.to_frame().sort_index(ascending=False).head(10)

We use the earlier created `daily_users` to calculate the daily conversion rate.

In [None]:
# calculate conversion rate
conversion_rate = conversions / daily_users
conversion_rate.sort_index(ascending=False).head(10)

From where do users convert most?

In [None]:
conversion_locations = modelhub.agg.unique_users(df[df.is_conversion_event], 
                                                 groupby=['application', 'feature_nice_name', 'event_type'])

# calling .to_frame() for nicer formatting
conversion_locations.sort_values(ascending=False).to_frame().head()

We can calculate what users did _before_ converting.

In [None]:
top_features_before_conversion = modelhub.agg.top_product_features_before_conversion(df, name='github_press')
top_features_before_conversion.head()

At last we want to know how much time users that converted spent on our site before they converted.

In [None]:
# label sessions with a conversion
df['converted_users'] = modelhub.map.conversions_counter(df, name='github_press') >= 1

# label hits where at that point in time, there are 0 conversions in the session
df['zero_conversions_at_moment'] = modelhub.map.conversions_in_time(df, 'github_press') == 0

# filter on above created labels
converted_users = df[(df.converted_users & df.zero_conversions_at_moment)]

# how much time do users spend before they convert?
modelhub.aggregate.session_duration(converted_users, groupby=None).to_frame().head()

## Get the SQL for any analysis

In [None]:
# just one analysis as an example, this works for anything you do with Objectiv Bach
display_sql_as_markdown(conversions)