# Objectiv example notebook

This demo notebook enables you to play with Bach, our modeling library, to get an idea of what it can do. A live web version is also available [here](https://notebook.objectiv.io/lab?path=product_analytics.ipynb).

A few notes about this example notebook:
* It uses a real dataset from objectiv.io, collected with an unaltered version of Objectiv’s [tracker](https://objectiv.io/docs/tracking/). No cleaning or transformation* has been applied to the data. Objectiv’s tracker uses the [open taxonomy for analytics](https://objectiv.io/docs/taxonomy/) to collect clean data that’s ready to model on.
*  You can also generate your own events and use/see them in this notebook. Check out our [Quickstart Guide](https://www.objectiv.io/docs/quickstart-guide) for instructions.
* It is connected to a PostgreSQL database and runs directly on the full dataset. You can use Pandas-like dataframe operations, that Bach translates to SQL under the hood.
* This notebook demonstrates only a selection all of the operations that are supported by Bach. Check out the [docs](https://objectiv.io/docs/modeling/reference#api-reference) for the full rundown.
* You can also use this notebook for your own website or app once you've instrumented it with Objectiv's tracker.

For any question, please join our [Slack channel](https://join.slack.com/t/objectiv-io/shared_invite/zt-u6xma89w-DLDvOB7pQer5QUs5B_~5pg).

<sub>*for privacy reasons, IPs have been removed and timeframes have been cut from the initial dataset.</sub>

In [None]:
import datetime
import plotly
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib.ticker import FuncFormatter
import sqlalchemy
import os

from jupyter_dash import JupyterDash as Dash

# import Objectiv Bach
from bach import DataFrame
from bach_open_taxonomy import FeatureFrame, basic_feature_model
from bach_open_taxonomy.sankey_dash import get_app

## Connect to full dataset in PostgreSQL

In [None]:
# connect to full postgresql dataset, add database and credentials here
dsn = os.environ.get('DSN', 'postgresql://objectiv:@localhost:5432/objectiv')
engine = sqlalchemy.create_engine(dsn, pool_size=1, max_overflow=0)

In [None]:
# create a Bach dataframe based on the full dataset
# note that the database is not queried for operations on this dataframe. The database will only be queried
# when data is outputted to the python environment (ie when using .head() or .to_pandas()).
basic_features = basic_feature_model()
df = DataFrame.from_model(engine=engine, model=basic_features, index=['event_id'])

## Set sample / unsample

In [None]:
# if desired, sample the data to develop models, for demo purposes we skip the sampling and work on full set
# all underlying data for df gets queried once in order to create the sample.

# df = df.get_sample(table_name='basic_features_sample', sample_percentage=10, overwrite=True)

In [None]:
# it is possible to apply all data manipulations on the full data set at any time.
# to unsample the data and run all models below on full dataset, use:

# df = df.get_unsampled()

## Add global contexts & location stack

In [None]:
# add the global contexts and location stack as custom dtype so we can use them in modeling
# global_contexts and location_stack are json type data columns. Setting custom dtypes extends the functionality
# for easy access to the contents of these columns.
df['global_contexts'] = df.global_contexts.astype('objectiv_global_context')
df['location_stack'] = df.location_stack.astype('objectiv_location_stack')

# functions specific for columns of the type 'objectiv_global_context' can be accessed using the `gc` name space.
# for 'objectiv_location_stack' type columns this is `ls`

# add the event location from the location_stack as new column to the df, using ls function:
df['event_location'] = df.location_stack.ls.nice_name

## Set the user application(s)

In [None]:
# add a new column to df with the user application from the global contexts, using gc function
df['user_application'] = df.global_contexts.gc.application

# select one or more user application(s) for analysis, in this case objectiv.io website and docs
df = df[(df['user_application'] == 'objectiv-website') | (df['user_application'] == 'objectiv-docs')]

## Set the time aggregation 

In [None]:
# choose for which level of time aggregation the rest of the analysis will run
# supports all Postgres datetime template patterns: https://www.postgresql.org/docs/9.1/functions-formatting.html#FUNCTIONS-FORMATTING-DATETIME-TABLE

agg_level = 'YYYYMMDD'

# add the time aggregation as new column to the dataframes, so we can group on this later
df['time_aggregation'] = df['moment'].dt.sql_format(agg_level)

## Set the timeframe

In [None]:
# set the timeframe for analysis
timeframe_selector = (df['moment'] >= datetime.date(2021,10,21)) 

# create a new df with timeframe applied 
timeframe_df = df[timeframe_selector]

## Explore the data

In [None]:
# only now the data gets queried. It is therefore recommended to limit the use of functions that query the
# database or use a sample when it is not (yet) required to query all data. The documentation of Bach always
# indicates in case the database gets queried.
timeframe_df.sort_values(by='moment', ascending=False).head()

## Users

In [None]:
# calculate unique users per timeframe
users = timeframe_df.groupby('time_aggregation').aggregate({'user_id':'nunique'})

# calculate total users, to reuse later
total_users = timeframe_df['user_id'].nunique()

users.sort_values(by='time_aggregation', ascending=False).head()

In [None]:
# visualize users
fig = px.line(data_frame = users.to_pandas())
fig.show()

## Sessions

In [None]:
# calculate unique sessions
sessions = timeframe_df.groupby('time_aggregation').aggregate({'session_id':'nunique'})

sessions.sort_values(by='time_aggregation', ascending=False).head()

In [None]:
# visualize sessions
fig = px.line(data_frame = sessions.to_pandas())
fig.show()

## Sessions per user

In [None]:
# merge users and sessions
users_sessions = sessions.merge(users, how='inner', on='time_aggregation')

# calculate average sessions per user
users_sessions['sessions_per_user_avg'] = users_sessions['session_id_nunique'] / users_sessions['user_id_nunique']

# clean-up columns
users_sessions.drop(columns=['session_id_nunique', 'user_id_nunique'], inplace=True)

users_sessions.sort_values('time_aggregation', ascending=False).head()

In [None]:
# visualize average sessions per user
fig = px.line(data_frame = users_sessions.to_pandas())
fig.show()

## New users

In [None]:
# define first seen per user, based on dataset with not timeframe applied
user_first_seen = df.groupby('user_id').aggregate({'time_aggregation':'min', 'session_id':'min'})

# select all users that have been active in the time
active_users = timeframe_df['user_id'].unique()

# merge with users that have been active in the timeframe
user_first_seen = user_first_seen.merge(active_users, how='inner', on='user_id')

# calculate new users for each timeframe
new_users = user_first_seen.groupby('time_aggregation_min').aggregate({'user_id':'nunique'})

# merge with total users to calculate ratio and limit to timerange
new_total_users = users.merge(new_users, how='inner', left_on='time_aggregation', right_on='time_aggregation_min', suffixes=('_total', '_new'))

# set time_aggregation as single index
new_total_users = new_total_users.set_index('time_aggregation')

# calculate new & returning user share
new_total_users['new_user_share'] = new_total_users['user_id_nunique_new'] / new_total_users['user_id_nunique_total']
new_total_users['returning_user_share'] = (new_total_users['user_id_nunique_total'] - new_total_users['user_id_nunique_new']) / new_total_users['user_id_nunique_total']

new_total_users.sort_values(by='time_aggregation', ascending=False).head()

In [None]:
# visualize new users
fig = px.line(data_frame = new_total_users[['user_id_nunique_new', 'user_id_nunique_total']].to_pandas())
fig.show()

In [None]:
# visualize returning users
fig = px.line(data_frame = new_total_users[['returning_user_share']].to_pandas())
fig.show()

## Feature creation

In [None]:
# using Objectiv, you can create features that utilize the context of where they occur on the UI, using the location stack
# while it is possible to use the event_type and location_stack as is to describe individual features,
# the location stack can be leveraged to group and aggregate various features at different levels of location 'depth'
# of your product. 

# choose for which application(s) to create features, in this case we select the Objectiv website
feature_creation_df = timeframe_df[(timeframe_df['user_application'] == 'objectiv-website')]

# limit the timerange to match the latest taxonomy version applied as example on the website
feature_creation_df = feature_creation_df[(feature_creation_df['moment'] >= datetime.date(2021,11,15))]

# first, create a feature frame that will be used to create features
feature_frame = FeatureFrame.from_data_frame(df=feature_creation_df, location_stack_column='location_stack', event_column='event_type', overwrite=True)
feature_frame.head()

**feature creation slicing the location stack**  
The `.json[]` syntax of location stacks allows you to slice with integers, but also dictionaries can be passed. If a dictionary matches
a context object in the stack, all objects of the stack starting at that object will be returned.  
  
**An example**  
We want to return only location stacks sub sets that contain this object:
```javascript
{"id": "contributors", "_type": "SectionContext"}
```
This means that if a location stack looks like this:
```json
[{"id": "#document", "_type": "WebDocumentContext"},
 {"id": "main", "_type": "SectionContext"},
 {"id": "core-team", "_type": "SectionContext"},
 {"id": "contributors", "_type": "SectionContext"},
 {"id": "jansentom", "_type": "SectionContext"},
 {"id": "contributor-card", "_type": "SectionContext"}]
```
The returned location stack looks like this:
```json
[{"id": "contributors", "_type": "SectionContext"},
 {"id": "jansentom", "_type": "SectionContext"},
 {"id": "contributor-card", "_type": "SectionContext"}]
```
In case a location stack does not contain the object, `None` is returned. The syntax for selecting like this is: 
```python
feature_frame["contributors_features"] = feature_frame.location_stack.json[{"_type": "SectionContext", "id": "contributors"}:]
```

Now we want to create a location stack that only contains the first object of this stack. For example if you are  not interested in clicks on individual contributors, but want to aggregate clicks on all of them. This can be done by using slices:
```python
feature_frame["contributors_aggregated"] = feature_frame.contributors_features.json[:1]
```
result:
```json
[{"id": "contributors", "_type": "SectionContext"}]
```



**feature creation with Dash app**  
using a Dash app, you can visualize all events with the location stack and create features.

the database gets queried for this to get all unique features.

as an example, we'll create features:
1. the job annoucement bar that is on both Home & About pages  
2. conversion, in this case going to GitHub repo
3. contributor features  
4. aggregate all contributers

In [None]:
# features are created
feature_frame['announcement_bar_features'] = feature_frame.location_stack.json[{'_type': 'SectionContext', 'id': 'announcement-bar'}:]
feature_frame['conversion'] = feature_frame.location_stack.json[{'_type': 'LinkContext', 'id': 'cta-repo-button'}:]
feature_frame['contributors_features'] = feature_frame.location_stack.json[{'_type': 'SectionContext', 'id': 'contributors'}:]

# this returns the stack of 'contributors_features' up to the first object in the stack (and therefore aggregates all
# following objects in the stack)
feature_frame['contributors_aggregated'] = feature_frame.contributors_features.json[:1]

**Visualizing the stack**  
Now we can visualize the location stack. You can select the features with 'Location stack column to visualize'. The width of the links indicates the number of hits (given the selected event type). The number of hits is also the number displayed when hovering over a node.  

It is also possible to create features using the tool by clicking nodes, or slicing the selected location stack. Clicking 'Add to Feature Frame' adds the feature to the feature frame.  
  
Try selecting the just created features. When the event type 'ClickEvent' is selected and switching between 'contributors_features' and 'contributors_aggregated', it shows how the clicks on individual contributors are aggregated.  
  
By clicking on nodes, or slicing in the sankey tool, Features can also be created. Try recreating the features above starting from the 'location_stack' column as 'Location stack column to visualize'.

In [None]:
app = get_app(Dash, feature_frame, dash_options={'server_url': 'http://localhost:8053'})
app.run_server(mode='inline', height = 1100, port=8053, host='0.0.0.0')

In [None]:
# if you are happy with the result, write these creatured features to the working df
feature_creation_df = feature_frame.write_to_full_frame()
feature_creation_df.head()

## Features

In [None]:
# select the features we just created
created_features = feature_creation_df[(feature_creation_df.conversion.notnull()) | 
                                (feature_creation_df.announcement_bar_features.notnull()) |
                                (feature_creation_df.contributors_features.notnull()) |
                                (feature_creation_df.contributors_aggregated.notnull())]

# get the number of total users and hits per feature
users_per_event = created_features.groupby(['user_application', 'event_type', 'event_location']).aggregate({'user_id':'nunique','session_hit_number':'count'})

users_per_event.sort_values(by=['user_id_nunique'], ascending=False).head(10)

## Conversion

In [None]:
# select the created conversion feature and define completed conversion as a click event
conversion_completed = feature_creation_df[(feature_creation_df.conversion.notnull()) & 
                                    (feature_creation_df.event_type == 'ClickEvent')]

# calculate conversions, now per user, but can easily be aggregated to session_id instead
conversions = conversion_completed.groupby('time_aggregation').aggregate({'user_id':'nunique'})

# merge with users, but can easily be done with sessions instead
conversion_rate = conversions.merge(users, how='inner', on='time_aggregation', suffixes=('_converting', '_total'))

# calculate conversion rate
conversion_rate['conversion_rate'] = conversion_rate['user_id_nunique_converting'] / conversion_rate['user_id_nunique_total']

conversion_rate.sort_values(by='time_aggregation', ascending=False).head()

In [None]:
# visualize conversion rate
fig = px.line(data_frame = conversion_rate[['conversion_rate']].to_pandas())
fig.show()

## Conversion funnel

In [None]:
# for users that have a conversion event, select their conversion sessions and session_hit_number of the first conversion moment in a session
converting_users = conversion_completed.groupby(['user_id', 'session_id']).aggregate({'session_hit_number':'min'})

# merge with the df that has all user events in the timeframe
converting_users_events = timeframe_df.merge(converting_users, how='inner', on=['user_id', 'session_id'])

# select all events that converting users had up to their first conversion moment in the same session
converting_users_events = converting_users_events[(converting_users_events['session_hit_number'] <= converting_users_events['session_hit_number_min'])]

# filter on only ClickEvent so we focus on user interactions
converting_users_events = converting_users_events[(converting_users_events['event_type'] == 'ClickEvent')]

# select all unique features used by these users
converting_users_features = converting_users_events.groupby(['event_type', 'event_location']).aggregate({'event_id':'nunique'}).sort_values(by='event_location', ascending=True)

# now we switch to Pandas, as the dataset is small enough and allows nice visualisation
feature_id_pd = converting_users_features.to_pandas().reset_index()

# clean-up columns
feature_id_pd.drop(columns=['event_id_nunique'], inplace=True)

# use the index to give each feature a unique id
feature_id_pd['feature_id'] = feature_id_pd.index

# create a rolling window that includes the previous event for each row and get it through window_lag()
rolling = converting_users_events.sort_values('session_hit_number').groupby('session_id').rolling(2)
converting_users_events['prev_event_location'] = rolling.event_location.window_lag()

# materizalize the df before we apply an expression on window
converting_users_events = converting_users_events.materialize()

# group each unique event by previous unique event
from_to_events = converting_users_events.groupby(['prev_event_location', 'event_location']).aggregate({'user_id':'nunique'})

# now we switch to Pandas, as the dataset is small enough and allows nice visualisation
from_to_events_pd = from_to_events.to_pandas().reset_index()

# merge with the unique id for each prev_feature
sankey_input_pd = from_to_events_pd.merge(feature_id_pd, how='inner', left_on='prev_event_location', right_on='event_location')
sankey_input_pd = sankey_input_pd.rename(columns={'event_location_x':'event_location', 'feature_id':'prev_feature_id'})
sankey_input_pd = sankey_input_pd.drop(columns={'event_location_y'})

# merge with the unique id for each feature
sankey_input_pd = sankey_input_pd.merge(feature_id_pd, how='left', left_on='event_location', right_on='event_location')
sankey_input_pd = sankey_input_pd.rename(columns={'event_type_x':'event_type'})
sankey_input_pd = sankey_input_pd.drop(columns={'event_type_y'})

# filter out events where prev_feature and feature are the same and user did not go anywhere new
sankey_input_pd = sankey_input_pd[(sankey_input_pd['prev_feature_id'] != sankey_input_pd['feature_id'])]
sankey_input_pd.head()

In [None]:
# visualize the sankey
fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 50,
      thickness = 5,
      line = dict(color = "black", width = 1),
      label = feature_id_pd['event_location'].str.slice(0,30).tolist(),
      color = "blue",
      customdata = (feature_id_pd['event_type'] + ' at ' + feature_id_pd['event_location']).tolist(),
      hovertemplate='%{customdata}<br />'+
        'unique users: %{value}'  
    ),
    link = dict(
      source = sankey_input_pd['prev_feature_id'].tolist(),
      target = sankey_input_pd['feature_id'].tolist(),
      value = sankey_input_pd['user_id_nunique'].tolist(),
  ))])

fig.update_layout(title_text="Conversion funnel", font_size=10)
fig.show()

## Session duration

In [None]:
# calculate duration of each session
session_duration = timeframe_df.groupby(['session_id']).aggregate({'moment':['min','max'], 'time_aggregation':'min'})
session_duration['session_duration'] = session_duration['moment_max'] - session_duration['moment_min']

# check which sessions have duration of zero and filter these out, as they are bounces
session_duration = session_duration[(session_duration['session_duration'] > '0')]

# rename columns
session_duration.rename(columns={'time_aggregation_min':'time_aggregation'}, inplace=True)

# calculate average session duration
avg_session_duration = session_duration.groupby(['time_aggregation']).aggregate({'session_duration': 'mean'})

avg_session_duration.sort_values(by='time_aggregation', ascending=False).head()

## Session duration for specific features

In [None]:
# from the features we created, select one or more to calculate duration for. In this example we calculate the time
# spent in the conversion funnel
start_stop = feature_creation_df[feature_creation_df.conversion.notnull()]

# get previous moment in the same session, window_lag(n) returns the nth previous value in the partition
rolling = start_stop.sort_values('moment').groupby('session_id').rolling(2)
start_stop['prev_moment'] = rolling.moment.window_lag()

# materizalize the df before we apply an expression on window
start_stop = start_stop.materialize()

# calculate duration
start_stop['duration'] = start_stop.moment - start_stop.prev_moment

# calculate average duration per timeframe
duration_between_events = start_stop.groupby('time_aggregation').aggregate({'duration':'sum'})

duration_between_events.sort_values(by='time_aggregation', ascending=False).head()

## Retention

In [None]:
# select all sorted time aggregations in the timeframe 
time_aggregations = timeframe_df.groupby(['time_aggregation']).aggregate({'user_id':'nunique'}).sort_values(by='time_aggregation', ascending=True)
time_aggregations.head()

# switch to Pandas as the dataset is small enough reset the index, use that to number each cohort
time_cohorts = time_aggregations.to_pandas().reset_index()
time_cohorts['cohort_id'] = time_cohorts.index
time_cohorts.drop(columns=['user_id_nunique'], inplace=True)

# select all active moments for each user
user_moments = timeframe_df.groupby(['user_id', 'time_aggregation']).aggregate({'moment':'count'})

# merge with first seen df
user_activity = user_moments.merge(user_first_seen, how='inner', on='user_id')

# clean-up and rename columns
user_activity.rename(columns={'time_aggregation_min':'new_user_cohort'}, inplace=True)
user_activity.drop(columns=['moment_count'], inplace=True)

# limit new users to the selected timeframe
timeframe_start = timeframe_df['time_aggregation'].min()
user_activity = user_activity[(user_activity['new_user_cohort'] >= timeframe_start)]

# for each new_user_cohort count how many users get back per timeframe
retention_input = user_activity.groupby(['new_user_cohort', 'time_aggregation']).aggregate({'user_id':'nunique'})

# add the size of each new user cohort
cohorts = retention_input.merge(new_users, how='inner', left_on='new_user_cohort', right_on='time_aggregation_min', suffixes=('_active', '_cohort'))

# calculate classic retention (so not rolling retention, where users are required to be active each timeframe)
cohorts['retention'] = cohorts['user_id_nunique_active'] / cohorts['user_id_nunique_cohort']

# now we switch to Pandas, as the dataset is small enough and allows nice visualisation
cohorts_pd = cohorts.to_pandas().reset_index()

# merge with cohorts to lookup the id for each new user cohort
cohorts_pd = cohorts_pd.merge(time_cohorts, how='inner', left_on='new_user_cohort', right_on='time_aggregation')
cohorts_pd.drop(columns=['time_aggregation_y'], inplace=True)
cohorts_pd.rename(columns={'cohort_id':'new_user_cohort_id', 'time_aggregation_x':'time_aggregation'}, inplace=True)

# merge with cohorts to lookup the id for each active user cohort
cohorts_pd = cohorts_pd.merge(time_cohorts, how='inner', on='time_aggregation')
cohorts_pd.rename(columns={'cohort_id':'active_user_cohort_id', 'time_aggregation_x':'time_aggregation'}, inplace=True)

# number the cohort in which users were active vs their new user cohort
cohorts_pd['active_in_timeframe'] = cohorts_pd.active_user_cohort_id - cohorts_pd.new_user_cohort_id

# create typical retention matrix
cohorts_pd.pivot('new_user_cohort', 'active_in_timeframe', 'retention').replace(np.nan, 0)

In [None]:
# remove timeframe 0 where the new users are all there, for better visualisation
cohorts_pd.drop(cohorts_pd[cohorts_pd.active_in_timeframe == 0].index, inplace=True)

# create retention matrix
retention_pd = cohorts_pd.pivot('new_user_cohort', 'active_in_timeframe', 'retention').replace(np.nan, 0)

# visualize heatmap
plt.figure(figsize=(15,10))
fmt = lambda x,pos: '{:.0%}'.format(x)
retention_heatmap = sns.heatmap(retention_pd, center=1, linewidths=1, square=True, annot=True, fmt=".0%", cbar_kws={'format': FuncFormatter(fmt)})

## Bounce rate

In [None]:
# gather sessions, hits per timeframe
hits_sessions = timeframe_df[['time_aggregation', 'session_id', 'session_hit_number']]

# calculate hits per session
hits_per_session = hits_sessions.groupby(['time_aggregation', 'session_id']).aggregate({'session_hit_number':'nunique'})

# select sessions with only one hit
hit_selector = (hits_per_session['session_hit_number_nunique'] == 1)
single_hit_sessions = hits_per_session[hit_selector]

# count these single hit sessions per timeframe
bounced_sessions = single_hit_sessions.groupby('time_aggregation').aggregate({'session_id':'nunique'})

# merge with total sessions
bounce_rate = bounced_sessions.merge(sessions, how='inner', on='time_aggregation', suffixes=('_bounce', '_total'))

# calculate bounce rate
bounce_rate['bounce_rate'] = bounce_rate['session_id_nunique_bounce'] / bounce_rate['session_id_nunique_total']

# clean-up columns
bounce_rate.drop(columns=['session_id_nunique_bounce', 'session_id_nunique_total'], inplace=True)

bounce_rate.sort_values(by='time_aggregation', ascending=False).head()

In [None]:
# visualize bounce rate
fig = px.line(data_frame = bounce_rate[['bounce_rate']].to_pandas())
fig.show()

## User agent

In [None]:
# add a new column to df with the user_agent from the global contexts, using gc function
timeframe_df['user_agent'] = timeframe_df.global_contexts.gc.user_agent

# gather overall basic stats grouped per user_agent
user_agent_counts = timeframe_df.groupby(['time_aggregation', 'user_agent']).aggregate({'user_id':'nunique', 'session_id':'nunique'})

# add total users and calculate share per user_agent
user_agent_counts['total_users'] = total_users

# calculate share per user_agent
user_agent_counts['share_of_users'] = user_agent_counts['user_id_nunique'] / user_agent_counts['total_users']

# clean-up colums
user_agent_counts.drop(columns=['total_users'], inplace=True)

user_agent_counts.sort_values(by=['time_aggregation', 'user_id_nunique'], ascending=False).head()

## Referer

In [None]:
# add a new column to dataframe with the referer from the global contexts, using gc function
timeframe_df['referer'] = timeframe_df.global_contexts.gc.get_from_context_with_type_series(type='HttpContext', key='referer')

# gather overall basic stats grouped per referer
referer_counts = timeframe_df.groupby(['time_aggregation', 'referer']).aggregate({'user_id':'nunique', 'session_id':'nunique'})

# add total users and calculate share per referer
referer_counts['total_users'] = total_users

# calculate share per referer
referer_counts['share_of_users'] = referer_counts['user_id_nunique'] / referer_counts['total_users']

# clean-up colums
referer_counts.drop(columns=['total_users'], inplace=True)

referer_counts.sort_values(by=['time_aggregation', 'user_id_nunique'], ascending=False).head()

## User timeline

In [None]:
# select the spefic user we want to replay
user_selector = (timeframe_df['user_id'].astype('string') == 'fe2657f1-a08c-4e33-b762-441c2f52855c')

# create df with only this user's events
selected_user_df = timeframe_df[user_selector]

# timeline of this user's events
user_timeline = selected_user_df[['moment','event_type', 'event_location', 'user_agent', 'referer']]

user_timeline.sort_values(by='moment', ascending=True).head()

## Frequency

In [None]:
# number of total sessions per user
total_sessions_user = timeframe_df.groupby(['user_id']).aggregate({'session_id':'nunique'})

# calculate frequency
frequency = total_sessions_user.groupby(['session_id_nunique']).aggregate({'user_id':'nunique'})

# add total users and calculate share per number of sessions
frequency['share_of_users'] = frequency['user_id_nunique'] / total_users

frequency.sort_values(by='session_id_nunique', ascending=True).head()

In [None]:
# visualize frequency
fig = px.bar(data_frame = frequency[['share_of_users']].to_pandas())
fig.show()

## Recency

In [None]:
# count the number of active days per user
user_active_check = timeframe_df.groupby(['user_id']).aggregate({'day':'nunique'})

# select all users that had more than one active day
user_active_check = user_active_check[(user_active_check['day_nunique'] > 1)]

# select all active days for each user
user_days = timeframe_df.groupby(['user_id', 'day']).aggregate({'time_aggregation':'min'})

# merge with users that have more than one active day
user_days = user_days.merge(user_active_check, how='inner', on='user_id')

# reset the index so we can use the user_id & day columns
user_days = user_days.reset_index()

# get previous (because of the sorting) day for each user
rolling = user_days.sort_values('day').groupby(['user_id']).rolling(2)
user_days['prev_day'] = rolling.day.window_lag()

# materizalize the df before we apply an expression on window
user_days = user_days.materialize()

# calculate the number of days between an active day and prev_day
user_days['recency'] = user_days['day'] - user_days['prev_day']

# rename columns
user_days.rename(columns={'time_aggregation_min':'time_aggregation'}, inplace=True)

# calculate the recency per time_aggregation
recency = user_days.groupby(['time_aggregation']).aggregate({'recency':'mean','user_id':'nunique'})

recency.sort_values(by='time_aggregation', ascending=False).head()

## Get metrics to production

In [None]:
# we're working on export functionality to dbt, until then, you can use view_sql() to get the SQL that runs on the full dataset for any metric above

# as an example, the SQL for the session duration metric
print(avg_session_duration.view_sql())