In [9]:
import pandas as pd
import numpy as np
import plotly.express as px

# Motivational Example A - Learning Optimal Position Using Logistic Bagging

Sequencing data dimensionality can get huge fast. For example if we had 8 unique channels and wanted to study all the unique sequences of the last 10 touchpoints there could be up to 1,227,133,512 sequences to study. 

What you noticed is I already said the last 10 touchpoints, meaning the number of true sequences can have even more of a long tail. Key takeaway is that the analysis complexity can be simplified if we just focus on N positions before conversion or dead end. 

The problem complexity can be even further simplified if we only study positions and not the unique sequences. (Later we will explore more complex sequence modeling). This notebook will do just positions. If we want to just study the optimal position only, considering up to the 5 last touchpoints, end up with only 40 features to model on (8 channels * 5 positions).

The methodology used is inspired by the following research paper

[Data-driven Multi-touch Attribution Models (2011)](http://wnzhang.net/share/rtb-papers/data-conv-att.pdf)

In [29]:
sequence_df = pd.read_csv('../datasets/sequence_fact.csv')
sequence_df.head(10)

Unnamed: 0,sequence_id,fullVisitorId,event_name,event_datetime,conversion_proximity
0,0099Rqojoj1MCXN,7343617347507729080,organic_search,2018-04-15 17:31:50,75.0
1,0099Rqojoj1MCXN,7343617347507729080,dead_end,2018-04-15 17:33:05,0.0
2,00A9Lkka73okUx2,89656057821147903,organic_search,2017-09-14 16:36:56,1033.0
3,00A9Lkka73okUx2,89656057821147903,dead_end,2017-09-14 16:54:09,0.0
4,00B30tmbMwJn7Cf,4307745811624101170,organic_search,2017-04-21 02:41:23,1.0
5,00B30tmbMwJn7Cf,4307745811624101170,dead_end,2017-04-21 02:41:24,0.0
6,00BKxKnEYlKbw9b,7129167701457127936,organic_search,2016-10-02 15:16:09,1.0
7,00BKxKnEYlKbw9b,7129167701457127936,dead_end,2016-10-02 15:16:10,0.0
8,00EttOfsTTyp45B,3217678225016118393,referral,2017-10-23 19:44:20,143.0
9,00EttOfsTTyp45B,3217678225016118393,dead_end,2017-10-23 19:46:43,0.0


In [30]:
## this setting will be used throughout the analysis to decide how many positions to analyze
n_positions = 5

## counts the number of unique events not counting conversion or dead ends
num_channels = len(sequence_df[~sequence_df['event_name'].isin(['conversion','dead_end'])]['event_name'].drop_duplicates())

## lets us know how many possible features we could end up with if we tried to build a feature for every posible sequence
x = 0
for i in range(1,n_positions+1):
    x = num_channels**i + x
    
## number of features we will end up with for the position only analysis
print("Total possible unique sequences for analysis on last {} positions: ".format(n_positions), x)

x = n_positions * num_channels
    
print("Number of features we will study in position only analysis with {} positions: ".format(n_positions), x)

Total possible unique sequences for analysis on last 5 positions:  37448
Number of features we will study in position only analysis with 5 positions:  40


## Make a position based dataset to model on

We want 1 row to represent the sequence id

For fun, lets build a string column that shows the events in order

Next there is a conversion column 1 = yes conversion 0 equals dead end journey



In [12]:
sequence_df[sequence_df['sequence_id']=='0AioIlToiDilMZ6']

Unnamed: 0,sequence_id,fullVisitorId,event_name,event_datetime,conversion_proximity
659,0AioIlToiDilMZ6,7547767069516152606,referral,2016-11-23 19:26:04,1044043.0
660,0AioIlToiDilMZ6,7547767069516152606,referral,2016-12-05 21:22:19,268.0
661,0AioIlToiDilMZ6,7547767069516152606,conversion,2016-12-05 21:26:47,0.0
