In [1]:
## standard packages
import pandas as pd
import numpy as np
from scipy import special

## visualization packages
import plotly.express as px
import plotly.graph_objects as go

## Model packages
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve

# Motivational Example A - Logistic MTA with No Sequence

This notebook will build a basic logistic model on MTA data with no sequence. 

It will model on frequency of touchpoints in a journey. 

It will arbitrarily enforce a non data driven 45 day window (perhaps provided by our marketing SME). You are encouraged to also look into data driven ways to define the proper window or model directly on touchpoint proximity.

In [2]:
sequence_df = pd.read_csv('../datasets/sequence_fact.csv')
sequence_df.head(10)

Unnamed: 0,sequence_id,fullVisitorId,event_name,event_datetime,conversion_proximity
0,0099Rqojoj1MCXN,7343617347507729080,organic_search,2018-04-15 17:31:50,75.0
1,0099Rqojoj1MCXN,7343617347507729080,dead_end,2018-04-15 17:33:05,0.0
2,00A9Lkka73okUx2,89656057821147903,organic_search,2017-09-14 16:36:56,1033.0
3,00A9Lkka73okUx2,89656057821147903,dead_end,2017-09-14 16:54:09,0.0
4,00B30tmbMwJn7Cf,4307745811624101170,organic_search,2017-04-21 02:41:23,1.0
5,00B30tmbMwJn7Cf,4307745811624101170,dead_end,2017-04-21 02:41:24,0.0
6,00BKxKnEYlKbw9b,7129167701457127936,organic_search,2016-10-02 15:16:09,1.0
7,00BKxKnEYlKbw9b,7129167701457127936,dead_end,2016-10-02 15:16:10,0.0
8,00EttOfsTTyp45B,3217678225016118393,referral,2017-10-23 19:44:20,143.0
9,00EttOfsTTyp45B,3217678225016118393,dead_end,2017-10-23 19:46:43,0.0


In [3]:
sequence_to_visitor_map = sequence_df[['sequence_id','fullVisitorId']].drop_duplicates().reset_index(drop=True)

## Make a modeling dataset

We want to filter out rows where conversion proximity >= 45 days (45 days *  86400 seconds per day
 = 3,888,000 seconds).

We want 1 row to represent the sequence id.

For fun, lets build a string column that shows the events in order

Next there is a conversion column 1 = yes conversion 0 equals dead end journey

Finally, since this a non sequence table we would need to make a column based on each channel if it is present in the 45 day journey. 

In [5]:
## filter conversion_proximity 
model_prep_df1 = sequence_df.loc[(sequence_df['conversion_proximity']/86400)<=45,:]

In [6]:
## make the sequence details
model_prep_df2 = model_prep_df1.groupby('sequence_id')['event_name'].agg(lambda x: '>'.join(x)).reset_index()
model_prep_df2.columns = ['sequence_id','sequence_details']
model_prep_df2.head()

Unnamed: 0,sequence_id,sequence_details
0,0099Rqojoj1MCXN,organic_search>dead_end
1,00A9Lkka73okUx2,organic_search>dead_end
2,00B30tmbMwJn7Cf,organic_search>dead_end
3,00BKxKnEYlKbw9b,organic_search>dead_end
4,00EttOfsTTyp45B,referral>dead_end


In [7]:
## make the modeling features
model_prep_df3 = model_prep_df1.pivot_table(index='sequence_id', columns='event_name', aggfunc='size', fill_value=0).reset_index()
model_prep_df3 = model_prep_df3.rename_axis(None, axis=1)
model_prep_df3.head()

Unnamed: 0,sequence_id,(other),affiliates,conversion,dead_end,direct,display,organic_search,paid_search,referral,social
0,0099Rqojoj1MCXN,0,0,0,1,0,0,1,0,0,0
1,00A9Lkka73okUx2,0,0,0,1,0,0,1,0,0,0
2,00B30tmbMwJn7Cf,0,0,0,1,0,0,1,0,0,0
3,00BKxKnEYlKbw9b,0,0,0,1,0,0,1,0,0,0
4,00EttOfsTTyp45B,0,0,0,1,0,0,0,0,1,0


In [8]:
## Final joining and prep
model_prep_df4 = model_prep_df2.merge(model_prep_df3, on='sequence_id',how='left')

## Add visitor id back in
model_prep_df4 = model_prep_df4.merge(sequence_to_visitor_map, on='sequence_id',how='left')


## drop dead_end and move Y to last spot
model_data_final =  model_prep_df4[['fullVisitorId','sequence_id','sequence_details','affiliates'
                                 ,'direct','display','organic_search'
                                 ,'paid_search','referral','social'
                                 ,'(other)','conversion']]
model_data_final.head()

Unnamed: 0,fullVisitorId,sequence_id,sequence_details,affiliates,direct,display,organic_search,paid_search,referral,social,(other),conversion
0,7343617347507729080,0099Rqojoj1MCXN,organic_search>dead_end,0,0,0,1,0,0,0,0,0
1,89656057821147903,00A9Lkka73okUx2,organic_search>dead_end,0,0,0,1,0,0,0,0,0
2,4307745811624101170,00B30tmbMwJn7Cf,organic_search>dead_end,0,0,0,1,0,0,0,0,0
3,7129167701457127936,00BKxKnEYlKbw9b,organic_search>dead_end,0,0,0,1,0,0,0,0,0
4,3217678225016118393,00EttOfsTTyp45B,referral>dead_end,0,0,0,0,0,1,0,0,0


In [9]:
model_data_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99718 entries, 0 to 99717
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   fullVisitorId     99718 non-null  object
 1   sequence_id       99718 non-null  object
 2   sequence_details  99718 non-null  object
 3   affiliates        99718 non-null  int64 
 4   direct            99718 non-null  int64 
 5   display           99718 non-null  int64 
 6   organic_search    99718 non-null  int64 
 7   paid_search       99718 non-null  int64 
 8   referral          99718 non-null  int64 
 9   social            99718 non-null  int64 
 10  (other)           99718 non-null  int64 
 11  conversion        99718 non-null  int64 
dtypes: int64(9), object(3)
memory usage: 9.9+ MB


We will split the data into 60 20 20 train validate test 

- train is used to fit models. Validate is used while making model selection feature selection decisions. Test is a true holdout for final reporting to stakeholders.

In [10]:
# Split the data into train, validation, and test sets
temp_data, test_data = train_test_split(model_prep_df4, test_size=0.2, random_state=42)
train_data, val_data = train_test_split(temp_data, test_size=0.25, random_state=42)

In [11]:
ind_vars = ['affiliates','direct','display','organic_search','paid_search','referral','social','(other)']
dep_var = 'conversion'

# Separate the features and target variable
X_train = train_data[ind_vars]
y_train = train_data[dep_var]
X_val = val_data[ind_vars]
y_val = val_data[dep_var]
X_test = test_data[ind_vars]
y_test = test_data[dep_var]

In [12]:
# Define the parameter grid for grid search
param_grid = {
    'C': [0.1, 1, 10, 100]
}

lr_model = LogisticRegression()

grid_search = GridSearchCV(lr_model, param_grid, cv=5, scoring='neg_log_loss')
grid_search.fit(X_train, y_train)

lr_model = LogisticRegression(**grid_search.best_params_)
lr_model.fit(X_train, y_train)

In [14]:
train_pred = lr_model.predict_proba(X_train)[:, 1]
val_pred = lr_model.predict_proba(X_val)[:, 1]

train_auc = roc_auc_score(y_train, train_pred)
test_auc = roc_auc_score(y_val, val_pred)

print("Train AUC:", train_auc)
print("Test AUC:", test_auc)

Train AUC: 0.8036488658056495
Test AUC: 0.8066017391060295


In [15]:
fig = go.Figure()
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)

for i, j in enumerate(['train','validate']):
    
    
    if j == 'train':
        fpr, tpr, _ = roc_curve(y_train, train_pred)
        auc_score = roc_auc_score(y_train, train_pred)
    elif j == 'validate':
        fpr, tpr, _ = roc_curve(y_val, val_pred)
        auc_score = roc_auc_score(y_val, val_pred)

    name = "{} AUC= {:.2f}".format(j.title(),auc_score)
    fig.add_trace(go.Scatter(x=fpr, y=tpr, name=name, mode='lines'))

fig.update_layout(
    title='AUC for Train and Validate'
    ,xaxis_title='False Positive Rate'
    ,yaxis_title='True Positive Rate'
    ,yaxis=dict(scaleanchor="x", scaleratio=1)
    ,xaxis=dict(constrain='domain')
    ,width=700
    ,height=500
)
fig.show()

In [16]:
# Get the coefficients (weights) for each variable
coefficients = lr_model.coef_

print("logisitc coefficients in log-odds")
print("")
# Print the coefficients
for i, feature_name in enumerate(X_train.columns):

    print("{}: {:.2f}".format(feature_name,coefficients[0][i]))

logisitc coefficients in log-odds

affiliates: -9.54
direct: 0.12
display: -0.07
organic_search: 0.16
paid_search: 0.55
referral: 0.84
social: -2.69
(other): -0.04


In [17]:
# Get the coefficients (weights) for each variable
coefficients = lr_model.coef_

print("logisitc coefficients as odds ratios")
print("")
# Print the coefficients
for i, feature_name in enumerate(X_train.columns):
    odds_ratio = np.exp(coefficients[0][i])
    probability_impact = np.exp(coefficients[0][i]) - 1
    print("{} | odds ratio: {:.2f}, probability impact: {:.2f}".format(feature_name,odds_ratio,probability_impact))

logisitc coefficients as odds ratios

affiliates | odds ratio: 0.00, probability impact: -1.00
direct | odds ratio: 1.13, probability impact: 0.13
display | odds ratio: 0.93, probability impact: -0.07
organic_search | odds ratio: 1.17, probability impact: 0.17
paid_search | odds ratio: 1.73, probability impact: 0.73
referral | odds ratio: 2.31, probability impact: 1.31
social | odds ratio: 0.07, probability impact: -0.93
(other) | odds ratio: 0.96, probability impact: -0.04


To interpret this: each additional touchpoint from direct channel results in the probability of conversion increasing by 13%.

## Deploying these weights into a business system/ attribution report dataset

So now how can we logically create a data driven logisitc regression output that would let us divy up conversions appropriately?

To divy up these credits we would need to meet the following rules.
- If a touchpoint is present but no conversion it gets zero credit for a conversion
- If a touchpoint is present but there is a conversion it gets >= 0 credit for a conversion
- If a touchpoint is not present it gets no credit and it gets no penalty either.

The softmax function transforms the original logistic coefficients into a set of values that sum up to 1, with each value representing the probability weight of the corresponding independent variable in contributing to the outcome variable. These probability weights can be interpreted as the relative importance of each independent variable in predicting the outcome variable.

In [16]:
## example interpretation

coefficients = special.softmax(lr_model.coef_)
print("logisitc coefficients in softmax")
print("")
# Print the coefficients
for i, feature_name in enumerate(X_train.columns):

    print("{}: {:.5f}".format(feature_name,coefficients[0][i]))

logisitc coefficients in softmax

affiliates: 0.00001
direct: 0.13573
display: 0.11199
organic_search: 0.14151
paid_search: 0.20818
referral: 0.27844
social: 0.00819
(other): 0.11595


These probability weights can be interpreted as follows: referral has the highest probability weight of 0.278, which means it is the most important independent variable in predicting the outcome variable. 

Assume we had 1 touchpoint of each channel present and then a conversion. These set of numbers could represent how much credit each channel gets in that scenario. 

Now what if we saw a journey with 2 paid search 1 referal 1 affiliate then a conversion

The conversion should be divvyd up like this

In [17]:
channel_impacts = 2 * (0.20818) + 1 * (0.27844) + 1 * (0.00001)

paid_search_channel_contribution = (2 * (0.20818)) / channel_impacts
paid_search_touchpoint_contribution = paid_search_channel_contribution / 2

referral_channel_contribution = (1 * (0.27844)) / channel_impacts
referral_touchpoint_contribution = referral_channel_contribution / 1

affiliate_channel_contribution = (1 * (0.00001)) / channel_impacts
affiliate_touchpoint_contribution = affiliate_channel_contribution / 1

print("paid search channel credit: {:.3f}".format(paid_search_channel_contribution))
print("paid search per touchpoint credit: {:.3f}".format(paid_search_touchpoint_contribution))
print("")
print("referral channel credit: {:.3f}".format(referral_channel_contribution))
print("referral touchpoint credit: {:.3f}".format(referral_touchpoint_contribution))
print("")
print("affiliate channel credit: {:.3f}".format(affiliate_channel_contribution))
print("affiliate touchpoint credit: {:.3f}".format(affiliate_touchpoint_contribution))

paid search channel credit: 0.599
paid search per touchpoint credit: 0.300

referral channel credit: 0.401
referral touchpoint credit: 0.401

affiliate channel credit: 0.000
affiliate touchpoint credit: 0.000


In summary, we can use the softmax transformation on the logistic regression coefficients to give us indepentent variable impacts on a scale between 0 and 1 and summing up to 1

These weights are data driven and sensible to then be deployed to calculate channel contribution and touchpoint contribution like the example above. 

Once the entire dataset is scored, we can now slice and dice the results. 

## Deploy the weights

Here we will deploy the weights (score the data). 

Next we will calculate tactic total contribution. 

Then we will calculate tactic contribution per activity.

Then we can track each tactics contribution per activity over time.

In [18]:
scored_data_df1 = model_data_final.loc[model_data_final['conversion']==1,:]

score_dict = {
    'affiliates':0.00001
    ,'direct':0.13573
    ,'display': 0.11199
    ,'organic_search': 0.14151
    ,'paid_search': 0.20818
    ,'referral': 0.27844
    ,'social': 0.00819
    ,'(other)': 0.11595
}

for i in score_dict.keys():
    scored_data_df1.loc[:,'{}_weights'.format(i)] = scored_data_df1.loc[:,'{}'.format(i)]*score_dict[i]
    
cols = [col for col in scored_data_df1.columns if 'weights' in col]
scored_data_df1['weights_total'] = scored_data_df1[cols].sum(axis=1)

for i in score_dict.keys():
    scored_data_df1.loc[:,'{}_impact'.format(i)] = scored_data_df1.loc[:,'{}_weights'.format(i)]/scored_data_df1['weights_total'] 
scored_data_df1



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

Unnamed: 0,fullVisitorId,sequence_id,sequence_details,affiliates,direct,display,organic_search,paid_search,referral,social,...,(other)_weights,weights_total,affiliates_impact,direct_impact,display_impact,organic_search_impact,paid_search_impact,referral_impact,social_impact,(other)_impact
299,7547767069516152606,0AioIlToiDilMZ6,referral>referral>conversion,0,0,0,0,0,2,0,...,0.0,0.55688,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
304,9583683554162057560,0AzVoxvZ7fJnkKY,organic_search>organic_search>conversion,0,0,0,2,0,0,0,...,0.0,0.28302,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
354,165267667365212716,0D6KWXV0v6IVYC3,referral>conversion,0,0,0,0,0,1,0,...,0.0,0.27844,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
417,119870259714905967,0FC7pTmjoyhBMPK,direct>direct>direct>conversion,0,3,0,0,0,0,0,...,0.0,0.40719,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
579,072965257504694282,0LXKW2ke11vif1T,referral>referral>referral>referral>referral>c...,0,0,0,0,0,5,0,...,0.0,1.39220,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99225,6521312251881307858,zfPRf8jln9r8K4Y,referral>referral>conversion,0,0,0,0,0,2,0,...,0.0,0.55688,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
99271,5613647306583973831,zhU0Zu6QXYyTfdY,referral>conversion,0,0,0,0,0,1,0,...,0.0,0.27844,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
99556,8906575148569160903,ztXs02S3xfqnn7u,referral>referral>conversion,0,0,0,0,0,2,0,...,0.0,0.55688,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
99674,8722485686890283296,zytaQynOIi5H98F,referral>conversion,0,0,0,0,0,1,0,...,0.0,0.27844,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [19]:
scored_data_df2 = scored_data_df1[['sequence_id','affiliates_impact','direct_impact'
                ,'display_impact','organic_search_impact'
                ,'paid_search_impact','referral_impact'
                ,'social_impact','(other)_impact']]

scored_data_df_final = model_data_final.merge(scored_data_df2, on ='sequence_id', how='left')

scored_data_df_final.fillna(0,inplace=True)
scored_data_df_final

Unnamed: 0,fullVisitorId,sequence_id,sequence_details,affiliates,direct,display,organic_search,paid_search,referral,social,(other),conversion,affiliates_impact,direct_impact,display_impact,organic_search_impact,paid_search_impact,referral_impact,social_impact,(other)_impact
0,7343617347507729080,0099Rqojoj1MCXN,organic_search>dead_end,0,0,0,1,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,89656057821147903,00A9Lkka73okUx2,organic_search>dead_end,0,0,0,1,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4307745811624101170,00B30tmbMwJn7Cf,organic_search>dead_end,0,0,0,1,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,7129167701457127936,00BKxKnEYlKbw9b,organic_search>dead_end,0,0,0,1,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3217678225016118393,00EttOfsTTyp45B,referral>dead_end,0,0,0,0,0,1,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99713,2993808115150274357,zztHCwxAXsUYmVF,referral>dead_end,0,0,0,0,0,1,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99714,8219089981045079603,zzvFftLlVUENeNU,organic_search>dead_end,0,0,0,1,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99715,546466813369261354,zzvh8qX8dzkWb2X,direct>dead_end,0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99716,6288261604719925213,zzxahVA1FamPayn,organic_search>dead_end,0,0,0,1,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we have a dataframe, scored data df final, that has each channels activity count and if there is a conversion it is divvied to the channels based on machine learning derived weights.

We can use this output now to make presentations and dashboards for the stakeholders. Below are a few example insights.

#### Contribution (impacts) by channel

In [28]:
## take impact columns and pivot from wide to long
impact_cols = [col for col in scored_data_df_final.columns if '_impact' in col]

impact_df = pd.melt(scored_data_df_final, id_vars=['sequence_id'], value_vars=impact_cols, var_name='Channel', value_name='Impacts')
impact_df = impact_df[impact_df['Impacts']>0]
impact_df['Channel'] = impact_df['Channel'].str.replace('_impact','')
impact_df_agg = impact_df.groupby(['Channel'],as_index=False).agg(
                Impacts=('Impacts','sum')
)


## make visualization
bar_fig = px.bar(impact_df_agg
                 ,x='Impacts'
                 ,y='Channel'
                 ,title='Contribution by Channel'
                 ,color='Channel'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

#### Contribution per touchpoint by channel

In [33]:
## look at columns with at least one conversion
activity_cols = ['direct','display','organic_search','paid_search','referral','social']

activity_df = pd.melt(scored_data_df_final, id_vars=['sequence_id'], value_vars=activity_cols, var_name='Channel', value_name='Touchpoints')
activity_df = activity_df[activity_df['Touchpoints']>0]
activity_df_agg = activity_df.groupby(['Channel'],as_index=False).agg(
                Touchpoints=('Touchpoints','sum')
)

##

agg_df = activity_df_agg.merge(impact_df_agg, on = 'Channel', how ='left')
agg_df['Contribution_per_Touchpoint'] = agg_df['Impacts'] / agg_df['Touchpoints']


## make visualization
bar_fig = px.bar(agg_df
                 ,x='Contribution_per_Touchpoint'
                 ,y='Channel'
                 ,title='Contribution per Touchpoint by Channel'
                 ,color='Channel'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

For paid channels (e.g. paid search and display) it could make sense to calculate return on investments. 

For example what if:
- Paid Search cost 2 dollars per touchpoint 
- Display cost .25 cents per touchpoint
- My revenue per conversion is $250

Is Paid search still more efficient / a better investment than display?

In [118]:
display_cost = agg_df.loc[agg_df['Channel']=='display','Touchpoints']*.25
display_revenue = agg_df.loc[agg_df['Channel']=='display','Impacts']*250
paid_search_cost = agg_df.loc[agg_df['Channel']=='paid_search','Touchpoints']*2
paid_search_revenue = agg_df.loc[agg_df['Channel']=='paid_search','Impacts']*250

display_roi = display_revenue / display_cost
paid_search_roi = paid_search_revenue / paid_search_cost

In [119]:
print("For every one dollar I spend in Display I make back ${:.2f}".format(float(display_roi)))

For every one dollar I spend in Display I make back $3.33


In [120]:
print("For every one dollar I spend in Paid Search I make back ${:.2f}".format(float(paid_search_roi)))

For every one dollar I spend in Paid Search I make back $2.73


Another insight that might be relevant is effectiveness over time? For example, as the paid search team is testing new keywords or new campaigns monitoring the attribution effectiveness could be useful for them.

In [150]:
## Data prep

# get every sequence ids sequence start month - year
sequence_df['event_datetime'] = pd.to_datetime(sequence_df['event_datetime'])

sequence_start_df = sequence_df.groupby('sequence_id')['event_datetime'].min().reset_index()

sequence_start_df['year-month'] = sequence_start_df['event_datetime'].dt.strftime('%Y-%m')

sequence_start_df = sequence_start_df[['sequence_id','year-month']]
sequence_start_df.columns = ['sequence_id','sequence_start_month']

# Activities by tactic + month
sequence_df2 = sequence_df.merge(sequence_start_df, on ='sequence_id', how='left')

sequence_df3 = sequence_df2.loc[~sequence_df2['event_name'].isin(['dead_end','conversion']),:]
                            
touchpoints_month_agg_df = sequence_df3.groupby(['sequence_start_month','event_name'], as_index=False).agg(
                touchpoint_count=('sequence_id','count')

)

touchpoints_month_agg_df.sort_values(by=['sequence_start_month','event_name'], inplace=True)

# Conversions by channel + month

conversions_month_df1 = scored_data_df_final.merge(sequence_start_df, on='sequence_id',how='left')

impact_cols = [col for col in conversions_month_df1.columns if '_impact' in col]

conversions_month_df2 = pd.melt(conversions_month_df1, id_vars=['sequence_id','sequence_start_month'], value_vars=impact_cols, var_name='event_name', value_name='Impacts')
conversions_month_df2 = conversions_month_df2[conversions_month_df2['Impacts']>0]

conversions_month_df2['event_name'] = conversions_month_df2['event_name'].str.replace('_impact','')
conversions_month_agg_df = conversions_month_df2.groupby(['sequence_start_month','event_name'],as_index=False).agg(
                Impacts=('Impacts','sum')
)

conversions_month_agg_df.sort_values(by=['sequence_start_month','event_name'], inplace=True)


# calculate effectivness per touch overtime
month_agg_df = touchpoints_month_agg_df.merge(conversions_month_agg_df, on=['sequence_start_month','event_name'], how='left')
month_agg_df = month_agg_df.fillna(0)
month_agg_df['impact_per_touchpoint'] = month_agg_df['Impacts']/month_agg_df['touchpoint_count']
month_agg_df.head()

Unnamed: 0,sequence_start_month,event_name,touchpoint_count,Impacts,impact_per_touchpoint
0,2016-08,affiliates,177,0.000000,0.000000
1,2016-08,direct,997,16.904967,0.016956
2,2016-08,display,49,1.167432,0.023825
3,2016-08,organic_search,1919,29.281046,0.015258
4,2016-08,paid_search,176,2.193265,0.012462
...,...,...,...,...,...
150,2018-05,direct,2,0.000000,0.000000
151,2018-05,organic_search,16,0.000000,0.000000
152,2018-05,paid_search,1,0.000000,0.000000
153,2018-05,referral,5,0.000000,0.000000


In [153]:
## make visualization
bar_fig = px.line(month_agg_df
                 ,x='sequence_start_month'
                 ,y='impact_per_touchpoint'
                 ,title='Channel Effectiveness Overtime'
                 ,color='event_name'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

Let's talk about some limitations of this approach in its current status now.

#### Limitation 1: Static Weights

We only get 1 weight/coefficient for a given channel touchpoint that doesn't change given the presence of other touchpoints so the interaction effects are limitied just to the attribution games' nature of the canabalism. Other approaches may actually let these base weights change given control variables and synergistic relationships across the channels. 

Weights should/could change based on:
1) synergyistic effects (presence or lack of presence of other channels)

2) special patterns or sequences

3) control variables (geography, customer demographics, etc.)

4) proximity or how close the marketing touchpoint occured to the conversion

Logistic oversimplifies by just giving you one set of weights to use regardless of these other potential factors.

For example if we had a model that could tell us which GEO, device type, countries, and day parts are best the Paid Search and Display channel can increase their spend in these segments in well performing areas. They can also decrease spend or completely turn off the channels in bad performing segments.

A model that is aware of synergistic effects, sequences, and proximity can help tell us next best action or next best channel to execute to get our customers to convert. 

##### Limitation 2: No Base Contribution or Concept of Incrementality, 

Another thing to consider is that this current approach gives marketing "full credit". In reality marketing might only be responsbile for some portion of the sales. This is called incrementality. That maybe you see 100 conversions but marketing only contributed to 20% of that or 20 conversions.

How do we include a base contribution? this can be incorporated in many different ways

1) Inclusion of control variables that get some of the credit

2) In some scenarios, you can look at conversions where no marketing is present and see what that conversion rate or model probability is and call that "base" 

3) You can run an incrementality experiment and some how try to incorporate those insights into the analysis

## What Next?

**Try similar approach with these better algorithms (does not address the mission of sequences)**
- Try xgboost

**Introduce Control Variables**
- Introduce daypart, devicetype, and geo

**Build initiator closer model**
- first touch channel, last touch channel, 8 channels

**Build position based only (non-sequence) solutions**
- number of features = last N positions * 8 channels, + null
- eg affiliate|1, affiliate|2, affiliate|3 ... etc

**Build small-sequence solutions xgboost**
- number of features permutations of 1 touchpoint present (8 channels) + permutations of 2 tactic + permutations of 3 tactic

**Build large scale sequence solutions with embedding, dimensionality reduction, sequence mining techniques**
- Figuring out hot to embbed or train machine learning algoirthms in the highly complex and spare sequences data

**go straight for LSTM fully sequential**
- you will need to research what type of input data is needed
- you will need to go back to the source data and grab more observations as well (feel free to use earlier notebooks so you don't have to start from scratch)