# PART III Exploratory Data Analysis on Google Merchandise Store Sequence Data

## <font color="#c70404"> ANSWERS </font>

This part of the EDA is intended to help us understand the different sequences of the data and their conversion rates.

We will help do data prep to create a sequence dataframe.You must go ahead and answer the remaining blank questions and any additional questions you feel fit.

The business has decided that they want you to only consider touchpoints that occur within 45 days of dead end or conversion. We will include a filter in the data prep steps but note taht counts may now change from prior EDAs.

In [11]:
import pandas as pd
import numpy as np

## Visualization packages
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt

## make pandas show the full strings
pd.set_option('display.max_colwidth', None)

## Purdue colors
purdue_colors = ['#CEB888', '#000000','#9D968D','#373A36','#C28E0E']

In [4]:
## Import Dataset\

sequence_df = pd.read_csv('../datasets/sequence_fact.csv')

#### Data Prep

In [5]:
sequence_to_visitor_map = sequence_df[['sequence_id','fullVisitorId']].drop_duplicates().reset_index(drop=True)

## filter conversion_proximity to within 45 days
sequence_prep_df1 = sequence_df.loc[(sequence_df['conversion_proximity']/86400)<=45,:]

## make the sequence details
sequence_prep_df2 = sequence_prep_df1.groupby('sequence_id')['event_name'].agg(lambda x: '>'.join(x)).reset_index()
sequence_prep_df2.columns = ['sequence_id','sequence_details_full']

## make the modeling features
sequence_prep_df3 = sequence_prep_df1.pivot_table(index='sequence_id', columns='event_name', aggfunc='size', fill_value=0).reset_index()
sequence_prep_df3 = sequence_prep_df3.rename_axis(None, axis=1)

## Final joining and prep
sequence_prep_df4 = sequence_prep_df2.merge(sequence_prep_df3, on='sequence_id',how='left')

## Add visitor id back in
sequence_prep_df4 = sequence_prep_df4.merge(sequence_to_visitor_map, on='sequence_id',how='left')

## Make a column where the sequence only contains touchpoints
sequence_prep_df4['sequence_details_touchpoints'] =  sequence_prep_df4['sequence_details_full'].str.replace('>dead_end','').str.replace('>conversion','')

## drop dead_end and move Y to last spot
sequence_data_final =  sequence_prep_df4[['fullVisitorId','sequence_id','sequence_details_full','sequence_details_touchpoints'
                                         ,'affiliates','direct'
                                         ,'display','organic_search'
                                         ,'paid_search','referral','social'
                                         ,'(other)','conversion']]
sequence_data_final.head()

Unnamed: 0,fullVisitorId,sequence_id,sequence_details_full,sequence_details_touchpoints,affiliates,direct,display,organic_search,paid_search,referral,social,(other),conversion
0,7343617347507729080,0099Rqojoj1MCXN,organic_search>dead_end,organic_search,0,0,0,1,0,0,0,0,0
1,89656057821147903,00A9Lkka73okUx2,organic_search>dead_end,organic_search,0,0,0,1,0,0,0,0,0
2,4307745811624101170,00B30tmbMwJn7Cf,organic_search>dead_end,organic_search,0,0,0,1,0,0,0,0,0
3,7129167701457127936,00BKxKnEYlKbw9b,organic_search>dead_end,organic_search,0,0,0,1,0,0,0,0,0
4,3217678225016118393,00EttOfsTTyp45B,referral>dead_end,referral,0,0,0,0,0,1,0,0,0


NOTE: Use the **sequence_details_touchpoints** column when considering EDA below

In [20]:
## This will make a table to answer questions about sequence details

results_df = sequence_data_final.groupby(['sequence_details_touchpoints'], as_index=False).agg(
                sequence_count=('sequence_id','count')
                ,conversions_count=('conversion','sum')
)
results_df['sequence_length'] = 1+results_df['sequence_details_touchpoints'].str.count('>')
results_df['touch_count'] = results_df['sequence_count']*results_df['sequence_length']
results_df = results_df.sort_values(by=['sequence_count'],ascending=[False]).reset_index(drop=True)
results_df

Unnamed: 0,sequence_details_touchpoints,sequence_count,conversions_count,sequence_length,touch_count
0,organic_search,39478,253,1,39478
1,social,22787,10,1,22787
2,direct,13112,134,1,13112
3,referral,7810,303,1,7810
4,organic_search>organic_search,2854,82,2,5708
...,...,...,...,...,...
466,organic_search>paid_search>display,1,0,3,3
467,direct>direct>direct>referral>referral>referral>referral>referral>referral,1,0,9,9
468,organic_search>paid_search>organic_search>display,1,0,4,4
469,organic_search>paid_search>organic_search>organic_search,1,0,4,4


**What are the total number of unique sequences?**

In [19]:
results_df['sequence_details_touchpoints'].nunique()

471

**What is sequence count by sequence length?**

In [58]:
viz_df = results_df.groupby('sequence_length')['sequence_count'].sum().reset_index()
## use visualization package to make a bar chart

viz_df.sort_values(by=['sequence_length'],inplace=True)

#viz_df['sequence_length'] = viz_df['sequence_length'].astype(str)

bar_fig = px.scatter(viz_df
                 ,x='sequence_length'
                 ,y='sequence_count'
                 ,title='Sequence Frequency by Sequence Touchpoint Count'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2'
                      ,xaxis_range=[0, 70]) 

bar_fig.show()

**What are the top 10 sequences that appear the most in general?**

In [42]:
results_df.sort_values(by=['sequence_count'], ascending=[False],inplace=True)
results_df[['sequence_details_touchpoints','sequence_count']].head(10)

Unnamed: 0,sequence_details_touchpoints,sequence_count
0,organic_search,39478
1,social,22787
2,direct,13112
3,referral,7810
4,organic_search>organic_search,2854
5,display,2077
6,paid_search,1694
7,affiliates,1594
8,direct>direct,1191
9,referral>referral,1177


**Given the sequence is a conversion, what are the top 10 sequences?**

In [43]:
results_df.sort_values(by=['conversions_count'], ascending=[False],inplace=True)
results_df[['sequence_details_touchpoints','conversions_count']].head(10)

Unnamed: 0,sequence_details_touchpoints,conversions_count
3,referral,303
0,organic_search,253
9,referral>referral,143
2,direct,134
4,organic_search>organic_search,82
12,referral>referral>referral,68
8,direct>direct,42
6,paid_search,41
11,organic_search>organic_search>organic_search,26
22,referral>referral>referral>referral,24


**What is conversion rate by sequence length?**

Perhaps remap to 1 touchpoint, 2 touchpoint, 3 touchpoint, 4+ touchpoint

In [56]:
results_df.sort_values(by=['sequence_count'], ascending=[False],inplace=True)
bins = [ 1, 2, 3, 4, float('inf')]
labels = ['1', '2', '3', '4+']

results_df['sequence_length_bins'] =  pd.cut(results_df['sequence_length'] , bins=bins, labels=labels, right=False).astype(str)

viz_df = results_df.groupby(['sequence_length_bins'], as_index=False).agg(
                    sequence_count=('sequence_count','sum')
                ,conversions_count=('conversions_count','sum')
)
viz_df['conversion_rate'] = viz_df['conversions_count'] / viz_df['sequence_count']

bar_fig = px.bar(viz_df
                 ,x='conversion_rate'
                 ,y='sequence_length_bins'
                 ,title='Conversion Rates by Sequence Length'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

**What is conversion rate by number of distinct channels present?**

In [66]:
## add distinct channels count
results_df['distinct_channel_count'] = (np.where(results_df['sequence_details_touchpoints'].str.contains('paid_search'),1,0)
                                    +np.where(results_df['sequence_details_touchpoints'].str.contains('direct'),1,0)
                                    +np.where(results_df['sequence_details_touchpoints'].str.contains('organic_search'),1,0)
                                    +np.where(results_df['sequence_details_touchpoints'].str.contains('referral'),1,0)
                                    +np.where(results_df['sequence_details_touchpoints'].str.contains('social'),1,0)
                                    +np.where(results_df['sequence_details_touchpoints'].str.contains('affiliates'),1,0)
                                    +np.where(results_df['sequence_details_touchpoints'].str.contains('(other)'),1,0)
                                    +np.where(results_df['sequence_details_touchpoints'].str.contains('display'),1,0))

viz_df = results_df.groupby(['distinct_channel_count'], as_index=False).agg(
                    sequence_count=('sequence_count','sum')
                ,conversions_count=('conversions_count','sum')
)
viz_df['conversion_rate'] = viz_df['conversions_count'] / viz_df['sequence_count']

viz_df['distinct_channel_count'] = viz_df['distinct_channel_count'].astype(str)

bar_fig = px.bar(viz_df
                 ,x='conversion_rate'
                 ,y='distinct_channel_count'
                 ,title='Conversion Rates by Distinct Channel Count'
)

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()


This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.



**Anything else we should consider? Can we bring in control variables for futher analysis?**