# PART II Exploratory Data Analysis on Google Merchandise Store Sequence Data

## <font color="#c70404"> ANSWERS </font>

This part of the EDA is to understand the different potential control variables in the dataset. We might want to introduce these variables to our modeling, but we need to understand counts and conversion rates across these areas.

We go ahead and start this notebook by joining data back together and building the first data viz. We also provide a multivariate viz example later on.

You must go ahead and answer the remaining blank questions and any additional questions you feel fit.

In [1]:
import pandas as pd
import numpy as np

## Visualization packages
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt

## Purdue colors
purdue_colors = ['#CEB888', '#000000','#9D968D','#373A36','#C28E0E']

In [2]:
## Import Dataset, this time we need to pull in the visitor id table as well

sequence_df = pd.read_csv('../datasets/sequence_fact.csv')
visitor_detail_df = pd.read_csv('../datasets/visitor_detail.csv')

Add in control variables that will be asked in some of the questions below.

In [3]:
visitor_detail_small_df = visitor_detail_df[['fullVisitorId','device_deviceCategory','device_browser','geoNetwork_continent']]

In [4]:
full_df = sequence_df.merge(visitor_detail_small_df, on='fullVisitorId',how='left')
full_df

Unnamed: 0,sequence_id,fullVisitorId,event_name,event_datetime,conversion_proximity,device_deviceCategory,device_browser,geoNetwork_continent
0,0099Rqojoj1MCXN,7343617347507729080,organic_search,2018-04-15 17:31:50,75.0,desktop,Chrome,Asia
1,0099Rqojoj1MCXN,7343617347507729080,dead_end,2018-04-15 17:33:05,0.0,desktop,Chrome,Asia
2,00A9Lkka73okUx2,89656057821147903,organic_search,2017-09-14 16:36:56,1033.0,mobile,Chrome,Asia
3,00A9Lkka73okUx2,89656057821147903,dead_end,2017-09-14 16:54:09,0.0,mobile,Chrome,Asia
4,00B30tmbMwJn7Cf,4307745811624101170,organic_search,2017-04-21 02:41:23,1.0,tablet,Safari,Americas
...,...,...,...,...,...,...,...,...
220996,zzvh8qX8dzkWb2X,546466813369261354,dead_end,2017-02-17 12:27:46,0.0,desktop,Firefox,Asia
220997,zzxahVA1FamPayn,6288261604719925213,organic_search,2017-08-22 18:56:15,24.0,desktop,Chrome,Americas
220998,zzxahVA1FamPayn,6288261604719925213,dead_end,2017-08-22 18:56:39,0.0,desktop,Chrome,Americas
220999,zzyM5alBCxAkdwq,7918896908390800801,social,2018-02-24 15:01:36,1.0,desktop,Chrome,Americas


In [10]:
## This will make a table to answer questions about device category

## touchpoint counts
viz_df = full_df[~full_df['event_name'].isin(['conversion','dead_end'])]['device_deviceCategory'].value_counts().reset_index()
viz_df.columns = ['device_deviceCategory','touch_count']

# we already have touch count from above just copy into new dataframe
results_df = viz_df.copy(deep=True)

# add sequence count
temp_df = full_df[~full_df['device_deviceCategory'].isin(['conversion','dead_end'])].groupby('device_deviceCategory')['sequence_id'].nunique().reset_index()
temp_df.columns = ['device_deviceCategory','sequence_count']
results_df = results_df.merge(temp_df, on='device_deviceCategory', how = 'left')

# add dead end sequence count
dead_end_sequence_ids = full_df[full_df['event_name'] == 'dead_end']['sequence_id'].unique()
temp_filtered_df = full_df[full_df['sequence_id'].isin(dead_end_sequence_ids)]
temp_df = temp_filtered_df[~temp_filtered_df['event_name'].isin(['conversion','dead_end'])].groupby('device_deviceCategory')['sequence_id'].nunique().reset_index()
temp_df.columns = ['device_deviceCategory','dead_sequence_count']
results_df = results_df.merge(temp_df, on='device_deviceCategory', how = 'left')

# add conversion sequence count
dead_end_sequence_ids = full_df[full_df['event_name'] == 'conversion']['sequence_id'].unique()
temp_filtered_df = full_df[full_df['sequence_id'].isin(dead_end_sequence_ids)]
temp_df = temp_filtered_df[~temp_filtered_df['event_name'].isin(['conversion','dead_end'])].groupby('device_deviceCategory')['sequence_id'].nunique().reset_index()
temp_df.columns = ['device_deviceCategory','conversion_sequence_count']
results_df = results_df.merge(temp_df, on='device_deviceCategory', how = 'left')
results_df.fillna(0,inplace=True)

# touchpoints per sequence ratio
results_df['touchpoint_per_sequence_ratio'] = results_df['touch_count']/results_df['sequence_count']

# calculate tactic univariate conversion rate
results_df['tactic_univariate_conversion_rate'] = results_df['conversion_sequence_count']/results_df['sequence_count']

results_df

Unnamed: 0,device_deviceCategory,touch_count,sequence_count,dead_sequence_count,conversion_sequence_count,touchpoint_per_sequence_ratio,tactic_univariate_conversion_rate
0,desktop,83219,67932,66716,1216,1.225034,0.0179
1,mobile,33338,27813,27716,97,1.198648,0.003488
2,tablet,4726,3973,3949,24,1.189529,0.006041


**What are touchpoint counts by desktop, mobile, tablet, other?**

In [12]:
## use visualization package to make a bar chart

bar_fig = px.bar(results_df
                 ,x='touch_count'
                 ,y='device_deviceCategory'
                 ,title='Touches by Device Category'
                 ,color=viz_df['device_deviceCategory']
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

**What are sequence counts by desktop, mobile, tablet, other?**

In [15]:
## use visualization package to make a bar chart

bar_fig = px.bar(results_df
                 ,x='sequence_count'
                 ,y='device_deviceCategory'
                 ,title='Sequence Count by Device Category'
                 ,color=viz_df['device_deviceCategory']
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

**What are conversion rates by desktop, mobile, tablet, other?**

In [16]:
## use visualization package to make a bar chart

bar_fig = px.bar(results_df
                 ,x='tactic_univariate_conversion_rate'
                 ,y='device_deviceCategory'
                 ,title='Conversion Rate by Device Category'
                 ,color=viz_df['device_deviceCategory']
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

In [20]:
## This will make a table to answer questions about browser

## browser has a lot of values, we will only consider Chrome Safari Firefox Internet Explorer and Edge

browser_list = ['Chrome','Safari','Firefox','Internet Explorer', 'Edge']
full_df['browser_remapped'] = np.where(full_df['device_browser'].isin(browser_list), full_df['device_browser'], 'Other')


## touchpoint counts
viz_df = full_df[~full_df['event_name'].isin(['conversion','dead_end'])]['browser_remapped'].value_counts().reset_index()
viz_df.columns = ['browser_remapped','touch_count']

# we already have touch count from above just copy into new dataframe
results_df = viz_df.copy(deep=True)

# add sequence count
temp_df = full_df[~full_df['browser_remapped'].isin(['conversion','dead_end'])].groupby('browser_remapped')['sequence_id'].nunique().reset_index()
temp_df.columns = ['browser_remapped','sequence_count']
results_df = results_df.merge(temp_df, on='browser_remapped', how = 'left')

# add dead end sequence count
dead_end_sequence_ids = full_df[full_df['event_name'] == 'dead_end']['sequence_id'].unique()
temp_filtered_df = full_df[full_df['sequence_id'].isin(dead_end_sequence_ids)]
temp_df = temp_filtered_df[~temp_filtered_df['event_name'].isin(['conversion','dead_end'])].groupby('browser_remapped')['sequence_id'].nunique().reset_index()
temp_df.columns = ['browser_remapped','dead_sequence_count']
results_df = results_df.merge(temp_df, on='browser_remapped', how = 'left')

# add conversion sequence count
dead_end_sequence_ids = full_df[full_df['event_name'] == 'conversion']['sequence_id'].unique()
temp_filtered_df = full_df[full_df['sequence_id'].isin(dead_end_sequence_ids)]
temp_df = temp_filtered_df[~temp_filtered_df['event_name'].isin(['conversion','dead_end'])].groupby('browser_remapped')['sequence_id'].nunique().reset_index()
temp_df.columns = ['browser_remapped','conversion_sequence_count']
results_df = results_df.merge(temp_df, on='browser_remapped', how = 'left')
results_df.fillna(0,inplace=True)

# touchpoints per sequence ratio
results_df['touchpoint_per_sequence_ratio'] = results_df['touch_count']/results_df['sequence_count']

# calculate tactic univariate conversion rate
results_df['tactic_univariate_conversion_rate'] = results_df['conversion_sequence_count']/results_df['sequence_count']

results_df

Unnamed: 0,browser_remapped,touch_count,sequence_count,dead_sequence_count,conversion_sequence_count,touchpoint_per_sequence_ratio,tactic_univariate_conversion_rate
0,Chrome,83238,66207,65029,1178,1.257239,0.017793
1,Safari,22198,19444,19336,108,1.141638,0.005554
2,Other,7186,6511,6506,5,1.103671,0.000768
3,Firefox,4592,3892,3865,27,1.179856,0.006937
4,Internet Explorer,2566,2356,2341,15,1.089134,0.006367
5,Edge,1503,1308,1304,4,1.149083,0.003058


**What are touchpoint counts by browser (Chrome, Safari, etc.)?**

In [22]:
## use visualization package to make a bar chart

bar_fig = px.bar(results_df
                 ,x='touch_count'
                 ,y='browser_remapped'
                 ,title='Touches by Browser'
                 ,color='browser_remapped'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

**What are sequence counts by browser (Chrome, Safari, etc.)?**

In [24]:
## use visualization package to make a bar chart

bar_fig = px.bar(results_df
                 ,x='sequence_count'
                 ,y='browser_remapped'
                 ,title='Sequence by Browser'
                 ,color='browser_remapped'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

**What are conversion rates by browser (Chrome, Safari, etc.)?**

In [26]:
## use visualization package to make a bar chart

bar_fig = px.bar(results_df
                 ,x='tactic_univariate_conversion_rate'
                 ,y='browser_remapped'
                 ,title='Sequence by Browser'
                 ,color='browser_remapped'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

In [27]:
## This will make a table to answer questions about continent


## touchpoint counts
viz_df = full_df[~full_df['event_name'].isin(['conversion','dead_end'])]['geoNetwork_continent'].value_counts().reset_index()
viz_df.columns = ['geoNetwork_continent','touch_count']

# we already have touch count from above just copy into new dataframe
results_df = viz_df.copy(deep=True)

# add sequence count
temp_df = full_df[~full_df['geoNetwork_continent'].isin(['conversion','dead_end'])].groupby('geoNetwork_continent')['sequence_id'].nunique().reset_index()
temp_df.columns = ['geoNetwork_continent','sequence_count']
results_df = results_df.merge(temp_df, on='geoNetwork_continent', how = 'left')

# add dead end sequence count
dead_end_sequence_ids = full_df[full_df['event_name'] == 'dead_end']['sequence_id'].unique()
temp_filtered_df = full_df[full_df['sequence_id'].isin(dead_end_sequence_ids)]
temp_df = temp_filtered_df[~temp_filtered_df['event_name'].isin(['conversion','dead_end'])].groupby('geoNetwork_continent')['sequence_id'].nunique().reset_index()
temp_df.columns = ['geoNetwork_continent','dead_sequence_count']
results_df = results_df.merge(temp_df, on='geoNetwork_continent', how = 'left')

# add conversion sequence count
dead_end_sequence_ids = full_df[full_df['event_name'] == 'conversion']['sequence_id'].unique()
temp_filtered_df = full_df[full_df['sequence_id'].isin(dead_end_sequence_ids)]
temp_df = temp_filtered_df[~temp_filtered_df['event_name'].isin(['conversion','dead_end'])].groupby('geoNetwork_continent')['sequence_id'].nunique().reset_index()
temp_df.columns = ['geoNetwork_continent','conversion_sequence_count']
results_df = results_df.merge(temp_df, on='geoNetwork_continent', how = 'left')
results_df.fillna(0,inplace=True)

# touchpoints per sequence ratio
results_df['touchpoint_per_sequence_ratio'] = results_df['touch_count']/results_df['sequence_count']

# calculate tactic univariate conversion rate
results_df['tactic_univariate_conversion_rate'] = results_df['conversion_sequence_count']/results_df['sequence_count']

results_df

Unnamed: 0,geoNetwork_continent,touch_count,sequence_count,dead_sequence_count,conversion_sequence_count,touchpoint_per_sequence_ratio,tactic_univariate_conversion_rate
0,Americas,62206,48247,46945,1302.0,1.289324,0.026986
1,Asia,28133,24620,24599,21.0,1.142689,0.000853
2,Europe,26249,22644,22634,10.0,1.159203,0.000442
3,Africa,2531,2314,2313,1.0,1.093777,0.000432
4,Oceania,1970,1726,1723,3.0,1.141367,0.001738
5,(not set),194,167,167,0.0,1.161677,0.0


**What are touchpoint counts by continent?**

In [29]:
## use visualization package to make a bar chart

bar_fig = px.bar(results_df
                 ,x='touch_count'
                 ,y='geoNetwork_continent'
                 ,title='Touches by Continent'
                 ,color='geoNetwork_continent'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

**What are sequence counts by continent?**

In [31]:
## use visualization package to make a bar chart

bar_fig = px.bar(results_df
                 ,x='sequence_count'
                 ,y='geoNetwork_continent'
                 ,title='Sequences by Continent'
                 ,color='geoNetwork_continent'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

**What are conversion rates by continent?**

In [32]:
## use visualization package to make a bar chart

bar_fig = px.bar(results_df
                 ,x='tactic_univariate_conversion_rate'
                 ,y='geoNetwork_continent'
                 ,title='Sequences by Continent'
                 ,color='geoNetwork_continent'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

**Please use this section to create and look at any other features you would like.**

ideas:
- Sequences that start during the morning, afternoon, overnight?
- The month of year that sequences that start (1-12 for Jan through Dec)
- The day of the week that the sequence started on (Monday-Sunday)
- Top Countries


**Please use this section to create and look at multivariate bars.**

ideas:

   - what are the touchpoints by each channel and control variable?
   - ... etc.

In [33]:
viz_df = full_df[~full_df['event_name'].isin(['conversion','dead_end'])].groupby(['event_name','device_deviceCategory']).size().reset_index(name='count')
viz_df.columns = ['event_name','device_deviceCategory','touch_count']

## use visualization package to make a bar chart

bar_fig = px.bar(viz_df
                 ,x='touch_count'
                 ,y='event_name'
                 ,title='Stacked Bar Chart Touches by Channel & Device Category'
                 ,color='device_deviceCategory'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()


In [34]:
viz_df = full_df[~full_df['event_name'].isin(['conversion','dead_end'])].groupby(['event_name','device_deviceCategory']).size().reset_index(name='count')
viz_df.columns = ['event_name','device_deviceCategory','touch_count']

## use visualization package to make a bar chart

bar_fig = px.bar(viz_df
                 ,x='touch_count'
                 ,y='event_name'
                 ,title='Grouped Bar Chart Touches by Channel & Device Category'
                 ,color='device_deviceCategory'
                 ,barmode='group'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()