# PART II Exploratory Data Analysis on Google Merchandise Store Sequence Data

### Potential Answers

This part of the EDA is to understand the different potential control variables in the dataset. We might want to introduce these variables to our modeling, but we need to understand counts and conversion rates across these areas.

We go ahead and start this notebook by joining data back together and building the first data viz. We also provide a multivariate viz example later on.

You must go ahead and answer the remaining blank questions and any additional questions you feel fit.

In [2]:
import pandas as pd
import numpy as np

## Visualization packages
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt

## Purdue colors
purdue_colors = ['#CEB888', '#000000','#9D968D','#373A36','#C28E0E']

In [3]:
## Import Dataset, this time we need to pull in the visitor id table as well

sequence_df = pd.read_csv('../datasets/sequence_fact.csv')
visitor_detail_df = pd.read_csv('../datasets/visitor_detail.csv')

Add in control variables that will be asked in some of the questions below.

In [7]:
visitor_detail_small_df = visitor_detail_df[['fullVisitorId','device_deviceCategory','device_browser','geoNetwork_continent']]

In [9]:
full_df = sequence_df.merge(visitor_detail_small_df, on='fullVisitorId',how='left')
full_df

Unnamed: 0,sequence_id,fullVisitorId,event_name,event_datetime,conversion_proximity,device_deviceCategory,device_browser,geoNetwork_continent
0,0099Rqojoj1MCXN,7343617347507729080,organic_search,2018-04-15 17:31:50,75.0,desktop,Chrome,Asia
1,0099Rqojoj1MCXN,7343617347507729080,dead_end,2018-04-15 17:33:05,0.0,desktop,Chrome,Asia
2,00A9Lkka73okUx2,89656057821147903,organic_search,2017-09-14 16:36:56,1033.0,mobile,Chrome,Asia
3,00A9Lkka73okUx2,89656057821147903,dead_end,2017-09-14 16:54:09,0.0,mobile,Chrome,Asia
4,00B30tmbMwJn7Cf,4307745811624101170,organic_search,2017-04-21 02:41:23,1.0,tablet,Safari,Americas
...,...,...,...,...,...,...,...,...
220996,zzvh8qX8dzkWb2X,546466813369261354,dead_end,2017-02-17 12:27:46,0.0,desktop,Firefox,Asia
220997,zzxahVA1FamPayn,6288261604719925213,organic_search,2017-08-22 18:56:15,24.0,desktop,Chrome,Americas
220998,zzxahVA1FamPayn,6288261604719925213,dead_end,2017-08-22 18:56:39,0.0,desktop,Chrome,Americas
220999,zzyM5alBCxAkdwq,7918896908390800801,social,2018-02-24 15:01:36,1.0,desktop,Chrome,Americas


**What are touchpoint counts by desktop, mobile, tablet, other?**

In [15]:
viz_df = full_df[~full_df['event_name'].isin(['conversion','dead_end'])]['device_deviceCategory'].value_counts().reset_index()
viz_df.columns = ['device_deviceCategory','touch_count']

## use visualization package to make a bar chart

bar_fig = px.bar(viz_df
                 ,x='touch_count'
                 ,y='device_deviceCategory'
                 ,title='Touches by Device Category'
                 ,color=viz_df['device_deviceCategory']
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()

**What are sequence counts by desktop, mobile, tablet, other?**

**What are conversion rates by desktop, mobile, tablet, other?**

**What are touchpoint counts by browser (Chrome, Safari, etc.)?**

**What are sequence counts by browser (Chrome, Safari, etc.)?**

**What are conversion rates by browser (Chrome, Safari, etc.)?**

**What are touchpoint counts by continent?**

**What are sequence counts by continent?**

**What are conversion rates by continent?**

**Please use this section to create and look at any other features you would like.**

ideas:
- Sequences that start during the morning, afternoon, overnight?
- The month of year that sequences that start (1-12 for Jan through Dec)
- The day of the week that the sequence started on (Monday-Sunday)
- Countries


**Please use this section to create and look at multivariate bars.**

ideas:

   - what are the touchpoints by each channel and control variable?
   - ... etc.

In [19]:
viz_df = full_df[~full_df['event_name'].isin(['conversion','dead_end'])].groupby(['event_name','device_deviceCategory']).size().reset_index(name='count')
viz_df.columns = ['event_name','device_deviceCategory','touch_count']

## use visualization package to make a bar chart

bar_fig = px.bar(viz_df
                 ,x='touch_count'
                 ,y='event_name'
                 ,title='Stacked Bar Chart Touches by Channel & Device Category'
                 ,color='device_deviceCategory'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()


In [25]:
viz_df = full_df[~full_df['event_name'].isin(['conversion','dead_end'])].groupby(['event_name','device_deviceCategory']).size().reset_index(name='count')
viz_df.columns = ['event_name','device_deviceCategory','touch_count']

## use visualization package to make a bar chart

bar_fig = px.bar(viz_df
                 ,x='touch_count'
                 ,y='event_name'
                 ,title='Grouped Bar Chart Touches by Channel & Device Category'
                 ,color='device_deviceCategory'
                 ,barmode='group'
                 )

bar_fig.update_layout(width=950
                      ,height=500
                      ,yaxis={'categoryorder': 'total ascending'}
                      ,plot_bgcolor='#f2f2f2') 

bar_fig.show()