Goal
The goal of this project is to act as a Web Analyst for the Google Merchandise Store, and analyze Google Analytics data from Bigquery using SQL. The period for this analysis will be 2016/8/1 - 2017/8/1.

Site: https://shop.googlemerchandisestore.com/

For this analysis we'll focus on 3 main goals:

- Understand the composition of current site traffic
- Understand the flow and conversion path of users
- Forecast product demand

In addition, we'll use the following steps as a framework to organize our analysis:

1. Extract data and confirm structure/contents
2. Make our goals more concrete
3. Explore and analyze data
4. Visualize insights and interpret results

In [2]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


### 1. Extract Data and Confirm Structure/Contents

First, let's connect to Bigquery and see what kind of data we'll be working with and how it's structured.

In [3]:
from google.cloud import bigquery

# Create client object
client = bigquery.Client()

# Create dataset reference
dataset_ref = client.dataset("google_analytics_sample", project="bigquery-public-data")

# Retrieve dataset from reference
dataset = client.get_dataset(dataset_ref)

In [6]:
# View tables in dataset
[x.table_id for x in client.list_tables(dataset)][:5]

['ga_sessions_20160801',
 'ga_sessions_20160802',
 'ga_sessions_20160803',
 'ga_sessions_20160804',
 'ga_sessions_20160805']

In [7]:
# Create table reference
table_ref_20160801 = dataset_ref.table('ga_sessions_20160801')

# Retrieve table from reference
table_20160801 = client.get_table(table_ref_20160801)

# View columns
client.list_rows(table_20160801, max_results=5).to_dataframe()

Unnamed: 0,visitorId,visitNumber,visitId,visitStartTime,date,totals,trafficSource,device,geoNetwork,customDimensions,hits,fullVisitorId,userId,channelGrouping,socialEngagementType
0,,1,1470117657,1470117657,20160801,"{'visits': 1, 'hits': 3, 'pageviews': 3, 'time...","{'referralPath': '/yt/about/', 'campaign': '(n...","{'browser': 'Internet Explorer', 'browserVersi...","{'continent': 'Americas', 'subContinent': 'Nor...","[{'index': 4, 'value': 'North America'}]","[{'hitNumber': 1, 'time': 0, 'hour': 23, 'minu...",7194065619159478122,,Social,Not Socially Engaged
1,,151,1470083489,1470083489,20160801,"{'visits': 1, 'hits': 3, 'pageviews': 3, 'time...","{'referralPath': '/yt/about/', 'campaign': '(n...","{'browser': 'Chrome', 'browserVersion': 'not a...","{'continent': 'Americas', 'subContinent': 'Nor...","[{'index': 4, 'value': 'North America'}]","[{'hitNumber': 1, 'time': 0, 'hour': 13, 'minu...",8159312408158297118,,Social,Not Socially Engaged
2,,1,1470052694,1470052694,20160801,"{'visits': 1, 'hits': 4, 'pageviews': 3, 'time...","{'referralPath': '/yt/about/', 'campaign': '(n...","{'browser': 'Chrome', 'browserVersion': 'not a...","{'continent': 'Asia', 'subContinent': 'Southea...",[],"[{'hitNumber': 1, 'time': 0, 'hour': 4, 'minut...",9236304747882138291,,Social,Not Socially Engaged
3,,1,1470061879,1470061879,20160801,"{'visits': 1, 'hits': 4, 'pageviews': 4, 'time...","{'referralPath': '/yt/about/', 'campaign': '(n...","{'browser': 'Firefox', 'browserVersion': 'not ...","{'continent': 'Americas', 'subContinent': 'Nor...","[{'index': 4, 'value': 'North America'}]","[{'hitNumber': 1, 'time': 0, 'hour': 7, 'minut...",1792676004815023069,,Social,Not Socially Engaged
4,,1,1470090830,1470090830,20160801,"{'visits': 1, 'hits': 4, 'pageviews': 2, 'time...","{'referralPath': '/yt/about/', 'campaign': '(n...","{'browser': 'Internet Explorer', 'browserVersi...","{'continent': 'Americas', 'subContinent': 'Nor...","[{'index': 4, 'value': 'North America'}]","[{'hitNumber': 1, 'time': 0, 'hour': 15, 'minu...",7305625498291809599,,Social,Not Socially Engaged


It looks like the totals, trafficSource, device, geoNetwork, customDimensions, and hits columns contain nested data.

Let's check the schema for these columns and see what kind of data they contain.