# Google Analytics App + Web data i BigQuery

Via Firebase kan vi som bekant exportera raw data från Google Analytics App + Web direkt till BigQuery. Vi har dock inte tittat hur denna lagras i BigQuery, och vad man kan göra med den. 

Google Analytics App + Web använder sig av Firebase Analytics datamodell. Denna är baserad på events till skillnad från vanliga Google Analytics som är baserad på hits och sessions. Varje rad i BigQuery motsvarar därför ett event. Firebase events är väldigt flexibla och varje event kan t.ex. ha upp till 25 st unika parametrar. Hur lagras detta i BigQuery?

I en vanlig relationsdatabas hade man antagligen valt att lägga parametrarna i en egen tabell och kopplat dessa till eventen. I BigQuery har man dock valt att lösa detta lite annorlunda, vilket vi kan se om vi tittar i databas-schemat nedan och där letar upp raden *event_params* som är av typen *RECORD* och mode *REPEATED*:

| Field name |	Type	| Mode |
|---|---|---|
| event_date | STRING| NULLABLE |
| event_timestamp | INTEGER| NULLABLE |
| event_name | STRING| NULLABLE |
| event_params | RECORD| REPEATED |
| event_params. key | STRING| NULLABLE |
| event_params. value | RECORD| NULLABLE |
| event_params.value. string_value | STRING| NULLABLE |
| event_params.value. int_value | INTEGER| NULLABLE |
| event_params.value. float_value | FLOAT| NULLABLE |
| event_params.value. double_value | FLOAT| NULLABLE |
| event_previous_timestamp | INTEGER| NULLABLE |
| event_value_in_usd | FLOAT| NULLABLE |
| event_bundle_sequence_id | INTEGER| NULLABLE |
| event_server_timestamp_offset | INTEGER| NULLABLE |
| user_id | STRING| NULLABLE |
| user_pseudo_id | STRING| NULLABLE |
| user_properties | RECORD| REPEATED |
| user_properties. key | STRING| NULLABLE |
| user_properties. value | RECORD| NULLABLE |
| user_properties.value. string_value | STRING| NULLABLE |
| user_properties.value. int_value | INTEGER| NULLABLE |
| user_properties.value. float_value | FLOAT| NULLABLE |
| user_properties.value. double_value | FLOAT| NULLABLE |
| user_properties.value. set_timestamp_micros | INTEGER| NULLABLE |
| user_first_touch_timestamp | INTEGER| NULLABLE |
| user_ltv | RECORD| NULLABLE |
| user_ltv. revenue | FLOAT| NULLABLE |
| user_ltv. currency | STRING| NULLABLE |
| device | RECORD| NULLABLE |
| device. category | STRING| NULLABLE |
| device. mobile_brand_name | STRING| NULLABLE |
| device. mobile_model_name | STRING| NULLABLE |
| device. mobile_marketing_name | STRING| NULLABLE |
| device. mobile_os_hardware_model | STRING| NULLABLE |
| device. operating_system | STRING| NULLABLE |
| device. operating_system_version | STRING| NULLABLE |
| device. vendor_id | STRING| NULLABLE |
| device. advertising_id | STRING| NULLABLE |
| device. language | STRING| NULLABLE |
| device. is_limited_ad_tracking | STRING| NULLABLE |
| device. time_zone_offset_seconds | INTEGER| NULLABLE |
| device. browser | STRING| NULLABLE |
| device. browser_version | STRING| NULLABLE |
| device. web_info | RECORD| NULLABLE |
| device.web_info. browser | STRING| NULLABLE |
| device.web_info. browser_version | STRING| NULLABLE |
| device.web_info. hostname | STRING| NULLABLE |
| geo | RECORD| NULLABLE |
| geo. continent | STRING| NULLABLE |
| geo. country | STRING| NULLABLE |
| geo. region | STRING| NULLABLE |
| geo. city | STRING| NULLABLE |
| geo. sub_continent | STRING| NULLABLE |
| geo. metro | STRING| NULLABLE |
| app_info | RECORD| NULLABLE |
| app_info. id | STRING| NULLABLE |
| app_info. version | STRING| NULLABLE |
| app_info. install_store | STRING| NULLABLE |
| app_info. firebase_app_id | STRING| NULLABLE |
| app_info. install_source | STRING| NULLABLE |
| traffic_source | RECORD| NULLABLE |
| traffic_source. name | STRING| NULLABLE |
| traffic_source. medium | STRING| NULLABLE |
| traffic_source. source | STRING| NULLABLE |
| stream_id | STRING| NULLABLE |
| platform | STRING| NULLABLE |
| event_dimensions | RECORD| NULLABLE |
| event_dimensions. hostname | STRING| NULLABLE |

Med typen *RECORD* menas att *event_params* är en behållare som i sin tur innehåller flera fält. I det här fallet innehåller *event_params* fälten *event_params.key* samt *event_params.value* (som i sin tur också är av typen RECORD). Med mode *REPEATED* menas att ett event kan ha flera *event_params*.

För den som är van att jobba med XML eller json-filer så är det här ett vanligt sätt att strukturera upp sig data, men är man van vid relationsdatabaser kan det kännas lite ovant, och det får betydelse för hur vi söker ut data från vår databas.

Låt oss dock först av allt koppla upp oss mot BigQuery för att komma åt vår data från den sajt med GA App+Web som vi skapat upp under kursen. Detta görs som vanligt via följande kommando: 




In [1]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


Vi börjar med en enkel fråga för att skapa oss en översikt av hur tabellen ser ut:

In [0]:
%%bigquery --project surfsapp-8011b df

SELECT * FROM `surfsapp-8011b.analytics_164498740.events_20191120`

In [9]:
df.head()

Unnamed: 0,event_date,event_timestamp,event_name,event_params,event_previous_timestamp,event_value_in_usd,event_bundle_sequence_id,event_server_timestamp_offset,user_id,user_pseudo_id,user_properties,user_first_touch_timestamp,user_ltv,device,geo,app_info,traffic_source,stream_id,platform,event_dimensions
0,20191120,1574290128307942,session_start,"[{'key': 'ga_session_id', 'value': {'string_va...",,,-69269786,,,2041489404.1572971,[],1573631825512551,"{'revenue': 0.0, 'currency': 'USD'}","{'category': 'desktop', 'mobile_brand_name': '...","{'continent': 'Europe', 'country': 'Sweden', '...",,"{'name': '(direct)', 'medium': '(none)', 'sour...",1619345481,WEB,
1,20191120,1574290128307942,page_view,"[{'key': 'ga_session_id', 'value': {'string_va...",,,-69269786,,,2041489404.1572971,[],1573631825512551,"{'revenue': 0.0, 'currency': 'USD'}","{'category': 'desktop', 'mobile_brand_name': '...","{'continent': 'Europe', 'country': 'Sweden', '...",,"{'name': '(direct)', 'medium': '(none)', 'sour...",1619345481,WEB,
2,20191120,1574290133366600,section_visible,"[{'key': 'ga_session_id', 'value': {'string_va...",,,-64211128,,,2041489404.1572971,[],1573631825512551,"{'revenue': 0.0, 'currency': 'USD'}","{'category': 'desktop', 'mobile_brand_name': '...","{'continent': 'Europe', 'country': 'Sweden', '...",,"{'name': '(direct)', 'medium': '(none)', 'sour...",1619345481,WEB,
3,20191120,1574290133366600,section_visible,"[{'key': 'ga_session_id', 'value': {'string_va...",,,-64211128,,,2041489404.1572971,[],1573631825512551,"{'revenue': 0.0, 'currency': 'USD'}","{'category': 'desktop', 'mobile_brand_name': '...","{'continent': 'Europe', 'country': 'Sweden', '...",,"{'name': '(direct)', 'medium': '(none)', 'sour...",1619345481,WEB,
4,20191120,1574290133366600,section_visible,"[{'key': 'ga_session_id', 'value': {'string_va...",,,-64211128,,,2041489404.1572971,[],1573631825512551,"{'revenue': 0.0, 'currency': 'USD'}","{'category': 'desktop', 'mobile_brand_name': '...","{'continent': 'Europe', 'country': 'Sweden', '...",,"{'name': '(direct)', 'medium': '(none)', 'sour...",1619345481,WEB,


Vi kan se att event_params faktiskt innehåller json-data, och skulle kunna använda pandas för att omvandla json-data till en DataFrame:

In [10]:
import json 
import pandas as pd 
from pandas.io.json import json_normalize

params_data = json_normalize(data=df.loc[0, 'event_params'])
params_data.head()

Unnamed: 0,key,value.string_value,value.int_value,value.float_value,value.double_value
0,ga_session_id,,1574290000.0,,
1,engaged_session_event,,1.0,,
2,page_location,https://hornstein.github.io/,,,
3,ga_session_number,,16.0,,
4,page_title,Squadfree - Free bootstrap 3 one page template,,,


Ett enklare sätt är att istället låta BigQuery 

In [0]:
%%bigquery --project surfsapp-8011b df2

SELECT * FROM `surfsapp-8011b.analytics_164498740.events_20191120`, UNNEST (event_params) as ep

In [15]:
df2.head()

Unnamed: 0,event_date,event_timestamp,event_name,event_params,event_previous_timestamp,event_value_in_usd,event_bundle_sequence_id,event_server_timestamp_offset,user_id,user_pseudo_id,user_properties,user_first_touch_timestamp,user_ltv,device,geo,app_info,traffic_source,stream_id,platform,event_dimensions,key,value
0,20191120,1574290128307942,session_start,"[{'key': 'ga_session_id', 'value': {'string_va...",,,-69269786,,,2041489404.1572971,[],1573631825512551,"{'revenue': 0.0, 'currency': 'USD'}","{'category': 'desktop', 'mobile_brand_name': '...","{'continent': 'Europe', 'country': 'Sweden', '...",,"{'name': '(direct)', 'medium': '(none)', 'sour...",1619345481,WEB,,ga_session_id,"{'string_value': None, 'int_value': 1574290127..."
1,20191120,1574290128307942,session_start,"[{'key': 'ga_session_id', 'value': {'string_va...",,,-69269786,,,2041489404.1572971,[],1573631825512551,"{'revenue': 0.0, 'currency': 'USD'}","{'category': 'desktop', 'mobile_brand_name': '...","{'continent': 'Europe', 'country': 'Sweden', '...",,"{'name': '(direct)', 'medium': '(none)', 'sour...",1619345481,WEB,,engaged_session_event,"{'string_value': None, 'int_value': 1, 'float_..."
2,20191120,1574290128307942,session_start,"[{'key': 'ga_session_id', 'value': {'string_va...",,,-69269786,,,2041489404.1572971,[],1573631825512551,"{'revenue': 0.0, 'currency': 'USD'}","{'category': 'desktop', 'mobile_brand_name': '...","{'continent': 'Europe', 'country': 'Sweden', '...",,"{'name': '(direct)', 'medium': '(none)', 'sour...",1619345481,WEB,,page_location,{'string_value': 'https://hornstein.github.io/...
3,20191120,1574290128307942,session_start,"[{'key': 'ga_session_id', 'value': {'string_va...",,,-69269786,,,2041489404.1572971,[],1573631825512551,"{'revenue': 0.0, 'currency': 'USD'}","{'category': 'desktop', 'mobile_brand_name': '...","{'continent': 'Europe', 'country': 'Sweden', '...",,"{'name': '(direct)', 'medium': '(none)', 'sour...",1619345481,WEB,,ga_session_number,"{'string_value': None, 'int_value': 16, 'float..."
4,20191120,1574290128307942,session_start,"[{'key': 'ga_session_id', 'value': {'string_va...",,,-69269786,,,2041489404.1572971,[],1573631825512551,"{'revenue': 0.0, 'currency': 'USD'}","{'category': 'desktop', 'mobile_brand_name': '...","{'continent': 'Europe', 'country': 'Sweden', '...",,"{'name': '(direct)', 'medium': '(none)', 'sour...",1619345481,WEB,,page_title,{'string_value': 'Squadfree - Free bootstrap 3...


Unnest gör en cross join mellan varje rad och dess event_params, så vi kan nu hitta våra event_params längst till höger i resultatet ovan, och kan använda i våra select och where satser:  

In [0]:
%%bigquery --project surfsapp-8011b df3

SELECT event_params.value.string_value FROM `surfsapp-8011b.analytics_164498740.events_20191120`, UNNEST(event_params) as event_params
WHERE event_params.key='page_title'

In [7]:
df3.head()

Unnamed: 0,string_value
0,Squadfree - Free bootstrap 3 one page template
1,Squadfree - Free bootstrap 3 one page template
2,Squadfree - Free bootstrap 3 one page template
3,Squadfree - Free bootstrap 3 one page template
4,Squadfree - Free bootstrap 3 one page template
