Skip to content
ML-engineer edited this page Jun 14, 2018 · 27 revisions

1. Create a project and enable billing

Skip this step if you already have a suitable project linked to a billing account.

Create a project to host your datahem solution.

Enable billing for your project

2. Requirements

Notice that all steps below that requires execution of git, gcloud, mvn or making changes of code can be done through Google Cloud Shell (tools are already installed) and Google Cloud Shell Editor. Otherwise you will have to ensure that you have the tools installed on your workstation.

3. Set datahem variables

This walk-through explains the setup for a Google Analytics/Measurement Protocol pipeline and hence require you to set the variables below. Make sure you fill in the required ones and DF_REGION and DF_ZONE to ensure that data stays within the intended geographical region.

#DataHem generic settings
PROJECT_ID='' # Required. Your google project id. Example: 'my-prod-project'
VERSION='' # Required. DataHem version used. Example: '0.6'

#Dataflow settings
DF_REGION='' # Optional. Default: 'us-central1' Example: 'europe-west1'
DF_ZONE='' # Optional. Default: '' an availability zone from the region set in DF_REGION. Example: 'europe-west1-b'
DF_NUM_WORKERS=1 # Optional. Default: Dataflow service will determine an appropriate number of workers. Example: 2
DF_MAX_NUM_WORKERS=1 # Optional. Default: Dataflow service will determine an appropriate number of workers. Example: 5
DF_DISK_SIZE_GB=30 # Optional. Default: Size defined in your Cloud Platform project. Minimum is 30. Example: 50
DF_WORKER_MACHINE_TYPE='n1-standard-1' # Optional. Default: The Dataflow service will choose the machine type based on your job. Example: 'n1-standard-1'

#Measurement Protocol Pipeline settings
STREAM_ID='' # Required. Lowercase and alphanumeric format of GA tracking Id. Example: 'ua123456789'
IGNORED_REFERERS_PATTERN='' # Required. Example: '.*(github.com|admin.datahem.org).*'
SEARCH_ENGINES_PATTERN='.*(www.google.|www.bing.|search.yahoo.).*' # Optional. Define search engine traffic with Java regex syntax. Default: '.*(www.google.|www.bing.|search.yahoo.).*'
SOCIAL_NETWORKS_PATTERN='.*(facebook.|instagram.|pinterest.|youtube.|linkedin.|twitter.).*' # Optional. Define social network traffic with Java regex syntax. Default: '.*(facebook.|instagram.|pinterest.|youtube.|linkedin.|twitter.).*'
INCLUDED_HOSTNAMES='' # Optional. Filter hits to only include defined hostnames with Java regex syntax. Default: '.*' Example: '.*(beta.datahem.org|www.datahem.org).*'
EXCLUDED_BOTS_PATTERN='.*(^$|bot|spider|crawler).*' # Optional. Filter out bot user agents with Java regex syntax. Default: '.*(^$|bot|spider|crawler).*'
SITE_SEARCH_PATTERN='.*q=(([^&#]*)|&|#|$)' #Optional. Define site search URL parameter with Java regex syntax. Default: '.*q=(([^&#]*)|&|#|$)'
TIME_ZONE='' #Optional. Define local time zone (ex. Europe/Stockholm) for date field used for partitioning. Default: 'Etc/UTC' Example: 'Europe/Stockholm' 

4. Collector

4.1. Create app

Be careful when selecting the app engine region since you can't change it later.

Set up App Engine.

gcloud app create \
--project=$PROJECT_ID \
--region=$DF_REGION

4.2. Download Collector

Clone the collector repository and change directory to the cloned folder

mkdir ~/datahem
cd ~/datahem
git clone https://github.com/mhlabs/datahem.collector.git
cd datahem.collector

4.3. Generate open-api, configure cloud endpoints and deploy the collector

mvn endpoints-framework:openApiDocs -Dendpoints.project.id=$PROJECT_ID 
gcloud endpoints services deploy target/openapi-docs/openapi.json 
mvn appengine:deploy -Dendpoints.project.id=$PROJECT_ID

4.4. Test collector

Check that the collector is up and listening on both POST and GET and responding with a 204 by executing curl.

curl \
    -H "Content-Type: application/json" \
    -X POST \
    -d '{"payload": "echo"}' \
    "https://$PROJECT_ID.appspot.com/_ah/api/measurementprotocol/v1/collect/$STREAM_ID" -i

curl \
    -H "Content-Type: text/plain" \
    -X GET \
    "https://$PROJECT_ID.appspot.com/_ah/api/measurementprotocol/v1/collect/$STREAM_ID?v=1" -i

5. Infrastructor

5.1. Enable Google Deployment Manager on your project

gcloud services enable deploymentmanager.googleapis.com

5.2. Download Infrastructor

Clone the datahem.infrastructor repository, change directory, run command to enable necessary api:s and setup cloud storage bucket for dataflow files

cd ~/datahem
git clone https://github.com/mhlabs/datahem.infrastructor.git
cd datahem.infrastructor/python
gcloud deployment-manager deployments create setup-apis --config setup-apis.yaml
gcloud deployment-manager deployments create setup-processor --config setup-processor-resources.yaml

5.3. Setup stream infrastructure

Setup necessary dataset and pubsub topics and subscriptions for each of the data sources you want to add by running command below.

gcloud deployment-manager deployments create ga-property-$STREAM_ID --template add-streaming-source.py --properties streamId:$STREAM_ID

Check that operation completed successfully and you have the following resources listed in the console:

NAME                                         TYPE                    STATE      ERRORS  INTENT
bigquery-dataset-ua123456789-entities      bigquery.v2.dataset     COMPLETED  []
pubsub-subscription-ua123456789-backup     pubsub.v1.subscription  COMPLETED  []
pubsub-subscription-ua123456789-processor  pubsub.v1.subscription  COMPLETED  []
pubsub-topic-ua123456789                   pubsub.v1.topic         COMPLETED  []
pubsub-topic-ua123456789-entities          pubsub.v1.topic         COMPLETED  []

6. Processor

6.1 Clone the datahem.processor repository

cd ~/datahem
git clone https://github.com/mhlabs/datahem.processor.git
cd datahem.processor

6.2 Measurement protocol pipeline

6.2.1 Create a measurement protocol pipeline job template.

# Create template
mvn compile exec:java \
     -Dexec.mainClass=org.datahem.processor.measurementprotocol.MeasurementProtocolPipeline \
     -Dexec.args="--runner=DataflowRunner \
                  --project=$PROJECT_ID \
                  --zone=$DF_ZONE \
                  --region=$DF_REGION \
                  --stagingLocation=gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/staging \
                  --templateLocation=gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/measurementprotocol/MeasurementProtocolPipeline \
                  --workerMachineType=$DF_WORKER_MACHINE_TYPE \
                  --diskSizeGb=$DF_DISK_SIZE_GB"

6.2.2 Execute the template.

# Run template
gcloud beta dataflow jobs run $STREAM_ID-processor \
--gcs-location gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/measurementprotocol/MeasurementProtocolPipeline \
--zone=$DF_ZONE \
--region=$DF_REGION \
--max-workers=$DF_MAX_NUM_WORKERS \
--parameters \
pubsubTopic=projects/$PROJECT_ID/topics/$STREAM_ID-entities,\
pubsubSubscription=projects/$PROJECT_ID/subscriptions/$STREAM_ID-processor,\
bigQueryTableSpec=$STREAM_ID.entities,\
ignoredReferersPattern=$IGNORED_REFERERS_PATTERN,\
searchEnginesPattern=$SEARCH_ENGINES_PATTERN,\
socialNetworksPattern=$SOCIAL_NETWORKS_PATTERN,\
includedHostnamesPattern=$INCLUDED_HOSTNAMES,\
excludedBotsPattern=$EXCLUDED_BOTS_PATTERN,\
siteSearchPattern=$SITE_SEARCH_PATTERN,\
timeZone=$TIME_ZONE

Read more about how to process Google Analytics Data (Measurement Protocol)

6.3 Pubsub backup pipeline

Steps to set up a pipeline to store the unprocessed data for backup and reprocessing.

6.3.1 Create pubsub backup pipeline for the measurement protocol data

# Create template
mvn compile exec:java \
     -Dexec.mainClass=org.datahem.processor.pubsub.backup.PubSubBackupPipeline \
     -Dexec.args="--runner=DataflowRunner \
                  --project=$PROJECT_ID \
                  --zone=$DF_ZONE \
                  --region=$DF_REGION \
                  --stagingLocation=gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/staging \
                  --templateLocation=gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/pubsub/backup/PubSubBackupPipeline \
                  --workerMachineType=$DF_WORKER_MACHINE_TYPE \
                  --diskSizeGb=$DF_DISK_SIZE_GB"

6.3.2 Execute backup pipeline

# Run template
gcloud beta dataflow jobs run $STREAM_ID-backup \
--gcs-location gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/pubsub/backup/PubSubBackupPipeline \
--zone=$DF_ZONE \
--region=$DF_REGION \
--max-workers=$DF_MAX_NUM_WORKERS \
--parameters pubsubSubscription=projects/$PROJECT_ID/subscriptions/$STREAM_ID-backup,\
bigQueryTableSpec=backup.$STREAM_ID

Stream data to backup

7. Tracker

Instructions to implement tracker using Google Tag Manager. The datahem tracker supports multiple GA trackers and collector endpoints (if you want to send data to separate test and prod environments).

7.1. Sign in

Sign in to Google Tag Manager and enter the correct container.

7.2. Create GA CustomTask variable

Create a new User Defined Variable of type custom javascript and name it "GA CustomTask" and copy/paste code from customtask.js

7.3. Create datahem collector endpoints variable

Create a new User Defined Variable of type constant and name it "datahem collector endpoints" and assign it the value of one or more collector servlet URLs (comma separated if more than one) that you have configured in the collector section above. Replace $PROJECT_ID with your project ID.

https://$PROJECT_ID.appspot.com/_ah/api/measurementprotocol/v1/collect/

7.4. Edit GA Settings variable

Edit your User Defined Variable GA Settings (explained by Simo Ahava) by adding an entry in "Fields to set", name it "customTask" and give it the value "{{GA CustomTask}}" to reference the User Defined Variable created in step 2 above.

7.5. Preview

Enter the Google Tag Manager Preview Mode, visit your site and open the console in your browser by pressing Ctrl+Shift+J (Google Chrome on Windows / Linux). Select the "Network" tab, filter on "collect", reload the web page and you should se a request sent to Google and to your own endpoint/endpoints. Make sure you receive a status 200 or 204 on your requests, that means your collector has received the request and responded to it. If not, make sure the request is sent to the URL used by your collector.

7.6 Publish

Publish the Google Tag Manager version and repeat step 7.5 to check that it receives a 204 also when published to live version.

8. Analyze in Google BigQuery

8.1. Create views for the measurement protocol entities table

When you have a table in bigquery named $STREAM_ID.entities, then execute:

cd ~/datahem/datahem.infrastructor/python
gcloud deployment-manager deployments create bigquery-$STREAM_ID-view --template create-bigquery-view-template.py --properties streamId:$STREAM_ID

Which should result in:

NAME                                           TYPE               STATE      ERRORS  INTENT
bigquery-dataset-ua123456789-view-event        bigquery.v2.table  COMPLETED  []
bigquery-dataset-ua123456789-view-exception    bigquery.v2.table  COMPLETED  []
bigquery-dataset-ua123456789-view-impression   bigquery.v2.table  COMPLETED  []
bigquery-dataset-ua123456789-view-pageview     bigquery.v2.table  COMPLETED  []
bigquery-dataset-ua123456789-view-product      bigquery.v2.table  COMPLETED  []
bigquery-dataset-ua123456789-view-promotion    bigquery.v2.table  COMPLETED  []
bigquery-dataset-ua123456789-view-search       bigquery.v2.table  COMPLETED  []
bigquery-dataset-ua123456789-view-social       bigquery.v2.table  COMPLETED  []
bigquery-dataset-ua123456789-view-timing       bigquery.v2.table  COMPLETED  []
bigquery-dataset-ua123456789-view-traffic      bigquery.v2.table  COMPLETED  []
bigquery-dataset-ua123456789-view-transaction  bigquery.v2.table  COMPLETED  []

9. Reports in Google Data Studio

10. Add another GA property

Most of the work is already done. Change $STREAM_ID to the new uaxxxxxxxxx tracking-id/property-id and other measurement protocol processing specific settings (i.e. ignore referer, hostname, etc.) as in step 3 above.

Then follow steps:

cd ~/datahem/datahem.infrastructor/python

gcloud deployment-manager deployments create ga-property-$STREAM_ID --template add-streaming-source.py --properties streamId:$STREAM_ID

cd ~/datahem/datahem.processor

# Run measurement protocol pipeline template
gcloud beta dataflow jobs run $STREAM_ID-processor \
--gcs-location gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/measurementprotocol/MeasurementProtocolPipeline \
--zone=$DF_ZONE \
--region=$DF_REGION \
--max-workers=$DF_MAX_NUM_WORKERS \
--parameters \
pubsubTopic=projects/$PROJECT_ID/topics/$STREAM_ID-entities,\
pubsubSubscription=projects/$PROJECT_ID/subscriptions/$STREAM_ID-processor,\
bigQueryTableSpec=$STREAM_ID.entities,\
ignoredReferersPattern=$IGNORED_REFERERS_PATTERN,\
searchEnginesPattern=$SEARCH_ENGINES_PATTERN,\
socialNetworksPattern=$SOCIAL_NETWORKS_PATTERN,\
includedHostnamesPattern=$INCLUDED_HOSTNAMES,\
excludedBotsPattern=$EXCLUDED_BOTS_PATTERN,\
siteSearchPattern=$SITE_SEARCH_PATTERN,\
timeZone=$TIME_ZONE

# Run pubsub backup pipeline template
gcloud beta dataflow jobs run $STREAM_ID-backup \
--gcs-location gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/pubsub/backup/PubSubBackupPipeline \
--zone=$DF_ZONE \
--region=$DF_REGION \
--max-workers=$DF_MAX_NUM_WORKERS \
--parameters pubsubSubscription=projects/$PROJECT_ID/subscriptions/$STREAM_ID-backup,\
bigQueryTableSpec=backup.$STREAM_ID

#When you have a table in bigquery named $STREAM_ID.entities, then execute:

gcloud deployment-manager deployments create bigquery-$STREAM_ID-view --template create-bigquery-view-template.py --properties streamId:$STREAM_ID