-
Notifications
You must be signed in to change notification settings - Fork 6
Setup
Skip this step if you already have a suitable project linked to a billing account.
Create a project to host your datahem solution.
Enable billing for your project
Notice that all steps below that requires execution of git, gcloud, mvn or making changes of code can be done through Google Cloud Shell (tools are already installed) and Google Cloud Shell Editor. Otherwise you will have to ensure that you have the tools installed on your workstation.
This walk-through explains the setup for a Google Analytics/Measurement Protocol pipeline and hence require you to set the variables below. Make sure you fill in the required ones and DF_REGION and DF_ZONE to ensure that data stays within the intended geographical region.
#DataHem generic settings
PROJECT_ID='' # Required. Your google project id. Example: 'my-prod-project'
VERSION='' # Required. DataHem version used. Example: '0.6'
#Dataflow settings
DF_REGION='' # Optional. Default: 'us-central1' Example: 'europe-west1'
DF_ZONE='' # Optional. Default: '' an availability zone from the region set in DF_REGION. Example: 'europe-west1-b'
DF_NUM_WORKERS=1 # Optional. Default: Dataflow service will determine an appropriate number of workers. Example: 2
DF_MAX_NUM_WORKERS=1 # Optional. Default: Dataflow service will determine an appropriate number of workers. Example: 5
DF_DISK_SIZE_GB=30 # Optional. Default: Size defined in your Cloud Platform project. Minimum is 30. Example: 50
DF_WORKER_MACHINE_TYPE='n1-standard-1' # Optional. Default: The Dataflow service will choose the machine type based on your job. Example: 'n1-standard-1'
#Measurement Protocol Pipeline settings
STREAM_ID='' # Required. Lowercase and alphanumeric format of GA tracking Id. Example: 'ua123456789'
IGNORED_REFERERS_PATTERN='' # Required. Example: '.*(github.com|admin.datahem.org).*'
SEARCH_ENGINES_PATTERN='.*(www.google.|www.bing.|search.yahoo.).*' # Optional. Define search engine traffic with Java regex syntax. Default: '.*(www.google.|www.bing.|search.yahoo.).*'
SOCIAL_NETWORKS_PATTERN='.*(facebook.|instagram.|pinterest.|youtube.|linkedin.|twitter.).*' # Optional. Define social network traffic with Java regex syntax. Default: '.*(facebook.|instagram.|pinterest.|youtube.|linkedin.|twitter.).*'
INCLUDED_HOSTNAMES='' # Optional. Filter hits to only include defined hostnames with Java regex syntax. Default: '.*' Example: '.*(beta.datahem.org|www.datahem.org).*'
EXCLUDED_BOTS_PATTERN='.*(^$|bot|spider|crawler).*' # Optional. Filter out bot user agents with Java regex syntax. Default: '.*(^$|bot|spider|crawler).*'
SITE_SEARCH_PATTERN='.*q=(([^&#]*)|&|#|$)' #Optional. Define site search URL parameter with Java regex syntax. Default: '.*q=(([^&#]*)|&|#|$)'
TIME_ZONE='' #Optional. Define local time zone (ex. Europe/Stockholm) for date field used for partitioning. Default: 'Etc/UTC' Example: 'Europe/Stockholm'
Be careful when selecting the app engine region since you can't change it later.
Set up App Engine.
gcloud app create \
--project=$PROJECT_ID \
--region=$DF_REGION
Clone the collector repository and change directory to the cloned folder
mkdir ~/datahem
cd ~/datahem
git clone https://github.com/mhlabs/datahem.collector.git
cd datahem.collector
mvn endpoints-framework:openApiDocs -Dendpoints.project.id=$PROJECT_ID
gcloud endpoints services deploy target/openapi-docs/openapi.json
mvn appengine:deploy -Dendpoints.project.id=$PROJECT_ID
Check that the collector is up and listening on both POST and GET and responding with a 204 by executing curl.
curl \
-H "Content-Type: application/json" \
-X POST \
-d '{"payload": "echo"}' \
"https://$PROJECT_ID.appspot.com/_ah/api/measurementprotocol/v1/collect/$STREAM_ID" -i
curl \
-H "Content-Type: text/plain" \
-X GET \
"https://$PROJECT_ID.appspot.com/_ah/api/measurementprotocol/v1/collect/$STREAM_ID?v=1" -i
gcloud services enable deploymentmanager.googleapis.com
Clone the datahem.infrastructor repository, change directory, run command to enable necessary api:s and setup cloud storage bucket for dataflow files
cd ~/datahem
git clone https://github.com/mhlabs/datahem.infrastructor.git
cd datahem.infrastructor/python
gcloud deployment-manager deployments create setup-apis --config setup-apis.yaml
gcloud deployment-manager deployments create setup-processor --config setup-processor-resources.yaml
Setup necessary dataset and pubsub topics and subscriptions for each of the data sources you want to add by running command below.
gcloud deployment-manager deployments create ga-property-$STREAM_ID --template add-streaming-source.py --properties streamId:$STREAM_ID
Check that operation completed successfully and you have the following resources listed in the console:
NAME TYPE STATE ERRORS INTENT
bigquery-dataset-ua123456789-entities bigquery.v2.dataset COMPLETED []
pubsub-subscription-ua123456789-backup pubsub.v1.subscription COMPLETED []
pubsub-subscription-ua123456789-processor pubsub.v1.subscription COMPLETED []
pubsub-topic-ua123456789 pubsub.v1.topic COMPLETED []
pubsub-topic-ua123456789-entities pubsub.v1.topic COMPLETED []
cd ~/datahem
git clone https://github.com/mhlabs/datahem.processor.git
cd datahem.processor
# Create template
mvn compile exec:java \
-Dexec.mainClass=org.datahem.processor.measurementprotocol.MeasurementProtocolPipeline \
-Dexec.args="--runner=DataflowRunner \
--project=$PROJECT_ID \
--zone=$DF_ZONE \
--region=$DF_REGION \
--stagingLocation=gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/staging \
--templateLocation=gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/measurementprotocol/MeasurementProtocolPipeline \
--workerMachineType=$DF_WORKER_MACHINE_TYPE \
--diskSizeGb=$DF_DISK_SIZE_GB"
# Run template
gcloud beta dataflow jobs run $STREAM_ID-processor \
--gcs-location gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/measurementprotocol/MeasurementProtocolPipeline \
--zone=$DF_ZONE \
--region=$DF_REGION \
--max-workers=$DF_MAX_NUM_WORKERS \
--parameters \
pubsubTopic=projects/$PROJECT_ID/topics/$STREAM_ID-entities,\
pubsubSubscription=projects/$PROJECT_ID/subscriptions/$STREAM_ID-processor,\
bigQueryTableSpec=$STREAM_ID.entities,\
ignoredReferersPattern=$IGNORED_REFERERS_PATTERN,\
searchEnginesPattern=$SEARCH_ENGINES_PATTERN,\
socialNetworksPattern=$SOCIAL_NETWORKS_PATTERN,\
includedHostnamesPattern=$INCLUDED_HOSTNAMES,\
excludedBotsPattern=$EXCLUDED_BOTS_PATTERN,\
siteSearchPattern=$SITE_SEARCH_PATTERN,\
timeZone=$TIME_ZONE
Read more about how to process Google Analytics Data (Measurement Protocol)
Steps to set up a pipeline to store the unprocessed data for backup and reprocessing.
# Create template
mvn compile exec:java \
-Dexec.mainClass=org.datahem.processor.pubsub.backup.PubSubBackupPipeline \
-Dexec.args="--runner=DataflowRunner \
--project=$PROJECT_ID \
--zone=$DF_ZONE \
--region=$DF_REGION \
--stagingLocation=gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/staging \
--templateLocation=gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/pubsub/backup/PubSubBackupPipeline \
--workerMachineType=$DF_WORKER_MACHINE_TYPE \
--diskSizeGb=$DF_DISK_SIZE_GB"
# Run template
gcloud beta dataflow jobs run $STREAM_ID-backup \
--gcs-location gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/pubsub/backup/PubSubBackupPipeline \
--zone=$DF_ZONE \
--region=$DF_REGION \
--max-workers=$DF_MAX_NUM_WORKERS \
--parameters pubsubSubscription=projects/$PROJECT_ID/subscriptions/$STREAM_ID-backup,\
bigQueryTableSpec=backup.$STREAM_ID
Instructions to implement tracker using Google Tag Manager. The datahem tracker supports multiple GA trackers and collector endpoints (if you want to send data to separate test and prod environments).
Sign in to Google Tag Manager and enter the correct container.
Create a new User Defined Variable of type custom javascript and name it "GA CustomTask" and copy/paste code from customtask.js
Create a new User Defined Variable of type constant and name it "datahem collector endpoints" and assign it the value of one or more collector servlet URLs (comma separated if more than one) that you have configured in the collector section above. Replace $PROJECT_ID with your project ID.
https://$PROJECT_ID.appspot.com/_ah/api/measurementprotocol/v1/collect/
Edit your User Defined Variable GA Settings (explained by Simo Ahava) by adding an entry in "Fields to set", name it "customTask" and give it the value "{{GA CustomTask}}" to reference the User Defined Variable created in step 2 above.
Enter the Google Tag Manager Preview Mode, visit your site and open the console in your browser by pressing Ctrl+Shift+J (Google Chrome on Windows / Linux). Select the "Network" tab, filter on "collect", reload the web page and you should se a request sent to Google and to your own endpoint/endpoints. Make sure you receive a status 200 or 204 on your requests, that means your collector has received the request and responded to it. If not, make sure the request is sent to the URL used by your collector.
Publish the Google Tag Manager version and repeat step 7.5 to check that it receives a 204 also when published to live version.
When you have a table in bigquery named $STREAM_ID.entities, then execute:
cd ~/datahem/datahem.infrastructor/python
gcloud deployment-manager deployments create bigquery-$STREAM_ID-view --template create-bigquery-view-template.py --properties streamId:$STREAM_ID
Which should result in:
NAME TYPE STATE ERRORS INTENT
bigquery-dataset-ua123456789-view-event bigquery.v2.table COMPLETED []
bigquery-dataset-ua123456789-view-exception bigquery.v2.table COMPLETED []
bigquery-dataset-ua123456789-view-impression bigquery.v2.table COMPLETED []
bigquery-dataset-ua123456789-view-pageview bigquery.v2.table COMPLETED []
bigquery-dataset-ua123456789-view-product bigquery.v2.table COMPLETED []
bigquery-dataset-ua123456789-view-promotion bigquery.v2.table COMPLETED []
bigquery-dataset-ua123456789-view-search bigquery.v2.table COMPLETED []
bigquery-dataset-ua123456789-view-social bigquery.v2.table COMPLETED []
bigquery-dataset-ua123456789-view-timing bigquery.v2.table COMPLETED []
bigquery-dataset-ua123456789-view-traffic bigquery.v2.table COMPLETED []
bigquery-dataset-ua123456789-view-transaction bigquery.v2.table COMPLETED []
Most of the work is already done. Change $STREAM_ID to the new uaxxxxxxxxx tracking-id/property-id and other measurement protocol processing specific settings (i.e. ignore referer, hostname, etc.) as in step 3 above.
Then follow steps:
cd ~/datahem/datahem.infrastructor/python
gcloud deployment-manager deployments create ga-property-$STREAM_ID --template add-streaming-source.py --properties streamId:$STREAM_ID
cd ~/datahem/datahem.processor
# Run measurement protocol pipeline template
gcloud beta dataflow jobs run $STREAM_ID-processor \
--gcs-location gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/measurementprotocol/MeasurementProtocolPipeline \
--zone=$DF_ZONE \
--region=$DF_REGION \
--max-workers=$DF_MAX_NUM_WORKERS \
--parameters \
pubsubTopic=projects/$PROJECT_ID/topics/$STREAM_ID-entities,\
pubsubSubscription=projects/$PROJECT_ID/subscriptions/$STREAM_ID-processor,\
bigQueryTableSpec=$STREAM_ID.entities,\
ignoredReferersPattern=$IGNORED_REFERERS_PATTERN,\
searchEnginesPattern=$SEARCH_ENGINES_PATTERN,\
socialNetworksPattern=$SOCIAL_NETWORKS_PATTERN,\
includedHostnamesPattern=$INCLUDED_HOSTNAMES,\
excludedBotsPattern=$EXCLUDED_BOTS_PATTERN,\
siteSearchPattern=$SITE_SEARCH_PATTERN,\
timeZone=$TIME_ZONE
# Run pubsub backup pipeline template
gcloud beta dataflow jobs run $STREAM_ID-backup \
--gcs-location gs://$PROJECT_ID-processor/$VERSION/org/datahem/processor/pubsub/backup/PubSubBackupPipeline \
--zone=$DF_ZONE \
--region=$DF_REGION \
--max-workers=$DF_MAX_NUM_WORKERS \
--parameters pubsubSubscription=projects/$PROJECT_ID/subscriptions/$STREAM_ID-backup,\
bigQueryTableSpec=backup.$STREAM_ID
#When you have a table in bigquery named $STREAM_ID.entities, then execute:
gcloud deployment-manager deployments create bigquery-$STREAM_ID-view --template create-bigquery-view-template.py --properties streamId:$STREAM_ID