snowplow-google-analytics-enrich-lambda

Lambda function to enrich data collected through Google Analytics Snowplow plugin

Python based function for aws lambda.

Function parse cloudfrront logs wich requested by Snowplow pixel tracker.

Lambda should triggers on any object creation for bucket with cloudfront logs. Logfiles must be in RAW folder.

Enriched and processed logs puts in same bucket within Converted folder.

Requirenments

Terraform 0.12
AWS user should have following recomended permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:GetPolicyVersion",
                "glue:DeleteDatabase",
                "iam:DeletePolicy",
                "iam:CreateRole",
                "iam:AttachRolePolicy",
                "athena:*",
                "iam:ListInstanceProfilesForRole",
                "cloudfront:GetDistribution",
                "iam:PassRole",
                "iam:DetachRolePolicy",
                "iam:ListAttachedRolePolicies",
                "cloudfront:UpdateDistribution",
                "iam:GetRole",
                "iam:GetPolicy",
                "glue:GetTables",
                "s3:*",
                "cloudfront:TagResource",
                "iam:DeleteRole",
                "cloudfront:CreateDistribution",
                "glue:GetDatabases",
                "iam:CreatePolicy",
                "glue:GetDatabase",
                "iam:ListPolicyVersions",
                "cloudfront:ListTagsForResource",
                "glue:CreateDatabase",
                "lambda:*",
                "cloudfront:DeleteDistribution"
            ],
            "Resource": "*"
        }
    ]
}

Configuring terraform script

File variables.tf contains all configurable variables for script:

env - Service tag. May be used as billing reports tag.
creator - Personalization tag.
website - Website FQDN for plowing.
access_key - AWS user access key.
primary_domain - Cloudfront distribution CNAME.
secret_key - AWS user secret key.
region - AWS region.

Deploying infrastructure

Inside repo directory run:

terraform init
terraform apply

Terraform will create:

3 Buckets:
1. With lt-src suffix. Public accessible for reading. Contains 1x1 pixel image for snowplow GET data.
2. With lt-logs suffix. Using for storing: cloudfront logs with RAW prefix, enriched snowplow data with Converted prefix and maxmind GeoLite2 database.
3. With lt-ath suffix. Using for storing Athena query results.
Cloudfront distribution with lt-src bucket as target and lt-logs bucket for logs storing.
Lambda function wich triggers on any lt-logs bucket object creation with prefix RAW and suffix .gz.
Athena workgroup with suffix wg
Athena database with prefix eventsdb
Athena saved query with name events

To complete infrastructure deployment run created saved athena query in created workgroup, it will create table with enriched snowplow events.

Configuring snowplow Google Analytics plugin:

Snowplow pixel GA plugin optimized for working with cloudfront looks like:

function() {
  var endpoint = 'https://d28zcvgo2jno01.cloudfront.net/i';
  return function(model) {    
    var globalSendTaskName = '_' + model.get('trackingId') + '_sendHitTask';
    var originalSendHitTask = window[globalSendTaskName] = window[globalSendTaskName] || model.get('sendHitTask');
    model.set('sendHitTask', function(sendModel) {
      var payload = sendModel.get('hitPayload');
      originalSendHitTask(sendModel);
      var request = new XMLHttpRequest();
      var path = endpoint + '?' + payload;
      request.open('GET', path, true);
      request.setRequestHeader('Content-type', 'text/plain; charset=UTF-8');
      request.send(payload);
    });
  };
}

You have to change endpoint to created cloudfront domain name and add code on pages you wanted to track with snowplow.

How it works

All data requested from Google Tag Manager compiles in GET request. Pixel tracker run that request to cloudfront distribution. Request string logs with cloudfront.

Log file puts into lt-logs bucket. Lambda function start to porocess new data, enrich it according Google Analytics Measurement Protocol and put enriched data in Converted folder.

Infrastructure termination

To delete infrastructure inside repo directory run:

terraform destroy

There may be errors while deleting some components because it consists data not created by terraform. You should delete these components manually.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
files		files
lamba_script		lamba_script
README.md		README.md
snowplow-infrastructure.tf		snowplow-infrastructure.tf
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

files

files

lamba_script

lamba_script

README.md

README.md

snowplow-infrastructure.tf

snowplow-infrastructure.tf

variables.tf

variables.tf

Repository files navigation

snowplow-google-analytics-enrich-lambda

Requirenments

Configuring terraform script

Deploying infrastructure

Configuring snowplow Google Analytics plugin:

How it works

Infrastructure termination

About

Releases

Packages

Contributors 2

Languages

ownyourbusinessdata/snowplow-google-analytics-enrich-lambda

Folders and files

Latest commit

History

Repository files navigation

snowplow-google-analytics-enrich-lambda

Requirenments

Configuring terraform script

Deploying infrastructure

Configuring snowplow Google Analytics plugin:

How it works

Infrastructure termination

About

Resources

Stars

Watchers

Forks

Languages