Skip to content

ownyourbusinessdata/snowplow-google-analytics-enrich-lambda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

snowplow-google-analytics-enrich-lambda

Lambda function to enrich data collected through Google Analytics Snowplow plugin

Python based function for aws lambda.

Function parse cloudfrront logs wich requested by Snowplow pixel tracker.

Lambda should triggers on any object creation for bucket with cloudfront logs. Logfiles must be in RAW folder.

Enriched and processed logs puts in same bucket within Converted folder.

Requirenments

  • Terraform 0.12
  • AWS user should have following recomended permissions:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:GetPolicyVersion",
                "glue:DeleteDatabase",
                "iam:DeletePolicy",
                "iam:CreateRole",
                "iam:AttachRolePolicy",
                "athena:*",
                "iam:ListInstanceProfilesForRole",
                "cloudfront:GetDistribution",
                "iam:PassRole",
                "iam:DetachRolePolicy",
                "iam:ListAttachedRolePolicies",
                "cloudfront:UpdateDistribution",
                "iam:GetRole",
                "iam:GetPolicy",
                "glue:GetTables",
                "s3:*",
                "cloudfront:TagResource",
                "iam:DeleteRole",
                "cloudfront:CreateDistribution",
                "glue:GetDatabases",
                "iam:CreatePolicy",
                "glue:GetDatabase",
                "iam:ListPolicyVersions",
                "cloudfront:ListTagsForResource",
                "glue:CreateDatabase",
                "lambda:*",
                "cloudfront:DeleteDistribution"
            ],
            "Resource": "*"
        }
    ]
}

Configuring terraform script

File variables.tf contains all configurable variables for script:

  • env - Service tag. May be used as billing reports tag.
  • creator - Personalization tag.
  • website - Website FQDN for plowing.
  • access_key - AWS user access key.
  • primary_domain - Cloudfront distribution CNAME.
  • secret_key - AWS user secret key.
  • region - AWS region.

Deploying infrastructure

Inside repo directory run:

terraform init
terraform apply

Terraform will create:

  • 3 Buckets:

    1. With lt-src suffix. Public accessible for reading. Contains 1x1 pixel image for snowplow GET data.
    2. With lt-logs suffix. Using for storing: cloudfront logs with RAW prefix, enriched snowplow data with Converted prefix and maxmind GeoLite2 database.
    3. With lt-ath suffix. Using for storing Athena query results.
  • Cloudfront distribution with lt-src bucket as target and lt-logs bucket for logs storing.

  • Lambda function wich triggers on any lt-logs bucket object creation with prefix RAW and suffix .gz.

  • Athena workgroup with suffix wg

  • Athena database with prefix eventsdb

  • Athena saved query with name events

To complete infrastructure deployment run created saved athena query in created workgroup, it will create table with enriched snowplow events.

Configuring snowplow Google Analytics plugin:

Snowplow pixel GA plugin optimized for working with cloudfront looks like:

function() {
  var endpoint = 'https://d28zcvgo2jno01.cloudfront.net/i';
  return function(model) {    
    var globalSendTaskName = '_' + model.get('trackingId') + '_sendHitTask';
    var originalSendHitTask = window[globalSendTaskName] = window[globalSendTaskName] || model.get('sendHitTask');
    model.set('sendHitTask', function(sendModel) {
      var payload = sendModel.get('hitPayload');
      originalSendHitTask(sendModel);
      var request = new XMLHttpRequest();
      var path = endpoint + '?' + payload;
      request.open('GET', path, true);
      request.setRequestHeader('Content-type', 'text/plain; charset=UTF-8');
      request.send(payload);
    });
  };
}

You have to change endpoint to created cloudfront domain name and add code on pages you wanted to track with snowplow.

How it works

All data requested from Google Tag Manager compiles in GET request. Pixel tracker run that request to cloudfront distribution. Request string logs with cloudfront.

Log file puts into lt-logs bucket. Lambda function start to porocess new data, enrich it according Google Analytics Measurement Protocol and put enriched data in Converted folder.

Infrastructure termination

To delete infrastructure inside repo directory run:

terraform destroy

There may be errors while deleting some components because it consists data not created by terraform. You should delete these components manually.

About

Lambda function to enrich data collected through Google Analytics Snowplow plugin

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published