Skip to content

ownyourbusinessdata/snowplow-s3-enrich

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lambda script that enriches snowplow event data and puts it back to S3

Python based function for AWS lambda.

Function parses cloudfrront logs which requested by Snowplow pixel tracker.

Lambda should triggers on any object creation for bucket with cloudfront logs. Logfiles must be in RAW folder.

Enriched and processed logs puts in same bucket within Converted folder.

For more detailed description check our blog post at https://www.ownyourbusinessdata.net/enrich-snowplow-data-with-aws-lambda-function/

Requirenments

  • Terraform 0.12
  • AWS user should have following recomended permissions:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:GetPolicyVersion",
                "glue:DeleteDatabase",
                "iam:DeletePolicy",
                "iam:CreateRole",
                "iam:AttachRolePolicy",
                "athena:*",
                "iam:ListInstanceProfilesForRole",
                "cloudfront:GetDistribution",
                "iam:PassRole",
                "iam:DetachRolePolicy",
                "iam:ListAttachedRolePolicies",
                "cloudfront:UpdateDistribution",
                "iam:GetRole",
                "iam:GetPolicy",
                "glue:GetTables",
                "s3:*",
                "cloudfront:TagResource",
                "iam:DeleteRole",
                "cloudfront:CreateDistribution",
                "glue:GetDatabases",
                "iam:CreatePolicy",
                "glue:GetDatabase",
                "iam:ListPolicyVersions",
                "cloudfront:ListTagsForResource",
                "glue:CreateDatabase",
                "lambda:*",
                "cloudfront:DeleteDistribution"
            ],
            "Resource": "*"
        }
    ]
}

Configuring terraform script

File variables.tf contains all configurable variables for script:

  • env - Service tag. May be used as billing reports tag.
  • creator - Personalization tag.
  • website - Website FQDN for plowing.
  • access_key - AWS user access key.
  • primary_domain - Cloudfront distribution CNAME.
  • secret_key - AWS user secret key.
  • region - AWS region.

Deploying infrastructure

Inside repo directory run:

terraform init
terraform apply

Terraform will create:

  • 3 Buckets:

    1. With lt-src suffix. Public accessible for reading. Contains 1x1 pixel image for snowplow GET data.
    2. With lt-logs suffix. Using for storing: cloudfront logs with RAW prefix, enriched snowplow data with Converted prefix and maxmind GeoLite2 database.
    3. With lt-ath suffix. Using for storing Athena query results.
  • Cloudfront distribution with lt-src bucket as target and lt-logs bucket for logs storing.

  • Lambda function wich triggers on any lt-logs bucket object creation with prefix RAW and suffix .gz.

  • Athena workgroup with suffix wg

  • Athena database with prefix eventsdb

  • Athena saved query with name events

To complete infrastructure deployment run created saved athena query in created workgroup, it will create table with enriched snowplow events.

Configuring snowplow pixel tracker

Snowplow pixel tracker code looks like:

<script type="text/javascript">
  ;(function(p,l,o,w,i,n,g){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
  p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
  };p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
  n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","//d1fc8wv8zag5ca.cloudfront.net/2.6.2/sp.js","snowplow"));

  window.snowplow('newTracker', 'cf', 'dolaqvbw76wrx.cloudfront.net', {
    appId: 'site',
    cookieDomain: 'bostata.com',
  });
  window.snowplow('enableActivityTracking', 1, 5);
  window.snowplow('trackPageView');
  window.snowplow('enableLinkClickTracking');
  window.snowplow('enableFormTracking');
</script>

You have to change cloudfront domain name in window.snowplow fuction to the created cloudfront domain name and add code the code above on web pages you want to track with snowplow.

How it works

Each time a page is loaded, the browser requests a sp.js script. It's collects data from endpoint device.

All data compiles in GET request. Pixel tracker runs that request to cloudfront distribution. Request string logs with cloudfront.

Log file puts into lt-logs bucket. Lambda function starts to process new data, enrich it according snowplow event model and puts enriched data in Converted folder.

Infrastructure termination

To delete infrastructure inside repo directory run:

terraform destroy

There may be errors while deleting some components because there is data not created by terraform. You should delete these components manually.

Releases

No releases published

Packages

No packages published