Lambda function to enrich data collected through Google Analytics Snowplow plugin
Python based function for aws lambda.
Function parse cloudfrront logs wich requested by Snowplow pixel tracker.
Lambda should triggers on any object creation for bucket with cloudfront logs. Logfiles must be in RAW folder.
Enriched and processed logs puts in same bucket within Converted folder.
- Terraform 0.12
- AWS user should have following recomended permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"iam:GetPolicyVersion",
"glue:DeleteDatabase",
"iam:DeletePolicy",
"iam:CreateRole",
"iam:AttachRolePolicy",
"athena:*",
"iam:ListInstanceProfilesForRole",
"cloudfront:GetDistribution",
"iam:PassRole",
"iam:DetachRolePolicy",
"iam:ListAttachedRolePolicies",
"cloudfront:UpdateDistribution",
"iam:GetRole",
"iam:GetPolicy",
"glue:GetTables",
"s3:*",
"cloudfront:TagResource",
"iam:DeleteRole",
"cloudfront:CreateDistribution",
"glue:GetDatabases",
"iam:CreatePolicy",
"glue:GetDatabase",
"iam:ListPolicyVersions",
"cloudfront:ListTagsForResource",
"glue:CreateDatabase",
"lambda:*",
"cloudfront:DeleteDistribution"
],
"Resource": "*"
}
]
}
File variables.tf
contains all configurable variables for script:
- env - Service tag. May be used as billing reports tag.
- creator - Personalization tag.
- website - Website FQDN for plowing.
- access_key - AWS user access key.
- primary_domain - Cloudfront distribution CNAME.
- secret_key - AWS user secret key.
- region - AWS region.
Inside repo directory run:
terraform init
terraform apply
Terraform will create:
-
3 Buckets:
- With lt-src suffix. Public accessible for reading. Contains 1x1 pixel image for snowplow GET data.
- With lt-logs suffix. Using for storing: cloudfront logs with RAW prefix, enriched snowplow data with Converted prefix and maxmind GeoLite2 database.
- With lt-ath suffix. Using for storing Athena query results.
-
Cloudfront distribution with lt-src bucket as target and lt-logs bucket for logs storing.
-
Lambda function wich triggers on any lt-logs bucket object creation with prefix RAW and suffix .gz.
-
Athena workgroup with suffix wg
-
Athena database with prefix eventsdb
-
Athena saved query with name events
To complete infrastructure deployment run created saved athena query in created workgroup, it will create table with enriched snowplow events.
Snowplow pixel GA plugin optimized for working with cloudfront looks like:
function() {
var endpoint = 'https://d28zcvgo2jno01.cloudfront.net/i';
return function(model) {
var globalSendTaskName = '_' + model.get('trackingId') + '_sendHitTask';
var originalSendHitTask = window[globalSendTaskName] = window[globalSendTaskName] || model.get('sendHitTask');
model.set('sendHitTask', function(sendModel) {
var payload = sendModel.get('hitPayload');
originalSendHitTask(sendModel);
var request = new XMLHttpRequest();
var path = endpoint + '?' + payload;
request.open('GET', path, true);
request.setRequestHeader('Content-type', 'text/plain; charset=UTF-8');
request.send(payload);
});
};
}
You have to change endpoint to created cloudfront domain name and add code on pages you wanted to track with snowplow.
All data requested from Google Tag Manager compiles in GET request. Pixel tracker run that request to cloudfront distribution. Request string logs with cloudfront.
Log file puts into lt-logs bucket. Lambda function start to porocess new data, enrich it according Google Analytics Measurement Protocol and put enriched data in Converted folder.
To delete infrastructure inside repo directory run:
terraform destroy
There may be errors while deleting some components because it consists data not created by terraform. You should delete these components manually.