code for talk developers guide for realtime pipeline with snowplow kafka and bigquery
- setup schemas folder
gcloud container clusters create clickstream-pipeline --num-nodes=2 --scopes bigquery
you only need bigquery for writing events couple of nodes are recommended as kubernetes mast er node has significant allocation of resources not leaving enough more than 1-2 deployments.
gcloud compute disks create --size=50GB --zone=europe-west1-d zookeeper-volume
gcloud compute disks create --size=100GB --zone=europe-west1-d kafka-volume
kubectl create -f k8s/zookeeper-volume.yaml
kubectl create -f k8s/kafka-volume.yaml
kubectl create -f k8s/zookeeper-volume-claim.yaml
kubectl create -f k8s/kafka-volume-claim.yaml
kubectl create -f k8s/zookeeper-service.yaml
kubectl create -f k8s/zookeeper.yaml
kubectl create -f k8s/kafka-service.yaml
kubectl create -f k8s/kafka.yaml
kubectl create -f iglu-repo-service.yaml
kubectl create -f iglu-repo.yaml
kubectl create -f snowplow-scala-stream-collector-service.yaml
kubectl create -f snowplow-scala-stream-collector.yaml
kubectl create -f snowplow-scala-stream-enrich.yaml
kubectl create -f bq-connector.yaml
kubectl create -f os-counter-service.yaml
kubectl create -f os-counter.yaml
kubectl get svc
NAME CLUSTER-IP EXTERNAL-IP PORT(S)
os-counter 10.7.245.144 <external-ip> 80:31005/TCP
scala-stream-collector-node 10.7.252.185 <external-ip> 80:32222/TCP
window.snowplow('newTracker', 'cf', '<collector-ip>', { // Initialise a tracker
appId: 'dev-guide-tracker',
cookieDomain: ''
});
- Open the os-counter website http://{os-counter-ip}
- Open testing client website website/index.html
- Check if os-counter events are coming in
- zookeeper and kafka multi instance clusters
- processing custom contexts
- adding endpoints for logs and monitoring
https://www.youtube.com/watch?v=t3bISkp7zBw
to learn more on clickstream data have a look at this post https://stacktome.com/blog/a-guide-to-data-warehousing-clickstream-data?utm_source=github&utm_campaign=sm:organic|blog|post-7965&utm_medium=resource#Clickstream_analysis