Skip to content

A Scala framework to build derived datasets, aka batch views, of Telemetry data.

Notifications You must be signed in to change notification settings

mhammond/telemetry-batch-view

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

telemetry-batch-view

This is a Scala application to build derived datasets, also known as batch views, of Telemetry data.

Build Status codecov.io

Raw JSON pings are stored on S3 within files containing framed Heka records. Reading the raw data in through e.g. Spark can be slow as for a given analysis only a few fields are typically used; not to mention the cost of parsing the JSON blobs. Furthermore, Heka files might contain only a handful of records under certain circumstances.

Defining a derived Parquet dataset, which uses a columnar layout optimized for analytics workloads, can drastically improve the performance of analysis jobs while reducing the space requirements. A derived dataset might, and should, also perform heavy duty operations common to all analysis that are going to read from that dataset (e.g., parsing dates into normalized timestamps).

Adding a new derived dataset

See the views folder for examples of jobs that create derived datasets.

See the docs folder for more information about the individual derived datasets.

Development

Before importing the project in IntelliJ IDEA, apply the following changes to Preferences -> Languages & Frameworks -> Scala Compile Server:

  • JVM maximum heap size, MB: 2048
  • JVM parameters: -server -Xmx2G -Xss4M

Note that the first time the project is opened it takes some time to download all the dependencies.

Generating Datasets

See the documentation for specific views for details about running/generating them.

For example, to create a longitudinal view locally:

sbt "run-main com.mozilla.telemetry.views.LongitudinalView --from 20160101 --to 20160701 --bucket telemetry-test-bucket"

For distributed execution we pack all of the classes together into a single JAR and submit it to the cluster:

sbt assembly
spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160701 --bucket telemetry-test-bucket

Caveats

If you run into memory issues during compilation time issue the following command before running sbt:

export JAVA_OPTIONS="-Xss4M -Xmx2G"

About

A Scala framework to build derived datasets, aka batch views, of Telemetry data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • Scala 95.9%
  • Jupyter Notebook 2.1%
  • Python 2.0%