This is a Scala application to build derived datasets, also known as batch views, of Telemetry data.
Raw JSON pings are stored on S3 within files containing framed Heka records. Reading the raw data in through e.g. Spark can be slow as for a given analysis only a few fields are typically used; not to mention the cost of parsing the JSON blobs. Furthermore, Heka files might contain only a handful of records under certain circumstances.
Defining a derived Parquet dataset, which uses a columnar layout optimized for analytics workloads, can drastically improve the performance of analysis jobs while reducing the space requirements. A derived dataset might, and should, also perform heavy duty operations common to all analysis that are going to read from that dataset (e.g., parsing dates into normalized timestamps).
See the views folder for examples of jobs that create derived datasets.
See the docs folder for more information about the individual derived datasets.
Before importing the project in IntelliJ IDEA, apply the following changes to Preferences
-> Languages & Frameworks
-> Scala Compile Server
:
- JVM maximum heap size, MB:
2048
- JVM parameters:
-server -Xmx2G -Xss4M
Note that the first time the project is opened it takes some time to download all the dependencies.
See the documentation for specific views for details about running/generating them.
For example, to create a longitudinal view locally:
sbt "run-main com.mozilla.telemetry.views.LongitudinalView --from 20160101 --to 20160701 --bucket telemetry-test-bucket"
For distributed execution we pack all of the classes together into a single JAR and submit it to the cluster:
sbt assembly
spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160701 --bucket telemetry-test-bucket
If you run into memory issues during compilation time issue the following command before running sbt:
export JAVA_OPTIONS="-Xss4M -Xmx2G"