telemetry-batch-view

This is a Scala application to build derived datasets, also known as batch views, of Telemetry data.

Raw JSON pings are stored on S3 within files containing framed Heka records. Reading the raw data in through e.g. Spark can be slow as for a given analysis only a few fields are typically used; not to mention the cost of parsing the JSON blobs. Furthermore, Heka files might contain only a handful of records under certain circumstances.

Defining a derived Parquet dataset, which uses a columnar layout optimized for analytics workloads, can drastically improve the performance of analysis jobs while reducing the space requirements. A derived dataset might, and should, also perform heavy duty operations common to all analysis that are going to read from that dataset (e.g., parsing dates into normalized timestamps).

Adding a new derived dataset

See the views folder for examples of jobs that create derived datasets.

See the docs folder for more information about the individual derived datasets.

Development

Before importing the project in IntelliJ IDEA, apply the following changes to Preferences -> Languages & Frameworks -> Scala Compile Server:

JVM maximum heap size, MB: 2048
JVM parameters: -server -Xmx2G -Xss4M

Note that the first time the project is opened it takes some time to download all the dependencies.

Generating Datasets

See the documentation for specific views for details about running/generating them.

For example, to create a longitudinal view locally:

sbt "run-main com.mozilla.telemetry.views.LongitudinalView --from 20160101 --to 20160701 --bucket telemetry-test-bucket"

For distributed execution we pack all of the classes together into a single JAR and submit it to the cluster:

sbt assembly
spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160701 --bucket telemetry-test-bucket

Caveats

If you run into memory issues during compilation time issue the following command before running sbt:

export JAVA_OPTIONS="-Xss4M -Xmx2G"

Name		Name	Last commit message	Last commit date
Latest commit History 388 Commits
docs		docs
project		project
scripts		scripts
src		src
.gitignore		.gitignore
.sbtopts		.sbtopts
.travis.yml		.travis.yml
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

telemetry-batch-view

Adding a new derived dataset

Development

Generating Datasets

Caveats

About

Releases

Packages

Languages

mhammond/telemetry-batch-view

Folders and files

Latest commit

History

Repository files navigation

telemetry-batch-view

Adding a new derived dataset

Development

Generating Datasets

Caveats

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages