Skip to content
This repository was archived by the owner on Feb 14, 2025. It is now read-only.
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions docs/CrashSummary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
The Crash Summary dataset
========================

The Crash Summary dataset is generated by `src/main/scala/com/mozilla/telemetry/views/CrashSummaryView.scala`.

It contains one record for every [crash ping](https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/crash-ping.html) submitted by Firefox and it was built with the long term goal of providing a base for CrashAggregates.

Generating the dataset
----------------------

For distributed execution, we can build a self-contained JAR file, then run it with Spark.
For example, to generate the main_summary dataset for April 12, 2016 to April 28, 2016,
and storing the resulting data in an s3 bucket called `example_bucket`:
```bash
sbt assembly
spark-submit \
--master yarn \
--deploy-mode client \
--class com.mozilla.telemetry.views.CrashSummaryView \
target/scala-2.11/telemetry-batch-view-1.1.jar \
--outputBucket example_bucket \
--from 20160412 \
--to 20160428
```
Notes:

* The job saves the resulting data to S3 as [Parquet](https://parquet.apache.org/)-serialized [DataFrames](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html), under prefixes of the form `crash_summary/v1/submission_date=(YYYYMMDD SUBMISSION DATE)/` in the `telemetry-parquet` bucket.
* The Crash Summary data can be accessed from Spark via clusters launched at [analysis.telemetry.mozilla.org](https://analysis.telemetry.mozilla.org/) or via [Presto](https://prestodb.io/) at `sql.telemetry.mozillla.org`. The table is called `crash_summary`.


Update Frequency
----------------

This dataset will be updated daily via the [telemetry-airflow](https://github.com/mozilla/telemetry-airflow) infrastructure.

The job runs every day shortly after midnight UTC.


Schemas and Making Queries
--------------------------

```
root
|-- client_id: string (nullable = true)
|-- normalized_channel: string (nullable = true)
|-- build_version: string (nullable = true)
|-- build_id: string (nullable = true)
|-- channel: string (nullable = true)
|-- application: string (nullable = true)
|-- os_name: string (nullable = true)
|-- os_version: string (nullable = true)
|-- architecture: string (nullable = true)
|-- country: string (nullable = true)
|-- experiment_id: string (nullable = true)
|-- experiment_branch: string (nullable = true)
|-- e10s_enabled: boolean (nullable = true)
|-- e10s_cohort: string (nullable = true)
|-- gfx_compositor: string (nullable = true)
|-- payload: struct (nullable = true)
| |-- crashDate: string (nullable = true)
| |-- processType: string (nullable = true)
| |-- hasCrashEnvironment: boolean (nullable = true)
| |-- metadata: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
| |-- version: integer (nullable = true)


```
For more detail on where these fields come from in the
[raw data](https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/crash-ping.html),
please look at the case classes [in the CrashSummaryView code](src/main/scala/views/CrashSummaryView.scala).

Here is an example query to get the total number of main crashes by gfx_compositor:
```sql
select gfx_compositor, count(*)
from crash_summary
where application = 'Firefox'
and (payload.processType IS NULL OR payload.processType = 'main')
group by gfx_compositor
```