mozilla · harterrt · Feb 15, 2017 · Jan 30, 2017
diff --git a/docs/CrashSummary.md b/docs/CrashSummary.md
@@ -0,0 +1,82 @@
+The Crash Summary dataset
+========================
+
+The Crash Summary dataset is generated by `src/main/scala/com/mozilla/telemetry/views/CrashSummaryView.scala`.
+
+It contains one record for every [crash ping](https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/crash-ping.html) submitted by Firefox and it  was built with the long term goal of providing a base for CrashAggregates.
+
+Generating the dataset
+----------------------
+
+For distributed execution, we can build a self-contained JAR file, then run it with Spark.
+For example, to generate the main_summary dataset for April 12, 2016 to April 28, 2016,
+and storing the resulting data in an s3 bucket called `example_bucket`:
+```bash
+sbt assembly
+spark-submit \
+    --master yarn \
+    --deploy-mode client \
+    --class com.mozilla.telemetry.views.CrashSummaryView \
+    target/scala-2.11/telemetry-batch-view-1.1.jar \
+    --outputBucket example_bucket \
+    --from 20160412 \
+    --to 20160428
+```
+Notes:
+
+* The job saves the resulting data to S3 as [Parquet](https://parquet.apache.org/)-serialized [DataFrames](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html), under prefixes of the form `crash_summary/v1/submission_date=(YYYYMMDD SUBMISSION DATE)/` in the `telemetry-parquet` bucket.
+* The Crash Summary data can be accessed from Spark via clusters launched at [analysis.telemetry.mozilla.org](https://analysis.telemetry.mozilla.org/) or via [Presto](https://prestodb.io/) at `sql.telemetry.mozillla.org`. The table is called `crash_summary`.
+
+
+Update Frequency
+----------------
+
+This dataset will be updated daily via the [telemetry-airflow](https://github.com/mozilla/telemetry-airflow) infrastructure.
+
+The job runs every day shortly after midnight UTC.
+
+
+Schemas and Making Queries
+-------------------------- 
+
+```
+root
+ |-- client_id: string (nullable = true)
+ |-- normalized_channel: string (nullable = true)
+ |-- build_version: string (nullable = true)
+ |-- build_id: string (nullable = true)
+ |-- channel: string (nullable = true)
+ |-- application: string (nullable = true)
+ |-- os_name: string (nullable = true)
+ |-- os_version: string (nullable = true)
+ |-- architecture: string (nullable = true)
+ |-- country: string (nullable = true)
+ |-- experiment_id: string (nullable = true)
+ |-- experiment_branch: string (nullable = true)
+ |-- e10s_enabled: boolean (nullable = true)
+ |-- e10s_cohort: string (nullable = true)
+ |-- gfx_compositor: string (nullable = true)
+ |-- payload: struct (nullable = true)
+ |    |-- crashDate: string (nullable = true)
+ |    |-- processType: string (nullable = true)
+ |    |-- hasCrashEnvironment: boolean (nullable = true)
+ |    |-- metadata: map (nullable = true)
+ |    |    |-- key: string
+ |    |    |-- value: string (valueContainsNull = true)
+ |    |-- version: integer (nullable = true)
+
+
+```
+For more detail on where these fields come from in the
+[raw data](https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/crash-ping.html),
+please look at the case classes [in the CrashSummaryView code](src/main/scala/views/CrashSummaryView.scala).
+
+Here is an example query to get the total number of main crashes by gfx_compositor:
+```sql
+select gfx_compositor, count(*)
+from crash_summary
+where application = 'Firefox'
+and (payload.processType IS NULL OR payload.processType = 'main') 
+group by gfx_compositor
+ ```
+