Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 9 additions & 8 deletions SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,15 @@
* [Dataset Reference](datasets/reference.md)
* [Pings](datasets/pings.md)
* [Derived Datasets](datasets/derived.md)
* [Longitudinal](datasets/batch_view/Longitudinal.md)
* [Cross Sectional](datasets/batch_view/cross_sectional.md)
* [Main Summary](datasets/batch_view/MainSummary.md)
* [Crash Summary](datasets/batch_view/CrashSummary.md)
* [Crash Aggregate](datasets/batch_view/CrashAggregateView.md)
* [Events](datasets/batch_view/Events.md)
* [Sync Summary](datasets/batch_view/SyncSummary.md)
* [Addons](datasets/batch_view/Addons.md)
* [Longitudinal](datasets/batch_view/longitudinal/reference.md)
* [Cross Sectional](datasets/batch_view/cross_sectional/reference.md)
* [Main Summary](datasets/batch_view/main_summary/reference.md)
* [Crash Summary](datasets/batch_view/crash_summary/reference.md)
* [Crash Aggregate](datasets/batch_view/crash_aggregates/reference.md)
* [Events](datasets/batch_view/events/reference.md)
* [Sync Summary](datasets/batch_view/sync_summary/reference.md)
* [Addons](datasets/batch_view/addons/reference.md)
* [Client Count](datasets/batch_view/client_count/reference.md)
* [Experimental Datasets](tools/experiments.md)
* [Accessing Shield Study data](datasets/shield.md)
* [Collecting New Data](datasets/new_data.md)
Expand Down
212 changes: 5 additions & 207 deletions concepts/choosing_a_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,195 +61,19 @@ This section describes the derived datasets we provide to make analyzing this da

## longitudinal

The `longitudinal` dataset is a 1% sample of main ping data
organized so that each row corresponds to a client_id.
If you're not sure which dataset to use for your analysis,
this is probably what you want.

#### Contents
Each row in the `longitudinal` dataset represents one `client_id`,
which is approximately a user.
Each column represents a field from the main ping.
Most fields contain **arrays of values**, with one value for each ping associated with a client_id.
Using arrays give you access to the raw data from each ping,
but can be difficult to work with from SQL.
Here's a [query showing some sample data](https://sql.telemetry.mozilla.org/queries/4188#table)
to help illustrate.
Take a look at the [longitudinal examples](/cookbooks/longitudinal.md) if you get stuck.

#### Background and Caveats
Think of the longitudinal table as wide and short.
The dataset contains more columns than `main_summary`
and down-samples to 1% of all clients to reduce query computation time and save resources.

In summary, the longitudinal table differs from `main_summary` in two important ways:

* The longitudinal dataset groups all data so that one row represents a client_id
* The longitudinal dataset samples to 1% of all client_ids

#### Accessing the Data

The `longitudinal` is available in re:dash,
though it can be difficult to work with the array values in SQL.
Take a look at this [example query](https://sql.telemetry.mozilla.org/queries/4189/source).

The data is stored as a parquet table in S3 at the following address.
See [this cookbook](/cookbooks/parquet.md) to get started working with the data
in [Spark](http://spark.apache.org/docs/latest/quick-start.html).
```
s3://telemetry-parquet/longitudinal/
```

#### Further Reading

The technical documentation for the `longitudinal` dataset is located in the
[telemetry-batch-view documentation](https://github.com/mozilla/telemetry-batch-view/blob/master/docs/Longitudinal.md).

We also have a set of examples in the [longitudinal cookbook](/cookbooks/longitudinal.md)

The code that generates this dataset is [here](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/Longitudinal.scala)
{% include "/datasets/batch_view/longitudinal/intro.md" %}

## main_summary

The `main_summary` table is the most direct representation of a main ping
but can be difficult to work with due to its size.
Prefer the `longitudinal` dataset unless using the sampled data is prohibitive.

#### Contents

The `main_summary` table contains one row for each ping.
Each column represents one field from the main ping payload,
though only a subset of all main ping fields are included.
This dataset **does not include histograms**.

#### Background and Caveats
This table is massive, and due to it's size, it can be difficult to work with.
You should **avoid querying `main_summary`** from [re:dash](https://sql.telemetry.mozilla.org).
Your queries will be **slow to complete** and can **impact performance for other users**,
since re:dash on a shared cluster.

Instead, we recommend using the `longitudinal` or `cross_sectional` dataset where possible.
If these datasets do not suffice, consider using Spark on an
[ATMO](https://analysis.telemetry.mozilla.org) cluster.
In the odd case where these queries are necessary,
make use of the `sample_id` field and limit to a short submission date range.

#### Accessing the Data

The data is stored as a parquet table in S3 at the following address.
See [this cookbook](/cookbooks/parquet.md) to get started working with the data in Spark.
```
s3://telemetry-parquet/main_summary/v3/
```

Though **not recommended** `main_summary` is accessible through re:dash.
Here's an [example query](https://sql.telemetry.mozilla.org/queries/4201/source).
Your queries will be slow to complete and can **impact performance for other users**,
since re:dash is on a shared cluster.

#### Further Reading

The technical documentation for `main_summary` is located in the
[telemetry-batch-view documentation](https://github.com/mozilla/telemetry-batch-view/blob/master/docs/MainSummary.md).

The code responsible for generating this dataset is
[here](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala)

{% include "/datasets/batch_view/main_summary/intro.md" %}

## cross_sectional

The `cross_sectional` dataset provides descriptive statistics
for each client_id in a 1% sample of main ping data.
This dataset simplifies the longitudinal table by replacing
the longitudinal arrays with summary statistics.
This is the most useful dataset for describing our user base.

#### Content
Each row in the `cross_sectional` dataset represents one `client_id`,
which is approximately a user.
Each column is a summary statistic describing the client_id.

For example, the longitudinal table has a row called `geo_country`
which contains an array of country codes.
For the same `client_id` the `cross_sectional` table
has columns called `geo_country_mode` and `geo_country_configs`
containing single summary statistics for
the modal country and the number of distinct countries in the array.

| `client_id` | `geo_country` | `geo_country_mode` | `geo_country_configs`|
| ----------- |:----------------------:|:------------------:|:--------------------:|
| 1 | array<"US"> | "US" | 1 |
| 2 | array<"DE", "DE" "US"> | "DE" | 2 |

#### Background and Caveats

This table is much easier to work with than the longitudinal dataset because
you don't need to work with arrays.
This table has a limited number of pre-computed summary statistics
so you're metric may not be included.

Note that this dataset is a summary of the longitudinal dataset,
so it is also a 1% sample of all client_ids.

All summary statistics are computed over the last 180 days,
so this dataset can be insensitive to changes over time.

#### Accessing the Data

The cross_sectional dataset is available in re:dash.
Here's an [example query](https://sql.telemetry.mozilla.org/queries/4202/source).

The data is stored as a parquet table in S3 at the following address.
See [this cookbook](/cookbooks/parquet.md) to get started working with the data in Spark.
```
s3://telemetry-parquet/cross_sectional/v1/
```

#### Further Reading

The `cross_sectional` dataset is generated by
[this code](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala).
Take a look at [this query](https://sql.telemetry.mozilla.org/queries/4203/source) for a schema.

{% include "/datasets/batch_view/cross_sectional/intro.md" %}

## client_count

The `client_count` dataset is useful for estimating user counts over a few
[pre-defined dimensions](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/ClientCountView.scala#L22).

#### Content

This dataset includes columns for a dozen factors and an HLL variable.
The `hll` column contains a
[HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog)
variable, which is an approximation to the exact count.
The factor columns include activity date and the dimensions listed
[here](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/ClientCountView.scala#L22).
Each row represents one combinations of the factor columns.

#### Background and Caveats

It's important to understand that the `hll` column is **not a standard count**.
The `hll` variable avoids double-counting users when aggregating over multiple days.
The HyperLogLog variable is a far more efficient way to count distinct elements of a set,
but comes with some complexity.
To find the cardinality of an HLL use `cardinality(cast(hll AS HLL))`.
To find the union of two HLL's over different dates, use `merge(cast(hll AS HLL))`.
The [Firefox ER Reporting Query](https://sql.telemetry.mozilla.org/queries/81/source#129)
is a good example to review.
Finally, Roberto has a relevant writeup
[here](https://robertovitillo.com/2016/04/12/measuring-product-engagment-at-scale/).

#### Accessing the Data

The data is available in re:dash.
Take a look at this
[example query](https://sql.telemetry.mozilla.org/queries/81/source#129).

I don't recommend accessing this data from ATMO.

#### Further Reading

{% include "/datasets/batch_view/client_count/intro.md" %}

# Crash Ping Derived Datasets

Expand All @@ -262,33 +86,7 @@ This section describes the derived datasets we provide to make analyzing this da

## crash_aggregates

The `crash_aggregates` dataset compiles crash statistics over various dimensions for each day.

#### Rows and Columns

There's one column for each of the stratifying dimensions and the crash statistics.
Each row is a distinct set of dimensions, along with their associated crash stats.
Example stratifying dimensions include channel and country,
example statistics include usage hours and plugin crashes.
See the [complete documentation](https://github.com/mozilla/telemetry-batch-view/blob/master/docs/CrashAggregateView.md)
for all available dimensions
and statistics.

#### Accessing the Data

This dataset is accessible via re:dash.

The data is stored as a parquet table in S3 at the following address.
See [this cookbook](/cookbooks/parquet.md) to get started working with the data in Spark.
```
s3://telemetry-parquet/crash_aggregates/v1/
```

#### Further Reading

The technical documentation for this dataset can be found in the
[telemetry-batch-view documentation](https://github.com/mozilla/telemetry-batch-view/blob/master/docs/CrashAggregateView.md)

{% include "/datasets/batch_view/crash_aggregates/intro.md" %}

# Appendix

Expand Down
5 changes: 0 additions & 5 deletions datasets/batch_view/Addons.md

This file was deleted.

5 changes: 0 additions & 5 deletions datasets/batch_view/CrashAggregateView.md

This file was deleted.

5 changes: 0 additions & 5 deletions datasets/batch_view/CrashSummary.md

This file was deleted.

5 changes: 0 additions & 5 deletions datasets/batch_view/Events.md

This file was deleted.

5 changes: 0 additions & 5 deletions datasets/batch_view/Longitudinal.md

This file was deleted.

5 changes: 0 additions & 5 deletions datasets/batch_view/MainSummary.md

This file was deleted.

5 changes: 0 additions & 5 deletions datasets/batch_view/SyncSummary.md

This file was deleted.

4 changes: 4 additions & 0 deletions datasets/batch_view/addons/intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

This is a work in progress.
The work is being tracked
[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364172).
63 changes: 63 additions & 0 deletions datasets/batch_view/addons/reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Addons Datsets

<!-- toc -->

# Introduction

{% include "./intro.md" %}

# Data Reference

## Example Queries

## Sampling

It contains one or more records for every [Main Summary](MainSummary.md) record
that contains a non-null value for `client_id`.
Each Addons record contains info for a single addon,
or if the main ping did not contain any active addons,
there will be a row with nulls for all the addon fields
(to identify client_ids/records without any addons).

Like the Main Summary dataset, No attempt is made to de-duplicate submissions by `documentId`, so any analysis that could be affected by duplicate records should take care to remove duplicates using the `documentId` field.

## Scheduling

This dataset is updated daily via the
[telemetry-airflow](https://github.com/mozilla/telemetry-airflow) infrastructure.
The job DAG runs every day after the Main Summary data has been generated.
The DAG is [here](https://github.com/mozilla/telemetry-airflow/blob/master/dags/main_summary.py#L36).

## Schema

As of 2017-03-16, the current version of the `addons` dataset is `v2`,
and has a schema as follows:
```
root
|-- document_id: string (nullable = true)
|-- client_id: string (nullable = true)
|-- subsession_start_date: string (nullable = true)
|-- normalized_channel: string (nullable = true)
|-- addon_id: string (nullable = true)
|-- blocklisted: boolean (nullable = true)
|-- name: string (nullable = true)
|-- user_disabled: boolean (nullable = true)
|-- app_disabled: boolean (nullable = true)
|-- version: string (nullable = true)
|-- scope: integer (nullable = true)
|-- type: string (nullable = true)
|-- foreign_install: boolean (nullable = true)
|-- has_binary_components: boolean (nullable = true)
|-- install_day: integer (nullable = true)
|-- update_day: integer (nullable = true)
|-- signed_state: integer (nullable = true)
|-- is_system: boolean (nullable = true)
|-- submission_date_s3: string (nullable = true)
|-- sample_id: string (nullable = true)
```
For more detail on where these fields come from in the
[raw data](https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/environment.html#addons),
please look
[in the AddonsView code](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/views/AddonsView.scala).

The fields are all simple scalar values.
Loading