diff --git a/SUMMARY.md b/SUMMARY.md index 2b35fbeb7..01d0e4c85 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -25,6 +25,7 @@ * [Sync Summary](datasets/batch_view/sync_summary/reference.md) * [Addons](datasets/batch_view/addons/reference.md) * [Client Count](datasets/batch_view/client_count/reference.md) + * [Client Count Daily](datasets/batch_view/client_count_daily/reference.md) * [Churn](datasets/mozetl/churn/reference.md) * [Retention](datasets/batch_view/retention/reference.md) * [Clients Daily](datasets/mozetl/clients_daily/reference.md) diff --git a/datasets/batch_view/client_count/intro.md b/datasets/batch_view/client_count/intro.md index d5422b675..21e4fd0cb 100644 --- a/datasets/batch_view/client_count/intro.md +++ b/datasets/batch_view/client_count/intro.md @@ -1,14 +1,18 @@ The `client_count` dataset is useful for estimating user counts over a few [pre-defined dimensions](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_view.sh). +The `client_count` dataset is deprecated in favor of +[`client_count_daily`](../client_count_daily/reference.md), +which is aggregated by submission date instead of activity date. + #### Content This dataset includes columns for a dozen factors and an HLL variable. The `hll` column contains a [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) variable, which is an approximation to the exact count. -The factor columns include activity date and the dimensions listed -[here](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/ClientCountView.scala#L22). +The factor columns include **activity** date and the dimensions listed +[here](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_view.sh). Each row represents one combinations of the factor columns. #### Background and Caveats diff --git a/datasets/batch_view/client_count/reference.md b/datasets/batch_view/client_count/reference.md index 307ce4ad5..5ce5d812e 100644 --- a/datasets/batch_view/client_count/reference.md +++ b/datasets/batch_view/client_count/reference.md @@ -32,7 +32,8 @@ The work is being tracked ## Schema +The `activity_date` column is formatted as `%Y-%m-%d`, like `2018-01-30`. + This document is a work in progress. The work is being tracked [here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). - diff --git a/datasets/batch_view/client_count_daily/intro.md b/datasets/batch_view/client_count_daily/intro.md new file mode 100644 index 000000000..6b3655a72 --- /dev/null +++ b/datasets/batch_view/client_count_daily/intro.md @@ -0,0 +1,39 @@ +The `client_count_daily` dataset is useful for estimating user counts over a few +[pre-defined dimensions](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_daily_view.sh). + +The `client_count_daily` dataset is similar to the deprecated +[`client_count` dataset](../client_count/reference.md) +except that is aggregated by submission date and not activity date. + +#### Content + +This dataset includes columns for a dozen factors and an HLL variable. +The `hll` column contains a +[HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) +variable, which is an approximation to the exact count. +The factor columns include **submission** date and the dimensions listed +[here](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_daily_view.sh). +Each row represents one combinations of the factor columns. + +#### Background and Caveats + +It's important to understand that the `hll` column is **not a standard count**. +The `hll` variable avoids double-counting users when aggregating over multiple days. +The HyperLogLog variable is a far more efficient way to count distinct elements of a set, +but comes with some complexity. +To find the cardinality of an HLL use `cardinality(cast(hll AS HLL))`. +To find the union of two HLL's over different dates, use `merge(cast(hll AS HLL))`. +The [Firefox ER Reporting Query](https://sql.telemetry.mozilla.org/queries/81/source#129) +is a good example to review. +Finally, Roberto has a relevant write-up +[here](https://robertovitillo.com/2016/04/12/measuring-product-engagment-at-scale/). + +#### Accessing the Data + +The data is available in Re:dash. +Take a look at this +[example query](https://sql.telemetry.mozilla.org/queries/81/source#129). + +I don't recommend accessing this data from ATMO. + +#### Further Reading diff --git a/datasets/batch_view/client_count_daily/reference.md b/datasets/batch_view/client_count_daily/reference.md new file mode 100644 index 000000000..ea1b83314 --- /dev/null +++ b/datasets/batch_view/client_count_daily/reference.md @@ -0,0 +1,39 @@ +# Client Count Daily Reference + +This document is a work in progress. +The work is being tracked +[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). + + + +# Introduction + +{% include "./intro.md" %} + +# Data Reference + +## Example Queries + +This document is a work in progress. +The work is being tracked +[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). + +## Sampling + +This document is a work in progress. +The work is being tracked +[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). + +## Scheduling + +This document is a work in progress. +The work is being tracked +[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). + +## Schema + +`submission_date` is formatted as `%Y%m%d`, like `20180130`. + +This document is a work in progress. +The work is being tracked +[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175).