From b9d11d28adafd179a493e810f56f5dc3fd6e4dd0 Mon Sep 17 00:00:00 2001 From: Tim Smith Date: Fri, 6 Apr 2018 16:21:57 -0700 Subject: [PATCH 1/7] Point to airflow job for client_count columns Control over the columns to aggregate was pushed down to Airflow after telemetry-batch-view was refactored in https://github.com/mozilla/telemetry-batch-view/pull/229. --- datasets/batch_view/client_count/intro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datasets/batch_view/client_count/intro.md b/datasets/batch_view/client_count/intro.md index d5422b675..e9361514d 100644 --- a/datasets/batch_view/client_count/intro.md +++ b/datasets/batch_view/client_count/intro.md @@ -8,7 +8,7 @@ The `hll` column contains a [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) variable, which is an approximation to the exact count. The factor columns include activity date and the dimensions listed -[here](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/ClientCountView.scala#L22). +[here](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_view.sh). Each row represents one combinations of the factor columns. #### Background and Caveats From bbbc4eee197d6f2739fa9860e74205881dd5a909 Mon Sep 17 00:00:00 2001 From: Tim Smith Date: Fri, 6 Apr 2018 16:26:30 -0700 Subject: [PATCH 2/7] Deprecate client_count dataset --- datasets/batch_view/client_count/intro.md | 6 ++- .../batch_view/client_count_daily/intro.md | 39 +++++++++++++++++++ .../client_count_daily/reference.md | 37 ++++++++++++++++++ 3 files changed, 81 insertions(+), 1 deletion(-) create mode 100644 datasets/batch_view/client_count_daily/intro.md create mode 100644 datasets/batch_view/client_count_daily/reference.md diff --git a/datasets/batch_view/client_count/intro.md b/datasets/batch_view/client_count/intro.md index e9361514d..2d2458c61 100644 --- a/datasets/batch_view/client_count/intro.md +++ b/datasets/batch_view/client_count/intro.md @@ -1,13 +1,17 @@ The `client_count` dataset is useful for estimating user counts over a few [pre-defined dimensions](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_view.sh). +The `client_count` dataset is deprecated in favor of +[client_count_daily](../client_count_daily/reference.md), +which is aggregated by submission date instead of activity date. + #### Content This dataset includes columns for a dozen factors and an HLL variable. The `hll` column contains a [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) variable, which is an approximation to the exact count. -The factor columns include activity date and the dimensions listed +The factor columns include **activity** date and the dimensions listed [here](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_view.sh). Each row represents one combinations of the factor columns. diff --git a/datasets/batch_view/client_count_daily/intro.md b/datasets/batch_view/client_count_daily/intro.md new file mode 100644 index 000000000..6b3655a72 --- /dev/null +++ b/datasets/batch_view/client_count_daily/intro.md @@ -0,0 +1,39 @@ +The `client_count_daily` dataset is useful for estimating user counts over a few +[pre-defined dimensions](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_daily_view.sh). + +The `client_count_daily` dataset is similar to the deprecated +[`client_count` dataset](../client_count/reference.md) +except that is aggregated by submission date and not activity date. + +#### Content + +This dataset includes columns for a dozen factors and an HLL variable. +The `hll` column contains a +[HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) +variable, which is an approximation to the exact count. +The factor columns include **submission** date and the dimensions listed +[here](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_daily_view.sh). +Each row represents one combinations of the factor columns. + +#### Background and Caveats + +It's important to understand that the `hll` column is **not a standard count**. +The `hll` variable avoids double-counting users when aggregating over multiple days. +The HyperLogLog variable is a far more efficient way to count distinct elements of a set, +but comes with some complexity. +To find the cardinality of an HLL use `cardinality(cast(hll AS HLL))`. +To find the union of two HLL's over different dates, use `merge(cast(hll AS HLL))`. +The [Firefox ER Reporting Query](https://sql.telemetry.mozilla.org/queries/81/source#129) +is a good example to review. +Finally, Roberto has a relevant write-up +[here](https://robertovitillo.com/2016/04/12/measuring-product-engagment-at-scale/). + +#### Accessing the Data + +The data is available in Re:dash. +Take a look at this +[example query](https://sql.telemetry.mozilla.org/queries/81/source#129). + +I don't recommend accessing this data from ATMO. + +#### Further Reading diff --git a/datasets/batch_view/client_count_daily/reference.md b/datasets/batch_view/client_count_daily/reference.md new file mode 100644 index 000000000..faf058ca9 --- /dev/null +++ b/datasets/batch_view/client_count_daily/reference.md @@ -0,0 +1,37 @@ +# Client Count Daily Reference + +This document is a work in progress. +The work is being tracked +[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). + + + +# Introduction + +{% include "./intro.md" %} + +# Data Reference + +## Example Queries + +This document is a work in progress. +The work is being tracked +[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). + +## Sampling + +This document is a work in progress. +The work is being tracked +[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). + +## Scheduling + +This document is a work in progress. +The work is being tracked +[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). + +## Schema + +This document is a work in progress. +The work is being tracked +[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). From 182b58f7a0f7bc30b7e48df60d80c37c0f708fd3 Mon Sep 17 00:00:00 2001 From: Tim Smith Date: Fri, 6 Apr 2018 16:34:52 -0700 Subject: [PATCH 3/7] Add client_count_daily to SUMMARY --- SUMMARY.md | 1 + 1 file changed, 1 insertion(+) diff --git a/SUMMARY.md b/SUMMARY.md index 2b35fbeb7..01d0e4c85 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -25,6 +25,7 @@ * [Sync Summary](datasets/batch_view/sync_summary/reference.md) * [Addons](datasets/batch_view/addons/reference.md) * [Client Count](datasets/batch_view/client_count/reference.md) + * [Client Count Daily](datasets/batch_view/client_count_daily/reference.md) * [Churn](datasets/mozetl/churn/reference.md) * [Retention](datasets/batch_view/retention/reference.md) * [Clients Daily](datasets/mozetl/clients_daily/reference.md) From d324a5a68ba69e5793c964bc327e0820cd3009d1 Mon Sep 17 00:00:00 2001 From: Tim Smith Date: Tue, 10 Apr 2018 09:24:32 -0700 Subject: [PATCH 4/7] Add table to spelling dictionary --- .spelling | 1 + 1 file changed, 1 insertion(+) diff --git a/.spelling b/.spelling index 4a0f2f7d5..2979a25a4 100644 --- a/.spelling +++ b/.spelling @@ -24,6 +24,7 @@ bugzilla CEP changesets chromehangs +client_count_daily config CPOW CSV From d8709d301798687af59a942fefbb163de9ea9bc1 Mon Sep 17 00:00:00 2001 From: Tim Smith Date: Wed, 11 Apr 2018 10:29:52 -0700 Subject: [PATCH 5/7] Add notes on date format --- datasets/batch_view/client_count/reference.md | 3 ++- datasets/batch_view/client_count_daily/reference.md | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/datasets/batch_view/client_count/reference.md b/datasets/batch_view/client_count/reference.md index 307ce4ad5..5ce5d812e 100644 --- a/datasets/batch_view/client_count/reference.md +++ b/datasets/batch_view/client_count/reference.md @@ -32,7 +32,8 @@ The work is being tracked ## Schema +The `activity_date` column is formatted as `%Y-%m-%d`, like `2018-01-30`. + This document is a work in progress. The work is being tracked [here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). - diff --git a/datasets/batch_view/client_count_daily/reference.md b/datasets/batch_view/client_count_daily/reference.md index faf058ca9..ea1b83314 100644 --- a/datasets/batch_view/client_count_daily/reference.md +++ b/datasets/batch_view/client_count_daily/reference.md @@ -32,6 +32,8 @@ The work is being tracked ## Schema +`submission_date` is formatted as `%Y%m%d`, like `20180130`. + This document is a work in progress. The work is being tracked [here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175). From 2b46ed0cb7e4130195c8ded43ae1e8dcf2459d70 Mon Sep 17 00:00:00 2001 From: Tim Smith Date: Fri, 13 Apr 2018 15:23:47 -0700 Subject: [PATCH 6/7] Revert "Add table to spelling dictionary" This reverts commit d324a5a68ba69e5793c964bc327e0820cd3009d1. --- .spelling | 1 - 1 file changed, 1 deletion(-) diff --git a/.spelling b/.spelling index 2979a25a4..4a0f2f7d5 100644 --- a/.spelling +++ b/.spelling @@ -24,7 +24,6 @@ bugzilla CEP changesets chromehangs -client_count_daily config CPOW CSV From eee18816ac1353e26bcf619bdb36ed46549a0685 Mon Sep 17 00:00:00 2001 From: Tim Smith Date: Fri, 13 Apr 2018 15:24:20 -0700 Subject: [PATCH 7/7] Backtick client_count_daily link reference --- datasets/batch_view/client_count/intro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datasets/batch_view/client_count/intro.md b/datasets/batch_view/client_count/intro.md index 2d2458c61..21e4fd0cb 100644 --- a/datasets/batch_view/client_count/intro.md +++ b/datasets/batch_view/client_count/intro.md @@ -2,7 +2,7 @@ The `client_count` dataset is useful for estimating user counts over a few [pre-defined dimensions](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_view.sh). The `client_count` dataset is deprecated in favor of -[client_count_daily](../client_count_daily/reference.md), +[`client_count_daily`](../client_count_daily/reference.md), which is aggregated by submission date instead of activity date. #### Content