Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
* [Sync Summary](datasets/batch_view/sync_summary/reference.md)
* [Addons](datasets/batch_view/addons/reference.md)
* [Client Count](datasets/batch_view/client_count/reference.md)
* [Client Count Daily](datasets/batch_view/client_count_daily/reference.md)
* [Churn](datasets/mozetl/churn/reference.md)
* [Retention](datasets/batch_view/retention/reference.md)
* [Clients Daily](datasets/mozetl/clients_daily/reference.md)
Expand Down
8 changes: 6 additions & 2 deletions datasets/batch_view/client_count/intro.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
The `client_count` dataset is useful for estimating user counts over a few
[pre-defined dimensions](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_view.sh).

The `client_count` dataset is deprecated in favor of
[`client_count_daily`](../client_count_daily/reference.md),
which is aggregated by submission date instead of activity date.

#### Content

This dataset includes columns for a dozen factors and an HLL variable.
The `hll` column contains a
[HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog)
variable, which is an approximation to the exact count.
The factor columns include activity date and the dimensions listed
[here](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/ClientCountView.scala#L22).
The factor columns include **activity** date and the dimensions listed
[here](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_view.sh).
Each row represents one combinations of the factor columns.

#### Background and Caveats
Expand Down
3 changes: 2 additions & 1 deletion datasets/batch_view/client_count/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ The work is being tracked

## Schema

The `activity_date` column is formatted as `%Y-%m-%d`, like `2018-01-30`.

This document is a work in progress.
The work is being tracked
[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175).

39 changes: 39 additions & 0 deletions datasets/batch_view/client_count_daily/intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
The `client_count_daily` dataset is useful for estimating user counts over a few
[pre-defined dimensions](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_daily_view.sh).

The `client_count_daily` dataset is similar to the deprecated
[`client_count` dataset](../client_count/reference.md)
except that is aggregated by submission date and not activity date.

#### Content

This dataset includes columns for a dozen factors and an HLL variable.
The `hll` column contains a
[HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog)
variable, which is an approximation to the exact count.
The factor columns include **submission** date and the dimensions listed
[here](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/client_count_daily_view.sh).
Each row represents one combinations of the factor columns.

#### Background and Caveats

It's important to understand that the `hll` column is **not a standard count**.
The `hll` variable avoids double-counting users when aggregating over multiple days.
The HyperLogLog variable is a far more efficient way to count distinct elements of a set,
but comes with some complexity.
To find the cardinality of an HLL use `cardinality(cast(hll AS HLL))`.
To find the union of two HLL's over different dates, use `merge(cast(hll AS HLL))`.
The [Firefox ER Reporting Query](https://sql.telemetry.mozilla.org/queries/81/source#129)
is a good example to review.
Finally, Roberto has a relevant write-up
[here](https://robertovitillo.com/2016/04/12/measuring-product-engagment-at-scale/).

#### Accessing the Data

The data is available in Re:dash.
Take a look at this
[example query](https://sql.telemetry.mozilla.org/queries/81/source#129).

I don't recommend accessing this data from ATMO.

#### Further Reading
39 changes: 39 additions & 0 deletions datasets/batch_view/client_count_daily/reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Client Count Daily Reference

This document is a work in progress.
The work is being tracked
[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175).

<!-- toc -->

# Introduction

{% include "./intro.md" %}

# Data Reference

## Example Queries

This document is a work in progress.
The work is being tracked
[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175).

## Sampling

This document is a work in progress.
The work is being tracked
[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175).

## Scheduling

This document is a work in progress.
The work is being tracked
[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175).

## Schema

`submission_date` is formatted as `%Y%m%d`, like `20180130`.

This document is a work in progress.
The work is being tracked
[here](https://bugzilla.mozilla.org/show_bug.cgi?id=1364175).