Skip to content

Commit

Permalink
Add tutorial on accessing main ping exponential histograms (#484)
Browse files Browse the repository at this point in the history
  • Loading branch information
wlach committed Jun 11, 2020
1 parent b78ab06 commit 41db05f
Show file tree
Hide file tree
Showing 6 changed files with 380 additions and 0 deletions.
1 change: 1 addition & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
* [Working with Normandy events](cookbooks/normandy_events.md)
* [Working with Crash Pings](cookbooks/crash_pings.md)
* [Working with Bit Patterns in Clients Last Seen](cookbooks/clients_last_seen_bits.md)
* [Visualizing Percentiles of an Main Ping Exponential Histogram](cookbooks/main_ping_exponential_histograms.md)
* [Real-time](cookbooks/realtime.md)
* [Seeing Your Own Pings](cookbooks/view_pings_cep.md)
* [Metrics](cookbooks/metrics.md)
Expand Down
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
379 changes: 379 additions & 0 deletions src/cookbooks/main_ping_exponential_histograms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,379 @@
# Visualizing Percentiles of a Main Ping Exponential Histogram

[GLAM](https://glam-stage.bespoke.nonprod.dataops.mozgcp.net/) is great if you want to check out the behaviour of a histogram over a large population across a curated set of dimensions, but what if you have a follow-up question that doesn't fit into its UI model? This tutorial will go into the guts of how to reproduce a GLAM-like view using `sql.telemetry.mozilla.org` (STMO), along with some suggestions on how to dig deeper.

This tutorial tries to build up an understanding and intuition of how things work on a low-level before it gets to its main act of reproducing GLAM. If you don't care about the details, you can probably skip the earlier sections in this document.

Assumptions: You have some idea of what a histogram is (if not, the [Wikipedia article](https://en.wikipedia.org/wiki/Histogram) is a great place to start), have at least skimmed over [the Firefox documentation on histograms](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/collection/histograms.html), and have read [the introduction to STMO](../tools/stmo.md).

## Table of Contents

<!-- toc -->

## Setting the stage: tab spinners duration

For the purposes of this tutorial, let's look at a typical performance-oriented histogram: [`FX_TAB_SWITCH_SPINNER_VISIBLE_MS`](https://probes.telemetry.mozilla.org/?view=detail&probeId=histogram%2FFX_TAB_SWITCH_SPINNER_VISIBLE_MS) which we use to count the number of times a tab spinner appears after a switch tab operation in Firefox, along with the duration of
its appearance in milliseconds (ms). This is an unwanted operation (especially if it's long), as it makes the browser appear unresponsive and [confuses / disturbs users](https://support.mozilla.org/en-US/questions/1198062).

`FX_TAB_SWITCH_SPINNER_VISIBLE_MS` is what's called an [exponential histogram](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/collection/histograms.html#exponential): it represents an exponentially increasing distribution of values in each of its "buckets". It's probably easier to visualize this using the histogram viewer than describe:

![example visualization of an exponential histogram](../assets/exponential_histograms_tutorial/example_visualization_of_an_exponential_histogram.png)
[link](https://telemetry.mozilla.org/histogram-simulator/#low=1&high=1000&n_buckets=20&kind=exponential&generate=normal)

This visualization above shows how a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) would map into the buckets: you'll see that it skews towards the end: the point of the exponential histogram is to be sensitive to lower values (which one would assume would be more frequent, so long as the tab spinner doesn't come up too frequently!). Each "tick" represents the range of a bucket (in milliseconds): so we have a bucket representing values between `1ms` and `2ms`, `2ms` and `3ms`, and so on. You'll note that this distribution caps out at 1000: any values greater than or equal to this will wind up in this bucket but we won't know their value with any precision. For tracking values higher than this, a separate histogram ([`FX_TAB_SWITCH_SPINNER_VISIBLE_LONG_MS`](https://probes.telemetry.mozilla.org/?view=detail&probeId=histogram%2FFX_TAB_SWITCH_SPINNER_VISIBLE_LONG_MS)) was created.

## Getting client-level data

Before we move on, let's do up a quick example of getting some client-level data using an SQL query, hopefully building up some intuition on how this stuff works on a low-level. From there, we can build up to aggregating it in interesting ways.

As of this writing, each main ping histogram is encoded as a JSON string inside the `telemetry.main` table inside BigQuery.

```sql
SELECT
payload.histograms.FX_TAB_SWITCH_SPINNER_VISIBLE_MS AS histogram_json,
FROM
telemetry.main
WHERE
sample_id = 42
AND normalized_channel = 'nightly'
AND DATE(submission_timestamp) = '2020-04-20'
AND payload.histograms.FX_TAB_SWITCH_SPINNER_VISIBLE_MS IS NOT NULL
LIMIT
3
```

[link](https://sql.telemetry.mozilla.org/queries/71333/source)

Running this query on STMO, we get the following output:

| `histogram_json` |
| --------------------------------------------------------------------------------------------------------- |
| `{"bucket_count":20,"histogram_type":0,"sum":118,"range":[1,1000],"values":{"80":0,"115":1,"165":0}}` |
| `{"bucket_count":20,"histogram_type":0,"sum":19145,"range":[1,1000],"values":{"237":0,"340":1,"1000":1}}` |
| `{"bucket_count":20,"histogram_type":0,"sum":1996,"range":[1,1000],"values":{"698":0,"1000":1}}` |

In this representation, `bucket_count`, `histogram_type`, and `range` represent the number of buckets, the histogram type (as an index: 0 means exponential), and the range of possible values. `values` represents the number of instances in each of the buckets while `sum` represents the sum total of all histogram values recorded.

In general, it is best not to rely on this representation of the histogram in production code (it is quite likely to change in the future). Instead, use the [`json_extract_histogram`](https://github.com/mozilla/bigquery-etl/blob/master/udf/json_extract_histogram.sql) user-defined-function (UDF) and extract out the fields you need: for example, to just get the `sum` for all the histograms above, you could modify the query above to something like:

```sql
WITH intermediate AS (
SELECT
udf.json_extract_histogram(payload.histograms.FX_TAB_SWITCH_SPINNER_VISIBLE_MS) AS histogram,
FROM
telemetry.main
WHERE
sample_id = 42
AND normalized_channel = 'nightly'
AND DATE(submission_timestamp) = '2020-04-20'
AND payload.histograms.FX_TAB_SWITCH_SPINNER_VISIBLE_MS IS NOT NULL
LIMIT
3
)
SELECT
histogram.sum,
histogram.bucket_count
FROM
intermediate;
```

[link](https://sql.telemetry.mozilla.org/queries/71408/source)

Which yields:

| `sum` | `bucket_count` |
| ----- | -------------- |
| 118 | 20 |
| 19145 | 20 |
| 1996 | 20 |

Note that these are the same values in the JSON histogram above.

Obviously this by itself is not particularly useful or meaningful - generally we are interested in _aggregate_ behaviour across a larger set of clients. Let's look at how we might get that.

## Getting percentiles from a set of histograms

Often, questions around histograms are framed as "what's the 99th percentile?" -- that is, what is the _maximum_ value that 99% of users experience: this helps give perspective on data which may have a number of weird outliers (a.k.a the _Bill Gates walks into a bar and everyone inside becomes a millionaire_ effect). Let's take an initial stab of grabbing some percentiles of the data we were looking at earlier using the [`histogram_merge`](https://github.com/mozilla/bigquery-etl/blob/master/udf/histogram_merge.sql) and [`histogram_percentiles`](https://github.com/mozilla/bigquery-etl/blob/master/udf/histogram_percentiles.sql) UDFs:

```sql
WITH merged_histogram AS (
SELECT
udf.histogram_merge(
ARRAY_AGG(udf.json_extract_histogram(payload.histograms.FX_TAB_SWITCH_SPINNER_VISIBLE_MS))
) AS spinner_visible_long_ms,
FROM
telemetry.main
WHERE
application.channel = 'nightly'
AND normalized_os = 'Windows'
AND DATE(submission_timestamp) = '2020-04-20'
),
percentiles AS (
SELECT
udf.histogram_percentiles(spinner_visible_long_ms, [.05, .25, .5, .75, .95]) AS percentile_nested
FROM
merged_histogram
)
SELECT
percentile,
value
FROM
percentiles
CROSS JOIN
UNNEST(percentiles.percentile_nested);
```

[link](https://sql.telemetry.mozilla.org/queries/71410/source)

Which gives us this set of results:

| Percentile | Value |
| ---------- | ----- |
| 0.05 | 13 |
| 0.25 | 56 |
| 0.50 | 165 |
| 0.75 | 698 |
| 0.95 | 1,000 |

So we see for this set of results that 95th percentile is `1000ms`, the 75th percentile is `698ms`, and so on.

There's a bit of intermediate-to-advanced SQL in the above query, due to the fact that the `histogram_percentiles` UDF returns an _array_ of results in a column (rather than a full-blown table) -- we wrangle the results into something we can handle using the [`UNNEST`](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unnest) operator combined with a cross-join at the end. If you don't immediately understand this, don't worry: it's just an implementation detail.

## Viewing change of percentiles over time

Knowing the approximate distribution of results on a given day is sort of interesting, but probably not what we really want: what we're usually interested in is the evolution of results _over time_. In particular, segmenting by `build_id` (a date-like structure in the `main` ping representing when Firefox was built) is a useful technique, as it allows us to see if changes to Firefox itself may have caused the distribution to change.

We can do this simply by _grouping by_ the build id field, and then merging the histograms corresponding to each:

```sql
DECLARE four_twenty DEFAULT DATE('2020-04-20');

WITH per_build_day AS (
SELECT
PARSE_DATETIME("%Y%m%d%H%M%S", application.build_id) AS build_id,
KEY,
SUM(value) AS value,
FROM
telemetry.main,
UNNEST(
udf.json_extract_histogram(
payload.histograms.FX_TAB_SWITCH_SPINNER_VISIBLE_MS
).VALUES
)
WHERE
application.channel = 'nightly'
AND normalized_os = 'Windows'
AND application.build_id > FORMAT_DATE("%Y%m%d", DATE_SUB(four_twenty, INTERVAL 2 WEEK))
AND application.build_id <= FORMAT_DATE("%Y%m%d", four_twenty)
AND DATE(submission_timestamp) >= DATE_SUB(four_twenty, INTERVAL 2 WEEK)
AND DATE(submission_timestamp) <= four_twenty
GROUP BY
KEY,
build_id
),
per_build_day_as_struct AS (
SELECT
build_id,
STRUCT(ARRAY_AGG(STRUCT(KEY, value)) AS VALUES) AS spinner_visible_ms
FROM
per_build_day
GROUP BY
build_id
)
SELECT
build_id,
percentile,
value
FROM
per_build_day_as_struct
CROSS JOIN
UNNEST(
udf.histogram_percentiles(
spinner_visible_ms,
[.05, .25, .5, .75, .95]
)
)
```

[link](https://sql.telemetry.mozilla.org/queries/71472/source)

As an implementation note, observe that we don't use `histogram_merge` here as we do above: doing so would require using [`ARRAY_AGG`](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#array_agg) which can break down when processing large amounts of data. Instead we create an intermediate result (the `per_build_day` `WITH` statement) and then reprocess it into a structured representation. If you're curious what the version using `histogram_merge` would look like, see [this example](https://sql.telemetry.mozilla.org/queries/71413/source).

In any case, rendering the data this query returns, we get a chart like this:

![example visualization of histogram percentiles](../assets/exponential_histograms_tutorial/example_visualization_of_histogram_percentiles.png)

You'll note that the 75th and the 95th percentiles are often the same. Which is to say: in 25% of cases, the value was somewhere between `698ms` and `1000ms`. Does this mean that 25% of the time people are seeing a _very_ long-running tab spinner? _No!_ It actually points to a flaw in our methodology, which GLAM was explicitly designed to address. For the last part our tutorial, let's look into how it does it, and how to reproduce its approach.

## Percentiles from client-normalized histograms

The example above basically created one _giant_ histogram and then gathered the percentiles out of each one. But histograms are not created equal! At the extreme end of things for the tab spinner, consider a user on an extremely old computer with various kinds of malware installed, constantly interacting with complex and slow web sites. Such a condition is going to trigger the tab spinner frequently, and for long periods. But it is not representative of the overall population, and probably shouldn't _overtly_ influence our decision-making process.

A solution used by GLAM is to give each client "one vote": that is, the aggregate histogram for a client over a day must sum up to one. Even in the extreme case where all tab spinner measurements fall between `658ms` and `1000ms` (the range of the highest bucket), the _maximum_ number for that bucket is just "1".

We can reproduce this approach by using the [`histogram_normalize`](https://github.com/mozilla/bigquery-etl/blob/master/udf/histogram_normalize.sql) UDF, which explicitly takes a set of histograms and makes sure that the values for each one sum up to exactly one:

```sql
DECLARE four_twenty DEFAULT DATE('2020-04-20');

WITH per_build_client_day AS (
SELECT
PARSE_DATETIME("%Y%m%d%H%M%S", application.build_id) AS build_id,
client_id,
udf.histogram_normalize(
udf.histogram_merge(
ARRAY_AGG(
udf.json_extract_histogram(
payload.histograms.FX_TAB_SWITCH_SPINNER_VISIBLE_MS
)
)
)
) AS tab_switch_visible_ms
FROM
telemetry.main
WHERE
application.channel = 'nightly'
AND normalized_os = 'Windows'
AND application.build_id > FORMAT_DATE("%Y%m%d", DATE_SUB(four_twenty, INTERVAL 14 DAY))
AND application.build_id <= FORMAT_DATE("%Y%m%d", four_twenty)
AND DATE(submission_timestamp) >= DATE_SUB(four_twenty, INTERVAL 14 DAY)
AND DATE(submission_timestamp) <= four_twenty
GROUP BY
build_id,
client_id
),
merged_histograms AS (
SELECT
build_id,
KEY,
SUM(value) AS value,
FROM
per_build_client_day,
UNNEST(per_build_client_day.tab_switch_visible_ms.VALUES)
GROUP BY
KEY,
build_id
),
as_struct AS (
SELECT
build_id,
STRUCT(ARRAY_AGG(STRUCT(KEY, value)) AS VALUES) AS spinner_visible_long_ms
FROM
merged_histograms
GROUP BY
build_id
),
percentiles AS (
SELECT
build_id,
udf.histogram_percentiles(
spinner_visible_long_ms,
[.05, .25, .5, .75, .95]
) AS percentile_nested
FROM
as_struct
)
SELECT
build_id,
percentile,
value
FROM
percentiles
CROSS JOIN
UNNEST(percentiles.percentile_nested);
```

[link](https://sql.telemetry.mozilla.org/queries/71489/source)

You'll notice this query groups by `client_id` in addition to `build_id` before `histogram_normalize`. Grouping by `client_id` gives each user equal representation and prevents "power users" from skewing the result.

In any case, the result of this query is this graph:

![example visualization of normalized histogram percentiles](../assets/exponential_histograms_tutorial/example_visualization_of_normalized_histogram_percentiles.png)

Things are looking much better! The 95th percentile is still capped out at 1000, but the other percentiles are much lower.

## Slicing along arbitrary dimensions

OK, so we've reproduced GLAM, but that isn't particularly exciting in and of itself: if you just wanted to see a GLAM-like view of things, GLAM by itself is going to do a better job. The power of writing SQL comes when you want to see how things look on an arbitrary set of dimensions. Let's look at an arbitrary question: what do the tab spinner percentiles look like Windows 7? These are likely to be much older machines, so we'd expect things to look worse. But how much?

We can filter our query to _just_ that group of users by adding a `AND normalized_os_version="6.1"` clause to our query above:

```sql
DECLARE four_twenty DEFAULT DATE('2020-04-20');

WITH per_build_client_day AS (
SELECT
PARSE_DATETIME("%Y%m%d%H%M%S", application.build_id) AS build_id,
client_id,
udf.histogram_normalize(
udf.histogram_merge(
ARRAY_AGG(
udf.json_extract_histogram(
payload.histograms.FX_TAB_SWITCH_SPINNER_VISIBLE_MS
)
)
)
) AS tab_switch_visible_ms
FROM
telemetry.main
WHERE
application.channel = 'nightly'
AND normalized_os = 'Windows'
AND normalized_os_version = "6.1"
AND application.build_id > FORMAT_DATE("%Y%m%d", DATE_SUB(four_twenty, INTERVAL 14 DAY))
AND application.build_id <= FORMAT_DATE("%Y%m%d", four_twenty)
AND DATE(submission_timestamp) >= DATE_SUB(four_twenty, INTERVAL 14 DAY)
AND DATE(submission_timestamp) <= four_twenty
GROUP BY
build_id,
client_id
),
merged_histograms AS (
SELECT
build_id,
KEY,
SUM(value) AS value,
FROM
per_build_client_day,
UNNEST(per_build_client_day.tab_switch_visible_ms.VALUES)
GROUP BY
KEY,
build_id
),
as_struct AS (
SELECT
build_id,
STRUCT(ARRAY_AGG(STRUCT(KEY, value)) AS VALUES) AS spinner_visible_long_ms
FROM
merged_histograms
GROUP BY
build_id
),
percentiles AS (
SELECT
build_id,
udf.histogram_percentiles(
spinner_visible_long_ms,
[.05, .25, .5, .75, .95]
) AS percentile_nested
FROM
as_struct
)
SELECT
build_id,
percentile,
value
FROM
percentiles
CROSS JOIN
UNNEST(percentiles.percentile_nested);
```

[link](https://sql.telemetry.mozilla.org/queries/71437/source)

If we do this, we see this chart:

![example visualization of normalized histogram percentiles for Windows 7](../assets/exponential_histograms_tutorial/example_visualization_of_normalized_histogram_percentiles_for_Windows_7.png)

As you can see, both the 75th and 100th percentile are now in the highest bucket, and the 50th percentile is much higher as well. From this we can intuit that the user experience for these users is likely considerably worse, which is exactly what we would have expected.

0 comments on commit 41db05f

Please sign in to comment.