Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-artedi histograms should be compatible with Prometheus #17

Closed
wants to merge 14 commits into from

Conversation

joshwilsdon
Copy link
Contributor

@joshwilsdon joshwilsdon commented Oct 18, 2018

node-artedi histograms should be compatible with Prometheus

TL;DR

Problem

If you use node-artedi histograms, any time you try to do a histogram_quantile across:

  • two different metrics
  • the same metric but where your PromQL query involves data with different label values

you are only ever going to get correct results by accident.

Cause

node-artedi has a different mechanism for handling histogram buckets from other Prometheus clients and a mechanism that's in fact incompatible with Prometheus.

Proposed solution

New major version of node-artedi with histograms compatible with Prometheus.

What's this about?

While doing some investigations into some dashboards, I found some very strange behavior for PromQL queries like:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds{service="cnapi"}[5m])) by (le))

where that data comes from node-triton-metrics which uses a node-artedi histogram.

This is an attempt to explain what happened, and what I think we should do about it.

To explain the problems properly, we have to gain some understanding of how Prometheus histograms work. So we'll go through that first.

Digression: What's a Prometheus histogram?

Let's start with a digression.

In Prometheus, a histogram is a single metric type that takes observed values and places those values in "configurable buckets". It also exposes a "sum" and a "count" for all values that have been observed. The buckets themselves are just monotonic counters. In a simple case, imagine we were trying to keep track of request times (a common use case) and we created a histogram with buckets for second values (these are the default buckets that all libraries I'm aware of other than node-artedi use, more on that later):

  • 0.005
  • 0.01
  • 0.025
  • 0.05
  • 0.075 // not all libraries have this: golang, rust, ruby, js do not
  • 0.1
  • 0.25
  • 0.5
  • 0.75 // not all libraries
  • 1
  • 2.5
  • 5
  • 7.5 // not all libraries
  • 10

There's also a special bucket which always exists called +Inf which gets incremented for every observation.

Also note that with the exception of a few values that were removed, these default values are actually log-linear with a base of 10, low exponent of -3, high exponent of 1 and 4 linear steps per magnitude.

To update a histogram, you call a function like:

myHistogram.observe(0.016);

and for that histogram what we would do is increment the counter for every bucket that's less than or equal to the observed value. In this case, since the value observed is 0.016, we'd increment the counter in the buckets:

  • 0.025
  • 0.05
  • 0.075
  • 0.1
  • 0.25
  • 0.5
  • 0.75
  • 1
  • 2.5
  • 5
  • 7.5
  • 10
  • +Inf

but we would not update the buckets:

  • 0.005
  • 0.01

since these do not satisfy the less than or equal condition when the value is 0.016. The way Prometheus exposes this is through a special "label". In Prometheus, labels are displayed as key-value pairs in curly brackets like:

myHistogram{key="value"}

and there's a special label that gets added for histograms called le which is used to indicate the bucket. Assuming the observation above was our first ever call to myHistogram.observe and the actual metric name was myHistogram, we'd end up with something like the following state if we were to query the /metrics endpoint of our collector:

myHistogram{le="0.005"} 0
myHistogram{le="0.01"} 0
myHistogram{le="0.025"} 1
myHistogram{le="0.05"} 1
myHistogram{le="0.075"} 1
myHistogram{le="0.1"} 1
myHistogram{le="0.25"} 1
myHistogram{le="0.5"} 1
myHistogram{le="0.75"} 1
myHistogram{le="1"} 1
myHistogram{le="2.5"} 1
myHistogram{le="5"} 1
myHistogram{le="7.5"} 1
myHistogram{le="10"} 1
myHistogram{le="+Inf"} 1

since we will have incremented the counter for every bucket greater than or equal to our value. If we were then to observe a much slower request that took 0.08 seconds... The resulting data would look like:

myHistogram{le="0.005"} 0
myHistogram{le="0.01"} 0
myHistogram{le="0.025"} 1
myHistogram{le="0.05"} 1
myHistogram{le="0.075"} 1
myHistogram{le="0.1"} 2
myHistogram{le="0.25"} 2
myHistogram{le="0.5"} 2
myHistogram{le="0.75"} 2
myHistogram{le="1"} 2
myHistogram{le="2.5"} 2
myHistogram{le="5"} 2
myHistogram{le="7.5"} 2
myHistogram{le="10"} 2
myHistogram{le="+Inf"} 2

It is a fundamental property of Prometheus histograms that if there is a counter value x in one bucket, all larger buckets will have a value greater than or equal to x. It's not a legal situation for example if we took the latest data above but instead of 2 placed a value 1 for myHistogram{le="5"}, because if myHistogram{le="2.5"} has a value of 2, le=5 must be at least 2.

This will be important later.

Disecting an actual query

To make this more concrete, lets look again at the query that led me down this path in the first place:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds{service="cnapi"}[5m])) by (le))

we'll work from inside out. If you have a recent Triton setup and have also setup triton-prometheus you should be able to follow along in your own Prometheus instance and see similar results.

What's http_request_duration_seconds?

The innermost part of this is:

http_request_duration_seconds

this histogram is created in node-triton-metrics with:

self.collector.histogram({
    name: 'http_request_duration_seconds',
    help: 'total time to process requests'
});

and then labels are generated which we pass in when we observe the value. The labels look like:

var labels = {
    route: routeName,
    method: req.method,
    user_agent: shortUserAgent,
    status_code: res.statusCode
};

so in Prometheus we will have observations like:

http_request_duration_seconds{route="servereventheartbeat",method="POST",user_agent="cn-agent/2.3.2",status_code="204",datacenter="coal",instance="edbe348f-09e6-4b77-80d7-a3baea3a7bc0",server="564d0052-8fd1-cf7a-807c-5818fc9055b8",service="cnapi",le="0.0001"} 51970

you can execute just the metric name as a query in Prometheus and you should see something like:

01_just_http_request_duration_seconds

which shows a whole bunch of results (in my case 1201 total, a lot of scrolling).

So that's not actually that helpful, what we want to do is filter just for CNAPI, so we add a label and our query becomes:

http_request_duration_seconds{service="cnapi"}

which will look something like:

02_http_request_duration_seconds_cnapi

which is better, but still has a lot of results. The reason we have so many different results here is that we've added so many labels and this means that each combination of:

  • le (from above, generated by Prometheus)
  • method (GET, PUT, POST, DELETE, etc)
  • route (the restify route name)
  • status_code (200, 204, 404, 500, etc)
  • user_agent (restify/1.6.0, cn-agent/2.3.2, curl/7.51.0, etc)

we'll have a separate counter in our metrics output on every scrape. This makes things pretty messy. But why don't we make them messier before we clean them up?

Since the values here are just counters they're not so helpful to observe how things are changing which is what we're usually interested in when we're looking at metrics. As such we're going to use the rate function to turn the values into something indicating how they're changing.

To use rate you need a range vector which represents a range of samples instead of a single sample like we had in the last query. To turn the query above into a range vector, we can what's called a range vector selector (side note: Vector Selector could be a good band name) which looks like:

http_request_duration_seconds{service="cnapi"}[5m]

the [5m] part there is the selector and tells Prometheus to grab all the samples it has for each individual metric/labels set for the last 5 minutes. Since our Prometheus instances typically scrape every 15 seconds, if we didn't miss any scrapes this should be around 20 data points for each metric. So, if you run that query in Prometheus, you'll see something like:

03_http_request_duration_seconds_vector_selector

which is pretty hard to look at, but shows what's going on here. If you scrolled down, you'd see 266 of these entries each as an array of ~20 values.

Now that we've made it more complex, lets finally start simplifying again. What we'll do is use the rate function. This function takes all those numbers from our range vector and turns them into a single value representing the per-second rate over those times. It does some fancy things to adjust for missed scrapes, bad alignment, etc. But that's unimportant for the sake of our current discussion. The important thing, is that this takes all the values from the last 5 minutes for each metrics/labels pair and gets us back to a single number. This time instead of representing a count, this represents how much the value is changing per-second.

The query now is:

rate(http_request_duration_seconds{service="cnapi"}[5m])

and you should see results something like:

04_http_request_duration_seconds_rate

which shows some interesting things that you couldn't tell from the counter itself. Like the fact that we saw no requests with the combination of status_code=200 and user_agent="restify/1.6.0" for the route="serverget" route in the last 5 minutes (this is the first 0 on the list). We know this because if there had been any, the +Inf counter would have been incremented at least once and the rate would not be 0.

As yet another slight digression, lets look at a bit more specific query here which just breaks this up a little more. We'll look at just the route="servereventheartbeat and method="POST" (the only method supported by this endpoint) entries that also have status_code="204" and user_agent="cn-agent/2.3.2 with the query:

rate(http_request_duration_seconds{service="cnapi",route="servereventheartbeat",method="POST",status_code="204",user_agent="cn-agent/2.3.2"}[5m])

this gets us a much more manageable list of only 36 results. Which almost even fits on one screen:

05_http_request_duration_seconds_rate_limited

Note: if you're running this in the future, the exact query might not work (user_agent for example might need to be adjusted for the new version of cn-agent, but you should be able to figure out what to query based on the last query).

What you'll notice with this data, is that the +Inf rate is the largest value and as the le value decreases, the rate also decreases. This becomes even more obvious if we apply 2 additional functions:

sort(sum(rate(http_request_duration_seconds{service="cnapi",route="servereventheartbeat",method="POST",status_code="204",user_agent="cn-agent/2.3.2"}[5m])) by (le))

what we've added here first is sum(...) by (le). What that does is add up each unique set of labels + le, and combine into a single result that's the sum of the values. For example, if we had:

...
myMetric{color="green",le="0.01"} 1.5
myMetric{color="blue",le="0.01"} 3.5
...

and these were the only values for le="0.01" the results would include:

...
{le="0.01"} 5
...

which is the sum of the value for all le="0.01" regardless of color. In our query here, we already had only 1 value for each le because everything else is unique, so this just serves to make the Element column a bit more readable in this example.

The sort function just sorts the elements by value in ascending order.

So the result should look something like:

06_http_request_duration_seconds_rate_sum_sort

which still is in the correct form of a Prometheus histogram, in that for each we see the rate always goes up or stays the same as the le value increases. It never decreases.

Now we're ready for a query closer to what we're here to talk about:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds{service="cnapi",route="servereventheartbeat",method="POST",status_code="204",user_agent="cn-agent/2.3.2"}[5m])) by (le))

histo ... what what?

The histogram_quantile function interpolates data and does some magic to give you a single value that represents a "φ-quantile" value from the buckets of a histogram. Basically this allows you to get the N-th percentile value from the histogram based on the buckets. What the above should output is something like:

07_http_request_duration_seconds_single_metric_histogram

This output is telling us that 95% of requests (P95) for this endpoint completed within 0.0009982834008097166 seconds based on the per-second rate calculated by the rate function over the last 5 minute's worth of data.

If you scroll up and look at the previous output, you'll see this makes sense. In that data you can see that the rate for le="0.0009" was ~1.53 but the rate for le="0.001" was ~5.94. And the highest rate (+Inf) was only 6.19. So the vast majority of requests completed between 0.0009 and 0.001 seconds. And 95% of all requests completed in less that 0.001 seconds.

We can also graph this data now (same query, just hit the graph button):

08_http_request_duration_seconds_single_metric_histogram_graph

and we can see that it's tending to stay around that value (0.0009) though there are occasional spikes where the P95 has gone up to about 0.00134 seconds.

Hopefully by this point you understand basically how these functions work, so lets go back to the console and do something crazy.

Imagine you want to know what the P95 request time is across multiple endpoints, or even all the endpoints of your service at once? That seems like a reasonable thing to be curious about. Lets just take 2 endpoints to start:

http_request_duration_seconds{service="cnapi",route=~"serverget|serverlist"}

This introduces one more thing which is that labels for your queries can have =~ instead of just = and then we can use a regular expression to include just two of the routes.

The output from this query should look something like:

09_http_request_duration_seconds_serverget_serverlist

which looks just fine right? It's got entries for both serverget and server list, and we have numbers... So lets go back to doing a sum of the rate again like we did before:

sum(rate(http_request_duration_seconds{service="cnapi",route=~"serverget|serverlist"}[5m])) by (le)

Note: You always want to use rate before sum.

This new query should make data that looks like:

10_http_request_duration_seconds_serverget_serverlist_sum_rate

which is starting to look a bit suspicious, if you look carefully. If we sort it the problem becomes more obvious:

11_http_request_duration_seconds_serverget_serverlist_sum_rate_sorted

the issue here is that we no longer have what seems to be a valid set of data. Whereas before we had a properly behaved histogram where higher le values were always greater or equal to lower le values. Now we've got:

  • le="6.48" having a value of 0, but le="0.18" having a higher rate, of 0.039.
  • le="0.08" having a value of 0.0035 but le="7.29" having a rate of 0.

So what's going on here?

If we look at a histogram_quantile of this data, it doesn't make a lot of sense:

12_http_request_duration_seconds_serverget_serverlist_histogram

The heart of the problem here is that we're combining two sets of buckets. The buckets for serverlist look like:

13_http_request_duration_seconds_serverlist_buckets

whereas the buckets for serverget look like:

14_http_request_duration_seconds_serverget_buckets

And you can see that even within those there are some anomalies. For example, why is the counter for le="0.007" higher than for le="7.29" for serverget?

The problem here is twofold:

  • Unlike every other Prometheus library I have been able to find, node-artedi does not use fixed buckets but rather implements its own log-linear bucket algorithm.
  • With node-triton-metrics, we're using several labels on out http_request_duration_seconds histograms. (This is why the data is broken even within just serverget, because we're not fully specifying all the labels.)

The combination of these two factors means that the only way you can actually use artedi histograms and compare them accurately, is to compare using all the labels such that you're always dealing with same set of buckets.

For the example we've been using thus far, that means that:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds{service="cnapi",route="servereventheartbeat",method="POST",status_code="204",user_agent="cn-agent/2.3.2"}[5m])) by (le))

which we used above is safe, because we've fully specified all the labels. If we fail to specify all the labels and a given metric happens to have more than one value for that label, you will be using different sets of buckets, so your results will only be accurate if you get lucky.

Note that the data here up to the point of histogram_quantile is correct. It just no longer looks like valid Prometheus histogram data. So trying to do histogram_quantile on data that's not valid, is not going to give you valid results.

You can get lucky though. For example, if most of your requests happen to be in the same general range, and you end up with exactly the same buckets. If the buckets are the same between two different combination of metric+labels, you could combine the two and things would be fine. But if any of the buckets are different (which seems to be common at least in Triton) comparing across them is not going to give you what you think you're getting. And doing histogram_quantile will just be confusing.

I'm sure at this point you must be asking yourself: why can't Prometheus handle figuring out some way to do a histogram_quantile histograms with le values all over the place? And I'm glad you asked. At least part of the reason is that since PR 2610, when Prometheus sees a histogram that looks like:

  • le=100 value=5
  • le=250 value=6
  • le=500 value=4
  • le=1000 value=8
  • le=2500 value=0.1

it assumes that the values are "nonsense" and as it walks up the le numbers it "fixes" the data such that each higher le has the max value. So in this case, the bucketQuantile function runs code that ensures the values are monotonic.

So the data above becomes:

  • le=100 value=5
  • le=250 value=6
  • le=500 value=6
  • le=1000 value=8
  • le=2500 value=8

which is a big problem when you're combining multiple endpoints with different characteristics. If you go back to our serverlist and serverget example, you can see that this happened there when we ran histogram_quantile on it. The values there, from the query:

sort(sum(rate(http_request_duration_seconds{service="cnapi",route=~"serverget|serverlist"}[5m])) by (le))

looked like (trimmed to just the relevant bits):

  • le=0.001 value=0
  • le=0.002 value=0
  • ...
  • le=0.009 value=0
  • le=0.01 value=0
  • le=0.02 value=0
  • ...
  • le=0.07 value=0
  • le=0.08 value=0.0035
  • le=0.09 value=0.0211
  • le=0.18 value=0.0386
  • le=0.27 value=0.0386
  • le=0.36 value=0.0386
  • le=0.45 value=0.0386
  • le=0.54 value=0.0386
  • le=0.72 value=0.0386
  • le=0.81 value=0.0386
  • le=1.62 value=0
  • le=2.43 value=0
  • le=3.24 value=0
  • ...
  • le=8.1 value=0
  • le=684 value=0.0386
  • ...
  • le=6156 value=0.0386
  • le=+Inf value=0.0386

so you can see here, that we're clearly mixing at least 2 different series. One had no value at le=1.62 and presumably did at some point later (maybe between 8.1 and 684?), and at least one other series seems to have started having values at 0.08 and peaked at 0.18.

With what we've described though, the data from 1.62-684 will all be replaced with 0.0386 since those are considered "nonsense" values. In fact nothing matters after 0.18. Whatever the values were after that, if it's less than 0.0386 it would be replaced with 0.0386.

This is why when we ran a histogram_quantile across this data above, we got a value 0.1701... Because as far as the histogram_quantile is concerned, 95% (in fact 100%) of values were less than or equal to 0.18 (since that was the first
bucket with the highest value).

Problem Number 2

Now that we've gone through that problem, there's another issue that we've observed with histograms while looking at the Triton data. Though this one has a similar cause (mixing unrelated buckets) the outcome is much different.

I created a small program (see "Appendix 1: histogram-breaker.js") that uses node-artedi which you can point your Prometheus at by adding a section like:

  - job_name: 'histogram_breaker'
    static_configs:
    - targets: ['10.192.0.7:8889']

(replacing the IP with the IP wherever you're running it, it listens on all IPs)

When you run this program, what it will do is:

  • every second observe values with the following labels/values:
    • label: {stream="10^-3"} value: random number 0.001 - 0.009999...
    • label: {stream="10^-2"} value: random number 0.01 - 0.09999...
    • label: {stream="10^-1"} value: random number 0.1 - 0.9999...
  • every 100 seconds, also observe a single value with:
    • label: {stream="outlier"} value: random number 10^6 - 10^7

every 5 minutes, this will produce 900 (300 seconds * 3 values each second) values that are less than 1, and 3 values that are outliers and greater than 10^6. So the outliers will be 0.333...% of the values.

Example data would look something like:

15_broken_histogram_by_itself

which shows that we're getting the proper distribution of 100 of each of the stream="10^-3", stream="10^-2", stream="10^-1", for every one stream="outlier" value.

If we look at the values as a rate, we see:

16_broken_histogram_as_5m_rate

which is unsurprising for the 10^* streams, but somewhat surprising for the outlier stream. Remember the rate value here is the per-second value. Since we're reporting 1 of each of the 10^* streams the correct value for those is 1 and Prometheus does a reasonable job of figuring that out. But for the outlier stream there are only at most 3 data points in a given 5m window (300 seconds, and these get reported once every 100s) so it seems to be more than 100% off here (I've seen it be more than 200% off with a value around 0.01). Which is kindof annoying, but shouldn't actually impact even the P99 since we're still talking about less than 1% of the values even in that case.

Since we expect the values of the outlier stream to all be between 1,000,000 and 9,999,999 the following distribution seems reasonable:

17_broken_histogram_outlier_le_distribution

since it tells us that all values were less than 13,463,172 and there were no values less than 498,636 and only 1 between 498,636 and 1,495,908. This all seems sane.

The other streams also look sane, for example:

18_broken_histogram_10_2_le_distribution

here we'd expect all values to be 0.01 - 0.099 which matches the data here.

What happens then when we combine them? This simulates what we'd do for example if we were talking about http requests across different endpoints where one endpoint might be fast and called frequently and other endpoints which are called infrequently but take a long time. Which we've actually seen in Triton.

What we see when we sum across all of the streams, we see:

19_broken_histogram_sum_all_streams

The first thing you might notice here is that the values and buckets are all over the place due to the combination of the buckets from the outliers and the other streams. This data looks pretty weird from a Prometheus histogram perspective, since if we try to histogram_quantile over it, we already know (as discussed previously) that the higher le values will get run over with higher values so that the value for each le is >= all those below it.

When we try to take the P95 quantile however, we end up with an even more strange result:

20_broken_histogram_p95_all_streams

And even if we go down to P60 we have the same result:

21_broken_histogram_p60_all_streams

which seems surprising since the value there is what we'd expect for an outlier, and yet those represent far less than 1% of the actual data.

What's going on here?

The first thing to note is that the 40389516 value is actually an exact bucket value. And more interestingly it's the highest bucket (other than +Inf).

Looking again at:

sort(sum(rate(broken_histogram[5m])) by (le))

so that we can see what the values/buckets are that we're trying to put into quantiles, we see:

22_broken_histogram_sorted_sums_by_le

The value for the bucket is 0.010526352724044648 which is tied for the lowest non-zero value. The value for +Inf is 3.0000105263527246. So there's a big gap between these two.

In order to illustrate the problem here further, I built a modified version of Prometheus which output some state as it was executing the bucketQuantile function and I ran the query:

histogram_quantile(0.95, sum(rate(broken_histogram[5m])) by (le))

and then reformatted the results as follows:

  Initial Value                       Used Value
  -------------                       ----------

  {0.001 0}                           -
  {0.002 0.006758654401635337}        -
  {0.003 0.02027596320490601}         -
  {0.004 0.03379327200817669}         -
  {0.005 0.0540692352130827}          -
  {0.006 0.0540692352130827}          -
  {0.007 0.10137981602453006}         -
  {0.008 0.10137981602453006}         -
  {0.009 0.10137981602453006}         -

  {0.01 0.10137981602453006}          {0.01 0.10137981602453006}
  {0.02 0.006758654401635337}         {0.02 0.10137981602453006}
  {0.03 0.006758654401635337}         {0.03 0.10137981602453006}
  {0.04 0.013517308803270674}         {0.04 0.10137981602453006}
  {0.05 0.03379327200817669}          {0.05 0.10137981602453006}
  {0.06 0.04731058081144736}          {0.06 0.10137981602453006}
  {0.07 0.060827889614718036}         {0.07 0.10137981602453006}
  {0.08 0.0743451984179887}           {0.08 0.10137981602453006}
  {0.09 0.08110385281962404}          {0.09 0.10137981602453006}

  {0.18 0.1081384704261654}           -
  {0.27 0.1284144336310714}           -
  {0.36 0.1486903968359774}           -
  {0.45 0.15544905123761277}          -
  {0.54 0.16896636004088345}          -
  {0.63 0.18924232324578943}          -
  {0.72 0.19600097764742477}          -
  {0.81 0.19600097764742477}          -

  {1.62 0.10137981602453006}          {1.62 0.19600097764742477}
  {2.43 0.10137981602453006}          {2.43 0.19600097764742477}
  {3.24 0.10137981602453006}          {3.24 0.19600097764742477}
  {4.05 0.10137981602453006}          {4.05 0.19600097764742477}
  {4.86 0.10137981602453006}          {4.86 0.19600097764742477}
  {5.67 0.10137981602453006}          {5.67 0.19600097764742477}
  {6.48 0.10137981602453006}          {6.48 0.19600097764742477}
  {7.29 0.10137981602453006}          {7.29 0.19600097764742477}
  {8.1 0.10137981602453006}           {8.1 0.19600097764742477}

  {498636 0}                          {498636 0.19600097764742477}
  {1.495908e+06 0}                    {1.495908e+06 0.19600097764742477}
  {2.49318e+06 0}                     {2.49318e+06 0.19600097764742477}
  {3.490452e+06 0}                    {3.490452e+06 0.19600097764742477}
  {4.487724e+06 0}                    {4.487724e+06 0.19600097764742477}
  {1.3463172e+07 0}                   {1.3463172e+07 0.19600097764742477}
  {2.243862e+07 0}                    {2.243862e+07 0.19600097764742477}
  {3.1414068e+07 0}                   {3.1414068e+07 0.19600097764742477}
  {4.0389516e+07 0}                   {4.0389516e+07 0.19600097764742477}
  {+Inf 0.30413944807359017}          -

the pairs of numbers here are the le and the value and the set of numbers in the left column is what came in as a result of the:

sum(rate(broken_histogram[5m])) by (le)

query, which hopefully looks reasonable to you by now based on the data we're talking about here. Note that I ran this soon after restarting Prometheus and that's why the large buckets have no samples in the last 5 minutes. This actually is important but we'll discuss that shortly.

In the right column we're showing what the value for each bucket was when we used it. Specifically immediately following the ensureMonotonic() call. We've discussed this above, but basically this is just designed to ensure that each le as we go down this list is >= the previous value. If it's not, it's set to the previous value. Lines with a - in the right column mean that we did not modify the original data. So you can see here that the majority of the values were modified because of the lack of monotonicity. Including those buckets which actually had a value 0.

The stated intention of the ensureMonotonic function is to deal with the fact that there aren't atomic scrapes. So it's possible data might still be coming in while we're loading it. And if the values decreased they'd be invalid anyway, so for a valid histogram, this does make sense. Unfortunately since we've mixed multiple sets of buckets from different labels, we no longer have something that's a valid Prometheus histogram so this mangles it pretty badly.

So that explains the data in the table, except for the last value which is actually the key to understanding where the 40389516 came from when we tried to take the P60 and P95 histogram_quantile before.

One last digression and we'll get there.

Once it has ensured the values are montonic the next thing Prometheus does is determine which bucket has the specified quantile value. To figure that out, it takes the value of the +Inf bucket (which is expected to contain the largest value seen) and multiplies that by the q quantile value (0.95 here). In the code this is called the rank. The rank from the above data was 0.28893247566991065 (I modified Prometheus to output this too) which matches:

0.30413944807359017 (our +Inf value) * 0.95 = 0.28893247566991065

then it looks for which of the buckets has a value >= that rank which in our case is only the +Inf bucket.

From that point it normally does some calculations to determine what value to return based on where the rank value sits in relation between the found bucket and the bucket previous to that.

But there is a special case...

And now we're getting to the actual heart of this second problem with the histograms. In the special case, if the rank value is found in the last bucket (i.e. +Inf) instead of determining where the value should be between the penultimate bucket and infinity, it will just return the le value from the previous bucket! So this explains the mystery of 40389516. The rank value calculated for 0.95 * the +Inf count put us above the 40389516 bucket and in the +Inf bucket. So Prometheus returned the value 40389516. But if you recall, in this case the original value for the 40389516 le was 0 (because there are no samples with this value in the time specified).

So basically here we've ended up in a really gross state. If you have one label (here {stream="outlier"}) that has ever had a huge outlier value and has thus generated a bad bucket... you've ruined your histogram forever. Even if that bad value is never seen again, and even if that bucket is never used again. At least, if you're mixing that set of buckets with any other set of buckets.

There's one final piece here. Why was +Inf so much larger than the largest bucket if we had no observations that were larger than 8.1 in this window? That's what forced us to be past the last bucket and into +Inf when we calculated the rank. And the reason is that we've combined a bunch of separate sets of histogram buckets into a single thing that we're trying to pass off as a single histogram. To illustrate, a simplified example... Say we had 3 different histograms each with only 2 buckets that when called with:

rate(simpleExample{example_number="X"}[5m])

had the following values:

// this metric outputs:
//  - a 0.001 value every other second
//  - a 0.01  value twice a second
simpleExample{le="0.001",example_number="1"} 0.5
simpleExample{le="0.01",example_number="1"} 2.0
simpleExample{le="+Inf",example_number="1"} 2.0

and

// this metric outputs:
//  - a 0.002 value every other second
//  - a 0.02  value twice a second
simpleExample{le="0.002",example_number="2"} 0.5
simpleExample{le="0.02",example_number="2"} 2.0
simpleExample{le="+Inf",example_number="2"} 2.0

and

// this metric outputs:
//  - a 0.003 value every other second
//  - a 0.03  value twice a second
simpleExample{le="0.003",example_number="3"} 0.5
simpleExample{le="0.03",example_number="3"} 2.0
simpleExample{le="+Inf",example_number="3"} 2.0

when we're combining these together with say:

sum(rate(simpleExample[5m])) by (le)

we'd get:

{le="0.001"} 0.5     // from example_number="1"
{le="0.002"} 0.5     // from example_number="2"
{le="0.003"} 0.5     // from example_number="3"
{le="0.01"} 2.0      // from example_number="1"
{le="0.02"} 2.0      // from example_number="2"
{le="0.03"} 2.0      // from example_number="3"
{le="+Inf"} 6.0      // from example_number="1" + example_number="2" + example_number="3"

so while each of the individual metrics makes sense, doing a sum across them breaks everything because the buckets are not the same and so the +Inf no longer actually makes any sense. And doing a histogram_quantile on this bad data will give us something that makes no sense. Everything from P34 up will give us a result of 0.03 (the le for the largest bucket before +Inf).

If we took exactly the same metrics and values, and instead put them with the same set of buckets for all 3 histograms: [0.001, 0.002, 0.003, 0.01, 0.02, 0.03]... (we're simplifying here!) with the metrics doing exactly the same thing, we'd end up with:

// this metric outputs:
//  - a 0.001 value every other second
//  - a 0.01  value twice a second
simpleExample{le="0.001",example_number="1"} 0.5
simpleExample{le="0.002",example_number="1"} 0.5
simpleExample{le="0.003",example_number="1"} 0.5
simpleExample{le="0.01",example_number="1"} 2.0
simpleExample{le="0.02",example_number="1"} 2.0
simpleExample{le="0.03",example_number="1"} 2.0
simpleExample{le="+Inf",example_number="1"} 2.0

and

// this metric outputs:
//  - a 0.002 value every other second
//  - a 0.02  value twice a second
simpleExample{le="0.001",example_number="2"} 0
simpleExample{le="0.002",example_number="2"} 0.5
simpleExample{le="0.003",example_number="2"} 0.5
simpleExample{le="0.01",example_number="2"} 0.5
simpleExample{le="0.02",example_number="2"} 2.0
simpleExample{le="0.03",example_number="2"} 2.0
simpleExample{le="+Inf",example_number="2"} 2.0

and

// this metric outputs:
//  - a 0.003 value every other second
//  - a 0.03  value twice a second
simpleExample{le="0.001",example_number="3"} 0
simpleExample{le="0.002",example_number="3"} 0
simpleExample{le="0.003",example_number="3"} 0.5
simpleExample{le="0.01",example_number="3"} 0.5
simpleExample{le="0.02",example_number="3"} 0.5
simpleExample{le="0.03",example_number="3"} 2.0
simpleExample{le="+Inf",example_number="3"} 2.0

now when we're combining these together with:

sum(rate(simpleExample[5m])) by (le)

we'd get:

{le="0.001"} 0.5     // from example_number="1" + example_number="2" + example_number="3"
{le="0.002"} 1.0     // from example_number="1" + example_number="2" + example_number="3"
{le="0.003"} 1.5     // from example_number="1" + example_number="2" + example_number="3"
{le="0.01"} 3.0      // from example_number="1" + example_number="2" + example_number="3"
{le="0.02"} 4.5      // from example_number="1" + example_number="2" + example_number="3"
{le="0.03"} 6.0      // from example_number="1" + example_number="2" + example_number="3"
{le="+Inf"} 6.0      // from example_number="1" + example_number="2" + example_number="3"

which is still (even after the sum) a valid Prometheus histogram. And we can do histogram_quantile on this and P99 would be less than 0.03 (and greater than 0.02) which makes much more sense.

Two problems one real cause

So both of the above problems really have the same cause. Specifically, the fact that unlike every other Prometheus library, we dynamically generate buckets and tie them to the histogram+label combination.

If broken_histogram{stream="10^-3"} and broken_histogram{stream="outlier"} were both using the same set of buckets, we could not have this situation. The occasional outlier would show up as a value that matches the highest bucket, but the histogram_quantile would go back to normal once the outlier value had gone back out of the [5m] window. Which is probably what you'd want. You'd get a spike there, then things would settle back down.

And for the first problem we discussed (values getting run over because of the ensureMonotonic) that also only happens because we're combining a different set of buckets. If we had the same buckets, that would not be an issue either.

Solutions

Change node-artedi to be compatible with Prometheus

The first and perhaps most obvious solution (also my preferred solution) is to change node-artedi to be fully compatible with Prometheus by making buckets work like they do in all the supported libraries.

Specifically, change node-artedi so that:

  • it has a fixed set of default buckets (and uses default buckets compatible with other libraries)
  • it allows passing in an alternative fixed set of buckets

If we did this, we'd be passing the same set of buckets in node-triton-metrics so for the http_request_duration_seconds histogram, we'd always have the same buckets and be able to combine and histogram_quantile all day long.

If we still wanted log-linear buckets, we could provide an:

  • artedi.logLinearBuckets() function

which could generate an array of log-linear buckets that could be passed in when creating a histogram.

This is the only of these proposed solutions that actually solves all the problems discussed above.

The PR this is attached to implements this solution with a new major version for node-artedi but without adding an artedi.logLinearBuckets() function.

Tying buckets to histogram rather than histogram+labels

Another related option would be that instead of passing in the buckets, we still compute the buckets dynamically... but we compute them for the whole histogram within a service, rather than per histogram+label combination. If the http_request_duration_seconds histogram for the {service="cnapi"} used the same buckets, regardless of user_agent, method, etc. we'd still be able to at least get valid histogram_quantile results for a single service and all its routes. But we'd still need to store the resulting buckets somewhere so I don't think this solution is really great.

Stop using labels

Another possibly more drastic option here would be top stop using any labels other than {service="cnapi"} with the histograms in node-triton-metrics. This would solve most of the problem since the buckets would be the same as long as the CNAPI instance is running. We do still have the problem that different buckets could be used when CNAPI restarts however (see next section: Problems with solutions).

Problems with solutions

Having the same set of buckets every time at least within a service does require us to think a bit more about how the buckets should be setup for a service, but this seems preferable to having useless histograms.

The other thing that has been mentioned is that changing the buckets (e.g. for a new version of the software) would invalidate all previous data. That is true insofar as you can't do a histogram_quantile across the data with different buckets... But that's already true with the current node-artedi implementation. When you start your program and create a histogram and then expose those with a /metrics endpoint... Nothing stores the set of buckets that were used. So if you stop your program and start it up again, it's already possible you're going to have a different set of buckets for the first call to /metrics than you had when your application was running previously.

It's easy to verify this by just running:

var artedi = require('./');
var collector = artedi.createCollector();
var value = process.argv[process.argv.length - 1];

histogram = collector.histogram({name: 'myHistogram', help: 'I do not know what buckets to use'});
histogram.observe(Number(value));
collector.collect(artedi.FMT_PROM, function sendMetrics(_, metrics) { console.log(metrics); });

with different arguments:

failhammer:node-artedi joshw$ node buckets.js 100
# HELP myHistogram I do not know what buckets to use
# TYPE myHistogram histogram
myHistogram{le="76"} 0
myHistogram{le="228"} 1
myHistogram{le="380"} 1
myHistogram{le="532"} 1
myHistogram{le="684"} 1
myHistogram{le="+Inf"} 1
myHistogram_count{} 1
myHistogram_sum{} 100

vs

failhammer:node-artedi joshw$ node buckets.js 1.5
# HELP myHistogram I do not know what buckets to use
# TYPE myHistogram histogram
myHistogram{le="0.81"} 0
myHistogram{le="1.62"} 1
myHistogram{le="2.43"} 1
myHistogram{le="3.24"} 1
myHistogram{le="4.05"} 1
myHistogram{le="4.86"} 1
myHistogram{le="5.67"} 1
myHistogram{le="6.48"} 1
myHistogram{le="7.29"} 1
myHistogram{le="8.1"} 1
myHistogram{le="+Inf"} 1
myHistogram_count{} 1
myHistogram_sum{} 1.5

I think if we're going to keep generating the buckets in node-artedi, we need to implement some mechanism to write out the bucket configuration somehow so that on restart of the service we end up with the same buckets. Either that or the set of buckets artedi generates needs to be the same for every application in every instance. And I think that just using fixed buckets is a much better solution.

Conclusion

It seems that dynamically generating buckets rather than using fixed buckets for Prometheus histograms is untenable. As such, we should figure out a way to have fixed buckets for Triton services. One option would be implementing the attached PR and then making the appropriate modifications to node-triton-metrics to use this , probably by modifying node-artedi to only support fixed buckets.

(Somewhat) Related Reading

Appendix 1: histogram-breaker.js

/*
 * This Source Code Form is subject to the terms of the Mozilla Public
 * License, v. 2.0. If a copy of the MPL was not distributed with this
 * file, You can obtain one at http://mozilla.org/MPL/2.0/.
 */

/*
 * Copyright (c) 2018, Joyent, Inc.
 */

/*
 * Histogram Breaker
 *
 *   node histogram-breaker.js
 *
 * Then point your prometheus at port http://<yourIP>:8889/metrics of wherever
 * you ran this (it listens on all interfaces).
 *
 * And look at the: broken_histogram metric.
 *
 */

var artedi = require('artedi');
var bunyan = require('bunyan');
var restify = require('restify');

var MIN_STREAM = -3;
var MAX_STREAM = 0;
var OUTLIER_FREQUENCY = 100; // in seconds between outliers
var OUTLIER_DELTA = 6;

var log = new bunyan({
    name: 'histogram-breaker',
    level: 'debug',
    serializers: {
        err: bunyan.stdSerializers.err,
        req: bunyan.stdSerializers.req,
        res: bunyan.stdSerializers.res
    }
});

var collector = artedi.createCollector();
var outlierCounter = 0;

// This function is a copy from node-triton-metrics.
function getMetrics(req, res, next) {
    collector.collect(artedi.FMT_PROM, function sendMetrics(
        err,
        metrics
    ) {
        if (err) {
            next(new VError(err, 'error retrieving metrics'));
            return;
        }
        /*
         * Content-Type header is set to indicate the Prometheus
         * exposition format version
         *
         * More information at:
         * https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#format-version-004
         */
        res.setHeader('Content-Type', 'text/plain; version=0.0.4');
        res.send(metrics);
        next();
    });
}

// Generate a random value between min (inclusive) and max (exclusive)
function generateValue(min, max) {
    return Math.random() * (max - min) + min;
}

function doObservations(histogram) {
    var exponent;
    var max;
    var min;
    var value;

    log.debug('doing observations');

    for (exponent = MIN_STREAM; exponent < MAX_STREAM; exponent++) {
        max = Math.pow(10, exponent + 1);
        min = Math.pow(10, exponent);

        value = generateValue(min, max);

        log.trace({
            min: min,
            max: max,
            value: value
        }, 'going to observe');

        histogram.observe(value, {
            stream: '10^' + exponent
        });
    }

    // Every OUTLIER_FREQUENCY observations, throw in a single much bigger
    // observation.
    if (outlierCounter++ >= OUTLIER_FREQUENCY) {
        max = Math.pow(10, MAX_STREAM + OUTLIER_DELTA + 1);
        min = Math.pow(10, MAX_STREAM + OUTLIER_DELTA);
        value = generateValue(min, max);

        histogram.observe(value, {
            stream: 'outlier'
        });

        outlierCounter = 0;
    }

    // schedule to call ourselves again in 1s
    setTimeout(doObservations, 1000, histogram);
}

function main() {
    var config = {log: log};
    var histogram;
    var server;

    server = restify.createServer(config);

    server.on('after', function _doLogging(req, res, route, err) {
        restify.auditLogger({log: log})(req, res, route, err);
    });

    histogram = collector.histogram({
        name: 'broken_histogram',
        help: 'histogram of brokenness'
    });

    doObservations(histogram);

    server.get({
        name: 'GetMetrics',
        path: '/metrics'
    }, getMetrics);

    server.listen(8889, function _onListen() {
        log.info('%s listening at %s', server.name, server.url);
    });
}

main();

README.md Outdated
@@ -40,7 +40,10 @@ collector.collect(artedi.FMT_PROM, function (err, metrics) {

var histogram = collector.histogram({
name: 'http_request_latency_ms',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to change these examples to use '_seconds' where they're set to '_ms.' Since we're trying to do things better now we may as well stop misleading users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

@KodyKantor
Copy link
Contributor

Thanks! Great analysis. I just posted a couple small notes in the code.

Before we merge this it would be nice to make sure this works with an existing service - a Triton or Manta service. It looks good to me, but if we're going to do a major rev. we may as well make sure we don't need a bug fix patch right out the gate.

@arekinath
Copy link
Contributor

On the release management front here, I think we should get at least one bucket generator function into the library before doing the new release. If we do a release without adding any, services using this are going to reinvent their own code for generating buckets and we'll be stuck with almost-copies of that code around the place forever.

@joshwilsdon
Copy link
Contributor Author

On the release management front here, I think we should get at least one bucket generator function into the library before doing the new release.

Ok, sure.

@joshwilsdon
Copy link
Contributor Author

Just so it doesn't get lost, for reference here are the possible buckets that it is possible to generate with node-artedi 1.x:

Magnitude Values
10^-4 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009
10^-3 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009
10^-2 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09
10^-1 0.18, 0.27, 0.36, 0.45, 0.54, 0.63, 0.72, 0.81
10^0 1.62, 2.43, 3.24, 4.05, 4.86, 5.67, 6.48, 7.29, 8.1
10^1 25, 42, 59, 76
10^2 228, 380, 532, 684
10^3 2052, 3420, 4788, 6156
10^4 18468, 30780, 43092, 55404
10^5 166212, 277020, 387828, 498636
10^6 1495908, 2493180, 3490452, 4487724
10^7 13463172, 22438620, 31414068, 40389516
10^8 121168548, 201947580, 282726612, 363505644
10^9 1090516932, 1817528220, 2544539508, 3271550796

There's no generator function in this PR that generates buckets using this pattern (there was briefly, but that was subsequently removed so that we don't encourage its use) but I'm including them here because I think they might be helpful as a reference when thinking about how to update software.

README.md Outdated
// http_request_latency_seconds{method="putobjectdir",le="10"} 1
// http_request_latency_seconds{le="+Inf",method="putobjectdir"} 1
// http_request_latency_seconds_count{method="putobjectdir"} 1
// http_request_latency_seconds_sum{method="putobjectdir"} 998
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this _sum value would be .998, right?

lib/collector.js Outdated
@@ -344,9 +345,13 @@ function processTriggers(opts, cb) {
}

module.exports = {
artedi1Buckets: lib_buckets.artedi1Buckets,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function gone now?

lib/buckets.js Outdated
assert.ok(min > 0, 'min must be > 0');

// TODO
// - limit on factor?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you're still working on this, but I think it would be good to get some of these TODOs in. They would be breaking changes in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm still working on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KodyKantor do you have suggestions for what sanity limits we should set on these generator functions? From the prometheus source, it seems the observed value is a float64 which from this: prometheus/prometheus#3031 seems like we should probably avoid numbers > 2^53? Other than that limit, I'm not sure what to check here. We probably should put some bound just to prevent people from creating more buckets than is ever reasonable. 100 buckets sounds like a lot to me, but 1000 seems like it would always be a mistake. But I don't know what others think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshwilsdon, yes I think limiting to (2^53)-1 to keep the buckets within the node.js limit is a good idea.

I think stopping at 100 buckets across the board sounds good. 100 is more than enough buckets to do the sort of things we do today. The only reason I would say we should have 100 and not 75 (or another arbitrary number) is that someone might want to observe percentage as integers and keep one percent value per bucket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://linuxczar.net/blog/2016/12/31/prometheus-histograms/#closing-thoughts suggests that case of over 100 buckets might be reasonable. He discusses an example with ~600 buckets.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference would be to not place arbitrary limits like 100 or 1000 on bucket count or value. Instead, I think it is reasonable to only limit it to buckets that will be valid. With that in mind, the only invalid bucket count and value would be anything bigger than 2^31 - 1, which is the max safe signed int as specified in the golang histogram that Prometheus uses. That said, it might be worth logging warn for anything over say 1000 buckets with a note about how it may create potential performance problems for the collecting endpoint.

docs/API.md Outdated
// returns [ 1, 2, 4, 8, 16 ]
```

#### logLinearBuckets(base, lowPower, highPower, bucketsPerMagnitude)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're missing the artedi. namespace preface here

@davepacheco
Copy link

Thanks for the detailed write-up!

Am I understanding correctly that the root issue here is that the buckets are dynamically generated and that we leave out larger buckets when we haven't seen data points that large? And that means that the range covered by the last bucket may vary depending on what measurements have already been observed? i.e., if we continued to dynamically generate the buckets, but ensured that we always generated all of them for each set of labels, that would work? (Not saying we should do that. The proposed solution makes sense to me.)

I'm wondering about the upgrade and rollout part. (Sorry if this is just my lack of really understanding the problem or solution here.) Being a major bump, I assume we'd want to update CHANGES.md with instructions for a consumer about what they need to know and what changes they would make when they update their software to the new version. It seems like they could be in a situation where they weren't affected by this before (e.g., if they weren't using many labels, or if they tended to have the same distribution of values that filled out all the same buckets quickly after startup). Now, if they update to v2, their programs would be generating a different set of buckets. In that case, during a phased rollout of the new version of their program, they'd actually be introducing the same problem we're trying to solve here (namely, different instances returning different buckets).

Here's an idea:

  • define the bucket generator function in this PR
  • document that for people upgrading from v1, if they want an existing histogram metric to generate buckets that are compatible with other instances running v1, they should specify the legacy bucket generator function. Maybe explain what happens if they don't change this and then do a rolling upgrade of a number of instances across versions?

Alternatively, use a separate type, like "histogram_legacy"?

I'm hoping this is relatively easy to do (if you were already going to support the generator function).
This way, by default people get the better thing, but people also have the opportunity to move forward more compatibly. For example, in Muskie, we may decide to define a second metric using the new bucket scheme while continuing to export the older metric. That way, operators can update their dashboards after they've updated all instances. Then we'd eventually remove the older metric. (I'm not sure we'd do this -- I think we'd want to know how broken our metrics are already today.)

Again, sorry if this is just my lack of understanding about the problem. If the old data is truly non-sensical, maybe this isn't worthwhile. But in practice it has matched the behavior we'd expect in a lot of cases, and it would be pretty painful if the intermediate state (some muskies updated before others) made the data worse. It's likely to be weeks or even months between updating the first and last muskies covered by some of our production dashboards.

@joshwilsdon
Copy link
Contributor Author

Thanks for the detailed write-up!

@davepacheco thanks for taking the time to read it and provide feedback!

Am I understanding correctly that the root issue here is that the buckets are dynamically generated and that we leave out larger buckets when we haven't seen data points that large? And that means that the range covered by the last bucket may vary depending on what measurements have already been observed? i.e., if we continued to dynamically generate the buckets, but ensured that we always generated all of them for each set of labels, that would work? (Not saying we should do that. The proposed solution makes sense to me.)

The dynamic buckets are indeed the biggest problem here. However it's not just the larger buckets. Smaller buckets are also a problem. The real problem is that everything you compare really needs to be using the same set of buckets. If you have a service with multiple instances and instanceA starts up and its first observed value is 0.850, with artedi 1.x you'll get the buckets:

>  var artedi = require('./');
undefined
> var collector = artedi.createCollector();
undefined
> var histogram = collector.histogram({name: 'myHistogram', help: 'no help for you', labels: {service: 'myService', instance: 'instanceA'}});
undefined
> histogram.observe(0.850)
undefined
> collector.collect(artedi.FMT_PROM, function (err, str) { console.log('\n' + str); });
undefined
>
# HELP myHistogram no help for you
# TYPE myHistogram histogram
myHistogram{service="myService",instance="instanceA",le="0.81"} 0
myHistogram{service="myService",instance="instanceA",le="1.62"} 1
myHistogram{service="myService",instance="instanceA",le="2.43"} 1
myHistogram{service="myService",instance="instanceA",le="3.24"} 1
myHistogram{service="myService",instance="instanceA",le="4.05"} 1
myHistogram{service="myService",instance="instanceA",le="4.86"} 1
myHistogram{service="myService",instance="instanceA",le="5.67"} 1
myHistogram{service="myService",instance="instanceA",le="6.48"} 1
myHistogram{service="myService",instance="instanceA",le="7.29"} 1
myHistogram{service="myService",instance="instanceA",le="8.1"} 1
myHistogram{le="+Inf",service="myService",instance="instanceA"} 1
myHistogram_count{service="myService",instance="instanceA"} 1
myHistogram_sum{service="myService",instance="instanceA"} 0.85

then if instanceB starts up and gets a different value (say 0.800) you get a totally different set of buckets:

>  var artedi = require('./');
undefined
> var collector = artedi.createCollector();
undefined
> var histogram = collector.histogram({name: 'myHistogram', help: 'no help for you', labels: {service: 'myService', instance: 'instanceB'}});
undefined
> histogram.observe(0.800)
undefined
> collector.collect(artedi.FMT_PROM, function (err, str) { console.log('\n' + str); });
undefined
>
# HELP myHistogram no help for you
# TYPE myHistogram histogram
myHistogram{service="myService",instance="instanceB",le="0.09"} 0
myHistogram{service="myService",instance="instanceB",le="0.18"} 0
myHistogram{service="myService",instance="instanceB",le="0.27"} 0
myHistogram{service="myService",instance="instanceB",le="0.36"} 0
myHistogram{service="myService",instance="instanceB",le="0.45"} 0
myHistogram{service="myService",instance="instanceB",le="0.54"} 0
myHistogram{service="myService",instance="instanceB",le="0.63"} 0
myHistogram{service="myService",instance="instanceB",le="0.72"} 0
myHistogram{service="myService",instance="instanceB",le="0.81"} 1
myHistogram{le="+Inf",service="myService",instance="instanceB"} 1
myHistogram_count{service="myService",instance="instanceB"} 1
myHistogram_sum{service="myService",instance="instanceB"} 0.8

now if you tried to do a query like:

histogram_quantile(0.95, sum(rate(myHistogram{service="myService"}[5m])) by (le))

you would be getting an invalid result. (In this case I believe you'd get a P50, P95 and P99 of 8.1 seconds since 8.1 is the largest bucket) The reason you get bad data is because you're comparing two different sets of buckets and instanceB is missing data in almost all the buckets used by instanceA (all but 1).

The only way you can get valid results is when the query you're using for histogram_quantile only involves metrics with exactly the same buckets. In this case if for this service, we had buckets:

{le="0.09"}
{le="0.18"}
{le="0.27"}
{le="0.36"}
{le="0.45"}
{le="0.54"}
{le="0.63"}
{le="0.72"}
{le="0.81"}
{le="1.62"}
{le="2.43"}
{le="3.24"}
{le="4.05"}
{le="4.86"}
{le="5.67"}
{le="6.48"}
{le="7.29"}
{le="8.1"}

for both instances from the start, we would be able to include both in a histogram_quantile without getting incorrect data.

To continue to dynamically generate the buckets, we'd have to come up with a way to ensure that each instance of a service always generates the same set of buckets. Which is basically the same thing as using a function to generate a fixed set of buckets up-front and passing that in when creating the histogram. Also note that if you were dynamically creating in the sense that we're adding the bucket when the program first observed a value that needed that bucket, we'd need to both:

  • share that data with all other instances so that they start including that bucket
  • store that value on disk somewhere so that on startup we are including the same set of buckets we had last time this instance was running

failure to do either of those would be very likely to result in invalid data when trying to do a histogram_quantile.

Also note that any buckets any instance of any service has used should be included in all future results. Permanently adding additional buckets to all instances seems (I haven't dug into this too deeply but from my experiments so far) to be fine, but removing buckets will break things. That's why on startup its important for your application that has been dynamically generating buckets to load from disk the set of all buckets it (or any other instances) has used and start with that set of buckets.

From my perspective, it's easier just to change such that the set of buckets is fixed in the code so that there's no need for coordination between instances or storage of which buckets have been generated so far.

I'm wondering about the upgrade and rollout part. (Sorry if this is just my lack of really understanding the problem or solution here.) Being a major bump, I assume we'd want to update CHANGES.md with instructions for a consumer about what they need to know and what changes they would make when they update their software to the new version.

I can try to add some more detailed instructions/explanation to CHANGES.md. That's a good point.

It seems like they could be in a situation where they weren't affected by this before (e.g., if they weren't using many labels, or if they tended to have the same distribution of values that filled out all the same buckets quickly after startup).

I suppose it's possible. If your application only ever produced a very narrow range of values such that you always have exactly the same buckets for every instance without ever having any outliers and if you had never restarted your service such that it started out with fewer buckets, or you get requests across your whole range of values immediately on startup... That seems like a pretty unlikely case to me though. But it is possible.

Now, if they update to v2, their programs would be generating a different set of buckets. In that case, during a phased rollout of the new version of their program, they'd actually be introducing the same problem we're trying to solve here (namely, different instances returning different buckets).

Yeah, my recommendation in the upgrade notes will be that you really should either:

  • update all instances to v2.x and then delete all existing data (probably preferable, since if you were using artedi 1.x your histogram data is almost certainly broken)
  • create a new set of buckets that includes every bucket you have ever used, and understand that looking at histogram_quantile graphs that span your old data will be wrong

Here's an idea:

  • define the bucket generator function in this PR
  • document that for people upgrading from v1, if they want an existing histogram metric to generate buckets that are compatible with other instances running v1, they should specify the legacy bucket generator function. Maybe explain what happens if they don't change this and then do a rolling upgrade of a number of instances across versions?

I did have an artedi.artedi1Buckets that generated buckets compatible with artedi 1.x but that was removed because it was suggested that this would just be extending people's problems rather than encouraging them to solve them.

You can see that implementation here:

https://github.com/joyent/node-artedi/blob/48369f498d144c22f6188fec08989fff5b85ef7b/lib/buckets.js#L51-L99

which basically just gives you the option to select a portion of the 1.x bucket space:

https://github.com/joyent/node-artedi/blob/48369f498d144c22f6188fec08989fff5b85ef7b/lib/buckets.js#L11-L28

Alternatively, use a separate type, like "histogram_legacy"?

If you mean having an artedi.histogram_legacy() which just generates all the possible artedi v1.x buckets without a buckets parameter, that could work but might be surprising for people since now they'll have a lot more buckets (80 buckets everywhere).

If you mean having it work the way the current histogram function works (dynamically generating buckets) I think that's probably not a good idea since that would be generating bad data that people will likely eventually need to reset their prometheus to remove. But it would be possible to do if we make it clear to people what the consequences are for using that.

I'm hoping this is relatively easy to do (if you were already going to support the generator function).
This way, by default people get the better thing, but people also have the opportunity to move forward more compatibly. For example, in Muskie, we may decide to define a second metric using the new bucket scheme while continuing to export the older metric. That way, operators can update their dashboards after they've updated all instances. Then we'd eventually remove the older metric. (I'm not sure we'd do this -- I think we'd want to know how broken our metrics are already today.)

I guess it sounds here like you meant for histogram_legacy to continue to do exactly the same thing that the 1.x version did?

Again, sorry if this is just my lack of understanding about the problem. If the old data is truly non-sensical, maybe this isn't worthwhile. But in practice it has matched the behavior we'd expect in a lot of cases, and it would be pretty painful if the intermediate state (some muskies updated before others) made the data worse. It's likely to be weeks or even months between updating the first and last muskies covered by some of our production dashboards.

The data itself is not nonsensical, it's just that it's incompatible with an actual Prometheus histogram if you've ever had outliers or if you're ever using different buckets between instances of a service, or really any combination of labels. The way it is right now, if you have a label for status_code (which we do in Triton) you can have a different set of buckets for status_code=500 from status_code=200 even for the same service and instance. This fact also makes histogram_quantile incorrect for that service.

Basically when using histogram_quantile on data generated by node-artedi 1.x the accuracy of what you're looking at depends mostly on luck. If you get lucky and your buckets mostly line up (especially at the top end) and you rarely restart and have different buckets for any length of time, you might get results that are mostly correct.

@trentm
Copy link
Contributor

trentm commented Oct 22, 2018

@davepacheco A potential way one could upgrade a lot of, say, muskies to an artedi v2 that no longer supports the v1 bucket generation is perhaps this:

  1. update the code to artedi v2 and use new metrics names for any histograms
  2. copy the "artedi1Buckets" function example above to manta-muskie.git and use that to generate static buckets for the old histogram metrics names
  3. roll out updates
  4. eventually drop the old metrics names

@KodyKantor
Copy link
Contributor

If we can't just drop the artedi v1 buckets then trentm's idea sounds good. It would be nice to revisit our metric names and units. Do we actually need to provide backwards compat with v1 buckets? It seems like having bucket mismatches is what this issue is all about.

If a subset of artedi consumers moved right from v1 to v2 buckets (creating a guaranteed mismatch) without changing metric names/labels/units, maybe the problem wouldn't be any worse than it already is. I think that's something we could test.

@joshwilsdon
Copy link
Contributor Author

fwiw: while searching for something else I just found that another project has had a similar problem because they did dynamic bucket creation (they removed support due to this issue) and one of the prometheus developers chimed in confirming that this is not supported.

@joshwilsdon
Copy link
Contributor Author

This is now a CR in Gerrit: https://cr.joyent.us/#/c/5024/

joyent-automation pushed a commit that referenced this pull request Nov 16, 2018
TRITON-928 node-artedi histograms should be compatible with Prometheus
Reviewed by: Kody A Kantor <kody@kkantor.com>
Approved by: Kody A Kantor <kody@kkantor.com>
@joshwilsdon
Copy link
Contributor Author

Implemented with 54b21b7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants