Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsm1 storage engine reports strange spikes in values #4357

Closed
dswarbrick opened this issue Oct 7, 2015 · 10 comments · Fixed by #4362
Closed

tsm1 storage engine reports strange spikes in values #4357

dswarbrick opened this issue Oct 7, 2015 · 10 comments · Fixed by #4362
Assignees

Comments

@dswarbrick
Copy link

Using collectd to feed perpetually increasing, counter-type values (e.g. eth0 rx/tx packet counters) into InfluxDB with tsm1 results in strange outliers occasionally:

SELECT "value" FROM "iolatency_read" WHERE "host" = 'foo.example.com' AND "instance" = 'md400' AND "type" = 'req_latency' AND "type_instance" = 'lt_8' AND time > now() - 10m

1444231716020334000 3.73705375e+08
1444231736022761000 3.73705896e+08
1444231756020388000 3.73706677e+08
1444231776016345000 3.73707002e+08
1444231796016225000 3.73708699e+08
1444231816019400000 3.73708855e+08
1444231836020839000 -8.551788012208318e-238
1444231856020667000 3.73709352e+08
1444231876022680000 3.737098e+08
1444231896014367000 3.7371048e+08
1444231916020391000 3.73711277e+08
1444231936020522000 3.73711405e+08

This in turn causes derivative(mean("value"), 1s) to go bananas, and results in unusable Grafana graphs.

Nothing has changed on the sending (collectd) side - only the storage engine in InfluxDB.

@jwilder
Copy link
Contributor

jwilder commented Oct 7, 2015

Are you using the native collectd plugin?

@jwilder
Copy link
Contributor

jwilder commented Oct 7, 2015

I'm able to reproduce some odd values using the native collectd plugin.

@jwilder jwilder self-assigned this Oct 7, 2015
@dswarbrick
Copy link
Author

@jwilder Yes, I'm using the native collectd plugin, just as before, when running 0.94.2.

@jmcook
Copy link

jmcook commented Oct 7, 2015

I'm seeing this even with telegraf data, specifically in the mem_available_percent measurement.

@jwilder
Copy link
Contributor

jwilder commented Oct 7, 2015

I've verified that there is a bug in the float encoding when similar values are recorded. Working on a fix.

@jmcook
Copy link

jmcook commented Oct 7, 2015

sweet, thanks Jason.

@johnl
Copy link

johnl commented Oct 7, 2015

@dgryski
Copy link
Contributor

dgryski commented Oct 7, 2015

You're going to want to pull in dgryski/go-tsz@918e888 too

@jwilder
Copy link
Contributor

jwilder commented Oct 7, 2015

@dgryski Thanks. Did not see that other change.

jwilder added a commit that referenced this issue Oct 7, 2015
If similar float values were encoded, the number of leading bits would
overflow the 5 available bits to store them (e.g. store 33 in 5 bits).  When
decoding, the values after the overflowed value would spike to very large and
small values.

To prevent the overflow, we clamp the value to 31 which is the maximum
number of leading zero bits we can encoded.

Fixes #4357
jwilder added a commit that referenced this issue Oct 8, 2015
If similar float values were encoded, the number of leading bits would
overflow the 5 available bits to store them (e.g. store 33 in 5 bits).  When
decoding, the values after the overflowed value would spike to very large and
small values.

To prevent the overflow, we clamp the value to 31 which is the maximum
number of leading zero bits we can encoded.

Fixes #4357
@johnl
Copy link

johnl commented Oct 9, 2015

This looks fixed to me. I've not had one strange metric for interface_rx/if_packets since I deployed this code yesterday. Previously I'd had them fairly regularly! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants