Stored 'lastupdate' being approximate can cause inconsistent missing data #1979

shanson7 · 2021-05-17T15:06:53Z

Describe the bug

The cassandra idx has a configurable update-interval. The lastupdate value is only updated when it gets out fo date by the configured update-interval (default 4h). There is an equivalent in bigtable (default 3h).

The bug is that when a Metrictank instance restarts it will load in the lastupdate value from storage, which can be behind by as much update-interval. For series that are not regularly published, this means that there is a range of time that this series will not be returned, despite having data, because the lastupdate value is imprecise.

Possible Solutions

1. Reduce `update-interval`

This was our first step. It's effective at reducing the likelihood of the issue, but reducing it too much adds a decent amount of load onto Metrictank instances/cassandra. So diminishing returns on this work around.

2. Periodically persist series in the index that haven't been persisted in a while and `lastupdate` is incorrect.

This seems like a reasonable solution since it would reduce the window that lastupdate is out-of-date in the database, but any instances that restart while it's out of date will not pick up changes afterward. Also, instances shutting down without updating lastupdate would not be able to update it later.

3. Add `update-interval` seconds to `lastupdate` on startup.

This solution seems the most sound to me. If update interval is 60 minutes, add up to 3600 seconds to the lastupdate we get from the database (not going beyond the current time) to err on the side of including series that might not have data rather than excluding series that might have data.

Helpful Information
Metrictank Version: latest
Golang Version: 1.12
OS: RHEL

The text was updated successfully, but these errors were encountered:

Dieterbe · 2021-05-18T11:52:11Z

I thought this exact issue has come up before, but I couldn't find an issue or PR.
Could only find some related issues (e.g. #550, #1532)

sidenote: somewhat confusingly, LastSave is the property that governs writes to the persistent index (subject to "update-interval"), and it causes the full metricdef to be written, but yeah i can see how this would be a problem for the lastUpdate field specifically.

Option 3 seems like the "obviously correct" solution to me. In fact, it essentially "extends" the exact same solution of #1532, taking the "extra allowance" used at query time and enabling it at index loading time. actually for the former fix to function correctly, it requires the latter fix (shortly after startup, anyway)
I'm not too worried about this causing a lot more index entries to be loaded, as the update-interval is typically a low value (couple of hours), much shorter to the the pruning durations used in index-rules.conf

So yes, let's do number 3. should be a very easy fix I think.

shanson7 · 2021-05-18T12:15:38Z

actually for the former fix to function correctly, it requires the latter fix (shortly after startup, anyway)

Not sure I understand this bit. It seems to me that Option 3 would obviate the need for #1532 (and would support the large number of Find style operations that have been added since). So implementing Option 3 would likely mean undoing the change in #1532.

Dieterbe · 2021-05-18T19:27:57Z

#1532 makes sure that e.g. a metric with lastUpdate=x-3600 is included in the response to a query from: x to: y, if update-interval is 2*3600 (because if the MT instance started after the last point was sent for a metric, the metric may have been updated after x-3600 but the lastUpdate would not have been saved to the persistent index. If the MT instance is not "fresh" and the metric is still alive, iow if it has ingested points for the metric with newer timestamp values, than the lastUpdate field in the memory index is perfectly accurate, and the #1532 tweak is needlessly "eager".

I think I understand now what you meant with option 3. The above assumes that the lastUpdate field is retained from cassandra/bigtable when loading into memory. But you're suggesting to increment the live, in memory lastUpdate property by update-interval. I think this will work indeed, and involves undoing #1532

The main concern I have is whether there's an edge case where repeated restart cycles will cause the lastUpdate to keep increasing. That would require updating the "adjusted lastUpdate" to the persistent index, but i have gone through the AddOrUpdate code and it seems this wouldn't happen precisely because we always wait update-interval before saving any update.

Dieterbe · 2021-05-18T19:33:54Z

and would support the large number of Find style operations that have been added since

TagQuery, basically?

shanson7 · 2021-06-02T13:55:05Z

The main concern I have is whether there's an edge case where repeated restart cycles will cause the lastUpdate to keep increasing. That would require updating the "adjusted lastUpdate" to the persistent index, but i have gone through the AddOrUpdate code and it seems this wouldn't happen precisely because we always wait update-interval before saving any update.

Right, we only write updates when updates are received, so I think we should be safe from this scenario.

TagQuery, basically?

Yeah, the various functions that take a TagQuery.

Dieterbe · 2021-06-03T13:25:15Z

I think I understand now what you meant with option 3. The above assumes that the lastUpdate field is retained from cassandra/bigtable when loading into memory. But you're suggesting to increment the live, in memory lastUpdate property by update-interval. I think this will work indeed, and involves undoing #1532

the upside from this is that we also remove the "needlessly eagerness" of #1532, by only adjusting lastUpdate when it makes sense, but in the average case (with a MT instance being up for a while) lastUpdate will always be precise (because the in memory index always updates its lastUpdate values) and #1532 needlessly bumped the values.

Let me whip up a PR for you...

(instead of at previous fix, which did it at query time for Find(), see #1532) This has two main benefits: 1) by making the change to the data, it works equally well across all types of queries, in particular this fixes the behavior for TagQuery 2) we no longer over-eagerly adjust the check at query time (if MT has seen a new point for a given metric - e.g. if the process has been up for a while - than the LastUpdate value in the memory index is perfectly accurate, and we don't need to make any adjustment) fix #1979

shanson7 added the bug label May 17, 2021

Dieterbe mentioned this issue Jun 3, 2021

correct Metricdef.LastUpdate when loading from persistent index #1983

Merged

Dieterbe closed this as completed in #1983 Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stored 'lastupdate' being approximate can cause inconsistent missing data #1979

Stored 'lastupdate' being approximate can cause inconsistent missing data #1979

shanson7 commented May 17, 2021

Dieterbe commented May 18, 2021 •

edited

shanson7 commented May 18, 2021

Dieterbe commented May 18, 2021 •

edited

Dieterbe commented May 18, 2021

shanson7 commented Jun 2, 2021

Dieterbe commented Jun 3, 2021

Stored 'lastupdate' being approximate can cause inconsistent missing data #1979

Stored 'lastupdate' being approximate can cause inconsistent missing data #1979

Comments

shanson7 commented May 17, 2021

Possible Solutions

1. Reduce update-interval

2. Periodically persist series in the index that haven't been persisted in a while and lastupdate is incorrect.

3. Add update-interval seconds to lastupdate on startup.

Dieterbe commented May 18, 2021 • edited

shanson7 commented May 18, 2021

Dieterbe commented May 18, 2021 • edited

Dieterbe commented May 18, 2021

shanson7 commented Jun 2, 2021

Dieterbe commented Jun 3, 2021

1. Reduce `update-interval`

2. Periodically persist series in the index that haven't been persisted in a while and `lastupdate` is incorrect.

3. Add `update-interval` seconds to `lastupdate` on startup.

Dieterbe commented May 18, 2021 •

edited

Dieterbe commented May 18, 2021 •

edited