Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Stored 'lastupdate' being approximate can cause inconsistent missing data #1979

Closed
shanson7 opened this issue May 17, 2021 · 6 comments · Fixed by #1983
Closed

Stored 'lastupdate' being approximate can cause inconsistent missing data #1979

shanson7 opened this issue May 17, 2021 · 6 comments · Fixed by #1983
Labels

Comments

@shanson7
Copy link
Collaborator

Describe the bug

The cassandra idx has a configurable update-interval. The lastupdate value is only updated when it gets out fo date by the configured update-interval (default 4h). There is an equivalent in bigtable (default 3h).

The bug is that when a Metrictank instance restarts it will load in the lastupdate value from storage, which can be behind by as much update-interval. For series that are not regularly published, this means that there is a range of time that this series will not be returned, despite having data, because the lastupdate value is imprecise.

Possible Solutions

1. Reduce update-interval

This was our first step. It's effective at reducing the likelihood of the issue, but reducing it too much adds a decent amount of load onto Metrictank instances/cassandra. So diminishing returns on this work around.

2. Periodically persist series in the index that haven't been persisted in a while and lastupdate is incorrect.

This seems like a reasonable solution since it would reduce the window that lastupdate is out-of-date in the database, but any instances that restart while it's out of date will not pick up changes afterward. Also, instances shutting down without updating lastupdate would not be able to update it later.

3. Add update-interval seconds to lastupdate on startup.

This solution seems the most sound to me. If update interval is 60 minutes, add up to 3600 seconds to the lastupdate we get from the database (not going beyond the current time) to err on the side of including series that might not have data rather than excluding series that might have data.

Helpful Information
Metrictank Version: latest
Golang Version: 1.12
OS: RHEL

@shanson7 shanson7 added the bug label May 17, 2021
@Dieterbe
Copy link
Contributor

Dieterbe commented May 18, 2021

I thought this exact issue has come up before, but I couldn't find an issue or PR.
Could only find some related issues (e.g. #550, #1532)

sidenote: somewhat confusingly, LastSave is the property that governs writes to the persistent index (subject to "update-interval"), and it causes the full metricdef to be written, but yeah i can see how this would be a problem for the lastUpdate field specifically.

Option 3 seems like the "obviously correct" solution to me. In fact, it essentially "extends" the exact same solution of #1532, taking the "extra allowance" used at query time and enabling it at index loading time. actually for the former fix to function correctly, it requires the latter fix (shortly after startup, anyway)
I'm not too worried about this causing a lot more index entries to be loaded, as the update-interval is typically a low value (couple of hours), much shorter to the the pruning durations used in index-rules.conf

So yes, let's do number 3. should be a very easy fix I think.

@shanson7
Copy link
Collaborator Author

actually for the former fix to function correctly, it requires the latter fix (shortly after startup, anyway)

Not sure I understand this bit. It seems to me that Option 3 would obviate the need for #1532 (and would support the large number of Find style operations that have been added since). So implementing Option 3 would likely mean undoing the change in #1532.

@Dieterbe
Copy link
Contributor

Dieterbe commented May 18, 2021

#1532 makes sure that e.g. a metric with lastUpdate=x-3600 is included in the response to a query from: x to: y, if update-interval is 2*3600 (because if the MT instance started after the last point was sent for a metric, the metric may have been updated after x-3600 but the lastUpdate would not have been saved to the persistent index. If the MT instance is not "fresh" and the metric is still alive, iow if it has ingested points for the metric with newer timestamp values, than the lastUpdate field in the memory index is perfectly accurate, and the #1532 tweak is needlessly "eager".

I think I understand now what you meant with option 3. The above assumes that the lastUpdate field is retained from cassandra/bigtable when loading into memory. But you're suggesting to increment the live, in memory lastUpdate property by update-interval. I think this will work indeed, and involves undoing #1532

The main concern I have is whether there's an edge case where repeated restart cycles will cause the lastUpdate to keep increasing. That would require updating the "adjusted lastUpdate" to the persistent index, but i have gone through the AddOrUpdate code and it seems this wouldn't happen precisely because we always wait update-interval before saving any update.

@Dieterbe
Copy link
Contributor

and would support the large number of Find style operations that have been added since

TagQuery, basically?

@shanson7
Copy link
Collaborator Author

shanson7 commented Jun 2, 2021

The main concern I have is whether there's an edge case where repeated restart cycles will cause the lastUpdate to keep increasing. That would require updating the "adjusted lastUpdate" to the persistent index, but i have gone through the AddOrUpdate code and it seems this wouldn't happen precisely because we always wait update-interval before saving any update.

Right, we only write updates when updates are received, so I think we should be safe from this scenario.

TagQuery, basically?

Yeah, the various functions that take a TagQuery.

@Dieterbe
Copy link
Contributor

Dieterbe commented Jun 3, 2021

I think I understand now what you meant with option 3. The above assumes that the lastUpdate field is retained from cassandra/bigtable when loading into memory. But you're suggesting to increment the live, in memory lastUpdate property by update-interval. I think this will work indeed, and involves undoing #1532

the upside from this is that we also remove the "needlessly eagerness" of #1532, by only adjusting lastUpdate when it makes sense, but in the average case (with a MT instance being up for a while) lastUpdate will always be precise (because the in memory index always updates its lastUpdate values) and #1532 needlessly bumped the values.

Let me whip up a PR for you...

Dieterbe added a commit that referenced this issue Jun 3, 2021
(instead of at previous fix, which did it at query time for Find(),
see #1532)

This has two main benefits:
1) by making the change to the data, it works equally well across all
   types of queries, in particular this fixes the behavior for TagQuery
2) we no longer over-eagerly adjust the check at query time (if MT has
   seen a new point for a given metric - e.g. if the process has been up for
   a while - than the LastUpdate value in the memory index is perfectly
   accurate, and we don't need to make any adjustment)

fix #1979
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants