-
Notifications
You must be signed in to change notification settings - Fork 106
Stored 'lastupdate' being approximate can cause inconsistent missing data #1979
Comments
I thought this exact issue has come up before, but I couldn't find an issue or PR. sidenote: somewhat confusingly, LastSave is the property that governs writes to the persistent index (subject to "update-interval"), and it causes the full metricdef to be written, but yeah i can see how this would be a problem for the lastUpdate field specifically. Option 3 seems like the "obviously correct" solution to me. In fact, it essentially "extends" the exact same solution of #1532, taking the "extra allowance" used at query time and enabling it at index loading time. actually for the former fix to function correctly, it requires the latter fix (shortly after startup, anyway) So yes, let's do number 3. should be a very easy fix I think. |
Not sure I understand this bit. It seems to me that Option 3 would obviate the need for #1532 (and would support the large number of |
#1532 makes sure that e.g. a metric with I think I understand now what you meant with option 3. The above assumes that the lastUpdate field is retained from cassandra/bigtable when loading into memory. But you're suggesting to increment the live, in memory lastUpdate property by update-interval. I think this will work indeed, and involves undoing #1532 The main concern I have is whether there's an edge case where repeated restart cycles will cause the lastUpdate to keep increasing. That would require updating the "adjusted lastUpdate" to the persistent index, but i have gone through the AddOrUpdate code and it seems this wouldn't happen precisely because we always wait update-interval before saving any update. |
TagQuery, basically? |
Right, we only write updates when updates are received, so I think we should be safe from this scenario.
Yeah, the various functions that take a TagQuery. |
the upside from this is that we also remove the "needlessly eagerness" of #1532, by only adjusting lastUpdate when it makes sense, but in the average case (with a MT instance being up for a while) lastUpdate will always be precise (because the in memory index always updates its lastUpdate values) and #1532 needlessly bumped the values. Let me whip up a PR for you... |
(instead of at previous fix, which did it at query time for Find(), see #1532) This has two main benefits: 1) by making the change to the data, it works equally well across all types of queries, in particular this fixes the behavior for TagQuery 2) we no longer over-eagerly adjust the check at query time (if MT has seen a new point for a given metric - e.g. if the process has been up for a while - than the LastUpdate value in the memory index is perfectly accurate, and we don't need to make any adjustment) fix #1979
Describe the bug
The cassandra idx has a configurable update-interval. The
lastupdate
value is only updated when it gets out fo date by the configuredupdate-interval
(default4h
). There is an equivalent in bigtable (default3h
).The bug is that when a Metrictank instance restarts it will load in the
lastupdate
value from storage, which can be behind by as muchupdate-interval
. For series that are not regularly published, this means that there is a range of time that this series will not be returned, despite having data, because thelastupdate
value is imprecise.Possible Solutions
1. Reduce
update-interval
This was our first step. It's effective at reducing the likelihood of the issue, but reducing it too much adds a decent amount of load onto Metrictank instances/cassandra. So diminishing returns on this work around.
2. Periodically persist series in the index that haven't been persisted in a while and
lastupdate
is incorrect.This seems like a reasonable solution since it would reduce the window that
lastupdate
is out-of-date in the database, but any instances that restart while it's out of date will not pick up changes afterward. Also, instances shutting down without updatinglastupdate
would not be able to update it later.3. Add
update-interval
seconds tolastupdate
on startup.This solution seems the most sound to me. If update interval is 60 minutes, add up to 3600 seconds to the
lastupdate
we get from the database (not going beyond the current time) to err on the side of including series that might not have data rather than excluding series that might have data.Helpful Information
Metrictank Version: latest
Golang Version: 1.12
OS: RHEL
The text was updated successfully, but these errors were encountered: