Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 57 - errors on VCL reload because of duplicated counters #70

Merged
merged 2 commits into from Jan 27, 2021

Conversation

LorenzoPeri
Copy link
Contributor

Fixes #57

Issue
The exporter fails when VCL is reloaded.
This happens because Varnish backend counters (VBE) are repeated for each reload with its timestamp:
Example: VBE.boot.default.happy and VBE.reload_20210114_155148_19902.default.happy
... thus generating duplicated metrics.

The -p vcl_cooldown=1 workaround is not good enough, because old counters are still present in varnishstat for some time (granularity is 30s, see https://varnish-cache.org/docs/6.5/reference/varnishd.html#vcl-cooldown).

The result is that on each reload the exporter keeps failing for some time (1m in my tests with -p vcl_cooldown=1, 10m without it).

Solution
I propose to consider only the counters with the most recent timestamp.
I quickly calculate the most recent timestamp by checking the ".happy" counters, then I filter out the outdated counters.
Should work for any Varnish version.

Tests included:

  • unit test for the 2 new methods
  • I duplicated VBE metrics in the 6.5.1.json file with 2 different timestamps to simulate 2 consecutive VCL reloads. Existing tests would fail without the fix.

Copy link
Owner

@jonnenauha jonnenauha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Did I understand right that the bug is that over time the stats contain these reload_ prefixed entries. They do go away after that cooldown. but while they exist it breaks the whole metrics scrape and export?

But if the names are unique by the timestamp there, why is prometheus giving and error? Does our code somehow strip the timestamp out or the metric identifier? And then we end up with multiple metrics with the same name/labels which prometheus then rejects?

Good code in general. Left some suggestions and I think one bug (though I assume its possible to have only one unique reload_ entries you might know better).

varnish.go Outdated
@@ -15,6 +15,11 @@ import (
"github.com/prometheus/client_golang/prometheus"
)

const (
vbeReload = "VBE.reload_"
vbeReloadLength = len(vbeReload)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this inline in findMostRecentVbeReloadPrefix, no need to make this globals constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it was maybe and excessive attempt of optimization on my side. :D
Done: feb6172

varnish.go Outdated
Comment on lines 230 to 241
var mostRecentVbeReloadPrefix string
for vName, _ := range countersJSON {
// Checking only the required ".happy" stat
if strings.HasPrefix(vName, vbeReload) && strings.HasSuffix(vName, ".happy") {
dotAfterPrefixIndex := vbeReloadLength + strings.Index(vName[vbeReloadLength:], ".")
vbeReloadPrefix := vName[:dotAfterPrefixIndex]
if strings.Compare(vbeReloadPrefix, mostRecentVbeReloadPrefix) > 0 {
mostRecentVbeReloadPrefix = vbeReloadPrefix
}
}
}
return mostRecentVbeReloadPrefix
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is only one set on reload_ entries, this function will return the only ones timestamp and that will then be (unintentionally?) ignored in ScrapeVarnishFrom. Did you intend this, I assume letting the one go through is ok, if its ok to let one of many through as well.

Suggested change
var mostRecentVbeReloadPrefix string
for vName, _ := range countersJSON {
// Checking only the required ".happy" stat
if strings.HasPrefix(vName, vbeReload) && strings.HasSuffix(vName, ".happy") {
dotAfterPrefixIndex := vbeReloadLength + strings.Index(vName[vbeReloadLength:], ".")
vbeReloadPrefix := vName[:dotAfterPrefixIndex]
if strings.Compare(vbeReloadPrefix, mostRecentVbeReloadPrefix) > 0 {
mostRecentVbeReloadPrefix = vbeReloadPrefix
}
}
}
return mostRecentVbeReloadPrefix
mostRecentVbeReloadPrefix := ""
numReloadedPrefix := 0
for vName, _ := range countersJSON {
// Checking only the required ".happy" stat
if strings.HasPrefix(vName, vbeReload) && strings.HasSuffix(vName, ".happy") {
numReloadedPrefix++
dotAfterPrefixIndex := vbeReloadLength + strings.Index(vName[vbeReloadLength:], ".")
vbeReloadPrefix := vName[:dotAfterPrefixIndex]
if strings.Compare(vbeReloadPrefix, mostRecentVbeReloadPrefix) > 0 {
mostRecentVbeReloadPrefix = vbeReloadPrefix
}
}
}
if numReloadedPrefix <= 1 {
return ""
}
return mostRecentVbeReloadPrefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is only one set on reload_ entries, this function will return the only ones timestamp

Correct

and that will then be (unintentionally?) ignored in ScrapeVarnishFrom.

I don't think this is true. ScrapeVarnishFrom will use isOutdatedVbe to determine that the VBE counters are not, in fact, outdated, so they won't be skipped (this should be covered by the current VCL version test case of Test_IsOutdatedVbe).

Let me know if I missed something!

@LorenzoPeri
Copy link
Contributor Author

Question: Did I understand right that the bug is that over time the stats contain these reload_ prefixed entries. They do go away after that cooldown. but while they exist it breaks the whole metrics scrape and export?

Correct.

Varnish starts with just the VBE.boot counters.
On each Varnish reload, a new set of VBE.reload_[timestamp] counters is added.
The old counters will disappear after the cooldown.

For example, if you reload Varnish many times in a row, you could end up with many concurrent sets of VBE.reload_ counters, but after the cooldown only the most recent will be present (note that also the original VBE.boot set will be gone, it behaves like the others).

The only way to expose all the counters at the same time would probably be to add the timestamp as additional label on the Prometheus metrics. This way the conflict would be avoided, but a new time series would be created in Prometheus on each Varnish reload. IMHO it would not be good memory-wise for Prometheus and also it would be harder to extract useful metrics (like, how to ignore the outdated ones in queries?).

I think it's better to just expose the current active values.

But if the names are unique by the timestamp there, why is prometheus giving and error? Does our code somehow strip the timestamp out or the metric identifier? And then we end up with multiple metrics with the same name/labels which prometheus then rejects?

Also correct. The timestamp is not part of the exported metrics (name or tags), so there are naming conflicts.

Good code in general. Left some suggestions and I think one bug (though I assume its possible to have only one unique reload_ entries you might know better).

Thanks!

@jonnenauha jonnenauha merged commit 86fc1b0 into jonnenauha:master Jan 27, 2021
@zipkid
Copy link

zipkid commented Nov 23, 2021

Is it possible to create a release containing this fix?

@jonnenauha
Copy link
Owner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants