Fix issue 57 - errors on VCL reload because of duplicated counters #70

LorenzoPeri · 2021-01-19T22:00:16Z

Fixes #57

Issue
The exporter fails when VCL is reloaded.
This happens because Varnish backend counters (VBE) are repeated for each reload with its timestamp:
Example: VBE.boot.default.happy and VBE.reload_20210114_155148_19902.default.happy
... thus generating duplicated metrics.

The -p vcl_cooldown=1 workaround is not good enough, because old counters are still present in varnishstat for some time (granularity is 30s, see https://varnish-cache.org/docs/6.5/reference/varnishd.html#vcl-cooldown).

The result is that on each reload the exporter keeps failing for some time (1m in my tests with -p vcl_cooldown=1, 10m without it).

Solution
I propose to consider only the counters with the most recent timestamp.
I quickly calculate the most recent timestamp by checking the ".happy" counters, then I filter out the outdated counters.
Should work for any Varnish version.

Tests included:

unit test for the 2 new methods
I duplicated VBE metrics in the 6.5.1.json file with 2 different timestamps to simulate 2 consecutive VCL reloads. Existing tests would fail without the fix.

jonnenauha

Question: Did I understand right that the bug is that over time the stats contain these reload_ prefixed entries. They do go away after that cooldown. but while they exist it breaks the whole metrics scrape and export?

But if the names are unique by the timestamp there, why is prometheus giving and error? Does our code somehow strip the timestamp out or the metric identifier? And then we end up with multiple metrics with the same name/labels which prometheus then rejects?

Good code in general. Left some suggestions and I think one bug (though I assume its possible to have only one unique reload_ entries you might know better).

jonnenauha · 2021-01-22T14:35:55Z

varnish.go

@@ -15,6 +15,11 @@ import (
 	"github.com/prometheus/client_golang/prometheus"
 )

+const (
+	vbeReload       = "VBE.reload_"
+	vbeReloadLength = len(vbeReload)


You can do this inline in findMostRecentVbeReloadPrefix, no need to make this globals constant.

Ok, it was maybe and excessive attempt of optimization on my side. :D
Done: feb6172

jonnenauha · 2021-01-22T14:39:18Z

varnish.go

+	var mostRecentVbeReloadPrefix string
+	for vName, _ := range countersJSON {
+		// Checking only the required ".happy" stat
+		if strings.HasPrefix(vName, vbeReload) && strings.HasSuffix(vName, ".happy") {
+			dotAfterPrefixIndex := vbeReloadLength + strings.Index(vName[vbeReloadLength:], ".")
+			vbeReloadPrefix := vName[:dotAfterPrefixIndex]
+			if strings.Compare(vbeReloadPrefix, mostRecentVbeReloadPrefix) > 0 {
+				mostRecentVbeReloadPrefix = vbeReloadPrefix
+			}
+		}
+	}
+	return mostRecentVbeReloadPrefix


If there is only one set on reload_ entries, this function will return the only ones timestamp and that will then be (unintentionally?) ignored in ScrapeVarnishFrom. Did you intend this, I assume letting the one go through is ok, if its ok to let one of many through as well.

Suggested change

var mostRecentVbeReloadPrefix string

for vName, _ := range countersJSON {

// Checking only the required ".happy" stat

if strings.HasPrefix(vName, vbeReload) && strings.HasSuffix(vName, ".happy") {

dotAfterPrefixIndex := vbeReloadLength + strings.Index(vName[vbeReloadLength:], ".")

vbeReloadPrefix := vName[:dotAfterPrefixIndex]

if strings.Compare(vbeReloadPrefix, mostRecentVbeReloadPrefix) > 0 {

mostRecentVbeReloadPrefix = vbeReloadPrefix

}

}

}

return mostRecentVbeReloadPrefix

mostRecentVbeReloadPrefix := ""

numReloadedPrefix := 0

for vName, _ := range countersJSON {

// Checking only the required ".happy" stat

if strings.HasPrefix(vName, vbeReload) && strings.HasSuffix(vName, ".happy") {

numReloadedPrefix++

dotAfterPrefixIndex := vbeReloadLength + strings.Index(vName[vbeReloadLength:], ".")

vbeReloadPrefix := vName[:dotAfterPrefixIndex]

if strings.Compare(vbeReloadPrefix, mostRecentVbeReloadPrefix) > 0 {

mostRecentVbeReloadPrefix = vbeReloadPrefix

}

}

}

if numReloadedPrefix <= 1 {

return ""

}

return mostRecentVbeReloadPrefix

If there is only one set on reload_ entries, this function will return the only ones timestamp

Correct

and that will then be (unintentionally?) ignored in ScrapeVarnishFrom.

I don't think this is true. ScrapeVarnishFrom will use isOutdatedVbe to determine that the VBE counters are not, in fact, outdated, so they won't be skipped (this should be covered by the current VCL version test case of Test_IsOutdatedVbe).

Let me know if I missed something!

LorenzoPeri · 2021-01-25T11:18:51Z

Question: Did I understand right that the bug is that over time the stats contain these reload_ prefixed entries. They do go away after that cooldown. but while they exist it breaks the whole metrics scrape and export?

Correct.

Varnish starts with just the VBE.boot counters.
On each Varnish reload, a new set of VBE.reload_[timestamp] counters is added.
The old counters will disappear after the cooldown.

For example, if you reload Varnish many times in a row, you could end up with many concurrent sets of VBE.reload_ counters, but after the cooldown only the most recent will be present (note that also the original VBE.boot set will be gone, it behaves like the others).

The only way to expose all the counters at the same time would probably be to add the timestamp as additional label on the Prometheus metrics. This way the conflict would be avoided, but a new time series would be created in Prometheus on each Varnish reload. IMHO it would not be good memory-wise for Prometheus and also it would be harder to extract useful metrics (like, how to ignore the outdated ones in queries?).

I think it's better to just expose the current active values.

But if the names are unique by the timestamp there, why is prometheus giving and error? Does our code somehow strip the timestamp out or the metric identifier? And then we end up with multiple metrics with the same name/labels which prometheus then rejects?

Also correct. The timestamp is not part of the exported metrics (name or tags), so there are naming conflicts.

Good code in general. Left some suggestions and I think one bug (though I assume its possible to have only one unique reload_ entries you might know better).

Thanks!

zipkid · 2021-11-23T07:18:01Z

Is it possible to create a release containing this fix?

jonnenauha · 2021-12-20T10:27:20Z

@zipkid latest release has this https://github.com/jonnenauha/prometheus_varnish_exporter/releases/tag/1.6.1

Fix issue 57 - errors on VCL reload because of duplicated counters

6d3f0ae

LorenzoPeri mentioned this pull request Jan 19, 2021

Regression after #56 when running varnishreload ("... was collected before with the same name and label values") #57

Closed

jonnenauha reviewed Jan 22, 2021

View reviewed changes

Fix issue 57 - remove const vbeReloadLength

feb6172

jonnenauha merged commit 86fc1b0 into jonnenauha:master Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue 57 - errors on VCL reload because of duplicated counters #70

Fix issue 57 - errors on VCL reload because of duplicated counters #70

LorenzoPeri commented Jan 19, 2021

jonnenauha left a comment •

edited

jonnenauha Jan 22, 2021

LorenzoPeri Jan 25, 2021

jonnenauha Jan 22, 2021

LorenzoPeri Jan 25, 2021

LorenzoPeri commented Jan 25, 2021

zipkid commented Nov 23, 2021

jonnenauha commented Dec 20, 2021

Fix issue 57 - errors on VCL reload because of duplicated counters #70

Fix issue 57 - errors on VCL reload because of duplicated counters #70

Conversation

LorenzoPeri commented Jan 19, 2021

jonnenauha left a comment • edited

Choose a reason for hiding this comment

jonnenauha Jan 22, 2021

Choose a reason for hiding this comment

LorenzoPeri Jan 25, 2021

Choose a reason for hiding this comment

jonnenauha Jan 22, 2021

Choose a reason for hiding this comment

LorenzoPeri Jan 25, 2021

Choose a reason for hiding this comment

LorenzoPeri commented Jan 25, 2021

zipkid commented Nov 23, 2021

jonnenauha commented Dec 20, 2021

jonnenauha left a comment •

edited