Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential race with LabelPairs #511

Closed
wants to merge 2 commits into from

Conversation

douglas-reid
Copy link

@douglas-reid douglas-reid commented Dec 5, 2018

Add a copyLabelPairs method to prometheus/value.go and have populateMetric and histogram.Write use that method for serialization of labelPairs.

This is an attempt to address: istio/istio#10236

There may be better ways to address this, etc. This PR is merely meant to suggest one solution.

Before this PR, with the modifications to the unit test shown in this PR:

--- FAIL: TestHistogramVecRegisterGatherConcurrency (2.04s)
    registry_test.go:856: Gathering failed: labels in collected metric test_histogram label:<name:"eight" value:"alpha" > label:<name:"eleven" value:"gimel" > label:<name:"fifteen" value:"beta" > label:<name:"five" value:"aleph" > label:<name:"foo" value:"bar" > label:<name:"four" value:"gimel" > label:<name:"fourteen" value:"b" > label:<name:"nine" value:"a" > label:<name:"one" value:"aleph" > label:<name:"seven" value:"c" > label:<name:"six" value:"gimel" > label:<name:"sixteen" value:"gamma" > label:<name:"ten" value:"gimel" > label:<name:"thirteen" value:"alpha" > label:<name:"three" value:"alpha" > label:<name:"twelve" value:"aleph" > label:<name:"two" value:"gimel" > histogram:<sample_count:1 sample_sum:0.14491978781995843 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:1 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > >  are inconsistent with descriptor Desc{fqName: "test_histogram", help: "This helps testing.", constLabels: {foo="bar"}, variableLabels: [one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen]}
    registry_test.go:856: Gathering failed: labels in collected metric test_histogram label:<name:"eight" value:"c" > label:<name:"eleven" value:"gamma" > label:<name:"fifteen" value:"a" > label:<name:"five" value:"b" > label:<name:"foo" value:"bar" > label:<name:"four" value:"aleph" > label:<name:"fourteen" value:"b" > label:<name:"nine" value:"gamma" > label:<name:"one" value:"gimel" > label:<name:"seven" value:"beth" > label:<name:"six" value:"c" > label:<name:"sixteen" value:"aleph" > label:<name:"ten" value:"alpha" > label:<name:"thirteen" value:"b" > label:<name:"three" value:"beta" > label:<name:"twelve" value:"alpha" > label:<name:"two" value:"gimel" > histogram:<sample_count:1 sample_sum:0.2885442155939729 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:0 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > >  are inconsistent with descriptor Desc{fqName: "test_histogram", help: "This helps testing.", constLabels: {foo="bar"}, variableLabels: [one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen]}
    registry_test.go:856: Gathering failed: labels in collected metric test_histogram label:<name:"eight" value:"c" > label:<name:"eleven" value:"gamma" > label:<name:"five" value:"gamma" > label:<name:"foo" value:"bar" > label:<name:"four" value:"gamma" > label:<name:"nine" value:"beta" > label:<name:"fourteen" value:"gamma" > label:<name:"one" value:"c" > label:<name:"seven" value:"beth" > label:<name:"fifteen" value:"gimel" > label:<name:"six" value:"gamma" > label:<name:"sixteen" value:"gimel" > label:<name:"ten" value:"beta" > label:<name:"thirteen" value:"alpha" > label:<name:"three" value:"beta" > label:<name:"twelve" value:"b" > label:<name:"two" value:"aleph" > histogram:<sample_count:1 sample_sum:0.453247760614353 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:0 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > >  are inconsistent with descriptor Desc{fqName: "test_histogram", help: "This helps testing.", constLabels: {foo="bar"}, variableLabels: [one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen]}
    registry_test.go:856: Gathering failed: labels in collected metric test_histogram label:<name:"eight" value:"c" > label:<name:"eleven" value:"gamma" > label:<name:"five" value:"gamma" > label:<name:"foo" value:"bar" > label:<name:"four" value:"gamma" > label:<name:"nine" value:"beta" > label:<name:"fourteen" value:"gamma" > label:<name:"one" value:"c" > label:<name:"seven" value:"beth" > label:<name:"fifteen" value:"gimel" > label:<name:"six" value:"gamma" > label:<name:"sixteen" value:"gimel" > label:<name:"ten" value:"beta" > label:<name:"thirteen" value:"alpha" > label:<name:"three" value:"beta" > label:<name:"twelve" value:"b" > label:<name:"two" value:"aleph" > histogram:<sample_count:1 sample_sum:0.453247760614353 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:0 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > >  are inconsistent with descriptor Desc{fqName: "test_histogram", help: "This helps testing.", constLabels: {foo="bar"}, variableLabels: [one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen]}
    registry_test.go:856: Gathering failed: 2 error(s) occurred:
        * labels in collected metric test_histogram label:<name:"eight" value:"aleph" > label:<name:"eleven" value:"alpha" > label:<name:"fifteen" value:"a" > label:<name:"five" value:"gimel" > label:<name:"foo" value:"bar" > label:<name:"four" value:"beth" > label:<name:"fourteen" value:"gamma" > label:<name:"nine" value:"b" > label:<name:"one" value:"beta" > label:<name:"seven" value:"alpha" > label:<name:"six" value:"gamma" > label:<name:"sixteen" value:"a" > label:<name:"ten" value:"beta" > label:<name:"thirteen" value:"b" > label:<name:"three" value:"b" > label:<name:"twelve" value:"b" > label:<name:"two" value:"c" > histogram:<sample_count:1 sample_sum:0.09970312840959775 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:1 upper_bound:0.1 > bucket:<cumulative_count:1 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > >  are inconsistent with descriptor Desc{fqName: "test_histogram", help: "This helps testing.", constLabels: {foo="bar"}, variableLabels: [one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen]}
        * labels in collected metric test_histogram label:<name:"eight" value:"beta" > label:<name:"eleven" value:"alpha" > label:<name:"fifteen" value:"beta" > label:<name:"five" value:"gimel" > label:<name:"foo" value:"bar" > label:<name:"four" value:"c" > label:<name:"fourteen" value:"alpha" > label:<name:"nine" value:"gamma" > label:<name:"one" value:"beth" > label:<name:"seven" value:"gamma" >label:<name:"six" value:"gamma" > label:<name:"sixteen" value:"gimel" > label:<name:"ten" value:"gimel" > label:<name:"thirteen" value:"aleph" > label:<name:"three" value:"aleph" > label:<name:"twelve" value:"gamma" > label:<name:"two" value:"a" > histogram:<sample_count:1 sample_sum:0.11973892090427114 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:1 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > > are inconsistent with descriptor Desc{fqName: "test_histogram", help: "This helps testing.", constLabels: {foo="bar"}, variableLabels: [one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen]}

With the copyLabelPairs patch:

$ GO111MODULE=on go test -v prometheus/registry_test.go
=== RUN   TestHandler
--- PASS: TestHandler (0.01s)
=== RUN   TestAlreadyRegistered
--- PASS: TestAlreadyRegistered (0.00s)
=== RUN   TestHistogramVecRegisterGatherConcurrency
--- PASS: TestHistogramVecRegisterGatherConcurrency (2.05s)
=== RUN   TestWriteToTextfile
--- PASS: TestWriteToTextfile (0.00s)
PASS
ok      command-line-arguments  2.117s

cc: @beorn7

@@ -791,7 +791,7 @@ func TestHistogramVecRegisterGatherConcurrency(t *testing.T) {
Help: "This helps testing.",
ConstLabels: prometheus.Labels{"foo": "bar"},
},
[]string{"one", "two", "three"},
[]string{"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen"},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know the sweet spot that this happens? Or, even better, the underlying reason why more makes this break?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobbytables i have not yet identified the exact scenario under which things get wonky. My guess is the sorting of the labelpairs can be racy. This PR is really just an attempt to highlight the issue and hopefully lead to further discovery, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason is here: https://golang.org/src/sort/sort.go?s=5414:5439#L184

I.e. the sort algorithm changes with more than 12 elements.

labelPairs = append(labelPairs, &dto.LabelPair{
Name: pair.Name,
Value: pair.Value,
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to append here since you created the slice with the correct length already? Why not just assign the index of the labelPairs to the current index in the loop?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mostly a copy-paste job from the method above, as I was aiming for consistency. I don't think there is any significant difference in the two approaches, but I don't feel strongly in either direction.

@douglas-reid
Copy link
Author

OK. More data points. If I comment out: https://github.com/prometheus/client_golang/blob/master/prometheus/registry.go#L875

Then the issue appears to go away as well. Similarly, if I change:

sort.Sort to sort.Stable, I also appear to see the issue go away.

So, my guess is that the sorting is indeed problematic. I'll defer to @beorn7 on the most proper correction.

@beorn7
Copy link
Member

beorn7 commented Dec 6, 2018

Good catch. I think you found the issue. The sort.Sort call is changing the labels slice that is supposed to be immutable. This will race if multiple Gather calls happen concurrently.

The sort.Stable probably only helps if labels are already sorted (which is usually the case), but if the aren't, we'll still have a race.

I have an idea how to solve that last detail. Thanks for the detective work so far.

@beorn7
Copy link
Member

beorn7 commented Dec 6, 2018

Thanks once more. I have created #513 to incorporate the findings here.
Thus, I'll close this one.

@beorn7 beorn7 closed this Dec 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants