Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where different label orders lead to different results #164

Merged
merged 1 commit into from
Oct 28, 2019

Conversation

dmagliola
Copy link
Collaborator

In the DirectFileStore, in order to store the labels hash on disk, we
turn it into a "querystring". The way this is done, we'll end up with
different strings if the same hash is passed in with its keys in
different order.

Since this string is used to index into the file where we store the data,
this will lead to two different values being stored for the same hash.
This is fine in cases where the aggregation is :SUM, because they end up
getting summed when the Client is summarizing. But in :ALL aggregation,
for example, you will end up with one value or the other, randomly.

The test in this commit reproduces this problem.

This way of serializing the labels is a bit slower,
but it's not a huge impact in the big scheme of things, and it leads to
the correct result.


Performance impact (based on the labels marshalling benchmark on our "experiments" repo):

  x.report("map into escaped querystring and join") do
    LABELSET.map{|k,v| "#{CGI::escape(k.to_s)}=#{CGI::escape(v.to_s)}"}.join('&')
  end

  x.report("SORT and map into escaped querystring and join") do
    LABELSET.to_a.sort.map{|k,v| "#{CGI::escape(k.to_s)}=#{CGI::escape(v.to_s)}"}.join('&')
  end
map into escaped querystring and join:   507021.8 i/s 
SORT and map into escaped querystring and join:   342933.3 i/s

That's 30% fewer iterations.
The difference ends up being an extra microsecond for this sort, on a metric with 3 labels.
It pains me to add that overhead, but feels worth it for correctness.

@dmagliola dmagliola requested a review from Sinjo October 28, 2019 12:07
In the DirectFileStore, in order to store the labels hash on disk, we
turn it into a "querystring". The way this is done, we'll end up with
different strings if the same hash is passed in with its keys in
different order.

Since this string is used to index into the file where we store the data,
this will lead to two different values being stored for the same hash.
This is fine in cases where the aggregation is :SUM, because they end up
getting summed when the Client is summarizing. But in :ALL aggregation,
for example, you will end up with one value or the other, randomly.

The test in this commit reproduces this problem.

This way of serializing the labels is a bit slower (see PR for details),
but it's not a huge impact in the big scheme of things, and it leads to
the correct result.

Signed-off-by: Daniel Magliola <danielmagliola@gocardless.com>
@coveralls
Copy link

coveralls commented Oct 28, 2019

Coverage Status

Coverage remained the same at 100.0% when pulling c59713d on fix_label_order_bug into eb1fc97 on master.

Copy link
Member

@Sinjo Sinjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried having a think over alternative ways to perform the sort, but I can't believe anything I came up with is faster than this.

@dmagliola dmagliola merged commit f98228c into master Oct 28, 2019
@dmagliola dmagliola deleted the fix_label_order_bug branch October 28, 2019 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants