Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept - Shared Edge histogram #23

Open
wants to merge 7 commits into
base: master
from

Conversation

@schneems
Copy link

schneems commented Mar 16, 2020

Sorry for the messy PR, this isn't intended to be merged as is, but I wanted to instead make a proof of concept, explain my problem, and then start a dialog about how we can address that problem.

Problem

I want to use histograms to how two different distributions compare to one another schneems/derailed_benchmarks#169 however, if I generate a histogram for each, the edges aren't aligned so it's impossible to tell at a glance which distribution is "better".

This Solution Implementation

I had the idea that we could align the edges of the two different distributions so that they can be easily graphed and compared. That's what this PR does. It generates an edge list for two different distributions, then it turns them into one edge list. To do this I averaged the bin size for each edge list and then I built another edge list going from the smallest edge value to the largest edge value using the averaged bin size.

Unknowns

  • C-extension clarity and quality. I don't spend a lot of time in c-extensions so I'm guessing some of this could be cleaned up dramatically.
  • Interface (return values): Maybe instead of returning a single histogram, we want to return two histograms in an array or create a new histogram structure.
  • Interface modularity: Does it make the most sense to have this special purpose method? Alternatively, we could maybe allow the histogram method to take an edges: argument, and expose a way to generate edges instead.

Your feedback

I want to thank you a ton for creating an maintaining this library. I would like your feedback on this PR. Please let me know if anything is not clear or if you have any questions.

schneems added 7 commits Mar 15, 2020
…q(b.dual_histograms(compare: a).edge.first)

     NoMethodError:
       undefined method `edge' for [7, 7, 7, 12, 12, 12, 12, 20, 20, 20]:Array
@schneems

This comment has been minimized.

Copy link
Author

schneems commented Mar 22, 2020

I made my own histogram library https://github.com/zombocom/mini_histogram. It's slower, but not that much slower

$ rake bench
true
[79, 103, 81, 103, 103, 100, 104, 99, 105, 123]
[79, 103, 81, 103, 103, 100, 104, 99, 105, 123]
Warming up --------------------------------------
    enumerable stats   924.000  i/100ms
    mini histogram     613.000  i/100ms
Calculating -------------------------------------
    enumerable stats      8.606k (± 6.5%) i/s -     43.428k in   5.069434s
    mini histogram        6.554k (± 2.6%) i/s -     33.102k in   5.054474s

Comparison:
    enumerable stats:     8606.1 i/s
    mini histogram  :     6553.8 i/s - 1.31x  slower

There might be a faster way to count weights than you're currently doing https://github.com/zombocom/mini_histogram/blob/b7d5804fe207910d43e1342f97e95be2ff0c4356/lib/mini_histogram.rb#L80-L89.

If you know step size, and the lowest value then you can calculate the index of the bin directly without having to do any search

array.each do |x|
  index = ((x - edge_lo) / step).floor
  @weights[index] += 1
end

This PR isn't really needed anymore for my purposes. Feel free to close, or reply back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

1 participant
You can’t perform that action at this time.