jashkenas / underscore Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make _.sample more efficient by exiting early #2158
Conversation
I think I made have made a mistake here, please give me a second while I investigate. |
a959614
to
a15ed33
Compare
Turns out you can't just take the loop from Anyway, I've rewritten the function and added a test to check it's actually sampling from the whole range rather than just from the first |
return _.shuffle(obj).slice(0, Math.max(0, n)); | ||
|
||
var set = isArrayLike(obj) ? obj : _.values(obj); | ||
n = Math.max(Math.min(n, set.length), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set.length
shouldn't ever be less than 0 and if it is it won't really matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to use getLength(set)
, to avoid iOS JIT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still need to use max because n
may be negative.
If this is done |
var set = isArrayLike(obj) ? obj : _.values(obj); | ||
n = Math.max(Math.min(n, set.length), 0); | ||
var sample = _.clone(set); | ||
for (var length = set.length, index = length - 1, rand, temp; index > (length - n - 1); index--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind moving rand
and temp
out of here?
Thanks for the review. I've changed the bits mentioned and made I wasn't sure what your policy on git commits is, so I squashed the whole thing together into one commit. |
var set = isArrayLike(obj) ? obj : _.values(obj); | ||
var length = getLength(set); | ||
n = Math.min(n, getLength(set)); | ||
var sample = _.clone(set); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nitpick: I think we can use a modified version of Fisher-Yates to only allocate an array of n
.
var set = isArrayLike(obj) ? obj : _.values(obj);
var length = getLength(set);
n = Math.min(n, length);
var sample = Array(n);
var rand;
for (var index = 0; index < n; index++) {
rand = _.random(0, index);
if (rand !== index) sample[index] = sample[rand];
sample[rand] = set[index];
}
for (; index < length; index++) {
rand = _.random(0, index);
if (rand < n) sample[rand] = set[index];
}
return sample;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. I've added another test to cater for the n<0 case as well.
38669eb
to
28f0683
Compare
shuffled[rand] = set[index]; | ||
} | ||
return shuffled; | ||
return _.sample(obj, _.size(obj)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, one more Infinity
instead of _.size(obj)
will accomplish the same thing without a potentially costly _.keys
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat, hadn't even thought about that!
Have you done any profiling yet @marekventur? |
ok(_.every(_.range(10), function() { //appears consistent? | ||
return _.every(numbers, _.partial(_.contains, numbers)); | ||
}), 'contains the same members before and after shuffle'); | ||
deepEqual(numbers, _.sortBy(shuffled, function(n) { return n; }), 'contains the same members before and after shuffle'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_.sortBy(shuffled)
will work just as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know that. Is that documented behaviour or a side-effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented from iteratee
On Apr 27, 2015 6:08 PM, "marekventur" notifications@github.com wrote:
In test/collections.js
#2158 (comment):notStrictEqual(numbers, shuffled, 'original object is unmodified');
- ok(.every(.range(10), function() { //appears consistent?
return _.every(numbers, _.partial(_.contains, numbers));
- }), 'contains the same members before and after shuffle');
- deepEqual(numbers, _.sortBy(shuffled, function(n) { return n; }), 'contains the same members before and after shuffle');
I didn't know that. Is that documented behaviour or a side-effect?
—
Reply to this email directly or view it on GitHub
https://github.com/jashkenas/underscore/pull/2158/files#r29196826.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll drop that function.
Good call. This is how long 1M operations on a 1000 elements array take on my machine:
profiling code is here: https://gist.github.com/marekventur/455c7bffb00d9759ec7e In lights of those numbers I'm going to revert back to my code pre the changes @jridgewell suggested, it might be more efficient in terms of memory usage, but it's slower. |
// Let's return predictable results
_.random = function (min, max) {
if (max == null) {
max = min;
min = 0;
}
return min + Math.floor(Math.min(Math.random(), .1) * (max - min + 1));
}
var array = _.range(10);
// master
_.shuffle(array); // => [8, 9, 1, 2, 3, 4, 5, 6, 7, 0];
// Yours
_.shuffle(array); // => [9, 2, 3, 4, 5, 6, 7, 8, 0, 1];
// Mine
_.shuffle(array); // => [8, 9, 1, 2, 3, 4, 5, 6, 7, 0];
Edit: Off-by-ones were throwing me off. Yours is indeed a proper shuffle. |
sample[index] = sample[rand]; | ||
sample[rand] = temp; | ||
} | ||
return sample.slice(length - n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Killing the elements after n
should be faster
sample.length = n;
Which would avoid the slice
call for shuffle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is, the shuffled elements are at the end of the array not the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, that wouldn't work. You want the end of the array, not the beginning.
I think it should be comparable/better if it is implemented like this megawac/lodash@d7d1a3d |
…rough to _.sample * Improves efficiency of _.sample for when n is significantly smaller than obj.length * Improves tests to make sure _.sample and _.shuffle actually change the order * Replace shuffle with call to _.sample to avoid duplicated logic
Putting an His is so much faster because it only runs |
Like @jridgewell says, megawac/lodash@d7d1a3d still runs |
Right, my bad I misread earlier comments |
Well, before we were running |
Is there anything else that needs doing before it can be merged? I'm happy to do it, I just don't know what it is :) |
Make _.sample more efficient by exiting early
Thanks for contributing! |
This makes
_.sample
more efficient by re-implementing_.shuffle
's logic and exiting early when enough elements are found. The speedup to the previous method of shuffling the whole input is especially large when the input is a long array and only a few sample elements are required.It is safe to exit a Fisher-Yates shuffle early: http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Comparison_with_other_shuffling_algorithms
There are two alternatives to the way I implemented it which would require either factoring the main
_.shuffle
logic out into a private function and using it in both_.sample
and_.shuffle
or adding an internal parameter to_.shuffle
that limits the result to a certain number, thus making_.shuffle
even more similar to_.sample
. Please let me know if any of those two options are preferred over the one I have here.