Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make _.sample more efficient by exiting early #2158

Merged
merged 1 commit into from May 22, 2015

Conversation

marekventur
Copy link
Contributor

@marekventur marekventur commented Apr 21, 2015

This makes _.sample more efficient by re-implementing _.shuffle's logic and exiting early when enough elements are found. The speedup to the previous method of shuffling the whole input is especially large when the input is a long array and only a few sample elements are required.

It is safe to exit a Fisher-Yates shuffle early: http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Comparison_with_other_shuffling_algorithms

There are two alternatives to the way I implemented it which would require either factoring the main _.shuffle logic out into a private function and using it in both _.sample and _.shuffle or adding an internal parameter to _.shuffle that limits the result to a certain number, thus making _.shuffle even more similar to _.sample. Please let me know if any of those two options are preferred over the one I have here.

@marekventur
Copy link
Contributor Author

@marekventur marekventur commented Apr 21, 2015

I think I made have made a mistake here, please give me a second while I investigate.

@marekventur marekventur force-pushed the master branch 2 times, most recently from a959614 to a15ed33 Compare Apr 21, 2015
@marekventur
Copy link
Contributor Author

@marekventur marekventur commented Apr 21, 2015

Turns out you can't just take the loop from _.shuffle and exit early, it's not the type of fisher-yates I remember (not even sure it's one, but I'm no expert).

Anyway, I've rewritten the function and added a test to check it's actually sampling from the whole range rather than just from the first n elements.

return _.shuffle(obj).slice(0, Math.max(0, n));

var set = isArrayLike(obj) ? obj : _.values(obj);
n = Math.max(Math.min(n, set.length), 0);
Copy link
Collaborator

@megawac megawac Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set.length shouldn't ever be less than 0 and if it is it won't really matter

Copy link
Collaborator

@jridgewell jridgewell Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to use getLength(set), to avoid iOS JIT.

Copy link
Collaborator

@michaelficarra michaelficarra Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to use max because n may be negative.

@megawac
Copy link
Collaborator

@megawac megawac commented Apr 27, 2015

If this is done shuffle should probably be implemented via sample

var set = isArrayLike(obj) ? obj : _.values(obj);
n = Math.max(Math.min(n, set.length), 0);
var sample = _.clone(set);
for (var length = set.length, index = length - 1, rand, temp; index > (length - n - 1); index--) {
Copy link
Collaborator

@jridgewell jridgewell Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind moving rand and temp out of here?

@marekventur
Copy link
Contributor Author

@marekventur marekventur commented Apr 27, 2015

Thanks for the review. I've changed the bits mentioned and made shuffle call through to sample. I've also did a few minor changes to the test for shuffle to make it look a bit cleaner and to test for one naive failure case.

I wasn't sure what your policy on git commits is, so I squashed the whole thing together into one commit.

var set = isArrayLike(obj) ? obj : _.values(obj);
var length = getLength(set);
n = Math.min(n, getLength(set));
var sample = _.clone(set);
Copy link
Collaborator

@jridgewell jridgewell Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nitpick: I think we can use a modified version of Fisher-Yates to only allocate an array of n.

    var set = isArrayLike(obj) ? obj : _.values(obj);
    var length = getLength(set);
    n = Math.min(n, length);
    var sample = Array(n);
    var rand;
    for (var index = 0; index < n; index++) {
      rand = _.random(0, index);
      if (rand !== index) sample[index] = sample[rand];
      sample[rand] = set[index];
    }
    for (; index < length; index++) {
      rand = _.random(0, index);
      if (rand < n) sample[rand] = set[index];
    }
    return sample;

Copy link
Contributor Author

@marekventur marekventur Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. I've added another test to cater for the n<0 case as well.

@marekventur marekventur force-pushed the master branch 3 times, most recently from 38669eb to 28f0683 Compare Apr 27, 2015
shuffled[rand] = set[index];
}
return shuffled;
return _.sample(obj, _.size(obj));
Copy link
Collaborator

@jridgewell jridgewell Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, one more 😉: Using Infinity instead of _.size(obj) will accomplish the same thing without a potentially costly _.keys.

Copy link
Contributor Author

@marekventur marekventur Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, hadn't even thought about that!

@megawac
Copy link
Collaborator

@megawac megawac commented Apr 27, 2015

Have you done any profiling yet @marekventur?

ok(_.every(_.range(10), function() { //appears consistent?
return _.every(numbers, _.partial(_.contains, numbers));
}), 'contains the same members before and after shuffle');
deepEqual(numbers, _.sortBy(shuffled, function(n) { return n; }), 'contains the same members before and after shuffle');
Copy link
Collaborator

@megawac megawac Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_.sortBy(shuffled) will work just as well

Copy link
Contributor Author

@marekventur marekventur Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that. Is that documented behaviour or a side-effect?

Copy link
Collaborator

@megawac megawac Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented from iteratee
On Apr 27, 2015 6:08 PM, "marekventur" notifications@github.com wrote:

In test/collections.js
#2158 (comment):

 notStrictEqual(numbers, shuffled, 'original object is unmodified');
  • ok(.every(.range(10), function() { //appears consistent?
  •  return _.every(numbers, _.partial(_.contains, numbers));
    
  • }), 'contains the same members before and after shuffle');
  • deepEqual(numbers, _.sortBy(shuffled, function(n) { return n; }), 'contains the same members before and after shuffle');

I didn't know that. Is that documented behaviour or a side-effect?


Reply to this email directly or view it on GitHub
https://github.com/jashkenas/underscore/pull/2158/files#r29196826.

Copy link
Contributor Author

@marekventur marekventur Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll drop that function.

@marekventur
Copy link
Contributor Author

@marekventur marekventur commented Apr 27, 2015

Good call. This is how long 1M operations on a 1000 elements array take on my machine:

Master
input = _.range(1000);
1000000x _.shuffle(input):      15321ms
1000000x _.sample(input, 1000): 32637ms
1000000x _.sample(input, 500):  25765ms
1000000x _.sample(input, 2):    16937ms

My version (before https://github.com/jashkenas/underscore/pull/2158#discussion_r29190358)
input = _.range(1000);
1000000x _.shuffle(input):      15931ms
1000000x _.sample(input, 1000): 15326ms
1000000x _.sample(input, 500):   8331ms
1000000x _.sample(input, 2):     1127ms

After https://github.com/jashkenas/underscore/pull/2158#discussion_r29190358:
input = _.range(1000);
1000000x _.shuffle(input):      15896ms
1000000x _.sample(input, 1000): 15310ms
1000000x _.sample(input, 500):  17067ms
1000000x _.sample(input, 2):    10645ms

profiling code is here: https://gist.github.com/marekventur/455c7bffb00d9759ec7e

In lights of those numbers I'm going to revert back to my code pre the changes @jridgewell suggested, it might be more efficient in terms of memory usage, but it's slower.

@jridgewell
Copy link
Collaborator

@jridgewell jridgewell commented Apr 27, 2015

I don't believe 92a9d2c is a proper shuffle, that's why my suggestions run slower.

// Let's return predictable results
_.random = function (min, max) {
  if (max == null) {
    max = min;
    min = 0;
  }
  return min + Math.floor(Math.min(Math.random(), .1) * (max - min + 1));
}

var array = _.range(10);

// master
_.shuffle(array); // => [8, 9, 1, 2, 3, 4, 5, 6, 7, 0];

// Yours
_.shuffle(array); // => [9, 2, 3, 4, 5, 6, 7, 8, 0, 1];

// Mine
_.shuffle(array); // => [8, 9, 1, 2, 3, 4, 5, 6, 7, 0];

It's not definitive proof, but it feels fishy to me.

Edit: Off-by-ones were throwing me off. Yours is indeed a proper shuffle.

sample[index] = sample[rand];
sample[rand] = temp;
}
return sample.slice(length - n);
Copy link
Collaborator

@megawac megawac Apr 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Killing the elements after n should be faster

sample.length = n;

Which would avoid the slice call for shuffle

Copy link
Collaborator

@jridgewell jridgewell Apr 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is, the shuffled elements are at the end of the array not the beginning.

Copy link
Contributor Author

@marekventur marekventur Apr 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, that wouldn't work. You want the end of the array, not the beginning.

@megawac
Copy link
Collaborator

@megawac megawac commented Apr 28, 2015

In lights of those numbers I'm going to revert back to my code pre the changes @jridgewell suggested, it might be more efficient in terms of memory usage, but it's slower.

I think it should be comparable/better if it is implemented like this megawac/lodash@d7d1a3d

…rough to _.sample

* Improves efficiency of _.sample for when n is significantly smaller than obj.length
* Improves tests to make sure _.sample and _.shuffle actually change the order
* Replace shuffle with call to _.sample to avoid duplicated logic
@jridgewell
Copy link
Collaborator

@jridgewell jridgewell commented Apr 28, 2015

I think it should be comparable/better if it is implemented like this megawac/lodash@d7d1a3d

Putting an if in the hot loop will make it slower, no? http://jsperf.com/samples-clone-vs-for v http://jsperf.com/samples-clone-vs-for/5.

His is so much faster because it only runs n iterations (with length space), whereas ours does the opposite.

@marekventur
Copy link
Contributor Author

@marekventur marekventur commented Apr 28, 2015

Like @jridgewell says, megawac/lodash@d7d1a3d still runs input.length iterations, right? Avoiding that is the whole point of this PR :)

@megawac
Copy link
Collaborator

@megawac megawac commented Apr 28, 2015

Right, my bad I misread earlier comments

@jridgewell
Copy link
Collaborator

@jridgewell jridgewell commented Apr 28, 2015

Like @jridgewell says, megawac/lodash@d7d1a3d still runs input.length iterations, right? Avoiding that is the whole point of this PR :)

Well, before we were running length iterations and using length space, the worst of both worlds. I think yours is good enough since it's no worse space-wise and so much faster.

@marekventur
Copy link
Contributor Author

@marekventur marekventur commented May 2, 2015

Is there anything else that needs doing before it can be merged? I'm happy to do it, I just don't know what it is :)

jashkenas added a commit that referenced this issue May 22, 2015
Make _.sample more efficient by exiting early
@jashkenas jashkenas merged commit 297b829 into jashkenas:master May 22, 2015
1 check passed
@jashkenas
Copy link
Owner

@jashkenas jashkenas commented May 22, 2015

Thanks for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants