Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Use a close-to-O(n) unique algorithm for arrays of strings. #837

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
6 participants

For an array with a large number of strings (think a million or so), the current implementation of _.uniq takes an unreasonable amount of time to complete as it uses a quadratic algorithm.

I've implemented a second algorithm that should run at close to linear time for arrays of strings.

Unfortunately Map is not available on current JavaScript implementations, or we'd be able to use this faster algorithm on arrays of values of any type.

judofyr commented Oct 23, 2012

The same technique should be possible with numbers too (right?).

In theory yes - but I didn't want to muck around with formatting or anything in case there were any gotchas.

Collaborator

michaelficarra commented Oct 23, 2012

Since (I believe, in most instances) JS objects aren't actually implemented as hash maps, I'd need to see performance metrics showing that this method is faster than sorting (in n*log(n)) and using the isSorted flag to make _.uniq linear.

Contributor

jdalton commented Oct 23, 2012

@judofyr if any more types were added the technique would need to be modded to allow the string '2' and the number 2 to be seen as different.

@charliesome I use a technique similar to this for large arrays. My implementation handles the issue of 2 and '2' by moving like values to smaller arrays (basically breaking the large bucket into smaller buckets). I don't currently use this technique for _.uniq because I only optimize for arrays that don't change length (_.difference, _.intersection, and _.without). I wanted to avoid the mechanics of adding items to the data object. I only use this optimization for large arrays (over ~30) to avoid losing perf due to the overhead of the cache initialization and setting.

@michaelficarra I've put a benchmark up on my web server: http://charlie.bz/files/underscore-perf/uniq.html

I get the following results:

--> using original uniq
original uniq with a sorted array 33.192
--> using string optimized uniq
optimized uniq with an unsorted array 10.004 
Collaborator

michaelficarra commented Oct 24, 2012

@charliesome: For some reason, I highly doubt that Array.prototype.sort runs on the order of n*log(n) in time. But thanks for the good test case other than that. For future reference, http://jsperf.com/ is a good JS benchmarking resource.

kittens commented Oct 28, 2012

+1 Ran into this issue yesterday when attempting to sort a large array, with this fix I managed to reduce something that would run indefinitely to a couple of seconds.

Good idea, but in case of strings, checking of strings will have O(n) complexity:

// here, `if` statement itself will have O(n) complexity  :(
if(_.every(array, function(x) { return typeof x === "string" })){ 
    ...
}

that makes code more than O(n2) with iterator.

Can't we make it like -

_.uniq = _.unique = function(array, isSorted, iterator, context) {    
    var results = [];
    var seen = []; 
    //here, one can define object instead of array, 
    //which makes `isSorted` unusable and complexity O(n) always
    each(array, function(value, index, list) {
        // call iterator inside `each` callback
        var initialValue = iterator ? iterator.call(array, iterator, context, list) : value;
        if (isSorted ? (!index || seen[seen.length - 1] !== initialValue) : !_.contains(seen, initialValue)) {
            seen.push(initialValue);
            results.push(array[index]);
        }
    });
    return results;
};

judofyr commented Oct 29, 2012

O(2n) is the same as O(n).

Contributor

jdalton commented Nov 1, 2012

For perf I think it's better to increase the overall perf of the function, instead of special casing it for strings. In this situation a specialized custom uniq method should be used. Also I don't think it's a good idea to sacrifice perf of the common case for the special case.

@jdalton jdalton closed this Nov 1, 2012

Contributor

jdalton commented Nov 3, 2012

@charliesome Here is my approach to optimizing _.uniq. It optimizes _.uniq for more than just strings or numbers and produces better results for large and small arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment