Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of unordered_multimap::insert much slower than GCC 4.8.2 #21649

Closed
llvmbot opened this issue Oct 14, 2014 · 15 comments
Closed

Performance of unordered_multimap::insert much slower than GCC 4.8.2 #21649

llvmbot opened this issue Oct 14, 2014 · 15 comments
Labels
bugzilla Issues migrated from bugzilla duplicate Resolved as duplicate libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.

Comments

@llvmbot
Copy link

llvmbot commented Oct 14, 2014

Bugzilla Link 21275
Resolution DUPLICATE
Resolved on Aug 27, 2015 08:39
Version 3.4
OS All
Reporter LLVM Bugzilla Contributor
CC @hfinkel,@mclow

Extended Description

The performance of unordered_multimap::insert is very slow in LLVM libc++
when used on large data sets that have degenerate non-uniform key distribution.

// g++ -std=c++11 -O3 umm.cpp -o umm && time ./umm
// 13.5s for Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn).
// 0.016s for GCC version 4.8.2.
#include <unordered_map>
int main() {
std::unordered_multimap<int,int> m;
for (int x = 0; x < 100000; ++x)
m.insert(std::pair<int,int>(x%4, x));
return 0;
}

I recognize that a another ticket was filed on this issue and marked as WONTFIX.
But even using max_load_factor() does not help LLVM libc++ get anywhere near
the performance of GCC.

One cannot always know in advance what the key distribution of the data set will
be - it can be uniform or degenerate. Data sets can be multi-gigabyte and would
not be conducive to multiple passes to determine key distribution. As a
workaround I used #ifdefs with an ordered std::multimap on Mac which is only
twice as slow as GCC 4.8.2 libstdc++'s unordered_multimap for this scenario,
which is not ideal.

@hfinkel
Copy link
Collaborator

hfinkel commented Oct 14, 2014

In the related bug you mention (#16747 I assume), it was mentioned that some of the other implementations might keep an end-of-bucket-list pointer to prevent this kind of worst-case forward bucket scanning. Perhaps we should do so too.

@llvmbot
Copy link
Author

llvmbot commented Oct 14, 2014

I think this is a case of the Principle of Least Astonishment. It took me a couple of hours to isolate this problem in a larger program that was working fine on Linux with GCC, but was unusable on Mac. It would be nice if LLVM had similar behavior to other compilers' STL implementations for unordered_multimap. Thanks.

@hfinkel
Copy link
Collaborator

hfinkel commented Oct 14, 2014

I agree. It is bad that our asymptotic behavior is so uncompetitive here.

@llvmbot
Copy link
Author

llvmbot commented Feb 3, 2015

Test code
I'm not sure if this is redundant, but I also observe a similar slow performance for calling find() on unordered_maps of relatively small size with sequential keys. Running the attached code with e.g.

./a.out 1e9 2
./a.out 1e9 4
...
./a.out 1e9 1024

for both Clang and GCC on OSX produces the following table of results:

// Timing results for tests 1, 2 using GCC 4.7.2
//
// N map unordered_map
// 2 2.06997 0.882264
// 4 2.53414 0.888939
// 8 3.19828 0.877453
// 16 3.83513 0.876052
// 32 3.82357 0.877334
// 64 4.38807 0.877624
// 128 4.94558 0.880401
// 256 5.52116 0.87198
// 512 6.20285 0.876554
// 1024 7.60791 0.916113

// clang++ Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn)
//
// N map unordered_map
// 2 2.07317 2.03433
// 4 1.87946 18.2131
// 8 3.10882 18.229
// 16 3.80371 18.4432
// 32 3.80869 18.4188
// 64 4.44196 18.3647
// 128 8.17782 18.388
// 256 9.3574 18.4078
// 512 10.5781 18.3727
// 1024 11.6872 18.4273

@llvmbot
Copy link
Author

llvmbot commented Feb 3, 2015

I'll take a look at this.

@llvmbot
Copy link
Author

llvmbot commented Apr 6, 2015

It seems like the optimizer is playing tricks on your unordered_map test. The reason that libstdc++ runs so much faster than libc++ in that case seems to be because the optimizer figures out it only needs to call unordered_map::find once. For unknown reasons it cant seems to deduce this for libc++.

If you change the input into unordered_map::find(...) to either be a) different every run, or b) hidden from the optimizer, then you will see that libc++ and libstdc++ compare almost exactly the same.

Can you provide me with a real life example where this optimization behavior causes a real problem?

@llvmbot
Copy link
Author

llvmbot commented Apr 6, 2015

@llvmbot
Copy link
Author

llvmbot commented Apr 13, 2015

The reason that libstdc++ runs so much faster than libc++ in that case seems to be
because the optimizer figures out it only needs to call unordered_map::find once.

Pretty impressive for the GCC optimizer to be able to do this!

I think I agree with your analysis. I also did some timings to confirm that calling the identity() function and accumulating the dummy parameter were not dominating the runtime. The results are below.

GCC 4.9.1 from the MOOSE package

N map unordered_map identity-only addition-only
8 5.21771 10.1175 1.7485 1.74631
16 6.41907 10.1135 1.74535 1.75165
32 6.3854 10.1792 1.74821 1.77143

/usr/bin/clang++ = Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn)

N map unordered_map identity-only addition-only
8 6.72477 9.90882 1.74936 2.19065
16 9.00235 9.97721 1.75315 2.17357
32 8.99803 9.91076 1.76576 2.18399

Can you provide me with a real life example where this optimization behavior causes a real problem?

I asked my colleague who originally brought up the issue, but it appears he no longer has the "real-world" code where unordered_map was slower than map. In the end, I believe he went with a vector-based solution that was faster than both associated container types for his application.

@llvmbot
Copy link
Author

llvmbot commented Aug 25, 2015

This bug ticket seems to have been hijacked by an unrelated issue.

Any progress on the original problem?

// g++ -std=c++11 -O3 umm.cpp -o umm && time ./umm
// 13.5s for Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn).
// 0.016s for GCC version 4.8.2.
#include <unordered_map>
int main() {
std::unordered_multimap<int,int> m;
for (int x = 0; x < 100000; ++x)
m.insert(std::pair<int,int>(x%4, x));
return 0;
}

@llvmbot
Copy link
Author

llvmbot commented Aug 25, 2015

Zaxxon, the bug you submitted seems to be a duplicate of #17121 .

See https://llvm.org/bugs/show_bug.cgi?id=16747

@llvmbot
Copy link
Author

llvmbot commented Aug 25, 2015

As explained at the top of this PR, the current implementation of libc++'s unordered_map and unordered_multimap is wildly uncompetitive with GCC with and without hints. A unordered hash map is supposed to be a faster drop-in alternative to an ordered map. The libc++ implementation is not competitive when inserting a lot of data with few keys - it becomes a very long sequential list scan to append data in a bucket. One cannot possibly know the key distribution in advance, so the libc++ implementation of unordered_map and unordered_multimap is simply not usable in its present form on large data sets.

Surely libc++ can be adapted to do something similar to GCC libstdc++'s superior algorithm? All the has to be done is to keep a pointer to the end of the values in the hash bucket and check that value first instead of doing a pointless and inefficient sequential scan.

@llvmbot
Copy link
Author

llvmbot commented Aug 26, 2015

As explained at the top of this PR, the current implementation of libc++'s
unordered_map and unordered_multimap is wildly uncompetitive with GCC with
and without hints.

The problem you describe at the top of this PR is exactly the same as #17121 . While you may not agree with the resolution this problem has been discussed and decided on.

ALso you ONLY describe a problem with std::unordered_multimap. I have yet to see how libc++'s std::unordered_map is uncompetative with GCC's. I have spent time benchmarking both implementations and I do not see the problem. Could you explain further?

One cannot possibly know
the key distribution in advance, so the libc++ implementation of
unordered_map and unordered_multimap is simply not usable in its present
form on large data sets.

While that is somewhat true for unordered_multimap #17121 describes how to use unordered_multimap to get the behavior and performance you desire.

You once again mention std::unordered_map but I don't know what your refering too. The problem only exists in std::unordered_multimap.

Surely libc++ can be adapted to do something similar to GCC libstdc++'s
superior algorithm? All the has to be done is to keep a pointer to the end
of the values in the hash bucket and check that value first instead of doing
a pointless and inefficient sequential scan.

I've looked into making changes to add an "bucket end" pointer but it seems quite complicated.

First you have to understand that a change like this is made harder by the fact we can't introduce ABI breaking changes. That means I cant actually add any new class data members to any data type in std::unordered_multimap. This makes the change a lot harder then simply adding a new list of bucket pointers.

Second, std::unordered_multimap and std::unordered_map are implemented using the same underlying container __hash_table. Any changes made to __hash_table to support std::unordered_multimap MUST NOT introduce needless overhead to std::unordered_map. I don't see how we can easily make the changes you describe without affecting std::unordered_map.

While I don't agree with Howard Hinnants decision to make std::unordered_multimap do insertions at the end I don't think I can change the existing behavior. If any change is going to be made to fix the performance problems it should be to do insertions at the front.

I'm closing this bug as a duplicate of #17121 . If I missed a performance problem in std::unordered_map please reopen this bug.

*** This bug has been marked as a duplicate of bug #17121 ***

@llvmbot
Copy link
Author

llvmbot commented Aug 26, 2015

It's been 10 months since I opened this bug report and I erred when I mentioned unordered_map. It was not intentional.

libc++ unordered_multimap is simply unusable for large data sets for the reasons I've previously outlined. The performance hint recommended in the other PR is not useful for a data set for which you do not know the key characteristics beforehand.

It is disappointing that this PR has been closed and made a duplicate of a PR also marked RESOLVED WONTFIX because it is deemed to be too difficult to fix in light of binary compatibility constraints. I was kind of hoping LLVM would have supported class versioning of some sort.

While I don't agree with Howard Hinnants decision to make std::unordered_multimap do insertions at the end I don't think I can change the existing behavior. If any change is going to be made to fix the performance problems it should be to do insertions at the front.

Why would changing this behavior break anything? The ordering for elements with the same key is not guaranteed by the standard.

@llvmbot
Copy link
Author

llvmbot commented Aug 26, 2015

libc++ unordered_multimap is simply unusable for large data sets for the
reasons I've previously outlined. The performance hint recommended in the
other PR is not useful for a data set for which you do not know the key
characteristics beforehand.

I don't understand why you can't use the method described in the other PR (without the max_load_factor call). How does that depend on the characteristics of the data set in a way that your current example does not?

If I were to implement GCC's behavior in libc++ I would likely do it in a way very similar to the performance hint described in #17121 .

@llvmbot
Copy link
Author

llvmbot commented Aug 27, 2015

The poor performance of the current implementation of libc++ unordered_multimap::insert violates the principle of least astonishment. It is a quality of implementation issue.

Good performance for unordered_multimap::insert should be achieved by the STL library implementation itself without the need for the insert/find hack. Indeed, GCC's libstdc++ has been able to solve this very problem with a more efficient algorithm. The user of the STL implementation should not have to know or care that the libc++ library implementation of this container degenerates to searching very long linked lists in buckets when there are a lot of elements with only a few distinct keys. I can't be the only user of this library that spent several hours isolating why my application was so slow on Macs.

I think the example at the top of this ticket speaks for itself.

// 13.5s for Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn).
// 0.016s for GCC version 4.8.2.

libc++ is three orders of magnitude slower than GCC to insert 100,000 elements.

It is clear that we will never agree on this issue, and the ticket was marked WONTFIX. Nothing left to say on this topic.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla duplicate Resolved as duplicate libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.
Projects
None yet
Development

No branches or pull requests

2 participants