-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of unordered_multimap::insert much slower than GCC 4.8.2 #21649
Comments
|
In the related bug you mention (#16747 I assume), it was mentioned that some of the other implementations might keep an end-of-bucket-list pointer to prevent this kind of worst-case forward bucket scanning. Perhaps we should do so too. |
|
I think this is a case of the Principle of Least Astonishment. It took me a couple of hours to isolate this problem in a larger program that was working fine on Linux with GCC, but was unusable on Mac. It would be nice if LLVM had similar behavior to other compilers' STL implementations for unordered_multimap. Thanks. |
|
I agree. It is bad that our asymptotic behavior is so uncompetitive here. |
|
Test code ./a.out 1e9 2 for both Clang and GCC on OSX produces the following table of results: // Timing results for tests 1, 2 using GCC 4.7.2 // clang++ Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn) |
|
I'll take a look at this. |
|
It seems like the optimizer is playing tricks on your unordered_map test. The reason that libstdc++ runs so much faster than libc++ in that case seems to be because the optimizer figures out it only needs to call unordered_map::find once. For unknown reasons it cant seems to deduce this for libc++. If you change the input into unordered_map::find(...) to either be a) different every run, or b) hidden from the optimizer, then you will see that libc++ and libstdc++ compare almost exactly the same. Can you provide me with a real life example where this optimization behavior causes a real problem? |
Pretty impressive for the GCC optimizer to be able to do this! I think I agree with your analysis. I also did some timings to confirm that calling the identity() function and accumulating the dummy parameter were not dominating the runtime. The results are below.
I asked my colleague who originally brought up the issue, but it appears he no longer has the "real-world" code where unordered_map was slower than map. In the end, I believe he went with a vector-based solution that was faster than both associated container types for his application. |
|
This bug ticket seems to have been hijacked by an unrelated issue. Any progress on the original problem? // g++ -std=c++11 -O3 umm.cpp -o umm && time ./umm |
|
Zaxxon, the bug you submitted seems to be a duplicate of #17121 . |
|
As explained at the top of this PR, the current implementation of libc++'s unordered_map and unordered_multimap is wildly uncompetitive with GCC with and without hints. A unordered hash map is supposed to be a faster drop-in alternative to an ordered map. The libc++ implementation is not competitive when inserting a lot of data with few keys - it becomes a very long sequential list scan to append data in a bucket. One cannot possibly know the key distribution in advance, so the libc++ implementation of unordered_map and unordered_multimap is simply not usable in its present form on large data sets. Surely libc++ can be adapted to do something similar to GCC libstdc++'s superior algorithm? All the has to be done is to keep a pointer to the end of the values in the hash bucket and check that value first instead of doing a pointless and inefficient sequential scan. |
The problem you describe at the top of this PR is exactly the same as #17121 . While you may not agree with the resolution this problem has been discussed and decided on. ALso you ONLY describe a problem with std::unordered_multimap. I have yet to see how libc++'s std::unordered_map is uncompetative with GCC's. I have spent time benchmarking both implementations and I do not see the problem. Could you explain further?
While that is somewhat true for unordered_multimap #17121 describes how to use unordered_multimap to get the behavior and performance you desire. You once again mention std::unordered_map but I don't know what your refering too. The problem only exists in std::unordered_multimap.
I've looked into making changes to add an "bucket end" pointer but it seems quite complicated. First you have to understand that a change like this is made harder by the fact we can't introduce ABI breaking changes. That means I cant actually add any new class data members to any data type in std::unordered_multimap. This makes the change a lot harder then simply adding a new list of bucket pointers. Second, std::unordered_multimap and std::unordered_map are implemented using the same underlying container While I don't agree with Howard Hinnants decision to make std::unordered_multimap do insertions at the end I don't think I can change the existing behavior. If any change is going to be made to fix the performance problems it should be to do insertions at the front. I'm closing this bug as a duplicate of #17121 . If I missed a performance problem in std::unordered_map please reopen this bug. *** This bug has been marked as a duplicate of bug #17121 *** |
|
It's been 10 months since I opened this bug report and I erred when I mentioned unordered_map. It was not intentional. libc++ unordered_multimap is simply unusable for large data sets for the reasons I've previously outlined. The performance hint recommended in the other PR is not useful for a data set for which you do not know the key characteristics beforehand. It is disappointing that this PR has been closed and made a duplicate of a PR also marked RESOLVED WONTFIX because it is deemed to be too difficult to fix in light of binary compatibility constraints. I was kind of hoping LLVM would have supported class versioning of some sort.
Why would changing this behavior break anything? The ordering for elements with the same key is not guaranteed by the standard. |
I don't understand why you can't use the method described in the other PR (without the max_load_factor call). How does that depend on the characteristics of the data set in a way that your current example does not? If I were to implement GCC's behavior in libc++ I would likely do it in a way very similar to the performance hint described in #17121 . |
|
The poor performance of the current implementation of libc++ unordered_multimap::insert violates the principle of least astonishment. It is a quality of implementation issue. Good performance for unordered_multimap::insert should be achieved by the STL library implementation itself without the need for the insert/find hack. Indeed, GCC's libstdc++ has been able to solve this very problem with a more efficient algorithm. The user of the STL implementation should not have to know or care that the libc++ library implementation of this container degenerates to searching very long linked lists in buckets when there are a lot of elements with only a few distinct keys. I can't be the only user of this library that spent several hours isolating why my application was so slow on Macs. I think the example at the top of this ticket speaks for itself. // 13.5s for Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn). libc++ is three orders of magnitude slower than GCC to insert 100,000 elements. It is clear that we will never agree on this issue, and the ticket was marked WONTFIX. Nothing left to say on this topic. |
Extended Description
The performance of unordered_multimap::insert is very slow in LLVM libc++
when used on large data sets that have degenerate non-uniform key distribution.
// g++ -std=c++11 -O3 umm.cpp -o umm && time ./umm
// 13.5s for Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn).
// 0.016s for GCC version 4.8.2.
#include <unordered_map>
int main() {
std::unordered_multimap<int,int> m;
for (int x = 0; x < 100000; ++x)
m.insert(std::pair<int,int>(x%4, x));
return 0;
}
I recognize that a another ticket was filed on this issue and marked as WONTFIX.
But even using max_load_factor() does not help LLVM libc++ get anywhere near
the performance of GCC.
One cannot always know in advance what the key distribution of the data set will
be - it can be uniform or degenerate. Data sets can be multi-gigabyte and would
not be conducive to multiple passes to determine key distribution. As a
workaround I used #ifdefs with an ordered std::multimap on Mac which is only
twice as slow as GCC 4.8.2 libstdc++'s unordered_multimap for this scenario,
which is not ideal.
The text was updated successfully, but these errors were encountered: