Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of degenerated unordered_multimap #17121

Closed
llvmbot opened this issue Jul 30, 2013 · 6 comments
Closed

Poor performance of degenerated unordered_multimap #17121

llvmbot opened this issue Jul 30, 2013 · 6 comments
Labels
bugzilla Issues migrated from bugzilla libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. wontfix Issue is real, but we can't or won't fix it. Not invalid

Comments

@llvmbot
Copy link

llvmbot commented Jul 30, 2013

Bugzilla Link 16747
Resolution WONTFIX
Resolved on Aug 25, 2015 22:01
Version unspecified
OS All
Reporter LLVM Bugzilla Contributor

Extended Description

The following code uses an unordered_multimap to associate very few values with a long list of values. The example performs very poor compared to libstdc++ and the std lib implementation of Visual Studio 2012.

#include
#include
#include
#include <unordered_map>

int main (int argc, char* argv[]) {
const int N = (argc > 1) ? std::stoi(argv[1]) : 123456;
const auto begin = std::chrono::high_resolution_clock::now();
std::unordered_multimap<int, int> map;
for (int i = 0; i < N; ++i) {
map.insert({i % 4, i});
}
const auto end = std::chrono::high_resolution_clock::now();
const auto ms = std::chrono::duration_caststd::chrono::milliseconds(end - begin);
std::cout << "Duration for " << map.size() << " integer inserts: " << ms.count() << "ms" << std::endl;
}

Results on Linux (SLES 11.2):

clang-3.2 with libc++ from 2013-01-31:
Duration for 123456 integer inserts: 21605ms

clang-3.3 with libc++ from 2013-01-31:
Duration for 123456 integer inserts: 21836ms

gcc-4.6 with GLIBCXX_3.4.16:
Duration for 123456 integer inserts: 7ms

Results on Windows 2008 Server:

Visual Studio 2012 SP3:
Duration for 123456 integer inserts, count=123456: 15ms

Results on OS X 10.8.4:

Apple LLVM version 4.2 (clang-425.0.28) (based on LLVM 3.2svn):
Duration for 123456 integer inserts: 18702ms

The Linux and Windows results are on the same hardware, the OS X test was run on a Mac Pro 2.93GHz and has similar performance to the Linux/Windows hardware.

I am building an inverse map for fast look-up. In most cases this works fine. In rare cases the map degenerates to a few values being mapped to a long list. In these rare case the performance deteriorates badly compared to other implementations.

@llvmbot
Copy link
Author

llvmbot commented Jul 30, 2013

This behavior is due to an extension for unordered containers that was meant to make the transition between multimap and unordered_multimap more seamless (in both directions).

For multimap, an insert without hint is required to insert at the end of any existing equal range within the multimap. This is not required of unordered_multimap, but libc++ implements this rule anyway to make these two data structures more interchangeable. And it is this behavior that is causing the dismal performance in this example.

In this example, the unordered_multimap degenerates into 4 very long linked lists:

size_t nb = map.bucket_count();
std::cout << nb << '\n';
for (size_t b = 0; b < nb; ++b)
{
size_t bs = map.bucket_size(b);
if (bs != 0)
std::cout << b << " : " << bs << '\n';
}

205759
0 : 30864
1 : 30864
2 : 30864
3 : 30864

And so each insertion is traversing one of these four lists to insert at the end. A hash-based container is ill-suited for this type of data, and therefore the libc++ unordered containers have not been optimized in any way for this case. A much better data structure would be an array of forward_list, or an array of vectors.

However I understand that this situation can arise accidentally. There is a way to tell libc++ to insert at the front of these equal ranges instead of at the back:

for (int i = 0; i < N; ++i) {
map.insert(map.find(i % 4), {i % 4, i});
}

I.e. first search for the key. This will find the beginning of the equal_range, if it exists. And then the hint is used to insert just before that. When this small change is made, the time for me drops from 18543ms to 15ms.

Such a strategy could even be switched to at run time by monitoring the max collision (max bucket_size(b)), and switching to it when it exceeds some predetermined threshold.

It is an open question as to whether this extension (for unordered_multimap without hint to insert in the same way that multimap does) is a value to libc++ clients or not. I know from decades of experience that clients definitely do value the prescribed order in multimap. Indeed, this is a new feature in C++11, standardized in large part by customer demand. And so my guess is that people will value the same behavior in unordered_multimap, even though the order between different equal_range's remains unspecified.

I am curious. Now that you know the cause, motivation, and workaround for this behavior, what are your thoughts?

@llvmbot
Copy link
Author

llvmbot commented Jul 30, 2013

PS: The libc++ implementation is all about giving you knobs that you can tweak for higher performance and/or better functionality. By tweaking a few more knobs I can double the speed again:

std::unordered_multimap<int, int> map(4);
map.max_load_factor(INFINITY);
for (int i = 0; i < N; ++i) {
map.insert(map.find(i % 4), {i % 4, i});
}

7ms

@llvmbot
Copy link
Author

llvmbot commented Jul 31, 2013

Thanks for your quick answer. The workaround works perfectly fine for me. Although I was not sure if I should set the status to resolved.

I didn't think of it myself because I didn't think it would change anything.

I can understand your motivation for keeping the order having been there myself. If I understood the MS implementation correctly they keep an end pointer to the list and just append; so they keep the order too. (I didn't look at the libstdc++ implementation).

The change has no measurable impact on the other libs, so I can use the same code for all platforms.

The second speedup tip with max_load_factor is not working for me. In my application I have no idea about the final size. And when I tried it the first time, I forgot the number of buckets in the constructor and the performance was as bad as before.

In my application the degenerated case is very rare. It did go unnoticed for some time. In the rare cases where it occurs the response time went from under a second to tens of minutes. And I missed it in my tests.

We are shipping on SLES and the default gcc is very old (4.3). So I switched to clang/libc++ to use C++11 features. After we found this problem we started shipping with gcc 4.6/libstdc++, which causes all sorts of other problems. So I am very happy to switch back to releasing with clang/libc++.

By the way I like the policy for allowing power-of-2 number of buckets. MS always uses power-of-2 number of buckets and I have a case where this can really hurt me. Making it tuneable is a very good idea.

@llvmbot
Copy link
Author

llvmbot commented Jul 31, 2013

Thanks for your feedback Gerald.

@llvmbot
Copy link
Author

llvmbot commented Aug 26, 2015

*** Bug llvm/llvm-bugzilla-archive#21275 has been marked as a duplicate of this bug. ***

@hfinkel
Copy link
Collaborator

hfinkel commented Nov 26, 2021

mentioned in issue llvm/llvm-bugzilla-archive#21275

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 9, 2021
@Quuxplusone Quuxplusone added the wontfix Issue is real, but we can't or won't fix it. Not invalid label Jan 20, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. wontfix Issue is real, but we can't or won't fix it. Not invalid
Projects
None yet
Development

No branches or pull requests

3 participants