-
-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linear slowdown with keys larger than 32 bytes #158
Comments
Some timing information: Without packing:
with packing:
System info: 2x AMD EPYC 7763 (Milan), 128 total cores, 256 total threads, 512 GiB total DDR4 memory (https://docs.nersc.gov/systems/perlmutter/system_details/#cpu-only-compute-nodes)
|
Hum, I suspect it is the hash function with your larger keys causing lots of collisions. It is critical that different keys mostly hash to different values. |
That seems to be most of it, yeah -- with this hash function namespace std
{
template<> struct hash<groupid_t>
{
std::size_t operator()(groupid_t const &g) const
{
return phmap::HashState().combine(0,
(uint64_t(0xFFFu & g[0]))
| (uint64_t(0xFFFu & g[1])<<12)
| (uint64_t(0xFFFu & g[2])<<24)
| (uint64_t(0xFFFu & g[3])<<36)
| (uint64_t(0xFFFu & g[4])<<48)
| (uint64_t(0xFFFu & g[5])<<60),
(uint64_t(0xFFFu & g[5])>>4)
| (uint64_t(0xFFFu & g[6])<<8)
| (uint64_t(0xFFFu & g[7])<<20)
| (uint64_t(0xFFFu & g[8])<<32)
| (uint64_t(0xFFFu & g[9])<<44)
| (uint64_t(0xFFFu & g[10])<<56),
(uint64_t(0xFFFu & g[10])>>8)
| (uint64_t(0xFFFu & g[11])<<4)
| (uint64_t(0xFFFu & g[12])<<16)
| (uint64_t(0xFFFu & g[13])<<28)
| (uint64_t(0xFFFu & g[14])<<40)
| (uint64_t(0xFFFu & g[15])<<52),
(uint64_t(0xFFFu & g[16]))
| (uint64_t(0xFFFu & g[17])<<12)
| (uint64_t(0xFFFu & g[18])<<24)
| (uint64_t(0xFFFu & g[19])<<36)
| (uint64_t(0xFFFu & g[20])<<48)
);
}
};
} it works just as well as using the bit-packed key. I guess that you can get away with a bad hash function for longer as long as you can vectorize the key comparison. If you have a bad hash function and the keys are larger than 32 bytes, the hash collisions become very visible because key comparison is expensive. I also expect that with my old key function, a sufficiently large map would start to show O(n) slowdowns -- it's just that, with a vectorized comparison, the coefficient on n is quite small. |
Hum, can you try this as a hash function? Maybe there is a problem when using a tuple of small integers?
|
Yes, that hash function works just as well as my bit-shift monstrosity. Something does seem off with a tuple of small integers. |
I believe I fixed the issue. It should now work correctly with your original hash function. |
I'm trying to build a large
parallel_flat_hash_map
and I'm noticing a distinct change in behavior when the key becomes larger than 32 bytes. My naive key would bestd::array<uint16_t,20>
, where I've defined types and a hash functionMy map typedef is
and I'm using
lazy_emplace_l
(where I know as a precondition that my initial keys are unique)If I insert ~50M elements, I notice that the time it takes to insert an element rises by about a factor of 100 between when the map is empty and when the map is "full", and rises approximately linearly with the number of elements already in the map. (Note: I know the initial number of elements, so I am doing
groupid_map.reserve()
and I can see that the map never gets much above ~50%load_factor
.) If I runperf
, I see that over 50% of the CPU time is spent in__memcmp_avx2_movbe
, which I believe is glibc's implementation ofstd::memcmp
and is probably used bystd::equal_to
for standard-layout types likestd::array
.However, my keys are actually really only 12-bit integers, so if I redefine
groupid_t
and do some fancy bit-packingI get no slowdown as the map increases in size.
I've also tested using
std::array<uint16_t,16>
as my key type, and I see "good" behavior, i.e. no linear slowdown. However, as soon as my key goes over 32-bytes/256-bits, insertion/emplacement slows down dramatically.I am guessing that this has to do with the fact that, if
sizeof(key)<=32
, a bunch of comparisons can be done using AVX2 instructions, but as soon as the key gets even one byte larger, the comparison changes to something non-vectorized.For now, I can just bit-pack my keys, but it would be nice to know if this is a fundamental limitation or just a weird bug/edge case.
The text was updated successfully, but these errors were encountered: