Skip to content

Conversation

arvidjonasson
Copy link

@arvidjonasson arvidjonasson commented Aug 25, 2024

Adds functionality requested in #102303.
Expands on functionality of https://libcxx.llvm.org/DesignDocs/UnspecifiedBehaviorRandomization.html.

  • Add randomization of element order during rehash in unordered containers (std::unordered_{set,map,multiset,multimap}) when the _LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY flag is set, similar to existing behavior in std::sort, std::nth_element, and std::partial_sort.
    • For std::unordered_{multiset,multimap}, equal ranges are shuffled and order within equal ranges are shuffled.
    • For std::unordered_{set,map}, order of elements are shuffled.
  • Add test libcxx/test/libcxx/containers/unord/hash_table_randomize_order.pass.cpp to assert that randomization works correctly.

Pseudo code for current approach:

Node[] nodes
Int[] indices

while table has elements {
    nodes += table.extract()
    indices += indices.size()
}

shuffle(indices)

for every index in indices {
    table.insert(nodes[index])
}

Can I get feedback on this approach?

@arvidjonasson arvidjonasson requested a review from a team as a code owner August 25, 2024 12:13
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Aug 25, 2024
@llvmbot
Copy link
Member

llvmbot commented Aug 25, 2024

@llvm/pr-subscribers-libcxx

Author: Arvid Jonasson (arvidjonasson)

Changes

Adds functionality requested in #102303.
Expands on functionality of https://libcxx.llvm.org/DesignDocs/UnspecifiedBehaviorRandomization.html.

  • Add randomization of element order during rehash in unordered containers (std::unordered_{set,map,multiset,multimap}) when the _LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY flag is set, similar to existing behavior in std::sort, std::nth_element, and std::partial_sort.
    • For std::unordered_{multiset,multimap}, equal ranges are shuffled and order within equal ranges are shuffled.
    • For std::unordered_{set,map}, order of elements are shuffled.
  • Add test libcxx/test/libcxx/containers/unord/hash_table_randomize_order.pass.cpp to assert that randomization works correctly.

Full diff: https://github.com/llvm/llvm-project/pull/105982.diff

3 Files Affected:

  • (modified) libcxx/docs/DesignDocs/UnspecifiedBehaviorRandomization.rst (+2)
  • (modified) libcxx/include/__hash_table (+57)
  • (added) libcxx/test/libcxx/containers/unord/hash_table_randomize_order.pass.cpp (+79)
diff --git a/libcxx/docs/DesignDocs/UnspecifiedBehaviorRandomization.rst b/libcxx/docs/DesignDocs/UnspecifiedBehaviorRandomization.rst
index 70278798ecf630..3e52a51684507e 100644
--- a/libcxx/docs/DesignDocs/UnspecifiedBehaviorRandomization.rst
+++ b/libcxx/docs/DesignDocs/UnspecifiedBehaviorRandomization.rst
@@ -82,5 +82,7 @@ Currently supported randomization
    on the order of the remaining part
 * ``std::nth_element``, there is no guarantee on the order from both sides of the
    partition
+* ``std::unordered_{set,map}``, there is no guarantee on the order of the elements
+* ``std::unordered_{multiset,multimap}``, there is no guarantee on the order of the elements nor the order of equal elements
 
 Patches welcome.
diff --git a/libcxx/include/__hash_table b/libcxx/include/__hash_table
index d5fbc92a3dfc4e..d6931a81d10a27 100644
--- a/libcxx/include/__hash_table
+++ b/libcxx/include/__hash_table
@@ -45,6 +45,11 @@
 #include <limits>
 #include <new> // __launder
 
+#ifdef _LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY
+#  include <__debug_utils/randomize_range.h>
+#  include <__numeric/iota.h>
+#endif
+
 #if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
 #  pragma GCC system_header
 #endif
@@ -980,6 +985,9 @@ private:
   template <bool _UniqueKeys>
   _LIBCPP_HIDE_FROM_ABI void __do_rehash(size_type __n);
 
+  template <bool _UniqueKeys>
+  _LIBCPP_HIDE_FROM_ABI void __debug_randomize_order();
+
   template <class... _Args>
   _LIBCPP_HIDE_FROM_ABI __node_holder __construct_node(_Args&&... __args);
 
@@ -1702,6 +1710,7 @@ void __hash_table<_Tp, _Hash, _Equal, _Alloc>::__rehash(size_type __n) _LIBCPP_D
 template <class _Tp, class _Hash, class _Equal, class _Alloc>
 template <bool _UniqueKeys>
 void __hash_table<_Tp, _Hash, _Equal, _Alloc>::__do_rehash(size_type __nbc) {
+  __debug_randomize_order<_UniqueKeys>();
   __pointer_allocator& __npa = __bucket_list_.get_deleter().__alloc();
   __bucket_list_.reset(__nbc > 0 ? __pointer_alloc_traits::allocate(__npa, __nbc) : nullptr);
   __bucket_list_.get_deleter().size() = __nbc;
@@ -1741,6 +1750,54 @@ void __hash_table<_Tp, _Hash, _Equal, _Alloc>::__do_rehash(size_type __nbc) {
   }
 }
 
+template <class _Tp, class _Hash, class _Equal, class _Alloc>
+template <bool _UniqueKeys>
+void __hash_table<_Tp, _Hash, _Equal, _Alloc>::__debug_randomize_order() {
+#ifdef _LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY
+  size_type __total_nodes       = size();
+  size_type __initialized_nodes = 0;
+
+  // Storage to handle non-assignable, non-default constructible __node_holder.
+  union __nh_storage {
+    __nh_storage() {}
+    ~__nh_storage() {}
+    __node_holder __nh;
+  };
+
+  auto __nh_storage_deleter = [&__initialized_nodes](__nh_storage* __p) {
+    for (size_type __i = 0; __i < __initialized_nodes; ++__i)
+      __p[__i].__nh.~__node_holder();
+    delete[] __p;
+  };
+
+  // Allocate storage for nodes and indices.
+  unique_ptr<__nh_storage[], decltype(__nh_storage_deleter)> __nodes(
+      new __nh_storage[__total_nodes], __nh_storage_deleter);
+  unique_ptr<size_type[]> __randomized_indices(new size_type[__total_nodes]);
+
+  // Move nodes into temporary storage.
+  for (; __initialized_nodes < __total_nodes; ++__initialized_nodes)
+    new (std::addressof(__nodes[__initialized_nodes].__nh)) __node_holder(remove(begin()));
+
+  // Randomize the order of indices.
+  std::iota(__randomized_indices.get(), __randomized_indices.get() + __total_nodes, size_type{0});
+  __debug_randomize_range<_ClassicAlgPolicy>(__randomized_indices.get(), __randomized_indices.get() + __total_nodes);
+
+  // Reinsert nodes into the hash table in randomized order.
+  for (size_type __i = 0; __i < __total_nodes; ++__i) {
+    __node_holder& __nh = __nodes[__randomized_indices[__i]].__nh;
+    __node_pointer __np = __nh->__upcast();
+    if _LIBCPP_CONSTEXPR_SINCE_CXX17 (_UniqueKeys) {
+      __node_insert_unique_perform(__np);
+    } else {
+      __next_pointer __pn = __node_insert_multi_prepare(__np->__hash(), __np->__get_value());
+      __node_insert_multi_perform(__np, __pn);
+    }
+    __nh.release();
+  }
+#endif
+}
+
 template <class _Tp, class _Hash, class _Equal, class _Alloc>
 template <class _Key>
 typename __hash_table<_Tp, _Hash, _Equal, _Alloc>::iterator
diff --git a/libcxx/test/libcxx/containers/unord/hash_table_randomize_order.pass.cpp b/libcxx/test/libcxx/containers/unord/hash_table_randomize_order.pass.cpp
new file mode 100644
index 00000000000000..bec3c5d353f83f
--- /dev/null
+++ b/libcxx/test/libcxx/containers/unord/hash_table_randomize_order.pass.cpp
@@ -0,0 +1,79 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// Test std::unordered_{set,map,multiset,multimap} randomization
+
+// UNSUPPORTED: c++03
+// ADDITIONAL_COMPILE_FLAGS: -D_LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY
+
+#include <unordered_set>
+#include <unordered_map>
+#include <cassert>
+#include <vector>
+#include <algorithm>
+
+const int kSize = 128;
+
+template <typename T, typename F>
+T get_random(F get_value) {
+  T v;
+  v.reserve(kSize);
+  for (int i = 0; i < kSize; ++i) {
+    v.insert(get_value());
+  }
+  v.rehash(v.bucket_count() + 1);
+  return v;
+}
+
+template <typename T, typename F>
+T get_deterministic(F get_value) {
+  T v;
+  v.reserve(kSize);
+  for (int i = 0; i < kSize; ++i) {
+    v.insert(get_value());
+  }
+  return v;
+}
+
+template <typename T>
+struct RemoveConst {
+  using type = T;
+};
+
+template <typename T, typename U>
+struct RemoveConst<std::pair<const T, U>> {
+  using type = std::pair<T, U>;
+};
+
+template <typename T, typename F>
+void test_randomization(F get_value) {
+  T t1 = get_deterministic<T>(get_value), t2 = get_random<T>(get_value);
+
+  // Convert pair<const K, V> to pair<K, V> so it can be sorted
+  using U = typename RemoveConst<typename T::value_type>::type;
+
+  std::vector<U> t1v(t1.begin(), t1.end()), t2v(t2.begin(), t2.end());
+
+  assert(t1v != t2v);
+
+  std::sort(t1v.begin(), t1v.end());
+  std::sort(t2v.begin(), t2v.end());
+
+  assert(t1v == t2v);
+}
+
+int main(int, char**) {
+  int i = 0, j = 0;
+  test_randomization<std::unordered_set<int>>([i]() mutable { return i++; });
+  test_randomization<std::unordered_map<int, int>>([i, j]() mutable { return std::make_pair(i++, j++); });
+  test_randomization<std::unordered_multiset<int>>([i]() mutable { return i++ % 32; });
+  test_randomization<std::unordered_multimap<int, int>>([i, j]() mutable {
+    return std::make_pair(i++ % 32, j++);
+  });
+  return 0;
+}

Copy link

github-actions bot commented Aug 25, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

…set,map,multiset,multimap} under _LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY

- Adds randomization of element order during rehash in unordered containers when the _LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY flag is set, similar to existing behavior in sort, nth_element, and partial_sort.
@arvidjonasson arvidjonasson force-pushed the hash-table-randomize-unspecified-stability branch from e60b49a to 83eb7ea Compare August 25, 2024 12:36
…nd_insertion_point()` from `__node_insert_multi_prepare()`.

Use user defined allocator for temporary node and indices buffers in `__debug_randomize_order()`.
};

__nh_vec __nhv(size());
__index_vec __iv(size());
Copy link
Author

@arvidjonasson arvidjonasson Aug 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd ideally want to use unique_ptr<__node_holder[], decltype(deleter)> and unique_ptr<size_type[], decltype(deleter)> (instead of __nh_vec and __index_vec) while still being able to use the user defined allocator. But min_pointer<T> is giving me problems (by not being convertible to T*). Is someone familiar with the matter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you paste what the exact error is? It's probably something benign like a missing std::__to_address call.

}

private:
_LIBCPP_HIDE_FROM_ABI __next_pointer __node_insert_multi_find_insertion_point(size_t __cp_hash, value_type& __cp_val);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__node_insert_multi_find_insertion_point is used in __node_insert_multi_prepare and __debug_randomize_order. Is refactoring warranted or should I instead copy the "find insertion point" logic/code from __node_insert_multi_prepare?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you extracted this into a separate function.

Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I get feedback on this approach?

There's something I don't understand about this approach. This is a naive question so bear with me. If you shuffle the elements in the hash table, how do you then manage to find them again based on their hash? Doesn't that mean the elements will be at a location that isn't correlated to their hash, meaning we'll basically devolve into a linear search whenever we look for an element?

Comment on lines +22 to +23
template <typename T, typename F>
T get_random(F get_value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <typename T, typename F>
T get_random(F get_value) {
template <typename Container, typename F>
Container get_random(F get_value) {

Comment on lines +24 to +25
T v;
v.reserve(kSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
T v;
v.reserve(kSize);
Container c;
c.reserve(kSize);

Or something like that -- minor change but it makes things easier to read.

}

private:
_LIBCPP_HIDE_FROM_ABI __next_pointer __node_insert_multi_find_insertion_point(size_t __cp_hash, value_type& __cp_val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you extracted this into a separate function.

}
__nh.release();
}
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
#endif // defined(_LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY)

};

__nh_vec __nhv(size());
__index_vec __iv(size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you paste what the exact error is? It's probably something benign like a missing std::__to_address call.

@olologin
Copy link

olologin commented Sep 2, 2024

Sorry, I just want to clarify:

std::unordered_{multiset,multimap}, equal ranges are shuffled and order within equal ranges are shuffled.

Lets say my multi* container contains this list: (1, 'a'), (1, 'a'), (2, 'b'), (3, 'c'), does this mean that I can iterate over it in the following order: (1, 'a'), (2, 'b'), (3, 'c'), (1, 'a') (if we are lucky enough)?

I don't completely understand your implementation, but it looks clever 👍 .
My original idea was much simpler, but with drawbacks: We could take some seed (maybe user-defined from environment variable), and just combine it together with result of user-supplied hash function. But the drawback is that you have to make that seed constant for whole process lifetime, otherwise you will not be able to pass hash table between your dlls (hashes would differ). And now when I write this I understand that I would still not be able to pass hashtable to libraries built without _LIBCPP_DEBUG_RANDOMIZE_UNSPECIFIED_STABILITY

@ldionne
Copy link
Member

ldionne commented Oct 1, 2024

Gentle ping @arvidjonasson!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants