[ADT] Add TrieRawHashMap #69528

cachemeifyoucan · 2023-10-18T21:36:58Z

Implement TrieRawHashMap which stores objects into a Trie based on the hash of the object.

User needs to supply the hashing function and guarantees the uniqueness of the hash for the objects to be inserted. Hash collision is not supported.

This is part of LLVMCAS implementation which you can see the overall change here: #68448

llvmbot · 2023-10-18T21:38:09Z

@llvm/pr-subscribers-llvm-support

@llvm/pr-subscribers-llvm-adt

Author: Steven Wu (cachemeifyoucan)

Changes

Implement TrieRawHashMap which stores objects into a Trie based on the hash of the object.

User needs to supply the hashing function and guarantees the uniqueness of the hash for the objects to be inserted. Hash collision is not supported.

This is part of LLVMCAS implementation which you can see the overall change here: #68448

Patch is 46.10 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/69528.diff

6 Files Affected:

(added) llvm/include/llvm/ADT/TrieRawHashMap.h (+398)
(modified) llvm/lib/Support/CMakeLists.txt (+1)
(added) llvm/lib/Support/TrieHashIndexGenerator.h (+89)
(added) llvm/lib/Support/TrieRawHashMap.cpp (+483)
(modified) llvm/unittests/ADT/CMakeLists.txt (+1)
(added) llvm/unittests/ADT/TrieRawHashMapTest.cpp (+342)

diff --git a/llvm/include/llvm/ADT/TrieRawHashMap.h b/llvm/include/llvm/ADT/TrieRawHashMap.h
new file mode 100644
index 000000000000000..baa08e214ce6fd7
--- /dev/null
+++ b/llvm/include/llvm/ADT/TrieRawHashMap.h
@@ -0,0 +1,398 @@
+//===- TrieRawHashMap.h -----------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_ADT_TRIERAWHASHMAP_H
+#define LLVM_ADT_TRIERAWHASHMAP_H
+
+#include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/Casting.h"
+#include <atomic>
+#include <optional>
+
+namespace llvm {
+
+class raw_ostream;
+
+/// TrieRawHashMap - is a lock-free thread-safe trie that is can be used to
+/// store/index data based on a hash value. It can be customized to work with
+/// any hash algorithm or store any data.
+///
+/// Data structure:
+/// Data node stored in the Trie contains both hash and data:
+/// struct {
+///    HashT Hash;
+///    DataT Data;
+/// };
+///
+/// Data is stored/indexed via a prefix tree, where each node in the tree can be
+/// either the root, a sub-trie or a data node. Assuming a 4-bit hash and two
+/// data objects {0001, A} and {0100, B}, it can be stored in a trie
+/// (assuming Root has 2 bits, SubTrie has 1 bit):
+///  +--------+
+///  |Root[00]| -> {0001, A}
+///  |    [01]| -> {0100, B}
+///  |    [10]| (empty)
+///  |    [11]| (empty)
+///  +--------+
+///
+/// Inserting a new object {0010, C} will result in:
+///  +--------+    +----------+
+///  |Root[00]| -> |SubTrie[0]| -> {0001, A}
+///  |        |    |       [1]| -> {0010, C}
+///  |        |    +----------+
+///  |    [01]| -> {0100, B}
+///  |    [10]| (empty)
+///  |    [11]| (empty)
+///  +--------+
+/// Note object A is sunk down to a sub-trie during the insertion. All the
+/// nodes are inserted through compare-exchange to ensure thread-safe and
+/// lock-free.
+///
+/// To find an object in the trie, walk the tree with prefix of the hash until
+/// the data node is found. Then the hash is compared with the hash stored in
+/// the data node to see if the is the same object.
+///
+/// Hash collision is not allowed so it is recommended to use trie with a
+/// "strong" hashing algorithm. A well-distributed hash can also result in
+/// better performance and memory usage.
+///
+/// It currently does not support iteration and deletion.
+
+/// Base class for a lock-free thread-safe hash-mapped trie.
+class ThreadSafeTrieRawHashMapBase {
+public:
+  static constexpr size_t TrieContentBaseSize = 4;
+  static constexpr size_t DefaultNumRootBits = 6;
+  static constexpr size_t DefaultNumSubtrieBits = 4;
+
+private:
+  template <class T> struct AllocValueType {
+    char Base[TrieContentBaseSize];
+    std::aligned_union_t<sizeof(T), T> Content;
+  };
+
+protected:
+  template <class T>
+  static constexpr size_t DefaultContentAllocSize = sizeof(AllocValueType<T>);
+
+  template <class T>
+  static constexpr size_t DefaultContentAllocAlign = alignof(AllocValueType<T>);
+
+  template <class T>
+  static constexpr size_t DefaultContentOffset =
+      offsetof(AllocValueType<T>, Content);
+
+public:
+  void operator delete(void *Ptr) { ::free(Ptr); }
+
+  LLVM_DUMP_METHOD void dump() const;
+  void print(raw_ostream &OS) const;
+
+protected:
+  /// Result of a lookup. Suitable for an insertion hint. Maybe could be
+  /// expanded into an iterator of sorts, but likely not useful (visiting
+  /// everything in the trie should probably be done some way other than
+  /// through an iterator pattern).
+  class PointerBase {
+  protected:
+    void *get() const { return I == -2u ? P : nullptr; }
+
+  public:
+    PointerBase() noexcept = default;
+    PointerBase(PointerBase &&) = default;
+    PointerBase(const PointerBase &) = default;
+    PointerBase &operator=(PointerBase &&) = default;
+    PointerBase &operator=(const PointerBase &) = default;
+
+  private:
+    friend class ThreadSafeTrieRawHashMapBase;
+    explicit PointerBase(void *Content) : P(Content), I(-2u) {}
+    PointerBase(void *P, unsigned I, unsigned B) : P(P), I(I), B(B) {}
+
+    bool isHint() const { return I != -1u && I != -2u; }
+
+    void *P = nullptr;
+    unsigned I = -1u;
+    unsigned B = 0;
+  };
+
+  /// Find the stored content with hash.
+  PointerBase find(ArrayRef<uint8_t> Hash) const;
+
+  /// Insert and return the stored content.
+  PointerBase
+  insert(PointerBase Hint, ArrayRef<uint8_t> Hash,
+         function_ref<const uint8_t *(void *Mem, ArrayRef<uint8_t> Hash)>
+             Constructor);
+
+  ThreadSafeTrieRawHashMapBase() = delete;
+
+  ThreadSafeTrieRawHashMapBase(
+      size_t ContentAllocSize, size_t ContentAllocAlign, size_t ContentOffset,
+      std::optional<size_t> NumRootBits = std::nullopt,
+      std::optional<size_t> NumSubtrieBits = std::nullopt);
+
+  /// Destructor, which asserts if there's anything to do. Subclasses should
+  /// call \a destroyImpl().
+  ///
+  /// \pre \a destroyImpl() was already called.
+  ~ThreadSafeTrieRawHashMapBase();
+  void destroyImpl(function_ref<void(void *ValueMem)> Destructor);
+
+  ThreadSafeTrieRawHashMapBase(ThreadSafeTrieRawHashMapBase &&RHS);
+
+  // Move assignment can be implemented in a thread-safe way if NumRootBits and
+  // NumSubtrieBits are stored inside the Root.
+  ThreadSafeTrieRawHashMapBase &
+  operator=(ThreadSafeTrieRawHashMapBase &&RHS) = delete;
+
+  // No copy.
+  ThreadSafeTrieRawHashMapBase(const ThreadSafeTrieRawHashMapBase &) = delete;
+  ThreadSafeTrieRawHashMapBase &
+  operator=(const ThreadSafeTrieRawHashMapBase &) = delete;
+
+  // Debug functions. Implementation details and not guaranteed to be
+  // thread-safe.
+  PointerBase getRoot() const;
+  unsigned getStartBit(PointerBase P) const;
+  unsigned getNumBits(PointerBase P) const;
+  unsigned getNumSlotUsed(PointerBase P) const;
+  std::string getTriePrefixAsString(PointerBase P) const;
+  unsigned getNumTries() const;
+  // Visit next trie in the allocation chain.
+  PointerBase getNextTrie(PointerBase P) const;
+
+private:
+  friend class TrieRawHashMapTestHelper;
+  const unsigned short ContentAllocSize;
+  const unsigned short ContentAllocAlign;
+  const unsigned short ContentOffset;
+  unsigned short NumRootBits;
+  unsigned short NumSubtrieBits;
+  struct ImplType;
+  // ImplPtr is owned by ThreadSafeTrieRawHashMapBase and needs to be freed in
+  // destoryImpl.
+  std::atomic<ImplType *> ImplPtr;
+  ImplType &getOrCreateImpl();
+  ImplType *getImpl() const;
+};
+
+/// Lock-free thread-safe hash-mapped trie.
+template <class T, size_t NumHashBytes>
+class ThreadSafeTrieRawHashMap : public ThreadSafeTrieRawHashMapBase {
+public:
+  using HashT = std::array<uint8_t, NumHashBytes>;
+
+  class LazyValueConstructor;
+  struct value_type {
+    const HashT Hash;
+    T Data;
+
+    value_type(value_type &&) = default;
+    value_type(const value_type &) = default;
+
+    value_type(ArrayRef<uint8_t> Hash, const T &Data)
+        : Hash(makeHash(Hash)), Data(Data) {}
+    value_type(ArrayRef<uint8_t> Hash, T &&Data)
+        : Hash(makeHash(Hash)), Data(std::move(Data)) {}
+
+  private:
+    friend class LazyValueConstructor;
+
+    struct EmplaceTag {};
+    template <class... ArgsT>
+    value_type(ArrayRef<uint8_t> Hash, EmplaceTag, ArgsT &&...Args)
+        : Hash(makeHash(Hash)), Data(std::forward<ArgsT>(Args)...) {}
+
+    static HashT makeHash(ArrayRef<uint8_t> HashRef) {
+      HashT Hash;
+      std::copy(HashRef.begin(), HashRef.end(), Hash.data());
+      return Hash;
+    }
+  };
+
+  using ThreadSafeTrieRawHashMapBase::operator delete;
+  using HashType = HashT;
+
+  using ThreadSafeTrieRawHashMapBase::dump;
+  using ThreadSafeTrieRawHashMapBase::print;
+
+private:
+  template <class ValueT> class PointerImpl : PointerBase {
+    friend class ThreadSafeTrieRawHashMap;
+
+    ValueT *get() const {
+      if (void *B = PointerBase::get())
+        return reinterpret_cast<ValueT *>(B);
+      return nullptr;
+    }
+
+  public:
+    ValueT &operator*() const {
+      assert(get());
+      return *get();
+    }
+    ValueT *operator->() const {
+      assert(get());
+      return get();
+    }
+    explicit operator bool() const { return get(); }
+
+    PointerImpl() = default;
+    PointerImpl(PointerImpl &&) = default;
+    PointerImpl(const PointerImpl &) = default;
+    PointerImpl &operator=(PointerImpl &&) = default;
+    PointerImpl &operator=(const PointerImpl &) = default;
+
+  protected:
+    PointerImpl(PointerBase Result) : PointerBase(Result) {}
+  };
+
+public:
+  class pointer;
+  class const_pointer;
+  class pointer : public PointerImpl<value_type> {
+    friend class ThreadSafeTrieRawHashMap;
+    friend class const_pointer;
+
+  public:
+    pointer() = default;
+    pointer(pointer &&) = default;
+    pointer(const pointer &) = default;
+    pointer &operator=(pointer &&) = default;
+    pointer &operator=(const pointer &) = default;
+
+  private:
+    pointer(PointerBase Result) : pointer::PointerImpl(Result) {}
+  };
+
+  class const_pointer : public PointerImpl<const value_type> {
+    friend class ThreadSafeTrieRawHashMap;
+
+  public:
+    const_pointer() = default;
+    const_pointer(const_pointer &&) = default;
+    const_pointer(const const_pointer &) = default;
+    const_pointer &operator=(const_pointer &&) = default;
+    const_pointer &operator=(const const_pointer &) = default;
+
+    const_pointer(const pointer &P) : const_pointer::PointerImpl(P) {}
+
+  private:
+    const_pointer(PointerBase Result) : const_pointer::PointerImpl(Result) {}
+  };
+
+  class LazyValueConstructor {
+  public:
+    value_type &operator()(T &&RHS) {
+      assert(Mem && "Constructor already called, or moved away");
+      return assign(::new (Mem) value_type(Hash, std::move(RHS)));
+    }
+    value_type &operator()(const T &RHS) {
+      assert(Mem && "Constructor already called, or moved away");
+      return assign(::new (Mem) value_type(Hash, RHS));
+    }
+    template <class... ArgsT> value_type &emplace(ArgsT &&...Args) {
+      assert(Mem && "Constructor already called, or moved away");
+      return assign(::new (Mem)
+                        value_type(Hash, typename value_type::EmplaceTag{},
+                                   std::forward<ArgsT>(Args)...));
+    }
+
+    LazyValueConstructor(LazyValueConstructor &&RHS)
+        : Mem(RHS.Mem), Result(RHS.Result), Hash(RHS.Hash) {
+      RHS.Mem = nullptr; // Moved away, cannot call.
+    }
+    ~LazyValueConstructor() { assert(!Mem && "Constructor never called!"); }
+
+  private:
+    value_type &assign(value_type *V) {
+      Mem = nullptr;
+      Result = V;
+      return *V;
+    }
+    friend class ThreadSafeTrieRawHashMap;
+    LazyValueConstructor() = delete;
+    LazyValueConstructor(void *Mem, value_type *&Result, ArrayRef<uint8_t> Hash)
+        : Mem(Mem), Result(Result), Hash(Hash) {
+      assert(Hash.size() == sizeof(HashT) && "Invalid hash");
+      assert(Mem && "Invalid memory for construction");
+    }
+    void *Mem;
+    value_type *&Result;
+    ArrayRef<uint8_t> Hash;
+  };
+
+  /// Insert with a hint. Default-constructed hint will work, but it's
+  /// recommended to start with a lookup to avoid overhead in object creation
+  /// if it already exists.
+  pointer insertLazy(const_pointer Hint, ArrayRef<uint8_t> Hash,
+                     function_ref<void(LazyValueConstructor)> OnConstruct) {
+    return pointer(ThreadSafeTrieRawHashMapBase::insert(
+        Hint, Hash, [&](void *Mem, ArrayRef<uint8_t> Hash) {
+          value_type *Result = nullptr;
+          OnConstruct(LazyValueConstructor(Mem, Result, Hash));
+          return Result->Hash.data();
+        }));
+  }
+
+  pointer insertLazy(ArrayRef<uint8_t> Hash,
+                     function_ref<void(LazyValueConstructor)> OnConstruct) {
+    return insertLazy(const_pointer(), Hash, OnConstruct);
+  }
+
+  pointer insert(const_pointer Hint, value_type &&HashedData) {
+    return insertLazy(Hint, HashedData.Hash, [&](LazyValueConstructor C) {
+      C(std::move(HashedData.Data));
+    });
+  }
+
+  pointer insert(const_pointer Hint, const value_type &HashedData) {
+    return insertLazy(Hint, HashedData.Hash,
+                      [&](LazyValueConstructor C) { C(HashedData.Data); });
+  }
+
+  pointer find(ArrayRef<uint8_t> Hash) {
+    assert(Hash.size() == std::tuple_size<HashT>::value);
+    return ThreadSafeTrieRawHashMapBase::find(Hash);
+  }
+
+  const_pointer find(ArrayRef<uint8_t> Hash) const {
+    assert(Hash.size() == std::tuple_size<HashT>::value);
+    return ThreadSafeTrieRawHashMapBase::find(Hash);
+  }
+
+  ThreadSafeTrieRawHashMap(std::optional<size_t> NumRootBits = std::nullopt,
+                           std::optional<size_t> NumSubtrieBits = std::nullopt)
+      : ThreadSafeTrieRawHashMapBase(DefaultContentAllocSize<value_type>,
+                                     DefaultContentAllocAlign<value_type>,
+                                     DefaultContentOffset<value_type>,
+                                     NumRootBits, NumSubtrieBits) {}
+
+  ~ThreadSafeTrieRawHashMap() {
+    if constexpr (std::is_trivially_destructible<value_type>::value)
+      this->destroyImpl(nullptr);
+    else
+      this->destroyImpl(
+          [](void *P) { static_cast<value_type *>(P)->~value_type(); });
+  }
+
+  // Move constructor okay.
+  ThreadSafeTrieRawHashMap(ThreadSafeTrieRawHashMap &&) = default;
+
+  // No move assignment or any copy.
+  ThreadSafeTrieRawHashMap &operator=(ThreadSafeTrieRawHashMap &&) = delete;
+  ThreadSafeTrieRawHashMap(const ThreadSafeTrieRawHashMap &) = delete;
+  ThreadSafeTrieRawHashMap &
+  operator=(const ThreadSafeTrieRawHashMap &) = delete;
+};
+
+} // namespace llvm
+
+#endif // LLVM_ADT_TRIERAWHASHMAP_H
diff --git a/llvm/lib/Support/CMakeLists.txt b/llvm/lib/Support/CMakeLists.txt
index b96d62c7a6224d6..677f52677a27f8a 100644
--- a/llvm/lib/Support/CMakeLists.txt
+++ b/llvm/lib/Support/CMakeLists.txt
@@ -239,6 +239,7 @@ add_llvm_component_library(LLVMSupport
   TimeProfiler.cpp
   Timer.cpp
   ToolOutputFile.cpp
+  TrieRawHashMap.cpp
   Twine.cpp
   TypeSize.cpp
   Unicode.cpp
diff --git a/llvm/lib/Support/TrieHashIndexGenerator.h b/llvm/lib/Support/TrieHashIndexGenerator.h
new file mode 100644
index 000000000000000..c9e9b70e10d3c77
--- /dev/null
+++ b/llvm/lib/Support/TrieHashIndexGenerator.h
@@ -0,0 +1,89 @@
+//===- TrieHashIndexGenerator.h ---------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_SUPPORT_TRIEHASHINDEXGENERATOR_H
+#define LLVM_LIB_SUPPORT_TRIEHASHINDEXGENERATOR_H
+
+#include "llvm/ADT/ArrayRef.h"
+#include <optional>
+
+namespace llvm {
+
+struct IndexGenerator {
+  size_t NumRootBits;
+  size_t NumSubtrieBits;
+  ArrayRef<uint8_t> Bytes;
+  std::optional<size_t> StartBit = std::nullopt;
+
+  size_t getNumBits() const {
+    assert(StartBit);
+    size_t TotalNumBits = Bytes.size() * 8;
+    assert(*StartBit <= TotalNumBits);
+    return std::min(*StartBit ? NumSubtrieBits : NumRootBits,
+                    TotalNumBits - *StartBit);
+  }
+  size_t next() {
+    size_t Index;
+    if (!StartBit) {
+      StartBit = 0;
+      Index = getIndex(Bytes, *StartBit, NumRootBits);
+    } else {
+      *StartBit += *StartBit ? NumSubtrieBits : NumRootBits;
+      assert((*StartBit - NumRootBits) % NumSubtrieBits == 0);
+      Index = getIndex(Bytes, *StartBit, NumSubtrieBits);
+    }
+    return Index;
+  }
+
+  size_t hint(unsigned Index, unsigned Bit) {
+    assert(Index >= 0);
+    assert(Bit < Bytes.size() * 8);
+    assert(Bit == 0 || (Bit - NumRootBits) % NumSubtrieBits == 0);
+    StartBit = Bit;
+    return Index;
+  }
+
+  size_t getCollidingBits(ArrayRef<uint8_t> CollidingBits) const {
+    assert(StartBit);
+    return getIndex(CollidingBits, *StartBit, NumSubtrieBits);
+  }
+
+  static size_t getIndex(ArrayRef<uint8_t> Bytes, size_t StartBit,
+                         size_t NumBits) {
+    assert(StartBit < Bytes.size() * 8);
+
+    Bytes = Bytes.drop_front(StartBit / 8u);
+    StartBit %= 8u;
+    size_t Index = 0;
+    for (uint8_t Byte : Bytes) {
+      size_t ByteStart = 0, ByteEnd = 8;
+      if (StartBit) {
+        ByteStart = StartBit;
+        Byte &= (1u << (8 - StartBit)) - 1u;
+        StartBit = 0;
+      }
+      size_t CurrentNumBits = ByteEnd - ByteStart;
+      if (CurrentNumBits > NumBits) {
+        Byte >>= CurrentNumBits - NumBits;
+        CurrentNumBits = NumBits;
+      }
+      Index <<= CurrentNumBits;
+      Index |= Byte & ((1u << CurrentNumBits) - 1u);
+
+      assert(NumBits >= CurrentNumBits);
+      NumBits -= CurrentNumBits;
+      if (!NumBits)
+        break;
+    }
+    return Index;
+  }
+};
+
+} // namespace llvm
+
+#endif // LLVM_LIB_SUPPORT_TRIEHASHINDEXGENERATOR_H
diff --git a/llvm/lib/Support/TrieRawHashMap.cpp b/llvm/lib/Support/TrieRawHashMap.cpp
new file mode 100644
index 000000000000000..af4cd8b57aed214
--- /dev/null
+++ b/llvm/lib/Support/TrieRawHashMap.cpp
@@ -0,0 +1,483 @@
+//===- TrieRawHashMap.cpp -------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/ADT/TrieRawHashMap.h"
+#include "TrieHashIndexGenerator.h"
+#include "llvm/ADT/LazyAtomicPointer.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/Support/Allocator.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/ThreadSafeAllocator.h"
+#include "llvm/Support/raw_ostream.h"
+#include <memory>
+
+using namespace llvm;
+
+namespace {
+struct TrieNode {
+  const bool IsSubtrie = false;
+
+  TrieNode(bool IsSubtrie) : IsSubtrie(IsSubtrie) {}
+
+  static void *operator new(size_t Size) { return ::malloc(Size); }
+  void operator delete(void *Ptr) { ::free(Ptr); }
+};
+
+struct TrieContent final : public TrieNode {
+  const uint8_t ContentOffset;
+  const uint8_t HashSize;
+  const uint8_t HashOffset;
+
+  void *getValuePointer() const {
+    auto Content = reinterpret_cast<const uint8_t *>(this) + ContentOffset;
+    return const_cast<uint8_t *>(Content);
+  }
+
+  ArrayRef<uint8_t> getHash() const {
+    auto *Begin = reinterpret_cast<const uint8_t *>(this) + HashOffset;
+    return ArrayRef(Begin, Begin + HashSize);
+  }
+
+  TrieContent(size_t ContentOffset, size_t HashSize, size_t HashOffset)
+      : TrieNode(/*IsSubtrie=*/false), ContentOffset(ContentOffset),
+        HashSize(HashSize), HashOffset(HashOffset) {}
+};
+static_assert(sizeof(TrieContent) ==
+                  ThreadSafeTrieRawHashMapBase::TrieContentBaseSize,
+              "Check header assumption!");
+
+class TrieSubtrie final : public TrieNode {
+public:
+  TrieNode *get(size_t I) const { return Slots[I].load(); }
+
+  TrieSubtrie *
+  sink(size_t I, TrieContent &Content, size_t NumSubtrieBits, size_t NewI,
+       function_ref<TrieSubtrie *(std::unique_ptr<TrieSubtrie>)> Saver);
+
+  static std::unique_ptr<TrieSubtrie> create(size_t StartBit, size_t NumBits);
+
+  explicit TrieSubtrie(size_t StartBit, size_t NumBits);
+
+private:
+  // FIXME: Use a bitset to speed up access:
+  //
+  //     std::array<std::atomic<uint64_t>, NumSlots/64> IsSet;
+  //
+  // This will avoid needing to visit sparsely filled slots in
+  // \a ThreadSafeTrieRawHashMapBase::destroyImpl() when there's a non-trivial
+  // destructor.
+  //
+  // It would also greatly speed up iteration, if we add that some day, and
+  // allow get() to return one level sooner.
+  //
+  // This would be the algorithm for updating IsSet (after updating Slots):
+  //
+  //     std::atomic<uint64_t> &Bits = IsSet[I.High];
+  //     const uint64_t NewBit = 1ULL << I.Low;
+  //     uint64_t Old = 0;
+  //     while (!Bits.compare_exchange_weak(Old, Old | NewBit))
+  //       ;...
[truncated]

github-actions · 2023-10-18T21:39:03Z

✅ With the latest revision this PR passed the C/C++ code formatter.

bzcheeseman

Don't block on me, just a drive-by comment while skimming the change :) I'll take a more in-depth look when I get the chance (and plz ping if I am holding something up)

llvm/lib/Support/TrieRawHashMap.cpp

llvm/include/llvm/ADT/TrieRawHashMap.h

llvm/lib/Support/TrieHashIndexGenerator.h

dwblaikie · 2023-10-19T21:05:42Z

llvm/lib/Support/TrieRawHashMap.cpp

+  friend class llvm::ThreadSafeTrieRawHashMapBase;
+
+public:
+  /// Linked list for ownership of tries. The pointer is owned by TrieSubtrie.


Maybe "This is an owning pointer"? ("The pointer is owned by TrieSubtrie" confused me a bit (it's not about what owns the pointer, but the memory it points to - and "which TrieSubtrie" - this or some other instance, etc))

This is not an owning pointer. Tries are actually also chained together as a single linked list as they get allocated. This is to get a well-defined destroying order. When destroying the entire TrieRawHashMap, it can't just have one owning pointer because it needs to destroy both the data stored in the trie and the trie itself. It needs to keep the trie structure (see comments in ThreadSafeTrieRawHashMapBase::destroyImpl) while destroying data. It achieves that by walk the allocation linked list using this pointer and does two passes:

Destroy all data

Destroy tries

Also this is also to avoid using recursion to destroy all tries, since the number of tries can be very very big and overflow the stack.

dwblaikie · 2023-10-19T21:07:00Z

llvm/lib/Support/TrieRawHashMap.cpp

+  size_t Size = sizeof(TrieSubtrie) + getTrieTailSize(StartBit, NumBits);
+  void *Memory = ::malloc(Size);
+  TrieSubtrie *S = ::new (Memory) TrieSubtrie(StartBit, NumBits);
+  return std::unique_ptr<TrieSubtrie>(S);


This would produce mismatched new/delete. If the object is created with malloc+placement new, it should be destroyed with an explicit dtor call (if it's non-trivial) and a call to free - not a call to delete. I think?
(a unique_ptr with a custom deleter could be used to address this issue)

delete is actually overridden with free(). I am not proud of this but it is tricky to get this placement new but also return unique_ptr.

llvm/lib/Support/TrieRawHashMap.cpp

llvm/unittests/ADT/TrieRawHashMapTest.cpp

dwblaikie · 2023-10-20T17:38:24Z

Per https://llvm.org/docs/GitHub.html#updating-pull-requests it'd be helpful to, "When updating a pull request, you should push additional “fix up” commits to your branch instead of force pushing. This makes it easier for GitHub to track the context of previous review comments. Consider using the built-in support for fixups in git."

(like, I can't readily tell what things changed in your most recent update - so I can see if/how different feedback was/wasn't addressed)

(I know we're all learning how to do pull requests in LLVM - I still haven't sent any of my own out... so I'm certainly no expert at making them, but I'm slowly learning what makes them easier/harder to review, at least)

cachemeifyoucan · 2023-10-20T20:00:55Z

Per https://llvm.org/docs/GitHub.html#updating-pull-requests it'd be helpful to, "When updating a pull request, you should push additional “fix up” commits to your branch instead of force pushing. This makes it easier for GitHub to track the context of previous review comments. Consider using the built-in support for fixups in git."

Oops, I didn't realize we push out an official guideline already. Definitely will do that in the future. At the meantime, let me see if the review history can snap back to the code if I restore the branch to old stage and add fixup commit.

cachemeifyoucan · 2023-10-27T17:23:57Z

Ping~

bzcheeseman · 2023-10-29T05:36:16Z

llvm/lib/Support/TrieHashIndexGenerator.h

+
+namespace llvm {
+
+struct IndexGenerator {


nit: Could you add comments to this struct and its functions? It also may be worth moving them out-of-line, they are maybe a little long to be fully inlined.

Added comments for the class and APIs. Let me know if it is still hard to read for inline implementation. This is currently a private header and I can definitely move some implementation out to its own cpp file.

I would probably move this to a .cpp if it was me. The comments help a lot though, thanks!

llvm/lib/Support/TrieRawHashMap.cpp

bzcheeseman · 2023-10-29T05:41:33Z

llvm/lib/Support/TrieRawHashMap.cpp

+                                                 size_t NumBits) {
+  size_t Size = sizeof(TrieSubtrie) + getTrieTailSize(StartBit, NumBits);
+  void *Memory = ::malloc(Size);
+  TrieSubtrie *S = ::new (Memory) TrieSubtrie(StartBit, NumBits);


Isn't operator new overloaded above? Could you use that?

I would echo David's comment below that this is a bit fussy but if this is what you have to do I believe you :)

Actually, is https://llvm.org/doxygen/classllvm_1_1TrailingObjects.html something that might be useful here? It looks like that's what you're going for, and this might be a little cleaner?

I guess using a TrailingObject is a cleaner solution but it does come with some downside (unless I missed how that can be used). Currently, the trailing object in TrieSubtrie is dynamically allocated based on parameter and I don't know how to get trailing object to work with a dynamically size object. So there are two ways to get it working:

instead of trailing object, just put a trailing pointer there.

for the currently implementation, the size is only affected by the configuration so technically we can template that, then we can have a static size to construct trailing object. The main problem here is that we haven't fine tuned to configuration yet and that will prevent creating a trie implementation that decides bit size of different level or you just have to create all variation of size node type ahead of time.

@dexonsmith any opinion on this?

llvm/lib/Support/TrieRawHashMap.cpp

bzcheeseman · 2023-10-29T05:46:39Z

llvm/lib/Support/TrieRawHashMap.cpp

+      NumRootBits(NumRootBits ? *NumRootBits : DefaultNumRootBits),
+      NumSubtrieBits(NumSubtrieBits ? *NumSubtrieBits : DefaultNumSubtrieBits),
+      ImplPtr(nullptr) {
+  assert((!NumRootBits || *NumRootBits < 20) &&


Out of curiosity, why do we have this restriction?

I don't know~ @dexonsmith any insight.

I can see why it is not a reasonable choice but I don't know anything bad would happen if you force that (I don't think we will overflow, even on a 32 bit architecture). I removed it for now.

The point is catch misuse, not logic errors, so it's okay for the assertion to be relaxed if there's another reasonable limit.

The limit of 20 bits ensures the root's allocation is under 8MB (1M entries). On some platforms, larger contiguous allocations might fail. An assertion could catch platform-dependent memory failures.

Certainly, I imagine a limit of 29 would be palatable? (500M entries, 4GB allocation for the root node)

In any case, it feels like someone is holding the trie "wrong" for root nodes this big. Going bigger is certainly valid in some sense, but I question the benefit of supporting it. If you really want to optimize for super large data sets, you probably want a file-backed trie instead.

I prefer the original limit of 20, unless someone has a specific argument for something else. (I don't feel strongly about what the limit is, just that there should be one.)

llvm/lib/Support/TrieRawHashMap.cpp

Implement TrieRawHashMap which stores objects into a Trie based on the hash of the object. User needs to supply the hashing function and guarantees the uniqueness of the hash for the objects to be inserted. Hash collision is not supported

bzcheeseman

Aside from the trailing object open discussion, LGTM. I'll circle back once a direction has been decided-upon there :)

bzcheeseman · 2023-10-31T00:42:23Z

llvm/lib/Support/TrieHashIndexGenerator.h

+
+namespace llvm {
+
+struct IndexGenerator {


I would probably move this to a .cpp if it was me. The comments help a lot though, thanks!

bzcheeseman · 2023-10-31T00:43:54Z

llvm/include/llvm/ADT/TrieRawHashMap.h

+    const_pointer(PointerBase Result) : const_pointer::PointerImpl(Result) {}
+  };
+
+  class LazyValueConstructor {


This could probably use a comment and/or usage example? Another "not super clear from inspection" case.

cachemeifyoucan requested review from rnk, bzcheeseman and dwblaikie October 18, 2023 21:36

llvmbot added llvm:support llvm:adt labels Oct 18, 2023

cachemeifyoucan force-pushed the eng/PR-trie-raw-hash-map branch from 6ab1e9d to 1207c95 Compare October 18, 2023 22:09

bzcheeseman reviewed Oct 19, 2023

View reviewed changes

llvm/lib/Support/TrieRawHashMap.cpp Outdated Show resolved Hide resolved

cachemeifyoucan force-pushed the eng/PR-trie-raw-hash-map branch from 1207c95 to fecf7a0 Compare October 19, 2023 20:19

dwblaikie reviewed Oct 19, 2023

View reviewed changes

cachemeifyoucan force-pushed the eng/PR-trie-raw-hash-map branch from fecf7a0 to 6bc0b67 Compare October 19, 2023 22:46

cachemeifyoucan force-pushed the eng/PR-trie-raw-hash-map branch from 6bc0b67 to 96b140a Compare October 20, 2023 20:06

cachemeifyoucan requested review from dwblaikie, bzcheeseman and dexonsmith October 27, 2023 17:23

bzcheeseman reviewed Oct 29, 2023

View reviewed changes

cachemeifyoucan force-pushed the eng/PR-trie-raw-hash-map branch 2 times, most recently from c96baad to 5088211 Compare October 30, 2023 18:28

cachemeifyoucan added 4 commits October 30, 2023 11:47

[ADT] Add TrieRawHashMap

e2f0157

Implement TrieRawHashMap which stores objects into a Trie based on the hash of the object. User needs to supply the hashing function and guarantees the uniqueness of the hash for the objects to be inserted. Hash collision is not supported

fixup! [ADT] Add TrieRawHashMap

2562bfd

fixup! Address review feedback for more comments

ee8b774

fixup! Avoid while (true) during index generation

8e0d23b

cachemeifyoucan force-pushed the eng/PR-trie-raw-hash-map branch from 5088211 to 8e0d23b Compare October 30, 2023 18:48

bzcheeseman reviewed Oct 31, 2023

View reviewed changes

cachemeifyoucan mentioned this pull request Feb 12, 2024

[CAS] LLVMCAS implementation #68448

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADT] Add TrieRawHashMap #69528

[ADT] Add TrieRawHashMap #69528

cachemeifyoucan commented Oct 18, 2023

llvmbot commented Oct 18, 2023 •

edited

github-actions bot commented Oct 18, 2023 •

edited

bzcheeseman left a comment •

edited

dwblaikie Oct 19, 2023

cachemeifyoucan Oct 19, 2023

cachemeifyoucan Oct 19, 2023

dwblaikie Oct 19, 2023

cachemeifyoucan Oct 19, 2023

dwblaikie commented Oct 20, 2023

cachemeifyoucan commented Oct 20, 2023

cachemeifyoucan commented Oct 27, 2023

bzcheeseman Oct 29, 2023

cachemeifyoucan Oct 30, 2023

bzcheeseman Oct 31, 2023

bzcheeseman Oct 29, 2023

bzcheeseman Oct 29, 2023

cachemeifyoucan Oct 30, 2023

bzcheeseman Oct 29, 2023

cachemeifyoucan Oct 30, 2023

dexonsmith Nov 8, 2023

bzcheeseman left a comment

bzcheeseman Oct 31, 2023

bzcheeseman Oct 31, 2023

[ADT] Add TrieRawHashMap #69528

Are you sure you want to change the base?

[ADT] Add TrieRawHashMap #69528

Conversation

cachemeifyoucan commented Oct 18, 2023

llvmbot commented Oct 18, 2023 • edited

github-actions bot commented Oct 18, 2023 • edited

bzcheeseman left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwblaikie commented Oct 20, 2023

cachemeifyoucan commented Oct 20, 2023

cachemeifyoucan commented Oct 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bzcheeseman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llvmbot commented Oct 18, 2023 •

edited

github-actions bot commented Oct 18, 2023 •

edited

bzcheeseman left a comment •

edited