Skip to content

Commit

Permalink
[llvm-profdata] Remove MD5 collision check in D147740 (#66544)
Browse files Browse the repository at this point in the history
This is the patch at https://reviews.llvm.org/D153692, migrating to
Github

After testing D147740 with multiple industrial projects with ~10 million
FunctionSamples, no MD5 collision has been found. In perfect hashing,
the probability of collision for N symbols over K possible hash value is
1 - K!/((K-N)! * K^N). When N is 1 million and K is 2^64, the
probability is 3*10^-8, when N is 10 million the probability is 3*10^-6,
so we are probably not going to find an actual case in real world
application. (However if K is 2^32, the probability of collision is
almost 1, this is indeed a problem, if anyone still use a large profile
on 32-bit machine, as hash_code is tied to size_t). Furthermore, when a
collision happens we can't do anything to recover it, unless using a
multi-map, but that is significantly slower, which contradicts the
purpose of optimizing the profile reader. One more thing, since we have
been using profiles with MD5 names, and they have to be coming from
non-MD5 sources, so if hash collision is to happen, it already happened
when we convert a non-MD5 profile to a MD5 one, so there's no point to
check for that in the reader, and this feature can be removed.
  • Loading branch information
huangjd committed Sep 15, 2023
1 parent 80e810f commit f4f85e0
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 227 deletions.
75 changes: 15 additions & 60 deletions llvm/include/llvm/ProfileData/SampleProf.h
Original file line number Diff line number Diff line change
Expand Up @@ -1299,19 +1299,16 @@ raw_ostream &operator<<(raw_ostream &OS, const FunctionSamples &FS);
/// performance of insert and query operations especially when hash values of
/// keys are available a priori, and reduces memory usage if KeyT has a large
/// size.
/// When performing any action, if an existing entry with a given key is found,
/// and the interface "KeyT ValueT::getKey<KeyT>() const" to retrieve a value's
/// original key exists, this class checks if the given key actually matches
/// the existing entry's original key. If they do not match, this class behaves
/// as if the entry did not exist (for insertion, this means the new value will
/// replace the existing entry's value, as if it is newly inserted). If
/// ValueT::getKey<KeyT>() is not available, all keys with the same hash value
/// are considered equivalent (i.e. hash collision is silently ignored). Given
/// such feature this class should only be used where it does not affect
/// compilation correctness, for example, when loading a sample profile.
/// Assuming the hashing algorithm is uniform, the probability of hash collision
/// with 1,000,000 entries is
/// (2^64)!/((2^64-1000000)!*(2^64)^1000000) ~= 3*10^-8.
/// All keys with the same hash value are considered equivalent (i.e. hash
/// collision is silently ignored). Given such feature this class should only be
/// used where it does not affect compilation correctness, for example, when
/// loading a sample profile.
/// Assuming the hashing algorithm is uniform, we use the formula
/// 1 - Permute(n, k) / n ^ k where n is the universe size and k is number of
/// elements chosen at random to calculate the probability of collision. With
/// 1,000,000 entries the probability is negligible:
/// 1 - (2^64)!/((2^64-1000000)!*(2^64)^1000000) ~= 3*10^-8.
/// Source: https://en.wikipedia.org/wiki/Birthday_problem
template <template <typename, typename, typename...> typename MapT,
typename KeyT, typename ValueT, typename... MapTArgs>
class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
Expand All @@ -1325,42 +1322,12 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
using iterator = typename base_type::iterator;
using const_iterator = typename base_type::const_iterator;

private:
// If the value type has getKey(), retrieve its original key for comparison.
template <typename U = mapped_type,
typename = decltype(U().template getKey<original_key_type>())>
static bool
CheckKeyMatch(const original_key_type &Key, const mapped_type &ExistingValue,
original_key_type *ExistingKeyIfDifferent = nullptr) {
const original_key_type &ExistingKey =
ExistingValue.template getKey<original_key_type>();
bool Result = (Key == ExistingKey);
if (!Result && ExistingKeyIfDifferent)
*ExistingKeyIfDifferent = ExistingKey;
return Result;
}

// If getKey() does not exist, this overload is selected, which assumes all
// keys with the same hash are equivalent.
static bool CheckKeyMatch(...) { return true; }

public:
template <typename... Ts>
std::pair<iterator, bool> try_emplace(const key_type &Hash,
const original_key_type &Key,
Ts &&...Args) {
assert(Hash == hash_value(Key));
auto Ret = base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
if (!Ret.second) {
original_key_type ExistingKey;
if (LLVM_UNLIKELY(!CheckKeyMatch(Key, Ret.first->second, &ExistingKey))) {
dbgs() << "MD5 collision detected: " << Key << " and " << ExistingKey
<< " has same hash value " << Hash << "\n";
Ret.second = true;
Ret.first->second = mapped_type(std::forward<Ts>(Args)...);
}
}
return Ret;
return base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
}

template <typename... Ts>
Expand All @@ -1382,17 +1349,15 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
key_type Hash = hash_value(Key);
auto It = base_type::find(Hash);
if (It != base_type::end())
if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
return It;
return It;
return base_type::end();
}

const_iterator find(const original_key_type &Key) const {
key_type Hash = hash_value(Key);
auto It = base_type::find(Hash);
if (It != base_type::end())
if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
return It;
return It;
return base_type::end();
}

Expand Down Expand Up @@ -1433,19 +1398,9 @@ class SampleProfileMap
Ctx);
}

// Overloaded find() to lookup a function by name. This is called by IPO
// passes with an actual function name, and it is possible that the profile
// reader converted function names in the profile to MD5 strings, so we need
// to check if either representation matches.
// Overloaded find() to lookup a function by name.
iterator find(StringRef Fname) {
uint64_t Hash = hashFuncName(Fname);
auto It = base_type::find(hash_code(Hash));
if (It != end()) {
StringRef CtxName = It->second.getContext().getName();
if (LLVM_LIKELY(CtxName == Fname || CtxName == std::to_string(Hash)))
return It;
}
return end();
return base_type::find(hashFuncName(Fname));
}

size_t erase(const SampleContext &Ctx) {
Expand Down
1 change: 0 additions & 1 deletion llvm/unittests/tools/llvm-profdata/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ set(LLVM_LINK_COMPONENTS

add_llvm_unittest(LLVMProfdataTests
OutputSizeLimitTest.cpp
MD5CollisionTest.cpp
)

target_link_libraries(LLVMProfdataTests PRIVATE LLVMTestingSupport)
Expand Down
166 changes: 0 additions & 166 deletions llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp

This file was deleted.

0 comments on commit f4f85e0

Please sign in to comment.