Skip to content

Latest commit

 

History

History
66 lines (53 loc) · 2.46 KB

2023-12-25-python-hash-is-not-deterministic.md

File metadata and controls

66 lines (53 loc) · 2.46 KB
layout title date last_modified_at tags published
post
Python hash() is not deterministic
2023-12-25 03:23 +0000
2023-12-25 04:09:59 +0000
Python
true

Python hash() is not deterministic. Output of hash function is not guaranteed to be the same across different Python versions, platforms or executions of the same program.

Lets take a look at the following example:

$ python -c "print(hash('foo'))"
-677362727710324010
$ python -c "print(hash('foo'))"
2165398033220216763
$ python -c "print(hash('foo'))"
5782774651590270115

As you can see, the output of hash function is different for the same input "foo". This is not a bug, but a feature in Python 3.3 and above. The reason for this is that Python 3.3 introduced a Hash randomization as a security feature to prevent attackers from using hash collision for denial-of-service attachs. Every time you start a Python program, a random value is generated and used to salt the hash values. This ensures that the hash values are consistent within a single Python run. But, the hash values will be different across different Python runs.

You could disable hash randomization by setting the environment variable PYTHONHASHSEED to 0, but this is not recommended.

If you want to hash arbitrary objects deterministically, you can use the ubelt or joblib.hashing modules.

Here's an example of using ubelt

import ubelt as ub

print(ub.hash_data('foo', hasher='md5', base='abc', convert=False))

Result:

$ python -c "import ubelt as ub; print(ub.hash_data('foo', hasher='md5', base='abc', convert=False))"
blhtggyvbuyhspdolqxdrhoajdka
$ python -c "import ubelt as ub; print(ub.hash_data('foo', hasher='md5', base='abc', convert=False))"
blhtggyvbuyhspdolqxdrhoajdka
$ python -c "import ubelt as ub; print(ub.hash_data('foo', hasher='md5', base='abc', convert=False))"
blhtggyvbuyhspdolqxdrhoajdka

References