# Issues with hash-function for float64 version of klib's hash-map #21866

Closed
opened this issue Jul 11, 2018 · 5 comments

Projects
None yet
5 participants
Contributor

### realead commented Jul 11, 2018 • edited

#### Problem description

Hash-maps for float64 use the following hash-function

``````#define kh_float64_hash_func(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)
``````

However, in order to guarantee consistent behavior, the following constrains must be met:

1. `==` must be an equivalence relation
2. `x==y` => `hash(x)==hash(y)`

Following IEEE-754, floats aren't an equivalence relation, this is fixed defining "equal" as

``````#define kh_float64_hash_equal(a, b) ((a) == (b) || ((b) != (b) && (a) != (a)))
``````

Thus, apart from trivial equivalence classes, there are the following two:

1. `{0.0, -0.0}`
2. `{x | x is NAN} `

for which the second constrain doesn't hold:

• `hash(0.0)=0 != 1073741824=2^30=hash(-0.0)`
• `hash(float("nan")) = 1073479680 != 2147221504 =hash(-float("nan"))`

Due to the way klib uses this hash value, the values "0.0" and "-0.0" end up in different buckets only if there are more than 2^29 (ca. 6e8) elements in the hash-map already. The same holds for `float("nan")` and `-float("nan")`. However in the case of not-a-number there are much more elements in the equivalence class, so the inconsistent behavior can be triggered for much smaller sizes, for example for

``````import struct
NAN1=struct.unpack("d", struct.pack("L", 9221120237041090560))[0]
NAN2=struct.unpack("d", struct.pack("L", 9221120237041090561))[0]
``````

### Proposed solution:

There are already some workarounds for this problem scattered throughout pandas, for example for `pd.unique(...)` here or for `isin` here.

However, this

1. is not the right place to work around the bug, as it makes the code not really clearer
2. fixes only the not-a-number case, but not `{0.0, -0.0}`
3. most probably everyone reusing the hash-map-implementation in the fututre will first get the NAN-case wrong and then fix it with a further workaround.

A better approach would be to fix the hash-function of float64-hash-map, a possibility would be:

``````   #include <math.h>
//right for all but not -0.0 and NANs
#define kh_float64_hash_func_0_NAN(key) (khint32_t)((f64_to_i64(key))>>33^(f64_to_i64(key))^(f64_to_i64(key))<<11)
//right for all except NANs
#define kh_float64_hash_func_NAN(key) ((key)==0.0 ? kh_float64_hash_func_0_NAN(0.0) : kh_float64_hash_func_0_NAN(key))
//right for all, also- 0.0 and NANs
#define kh_float64_hash_func(key) ((key) != (key) ? kh_float64_hash_func_NAN(NAN) : kh_float64_hash_func_NAN(key))
``````
Member

### WillAyd commented Jul 11, 2018

 A little over my head but can you clarify the problem further? Is this something visible from the end user perspective? I wouldn't really consider the items you've labeled as workarounds to actually be such, as `np.nan != np.nan` generally, i.e. that's not just a pandas construct

### WillAyd added the Needs Info label Jul 11, 2018

Contributor Author

### realead commented Jul 12, 2018 • edited

 Because of workarounds, I'm not aware of a way for trigging an error in NAN-case (as long as one doesn't care exactly which NAN it is). There is however a way to trigger inconsistent behavior for `0.0` and `-0.0` (I hope you have enough RAM:)): ``````#tested with pandas 0.22.0 import numpy as np import pandas as pd a=np.arange(12*10**8, dtype=np.float64) a[0]=.1 a[-1]=0.0 a[-2]=-0.0 b=pd.unique(a) print(b.size) `````` The size of b is `6*10^8`, but should be `6*10^8-1`, because `0.0` and `-0.0` are usually handled as equal, e.g.: ``````a=np.zeros(2, dtype=np.float64) a[-1]=0.0 a[-2]=-0.0 b=pd.unique(a) print(b.size) # size is 1 now, 0.0 and -0.0 are equal `````` I do understand, that this is quite an esoteric case. My main issue with the implementation of float64-table as it is: There is a trap which obviously already have bitten at least twice and it will struck again in the future. The problem is not the equal-operator (which is rightly extended with `np.nan != np.nan`) but the way the hash function is calculated. The proposed fix of the hash-function would eliminate the need for workarounds, because the float64-hashtable would handle NANs and signed zeros correctly out-of-the-box. This SO-question helped me to understand the issue, maybe it is better than my issue description.
Contributor

### chris-b1 commented Jul 12, 2018

 Certainly would take an alternative hash function - as the long comment in the code indicates - we used to use python's hash for doubles, but that caused issues due to size truncation, so we're using a generic bit-shuffling one, same that ints use. As you show `0.0, -0.0` could be an issue. But I think our approach for NaNs is fine? It's special cased, yes, but `NaN` is a special case throughout pandas, so don't really see a need for purity here? But open to suggestions if there's a cleaner way to embed our assumptions into the hash function itself.

Member

### gfyoung commented Jul 13, 2018

 A very esoteric bug indeed, but nonetheless a bug. At the very least, `0.0, -0.0` should be addressed.

### realead referenced this issue Jul 13, 2018

Merged

#### BUG: issues with hash-function for Float64HashTable (GH21866) #21904

Contributor Author

### realead commented Jul 13, 2018

 Added my suggestion as PR21904, it fixes both cases NaNs and signed zero. I think both are necessary: Using directly the `Float64HashTable` (instead of` pd.unique()`) it is much easier to trigger the bug for NaNs and 0.0/-0.0: ``````from pandas._libs import hashtable as ht m = ht.Float64HashTable(12 * 10**8) m.set_item(np.nan, 0) m.set_item(-np.nan, 0) assert len(m) == 1 `````` or ``````from pandas._libs import hashtable as ht m = ht.Float64HashTable(12 * 10**8) m.set_item(np.nan, -0.0) m.set_item(-np.nan, -0.0) assert len(m) == 1 `````` PS: don't know the right place for whatsnew entry, I hope, that in case the changes are ok, I will be guided to the right place...

Closed

Merged