ENH: support MultiIndex and tuple hashing #15224

Closed
wants to merge 7 commits into
from

Conversation

Projects
None yet
3 participants
Contributor

jreback commented Jan 25, 2017 edited

on top of #15216

closes #15227

Contributor

jreback commented Jan 25, 2017

cc @mikegraham

I separated this from the prior PR.

yes my goal here is to effectively emulate hash(tuple) but doing it efficiently (e.g. we already have the arrays in a packed format, turning into python objects, just to compute a tuple hash is very inefficient).

Contributor

jreback commented Jan 25, 2017

@mikegraham
ok I incorporated your commit (with a slight modification) and now passes tests!

codecov-io commented Jan 25, 2017 edited

Current coverage is 86.32% (diff: 100%)

Merging #15224 into master will increase coverage by 0.01%

@@             master     #15224   diff @@
==========================================
  Files           139        139          
  Lines         51096      51140    +44   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          44103      44148    +45   
+ Misses         6993       6992     -1   
  Partials          0          0          

Powered by Codecov. Last update be32852...8b1d3f9

jreback added this to the 0.20.0 milestone Jan 25, 2017

Contributor

mikegraham commented Jan 25, 2017

There were a couple memory improvements in 187573b over the commit as you cherry-picked it, I think.

Contributor

jreback commented Jan 25, 2017

@mikegraham ahh ok, I will pick-again. thanks!

Contributor

jreback commented Jan 26, 2017

@mikegraham ok fixed.

Note that this obviously returns uint64, so it is not quite compat with the tuplehashing in python?

Contributor

mikegraham commented Jan 26, 2017

I stole the algorithm, but it's not compatible. I also iterate over the members in forward error rather than reverse...I don't think that is substantive.

Contributor

jreback commented Jan 26, 2017

cc @mikegraham any final comments?

@jorisvandenbossche

Contributor

mikegraham commented Jan 26, 2017 edited

We might want to do something about this behavior.

>>> pandas.tools.hashing.hash_pandas_object(pandas.DataFrame(), index=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/tools/hashing.py", line 91, in hash_pandas_object
    h = _combine_hash_arrays(hashes, num_items)
  File "pandas/tools/hashing.py", line 19, in _combine_hash_arrays
    first = next(arrays)
StopIteration```
@jreback jreback memory optimization
58f682d
Contributor

jreback commented Jan 26, 2017

@mikegraham updated

jreback added some commits Jan 26, 2017

@jreback jreback support for mixed type arrays
48a2402
@jreback jreback not correctly hashing categorical in a MI
8b1d3f9

jreback closed this in c67486f Jan 27, 2017

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017

@jreback @AnkurDedania jreback + AnkurDedania ENH: support MultiIndex and tuple hashing
closes #15227

Author: Jeff Reback <jeff@reback.net>
Author: Mike Graham <mikegraham2gmail.com>

Closes #15224 from jreback/mi_hash2 and squashes the following commits:

8b1d3f9 [Jeff Reback] not correctly hashing categorical in a MI
48a2402 [Jeff Reback] support for mixed type arrays
58f682d [Jeff Reback] memory optimization
0c13df7 [Mike Graham] Steal the algorithm used to combine hashes from tupleobject.c
e8dd607 [Jeff Reback] add hash_tuples
44e9c7d [Mike Graham] wipSteal the algorithm used to combine hashes from tupleobject.c
e507c4a [Jeff Reback] ENH: support MultiIndex and tuple hashing
4b762d1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment