Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python 3] joblib.hashing.hash error for mixed types sets or dict with mixed types keys #254

Closed
lesteve opened this issue Oct 14, 2015 · 4 comments · Fixed by #286
Closed
Labels
Milestone

Comments

@lesteve
Copy link
Member

lesteve commented Oct 14, 2015

import joblib
joblib.hashing.hash({'a', 1})
joblib.hashing.hash({'a': 1, 1: 2})

The traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-e763c033a75e> in <module>()
----> 1 joblib.hashing.hash({'a': 1, 1: 2})

/home/le243287/dev/joblib/joblib/hashing.py in hash(obj, hash_name, coerce_mmap)
    231     else:
    232         hasher = Hasher(hash_name=hash_name)
--> 233     return hasher.hash(obj)

/home/le243287/dev/joblib/joblib/hashing.py in hash(self, obj, return_digest)
     52     def hash(self, obj, return_digest=True):
     53         try:
---> 54             self.dump(obj)
     55         except pickle.PicklingError as e:
     56             warnings.warn('PicklingError while hashing %r: %r' % (obj, e))

/volatile/le243287/miniconda3/lib/python3.4/pickle.py in dump(self, obj)
    410         if self.proto >= 4:
    411             self.framer.start_framing()
--> 412         self.save(obj)
    413         self.write(STOP)
    414         self.framer.end_framing()

/home/le243287/dev/joblib/joblib/hashing.py in save(self, obj)
    211             klass = obj.__class__
    212             obj = (klass, ('HASHED', obj.descr))
--> 213         Hasher.save(self, obj)
    214 
    215 

/home/le243287/dev/joblib/joblib/hashing.py in save(self, obj)
     77                 cls = obj.__self__.__class__
     78                 obj = _MyHash(func_name, inst, cls)
---> 79         Pickler.save(self, obj)
     80 
     81     def memoize(self, obj):

/volatile/le243287/miniconda3/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
    477         f = self.dispatch.get(t)
    478         if f is not None:
--> 479             f(self, obj) # Call unbound method with explicit self
    480             return
    481 

/volatile/le243287/miniconda3/lib/python3.4/pickle.py in save_dict(self, obj)
    812 
    813         self.memoize(obj)
--> 814         self._batch_setitems(obj.items())
    815 
    816     dispatch[dict] = save_dict

/home/le243287/dev/joblib/joblib/hashing.py in _batch_setitems(self, items)
    124     def _batch_setitems(self, items):
    125         # forces order of keys in dict to ensure consistent hash
--> 126         Pickler._batch_setitems(self, iter(sorted(items)))
    127 
    128     def save_set(self, set_items):

TypeError: unorderable types: str() < int()

The reason: we rely on sorted to guarantee the reproducibility of the hash for non ordered containers and in Python 3 you can not compare arbitrary types.

Maybe the first thing to do would be to just catch the error in Python 3 and raise a more user-friendly one saying that set with mixed types or dict with mixed type keys are not supported ?

A possible fix (suggested by @ogrisel): using joblib.hashing.hash to hash the elements (in the set case) or keys (in the dict case) and use key= in the sorted call.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 14, 2016 via email

@ogrisel
Copy link
Contributor

ogrisel commented Jan 15, 2016

How reproducible will the sorting based on hashes be?

Our hash function is deterministic, I don't understand your question.

@ogrisel
Copy link
Contributor

ogrisel commented Jan 15, 2016

Maybe let's move that discussion to #286 to be closer to the code.

@GaelVaroquaux
Copy link
Member

My question was stupid, because I thought that you were suggesting to use the builtin hash to sort.

The drawback of using joblib.hash is that it will make things slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants