New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized hashing of keys for memoization #1074

Merged
merged 14 commits into from Jan 18, 2017

Conversation

Projects
None yet
3 participants
@jlstevens
Member

jlstevens commented Jan 18, 2017

Addresses #1067.

This approach isn't perfect but it will allow a lot more keys to be memoized than before (sets, lists, dictionaries). The main limitation is that composite values as dictionary keys are not allowed by the JSON spec.

Currently pandas Dataframes/Series and numpy arrays are allowed as mutables. In general you will want to support mutable types explicitly as using the id does not reflect the contents of the object.

This PR isn't ready yet - I want to add a bunch of unit tests.

@jlstevens

This comment has been minimized.

Member

jlstevens commented Jan 18, 2017

Just some quick examples/performance statistics:

%%timeit
   ...: keyhash([datetime.datetime(1,2,3), set([1,2,3]), 
                 pd.DataFrame({'a':[1,2],'b':[3,4]}), 
                 np.array([1,2,3]), np.int64(34)])
   ...:
1000 loops, best of 3: 659 µs per loop
%%timeit
   ...: keyhash([1,2,3, (4,5)])
   ...:
10000 loops, best of 3: 23.5 µs per loop

Unsurprisingly, big keys hash slower than small keys (I expect most keys to be pretty small). Note that datetime objects are also supported.

@philippjfr

This comment has been minimized.

Member

philippjfr commented Jan 18, 2017

Looks good to me. I doubt we'll catch all the cases we'll need straight away so it's good you can add repr_hashable and str_hashable types. Happy to merge when you've added some unit tests.

supplied, the objects id is used - this may be an issue when dealing
with mutable objects. For this reason, it is important to support
common mutable types such as pandas Dataframes and numpy arrays
(supported).

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

The bit discussing mutable objects is confusing -- it calls out Pandas and Numpy, but doesn't mention dicts, lists, sets, etc. -- surely those are both mutable and supported as well?

This comment has been minimized.

@jlstevens

jlstevens Jan 18, 2017

Member

The distinction is between what is supported directly by JSON (dicts, lists) and things that we need to add support for it HashGenerator. Currently support is added for sets, datetime, numpy arrays and pandas dataframes/series. More things may need to be explicitly supported in future.

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

Ok, please reword to convey that.

"""
Extends JSONEncoder to generate a hashable string for as many types
of key as possible. These keys are supplied by widgets and Stream
parameters and may be nested. If an unrecognized object type is

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

I don't think the docstring for this class should mention widgets or Streams in any way other than something explicitly labeled as an example. It could mention memoization, though, so that we know the goal of this object is to generate an id that can tell whether two objects differ, not to provide an encoding of the object.

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

And I'm confused about whether it's truly a HashGenerator -- the docstring says it will generate a hashable string, not a hash, and the code indicates that sometimes it's one, sometimes it's the other. Which is it?

This comment has been minimized.

@jlstevens

jlstevens Jan 18, 2017

Member

HashableStringGenerator is a pretty ugly name but probably more accurate.

This comment has been minimized.

@jlstevens

jlstevens Jan 18, 2017

Member

I don't think the docstring for this class should mention widgets or Streams in any way other than something explicitly labeled as an example. It could mention memoization, though, so that we know the goal of this object is to generate an id that can tell whether two objects differ, not to provide an encoding of the object.

Happy to update the docstring as suggested.

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

Please do! Thanks.

This comment has been minimized.

@jlstevens

jlstevens Jan 18, 2017

Member

Hmm, all strings are hashable already. It is about taking an arbitrary thing and generating a unique string from it suitable to generate a hash. I agree that is a less ugly name though... @philippjfr Any suggestions?

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

StringForHashing? Not great.

This comment has been minimized.

@jlstevens

jlstevens Jan 18, 2017

Member

HashableJSON might be better - the output is a JSON string with some customised support for additional types.

This comment has been minimized.

@jlstevens

jlstevens Jan 18, 2017

Member

HashableJSONEncoder if we want to be pedantic...

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

HashableJSON seems fine to me.

def keyhash(key, as_string=False):

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

Or is this hashable_string?

In any case, I'm not sure in what sense the argument is a "key".

This comment has been minimized.

@jlstevens

jlstevens Jan 18, 2017

Member

Or is this hashable_string?

No. I'll get rid of the as_string argument as I only needed it temporarily for debugging. The output is an integer, as returned by the Python hash function.

I am happy to rename it. In the context it is used (memoisation) the argument is what we have always called a 'key'. I'll just call it obj here... but then this function will then need a new name. We can't simply use hash clashes with the built-in. Again, any suggestions?

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

Sure, you'll use it as a key, but that doesn't mean the function has anything to do with keys. obj is a good name for the argument; that's what it is (an arbitrary object). For the function name, maybe objhash (to contrast it with the regular hash(), which doesn't work for arbitrary objects)?

This comment has been minimized.

@jlstevens

jlstevens Jan 18, 2017

Member

Sure, you'll use it as a key, but that doesn't mean the function has anything to do with keys.

Agreed! Not that keen on objhash as a name though...

@jlstevens

This comment has been minimized.

Member

jlstevens commented Jan 18, 2017

Implemented suggestions and added unit tests. Ready for final review/merge once tests pass.

(supported).
of object as possible including nested objects and objects that are
not normally hashable. The purpose of this class is to generate
unique strings that once hashed are suitable for memoization.

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

Strictly speaking, it's not the hashed strings that are memoized, it's the actual results (and only implicitly even those, in this case). The hashed strings are useful for memoization, but they themselves are not memoized.

So maybe "The purpose of this class is to generate unique strings that once hashed are suitable for use in memoization and other cases where deep equality must be tested without storing the entire object".

By default JSONEncoder supports booleans, numbers, strings, lists,
tuples and dictionaries. In order to support other types such as
sets, datetime objects and mutable objects such as pandas Dataframes

This comment has been minimized.

@jbednar

jbednar Jan 18, 2017

Member

and custom mutable

@jbednar

This comment has been minimized.

Member

jbednar commented Jan 18, 2017

Not sure why it's failing, but it looks good to me apart from the minor docstring fixes suggested.

@jlstevens

This comment has been minimized.

Member

jlstevens commented Jan 18, 2017

Python 3 has an additional limitation - the sort_keys option in json.dumps fails if it encounters dictionaries with heterogenous keys. This is a very clear bug as the final JSON keys are always strings which can be sorted but unfortunately this issue is unlikely to be fixed.

If we are happy to accept these limitations, the PR is ready to merge now the test have passed. Even with its caveats/holes, this approach greatly expands what can be memoized (before this PR, even a simple list wouldn't allow memoization).

@jbednar

This comment has been minimized.

Member

jbednar commented Jan 18, 2017

Looks good to me.

@philippjfr

This comment has been minimized.

Member

philippjfr commented Jan 18, 2017

Annoying this turned out to be such a pain. We'll just have to see if we run into any issues in practice. Currently the types of objects that will appear in a stream are fairly limited so I don't foresee any immediate issues. Will merge.

@philippjfr philippjfr merged commit facf323 into master Jan 18, 2017

4 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.003%) to 77.568%
Details
s3-reference-data-cache Test data is cached.
Details

@philippjfr philippjfr deleted the nested_key_hashes branch Jan 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment