New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any descendant of the state can be a binary_type #1194

Merged
merged 21 commits into from Mar 21, 2017

Conversation

Projects
None yet
4 participants
@maartenbreddels
Member

maartenbreddels commented Mar 8, 2017

Before you could only do sth like this when serializing a state:

def array_to_binary(ar, obj=None):
    ar = ar.astype(np.float64)
    return memoryview(ar)

However, you'd like to be able to send shape along. This patch makes this possible, e.g.:

def array_to_binary(ar, obj=None):
	ar = ar.astype(np.float64)
	return {'data': memoryview(ar), 'shape': ar.shape}

To be able to do this, state is seperated in state_without_buffers (state on the js side), and a state_with_bufffers, where every dict items that is a buffer is pop'ed, and every list item replaced by None. If state_with_bufffers was not treated seperately, the state would first be deserialized completely, and later on the buffers were 'patched in. By keeping the state_with_bufffers` seperate, is will first put in all the buffers in place, and deserialize then afterwards.
Maybe it's not the cleanest solution, but my understanding of the code only goes this far.
This does not treat serialization (frontend -> backend), before continuing, I'd like to get 'approval' of this approach.

Any descendant of the state can be a binary_type now, usefull for ser…
…ializing numpy arrays, in addition to shape, dtype etc

maartenbreddels added a commit to maartenbreddels/ipyvolume that referenced this pull request Mar 8, 2017

@maartenbreddels maartenbreddels referenced this pull request Mar 8, 2017

Closed

Speed improvement. #13

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

Oh, and respect for the ones responsible for thinking out the method on doing binary transfers, this is truly a good idea!

@jasongrout

This comment has been minimized.

Member

jasongrout commented Mar 8, 2017

Why not just save the entire array+size+dtype into a binary format? For example, we could implement this: https://docs.scipy.org/doc/numpy/neps/npy-format.html#format-specification-version-1-0

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

That would be an option, but I also need serializing a dict of numpy arrays in the future, so even if you could solve some options with what you propose, this is more flexible. And I guess that requires a memcpy (although it's probably in the order of megabytes, so it shouldn't be an issue).

@SylvainCorlay

This comment has been minimized.

Member

SylvainCorlay commented Mar 8, 2017

Regardless of the method used for the case of numpy arrays, I think that allowing binary fields will be necessary for other use cases.

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

Yes, also other metadata, units, descriptions etc, you don't want to put it in binary form on each side of the wire, this would make life much easier.

@jasongrout

This comment has been minimized.

Member

jasongrout commented Mar 8, 2017

I hesitate to start doing things that involve picking apart objects and replacing their innards. Are we replacing things arbitrarily deep in the nesting here?

@SylvainCorlay

This comment has been minimized.

Member

SylvainCorlay commented Mar 8, 2017

We can probably address all possible use cases by allowing only top-level binary keys.

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

Yes, any level deep, as far as lists and dicts go. We could not allow that, but that doesn't change the code at all (except for the recursive part). It is not any different than _widget_to_json.

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

It will actually require hardcoded checks, so as it is now, it's the easiest to maintain I think.

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

I think something that makes this complex (and it was already the case before this PR), is that first an object get created/deserialized here: https://github.com/ipython/ipywidgets/blob/master/jupyter-js-widgets/src/manager-base.ts#L346
And then afterwards, the buffer items gets pushed in here: https://github.com/ipython/ipywidgets/blob/master/jupyter-js-widgets/src/widget.ts#L175
and the rest gets deserialized.
Maybe if this can be done differently, like the deserialize part not being called at two places, the state wouldn't required to be split in two parts. Then buffer items could be popped, and list items set to null, and later filled in.

Because now, if say my numpy array 'x' (with the missing buffer value on the js side), get deserialized, and then later the buffer gets filled in, and then the deserialize isn't getting called anymore.

@jasongrout

This comment has been minimized.

Member

jasongrout commented Mar 8, 2017

We can probably address all possible use cases by allowing only top-level binary keys.

This is essentially what we have now, one level up at the object's state keys. Some keys can be binary, some can be JSON.

What would we need to isolate this to a custom serializer (like the widgets serializer)? For example, what if we allowed multiple buffers to be associated with one key (to have possible zero-copy semantics).

(I'm not totally against this idea, but it certainly makes the serialization more complicated, so I want to explore some simpler ideas, or ways to move the complexity out of the base codebase)

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

Actually, that is also what @SylvainCorlay thought, but it is more like 0 levels deep.

state is a dict with all values, lets say we have a widget with a property 'x', then (igoring all the other meta data), state would be state = {'x': memoryview(ar)}, and that is allowed and get properly treated in _split_state_buffers and on the js side.
However, we'd like to allow at least state = {'x': {'data': memoryview(ar), 'shape': ar.shape, ....}}, so it's an extra level deeper we need to go. And well, 1 level or N levels, it's all the same code.

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

I agree that it makes it more complex, but I think the complexity start at the two places where the deserialization is happening. If, say at object construction we would already have the buffers available, that could make this simpler, we just pop buffers out of the dicts, and replace lists values with sentinels, and put them back in place directly before the object is constructed. Now the object is first constructed without anything that has a buffer in it missing (before it was only level 0 keys missing that had buffer object), and then an update is adding in the missing buffer objects.
Ah.... now that I am writing this, I understand where the issue is:
Widget.open first constructs the object without buffers, and then sends the state again. So my guess is, when this issue is fixed, it can at least be a bit simpler (as in not having state and state_with_buffers, but just 1 state)

@SylvainCorlay

This comment has been minimized.

Member

SylvainCorlay commented Mar 8, 2017

@jasongrout the current mechanism allows top-level properties of the object to by memory views, not top-level properties of a given trait attribute.

@jasongrout

This comment has been minimized.

Member

jasongrout commented Mar 8, 2017

@jasongrout the current mechanism allows top-level properties of the object to by memory views, not top-level properties of a given trait attribute.

Right - any particular trait (i.e., a key/value in the object's state) can be binary.

@jasongrout

This comment has been minimized.

Member

jasongrout commented Mar 8, 2017

However, we'd like to allow at least state = {'x': {'data': memoryview(ar), 'shape': ar.shape, ....}}, so it's an extra level deeper we need to go. And well, 1 level or N levels, it's all the same code.

Yes, I understand what we're trying to do. Another way to think about this proposal is that we are simply extending our base JSON encoder/decoder to handle binary objects anywhere in it. (The popping of binary buffers happens after the object's custom serializer has run, right?)

What if we allowed any particular trait to have arbitrary JSON metadata associated with it? We sort of have that now - if a trait is binary, some metadata is associated with it indicating which buffer the data is in. What if the serializer could return both a value and a metadata dict? And the deserializer could take both a data value and a metadata dict.

Also, the difference between what we have now (any trait can be encoded in binary) and a level-1 binary replacer is that the serialization/deserialization happens at the trait level, so with the level-1 binary replacer, a single trait can (very easily) be encoded as both binary data and JSON data, whereas with what we have now, there needs to be some extra magic going on to associate the JSON data with the binary data in two different traits.

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

I like the idea, but I can think of an example which doesn't really fit into this idea. Say a masked array, it would have two binary blobs associated with it, and I wouldn't call the array with the mask the meta data. It would make more sense to write it like this.

   ...
   return {'data': memoryview(data), 'mask': memoryview(mask), 'shape' data.shape}
@jasongrout

This comment has been minimized.

Member

jasongrout commented Mar 8, 2017

Yes, I'm agreeing with this PR's idea more and more, especially as I think about it as being just a way of augmenting our JSON format to take advantage of the binary buffers we have available.

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

Ok, good to know. The split state and state_with_buffers can be avoided if the issue in Widget.open is fixed, but it's not a lot of code. I'll comment the code on what can be changed if this issue is fixed. So I guess the next step would be to implement serialize (from js to py)?

@jasongrout

This comment has been minimized.

Member

jasongrout commented Mar 8, 2017

Can you show an example of what a deeper nesting path would look like? a state dictionary of {'x': {'arr': [ar, ar]}} would give [x, 'arr', 0], [x, 'arr', 1] as paths?

Implementing the serialization/deserialization in js definitely should be done too.

Thanks for tackling this! It's been a frustration we've had for a while (particularly with numpy arrays).

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 8, 2017

Ok, implemented, demo notebook here: https://gist.github.com/maartenbreddels/ac41cdcd22c8f84c9fc639dea7c90848
it's a model that takes x, and doubles every items, put it to y. As an extra check i also included a 'nested' property:

{'data': mv, 'shape': ar.shape, 'nested': {'list': [mv, 'test']}}

that gets passed along from x to y, just to see demonstate it handles nested items ok, in the last cell you see it is property serialize and deserialized.

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 9, 2017

I've answered your question in the docstring of _split_state_buffers. Futhermore, (in a previous commit) I've renamed buffer_keys by buffer_paths, since it better covers its usage.
Comments on the js/ts are welcome, since I'm not so comforable with it as I am with Python.

I also tried to keep the code simple (like a for loop in _handle_msg, instead of using reduce), to keep the code more readable for those who are not used to functional parts, but I don't know what you guidelines are on this.

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 10, 2017

FYI: Found some bugs in this PR, will commit new changes later.

@SylvainCorlay

This comment has been minimized.

Member

SylvainCorlay commented Mar 15, 2017

@maartenbreddels are you still making changes to this PR?

I did a rough review and this looks good. What issues did you encounter?

@jasongrout jasongrout referenced this pull request Mar 18, 2017

Closed

Review protocol for consistency #1212

1 of 1 task complete
_split_state_buffer renamed for ts/js consistency, _remove_buffers/_p…
…ut_buffers for py/ts codes are equivalent, unittest for py side covers more cases also putting it back together, bugfix (covered by python test) for _remove_buffers, typo corrections
@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 18, 2017

Ok, it's more consistent now. Both python and javascript sides have a (_)remove_buffers/(_)put_buffers. remove buffers doesn't modify existing dicts/objects (clones when needed), but references non-modifies ancestors. The js side has cyclic protection (needed to support Controller), the py side does not. _put_buffers does modify the state, but that should be fine since it coming off the wire.
I must say, the python unittest gives me a high level of trust in the code now, the fact that the ts/js code is practically the same makes me feel comfortable with it as well.

@jasongrout

This comment has been minimized.

Member

jasongrout commented Mar 20, 2017

I think I figured out why you needed the circular dependency check on the js side. Basically, we use Javascript's toJSON feature (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify#toJSON()_behavior) to serialize models, so the models are not serialized to "IPY_MODEL_..." strings until after the remove_buffers runs, so remove_buffers recurses into these complicated model objects. Looking into it...

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 20, 2017

Yes, I was surprised to see unpack_models, but not the inverse. But I think if the toJSON can handle it, we should be able to handle it as well.

@jasongrout

I'll push a few commits with these changes...

self.assertIn('shape', state['y'])
self.assertNotIn('ar', state['x'])
self.assertFalse(state['x']) # should be empty
self.assertNotIn('data', state['x'])

This comment has been minimized.

@jasongrout

jasongrout Mar 20, 2017

Member

state['y']

self.assertIsNot(state['y'], y)
self.assertIs(state['y']['shape'], y_shape)
# check that the buffers paths really points to the right buffer

This comment has been minimized.

@jasongrout
self.assertIn('plain', state)
self.assertIn('shape', state['y'])
self.assertNotIn('ar', state['x'])
self.assertFalse(state['x']) # should be empty

This comment has been minimized.

@jasongrout

jasongrout Mar 20, 2017

Member

self.assertEqual(state['x'], {})

obj = obj[key]
obj[buffer_path[-1]] = buffer
def _seperate_buffers(substate, path, buffer_paths, buffers):

This comment has been minimized.

@jasongrout

jasongrout Mar 20, 2017

Member

separate

buffers.append(v)
buffer_paths.append(path + [i])
elif isinstance(v, (dict, list, tuple)):
vnew = _seperate_buffers(v, path + [i], buffer_paths, buffers)

This comment has been minimized.

@jasongrout

jasongrout Mar 20, 2017

Member

separate

buffers.append(v)
buffer_paths.append(path + [k])
elif isinstance(v, (dict, list, tuple)):
vnew = _seperate_buffers(v, path + [k], buffer_paths, buffers)

This comment has been minimized.

@jasongrout

jasongrout Mar 20, 2017

Member

separate

[<memory at 0x107ffec48>, <memory at 0x107ffed08>])
"""
buffer_paths, buffers = [], []
state = _seperate_buffers(state, [], buffer_paths, buffers)

This comment has been minimized.

@jasongrout

jasongrout Mar 20, 2017

Member

separate

args = dict(target_name='jupyter.widget', data=state)
args = dict(target_name='jupyter.widget', data={'state': state, 'buffer_paths': buffer_paths})

This comment has been minimized.

@jasongrout

jasongrout Mar 20, 2017

Member

dict(..., buffers=buffers) (and delete the assignment below) to keep the buffer_paths with the buffers...

else {
var new_value = remove(value, path.concat([key]));
// only assigned when the value changes, we may serialize objects that don't support assignment
if(new_value != value) {

This comment has been minimized.

@jasongrout
else {
var new_value = remove(value, path.concat([i]));
// only assigned when the value changes, we may serialize objects that don't support assignment
if(new_value != value) {

This comment has been minimized.

@jasongrout
@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 20, 2017

I agree, good you spotted those bugs in the unittest.

@jasongrout

This comment has been minimized.

Member

jasongrout commented Mar 20, 2017

@maartenbreddels - can you try this out? I think it should work, even with Controller...

@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 21, 2017

With the toJSON the Controller should work fine, happy to see it get merged! Great work.

@jasongrout

This comment has been minimized.

Member

jasongrout commented Mar 21, 2017

You did the great work here! Thanks!

@jasongrout jasongrout merged commit 17a3e45 into jupyter-widgets:master Mar 21, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@maartenbreddels

This comment has been minimized.

Member

maartenbreddels commented Mar 21, 2017

Thanks for the careful and thorough review 👍

@jasongrout jasongrout modified the milestone: 7.0 Apr 18, 2017

@jasongrout jasongrout referenced this pull request Apr 18, 2017

Closed

Update Changelog #1279

jasongrout added a commit to vidartf/ipywidgets that referenced this pull request Aug 9, 2017

Change the binary types for custom serializers to memoryview, bytearr…
…ay, or python 3 bytes.

We disregard the python 2 buffer type - it is obsolete, even on python 2.

See jupyter-widgets#1194
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment