Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[concept] Queue length limit #726

Closed
wants to merge 2 commits into from

Conversation

monfera
Copy link
Contributor

@monfera monfera commented Jul 7, 2016

Client bug report on seemingly unjustified memory growth lead to, among other things, the retainment of data in the undo/redo queue. As I'm not too familiar with this feature, I can only suggest that there be a default (overridable) limit to the queue, or at least an optional configuration for the maximum queue length. So I'd be glad to pass the baton on this one, due to numerous WebGL improvement requirements I still need to do.

@etpinard
Copy link
Contributor

etpinard commented Jul 7, 2016

@monfera you bring up a very interesting point here.

Having some sort of queue limit length makes sense to me.

But perhaps we could go even further by not storing data_array updates (which can be large) into the queue.

@chriddyp @bpostlethwaite does the new plotly workspace use Plotly.Queue to undo/redo operations? If so, to what extent? In the old workspace the undo / redo isn't exposed for data arrays.

@rreusser can you think of any side-effects for animations?

@bpostlethwaite
Copy link
Member

We don't know what will be happening with undo/redo yet.

@bpostlethwaite
Copy link
Member

bpostlethwaite commented Jul 7, 2016

That said I think data arrays are ok so long as there is a limit. Most folks have a couple gigabytes of memory at their disposal which means hundreds of millions to billions of array values can be stored.

We just want the heap to level out at some point. A big heap is not necessarily bad when the user has a lot of data.

@rreusser
Copy link
Contributor

rreusser commented Jul 7, 2016

Re-reading that more carefully, regarding side effects on animations:

If you alter a frame, it's up to the user/implementer to pass a data_array with the same reference. If that's done correctly, then the queue shouldn't accumulate giant amounts of memory unless you're actually modifying the data_arrays.

@monfera
Copy link
Contributor Author

monfera commented Jul 7, 2016

In agreement with @rreusser : I believe a key benefit of WebGL (i.e. one main reason to use it) is its ability to handle large amounts of data, so it's especially important there. Luckily, WebGL requires a rather different implementation of animation, it's a big topic but in any case it's usually not done by re-exposing a gl plot to a new set of data, unless as a last resort (there are about 10 different ways and lots of constraints e.g. CPU->GPU bandwidth, unknowable GPU memory size, multiple geometries on the GPU, and the possibility to do some types of animations in the shaders, just changing uniforms).

@monfera
Copy link
Contributor Author

monfera commented Jul 7, 2016

@rreusser re using the same reference, maybe my comment is irrelevant or shows my ignorance of the bridge implementation but unless there's something expressly covering this case, R and probably Python and other bindings would send fresh arrays.

@bpostlethwaite looking at the implementation, a data set is often pasted into multiple places (sometimes the same object ID but haven't checked for all) and sometimes there's derived data, hover pick related data etc. that in effect acts as a small integer multiplier on the data array size. This doesn't alter your point but memory footprint can be more than a user's estimate. Also, except for the WebGL data transfers, plotly uses untyped arrays, where, especially if there are non-IEEE 754 values like null, the value representation may be boxed (larger size than pure numeric array). Also, folks who don't use high-end machines use Intel or AMD integrated GPUs by default (e.g. Macbook Retina 13 or lower end 15), so the GPU memory is taken off the main memory. Well it's a larger topic but the upshot is that with WebGL it'll be eventually useful to just push the data to the GPU and not retain any large arrays on the CPU side of things.

@rreusser
Copy link
Contributor

rreusser commented Jul 7, 2016

Thoughts on assessing memory usage:

  • window.performance.memory: proprietary. chrome only. progressive enhancement?
  • object-size totally fails to account for duplicate data_array references
  • A WeakMap, is a good way to reference things for analysis without actually retaining them, but again, compatibility. (Edit: not so simple… not iterable… but interesting)

@monfera
Copy link
Contributor Author

monfera commented Jul 7, 2016

@rreusser wasn't exactly plain sailing (due to multiple issues entangled in like effect) but the Chrome dev tool led me to the places where memory was retained, e.g. the awesome heap snapshot comparison (and things like sorting by real memory footprint). It can show the objects and who has reference to them, so as long as we do the proper dereferencing on disuse, other reasonable browsers should mark the unreferenced objects for GC too.

@monfera
Copy link
Contributor Author

monfera commented Jul 8, 2016

Is someone interested in taking over this, due to possible effects on the plot.ly web builder and R, Python etc. interfaces? A quick solution would help, because if I can't configure zero-length in master, I need to keep merging my local dev branches into a tmp branch into which I also merge this branch, otherwise memory use is unmeasurable. If noone can take this task now, then I can just change the limit to Infinity and allow user (plotly.js API) configurability of this limit, in expectation of quickly merging an upwards compatible change (even this would take time from the ongoing WebGL specific improvements).

@bpostlethwaite @etpinard I guess the solution depends on customer use case, i.e. do they need an undo/redo queue with both large datasets and (unlimited, or long) undo archival of changes to those large datasets (if there's a need for both, then it naturally leads to growing memory footprint even if there's some room to more compactly store data).

@bpostlethwaite
Copy link
Member

bpostlethwaite commented Jul 8, 2016 via email

@monfera
Copy link
Contributor Author

monfera commented Jul 8, 2016

@bpostlethwaite yes, it's "on" even in the plain vanilla plotly.js bundle which I was using. But I'm glad to double-check if I get the same behavior from a CDN uploaded bundle (on the off-chance there's some bundle control option that's set differently for CDN vs. the development bunde).

@monfera
Copy link
Contributor Author

monfera commented Jul 8, 2016

@bpostlethwaite the undo/redo queue is on for https://cdn.plot.ly/plotly-latest.js
image

@etpinard
Copy link
Contributor

etpinard commented Jul 8, 2016

do you know if the queue is used in Python and R?

I'm pretty sure that no, they don't use it. But @yankev @cpsievert ⏫ can you confirm?

If it is just
the old Plotly Workspace then it seems like setting it to 0 length by
default and altering Workspace code makes a lot of sense.

👍

Is someone interested in taking over this,

Yep. I'll wrap this up before 1.15.0

@etpinard etpinard self-assigned this Jul 8, 2016
@monfera
Copy link
Contributor Author

monfera commented Jul 8, 2016

@etpinard thank you, I'm continuing with the streaming load.

@yankev
Copy link
Contributor

yankev commented Jul 8, 2016

@etpinard I don't believe so either. The Python api never really comes into contact with plotly.js code aside from offline mode, which as far as I know only calls newPlot.

@timelyportfolio
Copy link
Contributor

R never calls the queue in the JavaScript for its htmlwidget.

@monfera
Copy link
Contributor Author

monfera commented Jul 11, 2016

@etpinard some good news: with large datasets, there's much avoidable CPU use and allocation by undo/redo outside queue.js. This line represents most of the runtime in cases where there are tens of thousands (or more) points: https://github.com/plotly/plotly.js/blob/master/src/plot_api/plot_api.js#L2036

Within this line, the actual cost comes from Lib.extendDeep as redoit stores large arrays:
image

As the default queue length is slated to be 0, overrideable via some mechanism, a part of this line - and perhaps the above code that generates redoit - could be conditionalized.

I came across this when profiling the time consumption of the repeated addition of 1000 points (test bench), which showed a slower and slower execution of Plotly.restyle. I was happy to run into this low hanging fruit that happened to also relate to the queue.

Numbers: given 100k points, adding 1k new points takes:

  • around 340ms if the line is intact
  • around 90ms if the line is commented out

All this doesn't change the fact that adding points takes longer as the data size grows. Adding 1k points to 1M preeexisting points takes 820ms; adding 1k points to 1k preexisting points is around 15ms. But if this line were left unchanged, then the planned GL speedup would have a significant impact.

@monfera
Copy link
Contributor Author

monfera commented Jul 11, 2016

... I'm also wondering if the fact that the data is deep copied even matters for the Workspace - maybe the event emission doesn't lead down to a code path where the redoit array contents are destructively modified. So maybe the removal of the deep copy is sufficient:

gd.emit('plotly_restyle', [redoit, traces]);

@monfera
Copy link
Contributor Author

monfera commented Jul 11, 2016

It turns out that slightly modifying Lib.extendDeep to have a shortcut for flat arrays (which the large arrays are) achieves nearly the same speedup as not calling Lib.extendDeep with the redoit data: #732

@etpinard
Copy link
Contributor

a few more things to consider from #732

More generally, I'm thinking that we may not need to extend the redoit objects and traces arrays before emitting plotly_restyle and plotly_relayout. Instead we should extend them inside Queue.add which, together with #726, will speed the (most common) case where the Queue is not activated.

@etpinard if you want to deeply extend [redoit, traces] - whether here or in queue - I'd recommend either of:

using @rreusser 's extendDeepNoArrays if you're okay with not deep-copying the arrays - in which case you'd need to actually make the call to extendDeepNoArrays in some loop, because traces itself is an array, and extendDeepNoArrays just takes all arrays as pointers if I understand it
or using this PR, if you want to stick to the current extendDeep semantics (which gives fresh arrays), or if you don't want to manually pry apart and loop over things a layer or two below [redoit, traces]

@etpinard etpinard closed this Jul 11, 2016
@etpinard
Copy link
Contributor

closed in favour of #737

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants