-
-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support serialization of HoloViews objects in Dask #2768
Comments
This is conceptually quite similar to something Philipp and I did many years ago for our PhDs. We distributed hundreds of batch simulations on a cluster, output holoviews pickles and collected all of them back into the notebook. We had a fair bit of infrastructure to do this but unfortunately it was generally too confusing to make public along with the rest of holoviews. The key point is that this is a workflow we are familiar with! That said, we were using an HPC batch system of independent processes and not using dask.
Styling information is held separate from the objects themselves - you can read a bit about the design in our SciPy 2015 paper. This means that there is some bookkeeping needed to associate the pickled elements with the styles. The way to store the styles as pickles is to use
We made the decision a long time ago to separate data from the details of representation. This approach keeps the elements as simple wrappers around your data, makes it easier to set global default styles and makes elements exist independently of any particular plotting library.
The notebook extension acts not so much a global configuration variable but as a way of loading a large chunk of Javascript into the notebook once instead of with every plot. It does set some necessary global state such as the active renderer e.g to keep track of whether the user is currently using matplotlib or bokeh. You can pickle holoviews objects without any plotting library installed unless you want to apply and preserve styles in which case you do need to activate the appropriate renderer. You can do this with
You can do this though applying plotting extension specific styles does currently assume that the corresponding plotting extension is available.
This is where the difference between independent batch processes and using dask becomes relevant and where I might suggest an alternative approach. Why build the holoviews objects on the workers? Why not work with the natively support dask data structures (e.g dask arrays/dataframes) then construct the holoviews objects locally from them? Dask will then do the job of processing the data and you only need holoviews in the one place where you need to pull everything back together. This would be the approach I advise when using dask whereas pickling individual holoviews objects to disk is what I would use when handling a large batch of independently run processes. Hope that helps! |
One thing we should do is revisit how we handle pickles. It is possible we now have a better idea how to get everything (i.e including the handling of the option trees) working within the normal pickle machinery. This would mean there wouldn't need to be a distinct API and it would work more smoothly with dask. |
To bump the issue a bit, I'm doing this on File ~/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
103 if func.__name__ != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)
File ~/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/_private/worker.py:2309, in get(object_refs, timeout)
2307 worker.core_worker.dump_object_store_memory_usage()
2308 if isinstance(value, RayTaskError):
-> 2309 raise value.as_instanceof_cause()
2310 else:
2311 raise value
RayTaskError(TypeError): ray::get_plot() (pid=36727, ip=192.168.0.107)
File "/home/toaster/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/toaster/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 627, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle 'weakref.ReferenceType' object |
I'm using dask.distributed to execute a data analysis pipeline that returns dicts of holoviews plots. Holoviews works fine if you can split your computation code (which returns pure numpy arrays or similar) and your plotting code (which takes those numpy arrays as input and returns holoviews plots; you'd run this entirely in the jupyter notebook kernel, not through dask).
For my use case, I can't do this. I want to construct holoviews ViewableElements on my remote execution nodes, serialize them and bring them back to the jupyter kernel, then display them (or combine them into Overlays/Layouts/HoloMaps, etc. before displaying them). For example, I may want to compute 1000 hv.Images on my remote nodes, bring them back to the jupyter kernel, turn them into a HoloMap, and display them as an animation.
To make this use case possible (I can't believe I'm the only one who wants to do this), I'll need to solve the following three problems:
pickle.dumps
. In my case, I just monkey-patchedhv.core.dimension.LabelledData.__setstate__
to do whathv.Store.loads
does (settinghv.Store.load_counter_offset = hv.StoreOptions.id_offset()
beforehv.core.dimension.LabelledData.__setstate__
). I'd appreciate it if anyone has ideas about better ways to do this. In particular, I'm a bit worried about thread safety. Is there any reason not to use GUIDs instead of integer id's? That way you're never worried about clobbering/overwriting settings upon unpickling if you get the offset wrong.A conceptual question: why do options need to live in a global
hv.Options._custom_options
dict instead of being attached to the holoviews objects themselves?hv.extension('bokeh')
on the jupyter kernel as well as on all my dask workers. Going forward, I was thinking of monkey-patchinghv.extension
to automatically run itself on every worker in the dask cluster, so holoviews. Are there other global settings I need to worry about synchronizing between the client node and the remote nodes?Why does holoviews need to know which backend is selected while constructing ViewableElements? (if it's only to check that the options are valid, I'd be happy to just turn option validation off during a .options call and delay the option validation until the ipython displayhook). I had imagined that you could build a holoviews plot, and later decide which backend you wanted to use to display it.
Thoughts?
CC @mrocklin
The text was updated successfully, but these errors were encountered: