Implements tree reduction in the dask layer #926

o-smirnov · 2020-05-25T19:35:54Z

Following discussions in ratt-ru/shadeMS#29 (in particular, using small chunks in the dask layer would, counterintuitive, explode RAM usage), @sjperkins replaced the current chunk aggregation code in dask.py with a tree reduction.

This has been tested extensively with https://github.com/ratt-ru/shadeMS, and was found to reduce RAM usage considerably. Not tested in a CUDA context at all though, so if somebody more knowledgable than me can take a look at it, that'd be great.

Tree reductions are go

Quick fix for pandas Categorical dtypes

sjperkins · 2020-05-26T08:47:16Z

It looks like a95dd1a doesn't handle the full range of types that can be present on Dataframe Arrays and is breaking the test suite. I'll work on a fix.

Improve dtype inspection

sjperkins · 2020-06-25T08:02:59Z

@jbednar I'd be interested in your thoughts on this PR and the use of a Reduction Operator to aggregate the images produced in each chunk.

o-smirnov · 2020-08-25T10:25:58Z

@jbednar any chance to look at this?

o-smirnov · 2020-10-30T22:42:22Z

@jbednar, ping. It would be very nice to get this (and #927) merged so we can do a proper shadeMS release in time for ADASS. I attach a copy of my poster in the hope that it motivates: shadeMS ADASS Poster.pdf

philippjfr

I'm in favor of merging this, but are there any concrete benchmarks about the impact on performance and memory usage?

datashader/data_libraries/dask.py

jbednar

The changes here are way over my head, so I have no idea how to tell whether they are a good idea. The basic idea of a tree reduction definitely makes sense, but I wouldn't be able to detect problems with this implementation if there are any, and as @philippjfr suggests it would be good to know what the actual impact on performance is, given that it's a clearly more complex approach than the current one.

jbednar · 2020-11-02T22:48:53Z

datashader/data_libraries/dask.py

@@ -70,18 +76,103 @@ def default(glyph, df, schema, canvas, summary, cuda=False):
    y_mapper = canvas.y_axis.mapper
    extend = glyph._build_extend(x_mapper, y_mapper, info, append)

-    def chunk(df):
+    # Here be dragons


Agreed, but explaining precisely what those dragons are would be helpful! :-)

Low-level dask graphs can represent a very broad range of operations and data. Dask Arrays and Dataframes are simply metadata describing the layer of an associated Low-Level Dask Array. For example:

>>> dsk = { ('a', 0, 0): (np.full, (2, 2), 1), ('a', 0, 1): (np.full, (2, 2), 2), ('a', 1, 0): (np.full, (2, 2), 3), ('a', 1, 1): (np.full, (2, 2), 4)} >>> chunks = ((2, 2), (2, 2)) >>> array = da.Array(dsk, 'a', chunks, meta=np.empty((0,0), dtype=np.int32)) >>> array dask.array<a, shape=(4, 4), dtype=int32, chunksize=(2, 2), chunktype=numpy.ndarray> >>> array.compute() Out[7]: array([[1, 1, 2, 2], [1, 1, 2, 2], [3, 3, 4, 4], [3, 3, 4, 4]])

In the above example, the low-level graph, chunks describe the purported size of the arrays produced by the 4 dask tasks, while the meta describes the type of each chunk, here represented with an empty array. This is to support arrays other that NumPy (e.g. sparse arrays). At low-level graph

The dragon is that it is entirely possibly to construct dsk in a way that disagrees with the metadata and this doesn't matter at all at the graph level. The only time there's a problem is:

if array.compute() is directly called (Because dask will try to construct an array of type meta)

If the output of arrays tasks are used as input to tasks that expect ndarrays.

(1) doesn't apply in this case and here, a dask Array is constructed from tasks producing Pandas Dataframes which are passed into tasks (combine, aggregate), that expect Dataframes, circumventing (2).

The final dask Array produced by the reduction describes a task that does return an ndarray so everything works out, under the hood.

Right, so in summary, no issues in the way it's currently used in the codebase (given that everything goes via dataframes), just leaving a signpost for future developers to look out for if they start hacking the code for another purpose?

Yep, I can leave a link to the explanation in the code.

jbednar · 2020-11-02T22:49:58Z

datashader/data_libraries/dask.py

+    dtype = np.result_type(*dtypes)
+    # Create a meta object so that dask.array doesn't try to look
+    # too closely at the type of the chunks it's wrapping
+    # they're actually dataframes, tell dask they're ndarrays


jbednar · 2020-11-02T22:57:42Z

Oh, and the poster looks amazing! I'm so glad that Datashader has been useful for this, and I apologize that it's taken so long for me to catch up to all the cool things you guys are doing! I particularly like your "Principle of Maximal Beigeness", and plan to use that in my own work. :-)

o-smirnov · 2020-11-03T14:56:31Z

but are there any concrete benchmarks about the impact on performance and memory usage?

I ran a few measurements for ratt-ru/shadeMS#29 and ratt-ru/shadeMS#34 back in the day. Basically, without this PR, my plotting problems were blowing out the RAM (on a 512GB box), so the comparison was rather binary. But I can try to do some tests with a reduced dataset.

o-smirnov · 2020-11-05T16:36:43Z

@philippjfr, I added some benchmarks here: ratt-ru/shadeMS#34 (comment).

Bottom line is, it depends on many factors, but the tree reduction version never does worse, and in some regimes (big problems, medium chunk sizes) does considerably better.

o-smirnov · 2020-11-06T09:46:07Z

This is perhaps the most striking example of where the master branch almost runs aground -- check that RAM curve just going up and up -- while tree reduction just trundles along happily. (This tends to happen when a large number of categories, 64 in this case, is employed):

master version	tree reduction

jbednar · 2020-11-11T04:13:33Z

I've fixed tests on master, so could you please rebase?

o-smirnov · 2020-11-11T14:51:38Z

All good to go.

jbednar · 2020-11-11T16:37:25Z

Thanks for the contribution! Sorry I couldn't get the tests fixed in time for the conference presentation; all I accomplished last week is checking election results about 40,000 times a day. :-/ Should be ready to make a new release soon.

o-smirnov · 2020-11-11T16:57:30Z

Aye, I think the election immobilized everyone... anyway, I realized we'll need a release anyway before I can release shadeMS to PyPI, so it was never going to be on time. Motivated people have been able to install off the github branch anyway.

jbednar · 2020-11-11T22:11:28Z

I've tagged a dev release that's building now; see https://travis-ci.org/github/holoviz/datashader/builds/743042551 . If it completes successfully, you'll be able to do conda install -c pyviz/label/dev datashader=0.11.2a1 to try it out with this PR's contents and also #927's.

o-smirnov · 2020-11-12T09:17:13Z

Thanks -- any idea why I can't pip install --pre datashader==0.11.2a1? The badge for the dev release is up on https://pypi.org/project/datashader/, but pip refuses to see it. I've never played with pip and dev releases before, so I'm fully ignorant here...

jbednar · 2020-11-12T15:08:03Z

The conda package is up and I've verified that it installs successfully using conda install -c pyviz/label/dev datashader=0.11.2a1. Looks like the pip build didn't run because of a previous unrelated job, so I restarted it. When https://travis-ci.org/github/holoviz/datashader/jobs/743042557 completes I'd hope that either a0 or a1 will be installable with pip (not sure which one this run is, as I tagged it twice! :-).

o-smirnov · 2020-11-12T18:43:32Z

Build failed:

pkg_resources.DistributionNotFound: The 'rfc3986>=1.4.0' distribution was not found and is required by the application

jbednar · 2020-11-12T22:37:14Z

Yep; looks like twine has undeclared dependencies. :-( I've forced those to be installed, with the new build at https://travis-ci.org/github/holoviz/datashader/builds/743245089

(It's been running for nearly 3 hours now!). The result will be v0.11.2a2 if it succeeds.

jbednar · 2020-11-13T02:48:03Z

It did not; apparently a further missing dependency. Another try: https://travis-ci.org/github/holoviz/datashader/builds/743321413

o-smirnov · 2020-11-13T18:07:34Z

Failed again, sadly...

jbednar · 2020-11-13T18:16:58Z

Aye. Still building with another minor tweak. Rather than me posting all the URLs each time, just look for a green one with a version number in this table: https://travis-ci.org/github/holoviz/datashader/builds/
Then when there is one, that's the version that can be installed on pip. (It's already installable on conda.)

Note that this is only a dev release. To get a real release, we have to make sure the new categorical stuff works ok on GPUs (even if just to raise an appropriate error message), fix line rendering on GPUs (broken by a different PR), and firmly commit to abandoning Python2 (as the recent PRs are only working in Python3). Not trivial!

jbednar · 2020-11-13T21:39:03Z

The pip build finally completed, and says that it uploaded datashader-0.11.2a5.tar.gz to https://test.pypi.org/legacy/ (https://travis-ci.org/github/holoviz/datashader/jobs/743444276). I assume that means that it should be available from PyPI, but pip install --pre datashader==0.11.2a5 doesn't find any alpha releases. Anyone know whether the upload command needs updating? I don't have any experience with pip dev releases myself...

o-smirnov · 2020-11-13T22:20:11Z

Not me... @Athanaseus or @gijzelaerr, any idea?

gijzelaerr · 2020-11-15T12:31:42Z

I think probably PyPI doesn't handle version extensions like a* very well, and I recommend against using them in general, just use simple x.y.z versioning.

o-smirnov · 2020-11-15T12:40:35Z

Thing is, I remember pushing out packages like radiopadre 1.0pre9, and PyPI was fine with that (and knew that 1.0pre9 < 1.0). So I'm not sure what exactly is different here!

philippjfr · 2020-11-15T12:50:09Z

The difference is that we upload dev releases to test.pypi.org not the main pypi.org repo. We've long debated changing that but for now @jbednar or I could manually copy it over.

jbednar · 2020-11-15T12:55:01Z

If pip handles dev releases properly at the main repo, i.e. pip install datashader only installs actual releases, then I think they should automatically go to the main repo. Alternatively, is there an easy way for people to install dev releases from test.pypi.org?

jbednar · 2020-11-15T12:57:04Z

In any case, I do seem to remember pip now handling dev releases fine, though I don't remember the details, and we definitely aren't ready to make a full x.y.z release while CUDA and py2 are so badly broken.

philippjfr · 2020-11-15T12:57:56Z

Yes support for --pre is pretty universal at this point, only very old pip installations won't handle it I believe. For now you can add --index-url https://test.pypi.org --pre to install from the other repo.

o-smirnov · 2020-11-17T15:52:52Z

For now you can add --index-url https://test.pypi.org --pre to install from the other repo

But I can't put that into my setup.py, or can I? For dev purposes I'm fine installing straight off github -- it's for the pre-release that I need a package on pypi.

gijzelaerr · 2020-11-18T06:25:42Z

2 things:

afaik test.pypi.org is for testing the upload of your package, not for alpha packages.
A custom upload repo is a pypi configuration setting, and should be set in the pypi configuration.

o-smirnov and others added 14 commits April 20, 2020 15:08

fixes holoviz#899, hopefully

dba9080

Implement tree reduction

7d381ef

Merge remote-tracking branch 'sjperkins/tree-reduction'

cb984d4

clearing up + exposition

2397ab1

introduce more branching

3331930

merged @sjperkins's tree reduction

bf9edab

npartitions is more accurate than the graph layer

3414b79

Merge remote-tracking branch 'sjperkins/tree-reduction' into tree-red

ec2c2a4

Merge pull request #2 from o-smirnov/tree-red

7ef66b6

Tree reductions are go

merged master (0.11.0a5 plus)

a4745f8

Quick fix for pandas Categorical Dtypes

a95dd1a

Merge pull request #3 from sjperkins/treered-ds11

8ad8d3b

Quick fix for pandas Categorical dtypes

Merge remote-tracking branch 'origin/master' into treered-ds11

cb5805c

reverted to master version of transfer_functions

0b2eb9f

Improve dtype inspection

1f826f0

sjperkins mentioned this pull request May 26, 2020

Improve dtype inspection o-smirnov/datashader#4

Merged

Merge pull request #4 from sjperkins/treered-ds11

6da009b

Improve dtype inspection

o-smirnov mentioned this pull request Jun 17, 2020

category_modulo and category_binning #927

Merged

sjperkins mentioned this pull request Jun 25, 2020

Chunked w-gridder with dask-ms and ducc ratt-ru/codex-africanus#204

Merged

Merge remote-tracking branch 'origin/master' into treered-ds11

ef1a5da

philippjfr approved these changes Oct 30, 2020

View reviewed changes

datashader/data_libraries/dask.py Outdated Show resolved Hide resolved

jbednar reviewed Nov 2, 2020

View reviewed changes

datashader/data_libraries/dask.py Outdated Show resolved Hide resolved

Sorted imports

3fd4aec

jbednar approved these changes Nov 2, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into treered-ds11

36b0096

jbednar merged commit 1f54bcf into holoviz:master Nov 11, 2020

philippjfr added this to the v0.12.0 milestone Dec 1, 2020

Implements tree reduction in the dask layer #926

Implements tree reduction in the dask layer #926

Conversation

o-smirnov commented May 25, 2020

sjperkins commented May 26, 2020

sjperkins commented Jun 25, 2020

o-smirnov commented Aug 25, 2020

o-smirnov commented Oct 30, 2020

philippjfr left a comment

Choose a reason for hiding this comment

jbednar left a comment

Choose a reason for hiding this comment

jbednar Nov 2, 2020

Choose a reason for hiding this comment

sjperkins Nov 3, 2020

Choose a reason for hiding this comment

o-smirnov Nov 3, 2020

Choose a reason for hiding this comment

sjperkins Nov 3, 2020

Choose a reason for hiding this comment

jbednar Nov 2, 2020

Choose a reason for hiding this comment

jbednar commented Nov 2, 2020 • edited

o-smirnov commented Nov 3, 2020

o-smirnov commented Nov 5, 2020

o-smirnov commented Nov 6, 2020

jbednar commented Nov 11, 2020

o-smirnov commented Nov 11, 2020

jbednar commented Nov 11, 2020

o-smirnov commented Nov 11, 2020

jbednar commented Nov 11, 2020 • edited

o-smirnov commented Nov 12, 2020

jbednar commented Nov 12, 2020

o-smirnov commented Nov 12, 2020

jbednar commented Nov 12, 2020

jbednar commented Nov 13, 2020

o-smirnov commented Nov 13, 2020

jbednar commented Nov 13, 2020 • edited

jbednar commented Nov 13, 2020

o-smirnov commented Nov 13, 2020

gijzelaerr commented Nov 15, 2020 • edited

o-smirnov commented Nov 15, 2020

philippjfr commented Nov 15, 2020

jbednar commented Nov 15, 2020 • edited

jbednar commented Nov 15, 2020

philippjfr commented Nov 15, 2020

o-smirnov commented Nov 17, 2020

gijzelaerr commented Nov 18, 2020

jbednar commented Nov 2, 2020 •

edited

jbednar commented Nov 11, 2020 •

edited

jbednar commented Nov 13, 2020 •

edited

gijzelaerr commented Nov 15, 2020 •

edited

jbednar commented Nov 15, 2020 •

edited