BossRemote#get_cutout parallelism #52

j6k4m8 · 2020-04-18T00:24:55Z

You can now pass a parallel=True (will use your number of cores) or parallel=<int number of jobs> argument to BossRemote#get_cutout in order to spread chunk downloads over a multiprocessing.Pool#starmap.

On a few informal stress-tests, this improved download speed by 37–40%.

You can still use the old behavior with parallel=False.

movestill

Awesome how easy that was to parallelize!

j6k4m8 · 2020-04-18T17:37:39Z

I'd been putting it off for years because I figured it was going to be hard...

This isn't perfect; ideally we'd parallelize writing to the results array as well using a SharedMemory numpy array. But from a few initial experiments, you take a pretty serious IO hit...and I don't want to do the algebra to figure out when it's worth it :P

For most use-cases, it's a tiny piece of the overall time profile: For very large arrays, that recombination is a nontrivial time cost, and then it's probably worth it! But I don't know what "very large" means... Multi-GB? 100 GB? At a certain point, you're swapping instead of writing to RAM, and then ain't no thang going to speed you up. So for now, I'm leaving it...

j6k4m8 · 2020-04-18T17:40:36Z

@movestill: Questions before I merge:

Is there any reason we'd want the default to be non-parallelized? (I think not, but it's worth checking just in case)
Is there any mission-critical infrastructure currently deployed with intern that assumes that it should be running its own parallelization at the function-call level (i.e. this would lead to overscheduling)? I don't think we do this anywhere that's not easily updated (i.e. client-side notebooks etc).

My understanding is that both of these are no's, but just putting up a final check before we merge :)

movestill · 2020-04-18T18:49:34Z

Is there any reason we'd want the default to be non-parallelized? (I think not, but it's worth checking just in case)

I don't think so, either.

Is there any mission-critical infrastructure currently deployed with intern that assumes that it should be running its own parallelization at the function-call level (i.e. this would lead to overscheduling)? I don't think we do this anywhere that's not easily updated (i.e. client-side notebooks etc).

The only thing that I can think of that uses intern and may be parallelized is the ingest client's intern plugin. For that plugin, maybe we just turn parallel off when it uses get_cutout(). I don't think it'll be easy to give the plugin knowledge of how many ingest processes are running.

j6k4m8 · 2020-04-18T19:58:17Z

Cool. Yeah I agree w.r.t the ingest plugin. I'll make a complementary PR in that repo shortly! Anecdotally, I have a feeling that it'd even be fine to leave it turned on there, since the ingest client will likely spend a bunch of time in other phases of the ingest job besides the get_cutout portion, too, and this will help it saturate its downlink.

j6k4m8 added 2 commits April 17, 2020 20:21

Add BossRemote#get_cutout parallelism

c92cecc

Add BossRemote parallelism to the changelog

7726059

j6k4m8 requested review from lrodri29, dxenes1, movestill and sandyhider April 18, 2020 00:25

j6k4m8 marked this pull request as draft April 18, 2020 14:09

j6k4m8 added 5 commits April 18, 2020 10:12

Add PR number to changelog

b0340ff

Fix tests to expect another arg

697ac7e

Add keyword to call (duh)

9b17c69

Continue to fix tests

a1e0774

Fix tests locally

e2dd946

movestill approved these changes Apr 18, 2020

View reviewed changes

j6k4m8 marked this pull request as ready for review April 18, 2020 17:40

j6k4m8 merged commit 225e1ac into master Apr 18, 2020

j6k4m8 deleted the parallel-process-downloads branch April 18, 2020 19:58

j6k4m8 mentioned this pull request May 15, 2020

Parallelize get_cutout "chunked" calls #20

Closed

j6k4m8 mentioned this pull request Jul 15, 2020

Parallelize BossRemote#create_cutout #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BossRemote#get_cutout parallelism #52

BossRemote#get_cutout parallelism #52

j6k4m8 commented Apr 18, 2020

movestill left a comment

j6k4m8 commented Apr 18, 2020

j6k4m8 commented Apr 18, 2020

movestill commented Apr 18, 2020

j6k4m8 commented Apr 18, 2020

BossRemote#get_cutout parallelism #52

BossRemote#get_cutout parallelism #52

Conversation

j6k4m8 commented Apr 18, 2020

movestill left a comment

Choose a reason for hiding this comment

j6k4m8 commented Apr 18, 2020

j6k4m8 commented Apr 18, 2020

movestill commented Apr 18, 2020

j6k4m8 commented Apr 18, 2020