Skip to content

Parallel copy#87

Merged
rclark merged 6 commits intomasterfrom
parallel-copy
Oct 8, 2014
Merged

Parallel copy#87
rclark merged 6 commits intomasterfrom
parallel-copy

Conversation

@rclark
Copy link
Contributor

@rclark rclark commented Oct 2, 2014

Allows you to split a read operation into an arbitrary number of jobs. Pass a job parameter to options when using tilelive.createReadStream or tilelive.deserialize:

var readable = tilelive.createReadStream(src, { type: 'scanline', job: { total: 4, num: 1 } });

This instructs tilelive to only read tiles that would fall into job 1 of 4. A complete read would mean four calls each with a different num.

Still to-do:

  • deserialize shouldn't utilize the same x % total === num - 1 approach that other streams do. That means deserializing every row before throwing out the ones that aren't part of the current job. It should skip based on row number instead. Done.
  • other ideas for tests?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.15%) when pulling ef301a6 on parallel-copy into 580d44f on master.

@rclark
Copy link
Contributor Author

rclark commented Oct 2, 2014

In the failing tests, as the ratio of num jobs : total tiles increases the distribution of tile reads across jobs gets very bad very quickly.

Because of the approach here (dividing jobs based on tile.x values) getting an even distribution is going to be very tileset-dependent. I think the next step is to look at this distribution in some more true-to-life situations to determine if it is a reasonable approach.

@yhahn
Copy link
Member

yhahn commented Oct 2, 2014

@rclark I'm not too concerned about even distribution across jobs. Any approach (bbox, etc) is going to be datasource dependent unless the approach involves knowing the geographic shape/density of the data beforehand.

I'll keep pondering the pyramid question tonight.

@rclark
Copy link
Contributor Author

rclark commented Oct 2, 2014

The horrible distribution I was seeing is because the modulus approach will never put tiles into jobs where job num > max tile.x. Weird implication: the further west your copy operation is, the less you can benefit from parallelization.

@rclark
Copy link
Contributor Author

rclark commented Oct 2, 2014

@yhahn I managed to write stream-pyramid.js such that:

  • Tiles in zoom levels where num tiles < num jobs are all fed to job 1
  • modulus-splitting occurs on tiles at the zoom level where num tiles >= num jobs. Each job gets a set of these tiles and renders out its children pyramid-style.

The logic is a mess and I am not proud of this.

@rclark rclark force-pushed the parallel-copy branch 2 times, most recently from 1600f63 to 503c036 Compare October 7, 2014 18:43
@rclark
Copy link
Contributor Author

rclark commented Oct 7, 2014

Okay @yhahn I fell back on straight bbox-splitting with low-zoom and along-the-boundaries duplication of rendered tiles. The logic is certainly cleaner.

@yhahn
Copy link
Member

yhahn commented Oct 7, 2014

@rclark 👍

@rclark rclark changed the title [wip] Parallel copy Parallel copy Oct 7, 2014
@rclark
Copy link
Contributor Author

rclark commented Oct 7, 2014

Now it doesn't mod-split the deserialization stream based on stream order, but by pulling the X value out of the serialized data via regex. JSON.parse (without decoding the buffer) is noticeably costly when running hundreds of jobs.

I tried to set it up in an abstracted-enough way that it should be clear what would have to be done if you wanted to change serialization formats.

@yhahn
Copy link
Member

yhahn commented Oct 8, 2014

@rclark 👍 want to merge + roll 5.3.0 or so?

rclark pushed a commit that referenced this pull request Oct 8, 2014
@rclark rclark merged commit 79b4dc7 into master Oct 8, 2014
@rclark rclark deleted the parallel-copy branch October 8, 2014 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants