Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split a pmtiles file #25

Closed
msbarry opened this issue Jan 19, 2022 · 6 comments
Closed

Split a pmtiles file #25

msbarry opened this issue Jan 19, 2022 · 6 comments

Comments

@msbarry
Copy link

msbarry commented Jan 19, 2022

Some hosts (like github pages) have maximum file sizes. Alternatives like https://github.com/phiresky/sql.js-httpvfs provide a way to split the tile archive until it is less than that max file size (https://github.com/phiresky/sql.js-httpvfs/blob/master/create_db.sh). Would it be possible for the pmtiles reader and writer to optionally support splitting a pmtiles file?

@bdon
Copy link
Member

bdon commented Jan 20, 2022

Thought about this for a bit and here's what I think are the benefits/drawbacks:

  • if we were to allow split archives, we'd either want an entire leaf directory and its tiles to be in the 2nd archive, or all directories remain in the 1st archive with pointers into the 2nd for tiles. Either way, it makes writing more complicated (you can't just stream all tiles into one file in one shot) and storing filenames or file indexes in the directory entries would break the fixed-width entry design.
  • One of the value propositions of PMTiles is to hold and manage the entire archive in a single file, instead of managing thousands of different directories/files. This is obviously nice for UX reasons, like uploading and versioning data, but also has the technical benefit of working with ETag semantics over HTTP. Multiple-archives would then have multiple Tags so a single version of a resource would no longer have a single identifier. (see ETags: mismatch detection and Access-Control-Expose-Headers #24)
  • The primary target of the PMTiles design is commodity storage platforms like S3 which support an effectively unlimited object size.

The case of GitHub pages seems to be meant for versioned code/docs and some associated assets, so I don't think it's a great fit as a primary target for tile archive hosting, though being free+fast is nice and you can accomplish the same thing with expanding to directories/archives. Are there other examples out there where we need to split archives to a max piece size? 32-bit systems might be one but I'd rather not consider that in scope.

@msbarry
Copy link
Author

msbarry commented Jan 20, 2022

Agree that since the goal of pmtiles is to combine many files into one it may not make sense to split them back out again... According to https://github.com/phiresky/sql.js-httpvfs, the benefit they see for splitting a large file that you make byte range requests to from the client are:

This is needed if your hoster has a maximum file size. It can also be a good idea generally depending on your CDN since it allows selective CDN caching of the chunks your users actually use and reduces cache eviction.

Also using something like S3 is it possible to allow only range requests? A concern hosting a tileset in S3 would be a request comes from a client missing a range header and they accidentally start downloading the whole thing, which could run up bandwidth costs quickly. A split archive would partially mitigate that concern, but maybe it's not really an issue in practice?

@bdon
Copy link
Member

bdon commented Jan 20, 2022

It's not possible on raw S3 to allow only range requests. That concern is somewhat mitigated by having clients implement a rudimentary check as shown on this line: https://github.com/protomaps/PMTiles/blob/master/js/index.src.mjs#L71

In practice, it can be an issue, but it's not unique to PMTiles; the other cloud-optimized formats like COG have the same drawback. The best solution for now is to run a proxy in front of your bucket such as https://github.com/protomaps/go-pmtiles , but of course that's no longer just S3 :)

@msbarry
Copy link
Author

msbarry commented Jan 21, 2022

It's not possible on raw S3 to allow only range requests. That concern is somewhat mitigated by having clients implement a rudimentary check as shown on this line: https://github.com/protomaps/PMTiles/blob/master/js/index.src.mjs#L71

OK thanks, that check helps prevent accidental full downloads, but there's still the issue of intentional full downloads, which could start to be an issue with a 100gb full planet tileset hosted on s3 since each full download would cost the owner $10 in egress fees.

I was thinking of using pmtiles for the planetiler demo site (~500MB mbtiles file on github pages) but if splitting a pmtiles archive doesn't make sense then I can stick with the current approach of extracting all of the tiles to individual files.

@bdon
Copy link
Member

bdon commented Jan 23, 2022

OK thanks, that check helps prevent accidental full downloads, but there's still the issue of intentional full downloads, which could start to be an issue with a 100gb full planet tileset hosted on s3 since each full download would cost the owner $10 in egress fees.

Yeah, I agree the intentional linking/leeching is a concern - the basemap downloads I offer at http://protomaps.com/downloads are limited to at most a hundred or so megabytes, and my stopgap solutions for larger maps is proxy-based like above. I'm optimistic about the long-term solve here being market pressure downwards on bandwidth in the next few years, for example if/when Cloudflare R2 becomes available.

@bdon
Copy link
Member

bdon commented Feb 7, 2022

I'm going to close this issue about archive splitting for now; I think the ETag features enabled by a single file take precedence over working around max file size limits. For the planetiler demo site, I've spun up a demo tile server using https://github.com/protomaps/go-pmtiles on an unmetered bandwidth server:

https://bdon.github.io/planetiler-demo/ (endpoint http://free-tiles.protomaps.com/planetiler/{z}/{x}/{y}.pbf)

Open to suggestions on how to organize the URL structures or metadata, or access for hosting regular updates.

@bdon bdon closed this as completed Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants