Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A revised breaking protobuf schema for Geobuf #27

Closed
mourner opened this issue Dec 10, 2014 · 42 comments
Closed

A revised breaking protobuf schema for Geobuf #27

mourner opened this issue Dec 10, 2014 · 42 comments
Assignees

Comments

@mourner
Copy link
Member

mourner commented Dec 10, 2014

The more I find ideas to improve the sizes of geobuf, the more I realize that we would need to completely rewrite the schema to support the improvements, breaking compatibility. So I'm opening the ticket to start a discussion about what a perfect Geobuf schema would look like (not to say this is a priority, but still a good thing to discuss).

I wrote a prototype schema with all the improvements I could think of here: https://gist.github.com/mourner/3c6ddca04c9772593302

The main difference is utilizing the power and flexibility of the new oneof statement to create a better and more compact schema.

Features:

  • the data itself contains information whether it's Feature, FeatureCollection, Geometry or GeometryCollection, so you don't need to guess this when decoding
  • keys and values for properties are stored separately in the top-level Data object, and features only store indexes to them (like vector-tile-spec does); keys and possibly values are reduced to unique values Fancy encoding for properties #26 — for much better properties packing
  • geometry coordinates are a oneof set of different fields (depending on type), which solves the ambiguity with null values since a oneof field can be empty (without a default value), and also makes it easier to understand and work with
  • feature has a oneof of either geometry or geometry collection instead of repeated geometries
  • coordinates are stored as delta-encoded sint32 to take full advantage of varint and zig-zag encoding Switch to sint32 for coords #24 — for much more compact geometries
  • Value message is also a oneof of different value types
  • the property value int type is sint32 instead of int64 — it's more compact and JS doesn't actually handle int64 values; in addition, uint type becomes uint32
  • the data contains optional flag indicating whether it contains altitude (third coordinate), since this is a global-level setting
  • the data also contains optional precision information (6 by default) Configurable precision #25

@tmcw @springmeyer what do you think?

@springmeyer
Copy link

Awesome to learn about oneof. Overall your ideas look excellent. I'll comment more when I've had time to digest the proposed schema.

@brendan-ward
Copy link

This looks like a great direction.

What about letting the feature id field leverage oneof to allow encoding as uint32 values (much more common than string IDs in many cases we deal with):

message Feature {
    oneof geometry_type {
        Geometry geometry = 1;
        GeometryCollection geometry_collection = 2;
    }
    repeated uint32 properties = 3 [packed = true];
    optional ID id = 4;
}

message ID {
    oneof value {
        string string_value = 1;
        uint32 uint_value = 2;
    }
}

@mourner
Copy link
Member Author

mourner commented Dec 11, 2014

@brendan-ward yes, excellent suggestion! Added it to the proto gist, except including oneof directly in the Feature message, since oneof fields are already optional by nature.

@mourner
Copy link
Member Author

mourner commented Dec 11, 2014

also cc @mick @rclark @sgillies in case you're interested

@mourner
Copy link
Member Author

mourner commented Dec 11, 2014

OK guys, today I wrote the encoder of my proposed format in Python to test it out, and the results are pretty mind-blowing.

Here's the comparison for the 100Mb 5.4-million points (1e4 precision) US zip codes GeoJSON (that I used for testing geojson-vt):

normal gzipped
GeoJSON 101.85 MB 26.67 MB
current geobuf 73.53 MB 31.09 MB
my geobuf 1e6 22.27 MB 16.39 MB
my geobuf 1e4 12.43 MB 10.62 MB

Here's another sample, Idaho from geojson-fixtures:

normal gzipped
GeoJSON 10.93 MB 2.58 MB
current geobuf 5.53 MB 2.07 MB
my geobuf 1e6 1.38 MB 1.18 MB

The keys of properties stored are unique. I think I can squish out just a bit more by making string values unique too.

@tmcw
Copy link
Contributor

tmcw commented Dec 11, 2014

Cool. To be clear, "your geobuf" can be "our geobuf" :)

@tmcw tmcw closed this as completed Dec 11, 2014
@tmcw tmcw reopened this Dec 11, 2014
@mourner
Copy link
Member Author

mourner commented Dec 11, 2014

@tmcw sure, I just wanted to share the motivation before suggesting that I completely rewrite this repo from scratch while breaking compatibility, which can be very annoying. :)

@tmcw
Copy link
Contributor

tmcw commented Dec 11, 2014

I don't mind anything being rewritten ever as long as it's better :) But the people who might feel the pain of this would be @mick basically, because it's used in cardboard afaik

@mick
Copy link
Contributor

mick commented Dec 11, 2014

@mourner I will feel the pain. But that's is ok, this like a great rewrite. I can migrate my usage of geobuf to the new version.

@mourner
Copy link
Member Author

mourner commented Dec 12, 2014

Reference Python encoder here: https://github.com/mapbox/pygeobuf
Working on decoder and some tests, and after that we can start reworking the JS implementation.

@kkaefer
Copy link

kkaefer commented Dec 12, 2014

Any comparisons to TopoJSON?

@mourner
Copy link
Member Author

mourner commented Dec 12, 2014

@kkaefer depends on the data — Geobuf is a GeoJSON equivalent, without encoding topology. It will probably be smaller compared to TopoJSON, but I'm thinking about making it support TopoJSON encoding as well for even more compression (or making a separate protobuf schema for TopoJSON).

@rclark
Copy link
Contributor

rclark commented Dec 12, 2014

making a separate protobuf schema for TopoJSON

👍

@sgillies
Copy link
Contributor

@mourner allowing integer or string ids in GeoJSON was a mistake, let's try to fix that here. I'm not sure what the better choice is but I'm convinced there should be only one.

And while we're considering other schemas, how about one for GPX type data? The internet is forever trying to shoehorn GPX into GeoJSON and if geobuf becomes the new binary GeoJSON, we'll see the same kind of interest.

@brendan-ward
Copy link

For comparison, Idaho from geojson-fixtures (above) encoded as topojson, using defaults:

normal gzipped
856 KB 99 KB

👍 for a separate protobuf schema for TopoJSON

👍 for integer ids. They are smaller in the protobuf, easier to index, easier to reason about, and often come in that way from other geo encodings. And the vector-tile-spec uses them.

@tmcw
Copy link
Contributor

tmcw commented Dec 13, 2014

Hmm: @brendan-ward is that using the TopoJSON default of throwing away all attribute data? That setting makes it very apples-oranges

@brendan-ward
Copy link

@tmcw I obviously missed that default behavior - apologies! The numbers are still pretty close.

Including properties and specifying quantization hopefully makes it less apples and oranges:

normal gzipped
default quantization 928 KB 110 KB
quantization at 1e4 956 KB 119 KB
quantization at 1e6 1.95 MB 560 KB
no quantization 5.1 MB 1 MB

@mourner
Copy link
Member Author

mourner commented Dec 13, 2014

allowing integer or string ids in GeoJSON was a mistake, let's try to fix that here. I'm not sure what the better choice is but I'm convinced there should be only one.

@sgillies not sure if we should impose our own restrictions on the format as it would make it harder for people to switch to — the idea is that you can throw your GeoJSON at it and it'll just work, and roundtrip back to GeoJSON without issues.

@brendan-ward ha, so we're doing pretty good if we compare it at the same quantization. When we make Topobuf, it'll be even smaller.

2all: wrote an initial decoding implementation (while on a plane) as well. Seems to work except a rounding issue with coordinates which I'm trying to figure out.

@mourner
Copy link
Member Author

mourner commented Dec 15, 2014

OK, my Python Geobuf 2 encoder/decoder now works pretty well. It roundtrips (encodes and then decodes into the same JSON) all GeoJSON test cases I have. The only exception is nested properties, but it is solvable.

Encoding/decoding isn't very fast (15s/12s for a 22MB GeoJSON file), but I assume this can be heavily optimized later if necessary (e.g. with Cython) — @sgillies knows this magic well. I'm also sure the JS port will be quite fast.

I'm now looking into TopoJSON support. The idea is to be able to throw any kind of GeoJSON/TopoJSON at the format and store it losslessly in a compact binary.

@mourner
Copy link
Member Author

mourner commented Dec 15, 2014

Once I have the Topobuf format & reference implementation ready with some feedback from the outside crowd, we can create the geobuf-spec repo (similar to vector-tile-spec) and start other implementations (JS & C++).

@sgillies
Copy link
Contributor

I do have a preference for ids this morning. @brendan-ward strings are better than ints because all ints can be cast/converted to strings, but the inverse is not true.

The idea is to be able to throw any kind of GeoJSON/TopoJSON at the format and store it losslessly in a compact binary.

@mourner what about GeoJSON with properties that are arrays or objects?

{ "properties": {
    "lol": [1,2,3],
    "wut": {"a": 1}
  }
}

@mourner
Copy link
Member Author

mourner commented Dec 15, 2014

what about GeoJSON with properties that are arrays or objects?

I think @brendan-ward's suggestion to stringify complex properties as JSON when encoding and parsing them when decoding is a pretty good compromise — it'll enable us to store any type of GeoJSON, while keeping it simple and being efficient most of the time (since nested properties are not that common).

strings are better than ints because all ints can be cast/converted to strings, but the inverse is not true.

All ints can be remembered as ints when encoding, so that we know for sure that they can be converted back from strings when decoding. Thus Geobuf will never try to convert a string to integer if the result is not guaranteed.

@sgillies
Copy link
Contributor

👍

@mourner
Copy link
Member Author

mourner commented Dec 15, 2014

@sgillies @brendan-ward as expected, nested props support was a trivial change: pygeobuf/pygeobuf@714b5df

@mourner
Copy link
Member Author

mourner commented Dec 16, 2014

Topobuf encoding results are not groundbreaking but still nice — in average ~30% of original TopoJSON, same size if gzipped or a bit smaller.

original gzipped
mamou TopoJSON 3252 KB 459 KB
mamou Geobuf 978 KB 445 KB
world-50m TopoJSON 727 KB 228 KB
world-50m Geobuf 219 KB 175 KB

In-progress work here: https://github.com/mapbox/pygeobuf/compare/topobuf.

@mourner
Copy link
Member Author

mourner commented Dec 16, 2014

TopoJSON support complete. pygeobuf/pygeobuf#7

Now just a bit of finishing touches and we can ask for feedback from the geo crowd. Pretty excited about this.

@brendan-ward
Copy link

@mourner This looks great - nice work!

I hate to be the one to ask, but is spatial indexing being viewed as a client side problem? Would there be any benefit (albeit at potential cost of additional space) of storing any spatial index related constructs or precursors in the buffer to aid in client side indexing? I realize there are lots of spatial index implementations, and that it might be better to keep that out of the scope of the buffer.

Extracting features based on spatial location seems like a common problem, so it would be nice to avoid repeated work if possible.

What I'm getting at could take a few forms:

  1. simply calculate and store bounding boxes alongside every feature (except single point featuers) during encoding instead of treating them as optional and derived from the geojson
  2. follow the precedent set with property keys / values, and store a global array of bounding boxes, indexed in exactly the same order as features.
  3. serialize some aspect of the R-Tree used for the index. Obviously this one is the most problematic.

Maybe this would be better handled at the next level up from this set of buffers: a container format that combines the geo/topo buffer with an index / bbox buffer, so that geo/topo buffers can focus purely on roundtripping to / from Geo & Topo JSON.

@mourner
Copy link
Member Author

mourner commented Dec 16, 2014

@brendan-ward I thought about this. And I also happen to be the author of a pretty awesome R-tree-based spatial indexing library — rbush. So obviously I'd want to serialize the R-tree for this purpose. :)

Since Protobuf handles tree-like structures with recursive embedded messages, I think it's pretty doable. Do you see any specific problems to R-tree serialization? Maybe we should store it as a separate protobuf file by the way, like shapefiles do — this way we wouldn't compromise on the compression and allow storing indexes separately (which is beneficial in some app setups).

@mourner
Copy link
Member Author

mourner commented Dec 16, 2014

@brendan-ward btw, numbers for the Idaho TopoJSON test case above:

normal gzipped
TopoJSON 1e6 1.95 MB 560 KB
Geobuf 545 KB 451 KB

@rclark
Copy link
Contributor

rclark commented Dec 16, 2014

I would just chime in to say that I imagine many implementations of geobuf will use it as a straightforward compression of geographic data while separately managing spatial indexing in some fashion that's applicable to that specific situation. For example cardboard.

I'm not opposed to a more compressed serialization of indexes, but my first reaction is that it is a separate concern that shouldn't be addressed in geobuf.

@mourner
Copy link
Member Author

mourner commented Dec 16, 2014

@rclark 👍 to a separate repo for rbush-based spatial indexing and compressed index serialization.

@brendan-ward
Copy link

A separate repo for rbush-based spatial indexing & serialization sounds like good separation of concerns.

@mourner My concern with serializing the R-Tree vs bounding boxes or other spatial index precursors was simply that the other implementations wouldn't be able to consume the serialized R-Tree directly, their APIs would probably expect the index precursors (e.g., bboxes) instead. But I'm outside my domain on the internals of spatial indexes, so I would definitely defer to you on this. rbush is awesome and would serve as a good implementation to build this around.

@mourner
Copy link
Member Author

mourner commented Dec 17, 2014

@brendan-ward I don't know much about the cases involving serialized spatial indexes and what kind of applications would consume it, but I don't see much point in storing an index if it can't actually be used as an index.

E.g. to create an index from bboxes only, you would spend time running R-tree indexing on them after unpacking, which kinda defeats the purpose of having a stored precalculated index. And if you're fine with just bboxes, you could just as well precalculate and put them in the input data (GeoJSON or TopoJSON) since they're in the spec, and then store in Geobuf without any index extensions.

Also, the main challenge of spatial indexing is creating the index. Using it is dead-simple (like, simple depth-first search through a tree is less than 10 lines of code).

@joto
Copy link

joto commented Dec 19, 2014

@mourner: very interesting work! We have been missing an efficient modern geo format for a long time. Some ideas:

  • I like the idea of (optionally) storing bboxes with features. It would allow quickly reading a file picking out only those features inside a given bbox. So even without an index it could allow faster access to some features without the overhead of de-serializing everything. Not sure how useful and how fast it would be in the real world. But reading data sequentially can often be faster than jumping through indexes.
  • Have you looked at the OSM PBF format? It solves some of the same problems for OSM and also uses Protobuf. What I especially like about it, is its block structure. Data is grouped into blocks and each block is encoded on its own. PBF has the same kind of lookup table for the strings as you have (for tags, usernames, etc. in PBFs case). But the lookup table is stored in each block. This allows easier streaming read and write because you don't have to wait for all the data to be complete before you can write the string table. It also allows for rather trivial parallelization of parsing because each block can be (de)serialized by itself in a different thread. This is what I do in Osmium. Of course it creates some overhead...
  • Protobuf (de)serialization has quite some overhead in my experience even in C++ due to the need for memory allocation. Your design of Multipolygons as repeated MultiLineStrings which are repeated LineStrings which are repeated coordinates looks straightforward, but might be expensive to encode and decode because of the nested messages. Maybe one large coordinate array with length fields telling us where sub-features start and end is better? This would need, of course, some experimentation.
  • The format should have some kind of versioning support and maybe some kind of header for optional features. Again, the OSM PBF format has an interesting (although not perfect) approach. It is important that features and settings (such as precision etc.) are located at the beginning of the file so they can be easily read before the rest of the file is parsed.

These are just a few ideas from the top of my head. I'd love to try a C++ implementation to see how things would be implemented there and how performance looks.

@mourner
Copy link
Member Author

mourner commented Dec 19, 2014

@joto thanks for the awesome feedback!

But reading data sequentially can often be faster than jumping through indexes.

Here's what I have in mind for the index: it would be very small compared to the data file, because it would only contain bboxes, without geometry or properties, but additionally storing the byte index to jump to for reading a particular feature from the main data file. Then to read the features you need, you would do a simple search through the index, save all the feature byte indexes and then just run through them. If non-sequential reads are a concern, you can just sort the indexes before reading. It seems that it would definitely be faster than reading through every feature. What do you think?

Regarding bboxes for every feature/geometry — it's pretty easy to add as an option, so not a problem.

PBF has the same kind of lookup table for the strings as you have (for tags, usernames, etc. in PBFs case). But the lookup table is stored in each block. This allows easier streaming read and write [...]

I see what you mean. I think we can find a compromise on this. Maybe store the unique keys globally — are usually the same through the whole data and are jut repeated, so the array will be small, and it will be very fast to read at the beginning. But store values per feature — those aren't duplicated as often as keys.

Protobuf (de)serialization has quite some overhead in my experience even in C++ due to the need for memory allocation.

Protobuf 3 is going to solve this with arena allocation — it will know the deserialized size beforehand and will be able to preallocate all the memory needed.

Your design of Multipolygons as repeated MultiLineStrings which are repeated LineStrings which are repeated coordinates looks straightforward, but might be expensive to encode and decode because of the nested messages. Maybe one large coordinate array with length fields telling us where sub-features start and end is better?

This sounds like a great idea. I'll definitely give this a try — sounds pretty easy to implement. So I'll just have an array of coordinates and an array of lengths, e.g. for MultiPolygon:

lengths: [polygons, poly1_rings, ring1, ring2, ..., poly2_rings, ring1, ring2, ...]

I'll also take a closer look at the OSM PBF schema.

I'd love to try a C++ implementation to see how things would be implemented there and how performance looks.

What would you use for the decoder? Our mapbox/pbf.hpp?

@joto
Copy link

joto commented Dec 19, 2014

@mourner: Byte indexes are problematic, because the protobuf library hides that from you. Maybe there is a way to get the byte index of a feature from the protobuf lib, but I doubt that. If you are going that route you'd have to serialize each feature separately using protobuf but then write each feature into the file yourself, probably with a length header.

...store the unique keys globally...

That sounds reasonable. As long as we keep the streaming capability, ie we are sure we know before we start writing all strings that will end up in the header.

Protobuf 3 is going to solve this with arena allocation

Well, the point is we don't want to allocate memory at all if we can avoid it. Don't know what Protobuf wants to do about it. Couldn't find anything about Protobuf 3. Do you have a link?

What would you use for the decoder? Our mapbox/pbf.hpp?

In the first version the official protobuf lib. Maybe later something else. When I tried the mapbox pbf parser the last time for Osmium, it didn't implement everything in the Protobuf spec and couldn't do everything that OSM PBF needed. Have to re-evaluate this at some point and see whether it would work for geobuf.

@springmeyer
Copy link

Big +1 behind giving a shot to @joto's idea about Maybe one large coordinate array with length fields telling us where sub-features start and end is better? This would need, of course, some experimentation. (refs springmeyer/geojsonp#2 (comment)). The goal would be to allow [ packed = true ]; for the entire geometry since this would be potentially much more compact.

Also I agree that while arena allocation in proto3 might mitigate the cost of memory allocation for parsing the message into protobuf objects the ideal in high performance situations would be to avoid needing to allocate at all and instead go directly from messages -> your applications objects (skipping the intermediate objects libprotobuf creates). This is what https://github.com/mapbox/pbf.hpp and https://github.com/mapbox/pbf are about. I think you all understand this but wanting to re-state for clarity.

@mourner
Copy link
Member Author

mourner commented Dec 20, 2014

@joto about byte indexes — there can be a problem if you use the official protobuf library which is very high-level. But with a more low-level approach like pbf.hpp there should be no problem keeping track of byte indexes, and no problem serializing everything in one pass without any additional work.

When I tried the mapbox pbf parser the last time for Osmium, it didn't implement everything in the Protobuf spec

Lets create issues on any things lacking there. It's simple enough to add something relatively quickly, and there's @kkaefer to help.

@springmeyer implemented flattened array encoding — it doesn't make things much more compact (just a bit of bytes here and there) because most of the weight is still linestrings which are packed — not as many multilinestrings and multipolygons in real world datasets, and they don't have a lot of rings usually — just a couple.

But what I like about this approach is simplicity. It makes both schema and encoding/decoding code simpler.

@joto
Copy link

joto commented Dec 20, 2014

@mourner: re byte indexes: I think it is important to stay compatible with the official protobuf libraries. If nothing else it allows other people to have their own implementations based on those libraries possibly in many different languages.

@mourner
Copy link
Member Author

mourner commented Dec 20, 2014

@joto good point. I'll think about this. At least we can do indexing + byte indexes as a separate tool, not baking anything specific to it into the Geobuf schema itself.

@mourner
Copy link
Member Author

mourner commented Dec 23, 2014

Made some progress:

  • all geometries are now encoded as a flat packed coord array + flat packed array of lengths used to derive MultiLineString/Polygon/MultiPolygon structure if necessary — shaves off a bit of bytes and potentially makes encoding/decoding faster
  • all values are now encoded in each feature as opposed to global values array, to make incremental parsing easier and faster — this doesn't lead to a noticeable size increase since repeated values are not that common
  • Geobuf now encodes any arbitrary properties of any objects as well, making it possible to store GeoJSON/TopoJSON extensions and custom stuff

@mourner
Copy link
Member Author

mourner commented Jan 9, 2015

JS encoder/decoder implemented in #29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants