A revised breaking protobuf schema for Geobuf #27

mourner · 2014-12-10T23:25:04Z

The more I find ideas to improve the sizes of geobuf, the more I realize that we would need to completely rewrite the schema to support the improvements, breaking compatibility. So I'm opening the ticket to start a discussion about what a perfect Geobuf schema would look like (not to say this is a priority, but still a good thing to discuss).

I wrote a prototype schema with all the improvements I could think of here: https://gist.github.com/mourner/3c6ddca04c9772593302

The main difference is utilizing the power and flexibility of the new oneof statement to create a better and more compact schema.

Features:

the data itself contains information whether it's Feature, FeatureCollection, Geometry or GeometryCollection, so you don't need to guess this when decoding
keys and values for properties are stored separately in the top-level Data object, and features only store indexes to them (like vector-tile-spec does); keys and possibly values are reduced to unique values Fancy encoding for properties #26 — for much better properties packing
geometry coordinates are a oneof set of different fields (depending on type), which solves the ambiguity with null values since a oneof field can be empty (without a default value), and also makes it easier to understand and work with
feature has a oneof of either geometry or geometry collection instead of repeated geometries
coordinates are stored as delta-encoded sint32 to take full advantage of varint and zig-zag encoding Switch to sint32 for coords #24 — for much more compact geometries
Value message is also a oneof of different value types
the property value int type is sint32 instead of int64 — it's more compact and JS doesn't actually handle int64 values; in addition, uint type becomes uint32
the data contains optional flag indicating whether it contains altitude (third coordinate), since this is a global-level setting
the data also contains optional precision information (6 by default) Configurable precision #25

@tmcw @springmeyer what do you think?

The text was updated successfully, but these errors were encountered:

springmeyer · 2014-12-11T00:40:31Z

Awesome to learn about oneof. Overall your ideas look excellent. I'll comment more when I've had time to digest the proposed schema.

brendan-ward · 2014-12-11T01:09:35Z

This looks like a great direction.

What about letting the feature id field leverage oneof to allow encoding as uint32 values (much more common than string IDs in many cases we deal with):

message Feature {
    oneof geometry_type {
        Geometry geometry = 1;
        GeometryCollection geometry_collection = 2;
    }
    repeated uint32 properties = 3 [packed = true];
    optional ID id = 4;
}

message ID {
    oneof value {
        string string_value = 1;
        uint32 uint_value = 2;
    }
}

mourner · 2014-12-11T03:04:42Z

@brendan-ward yes, excellent suggestion! Added it to the proto gist, except including oneof directly in the Feature message, since oneof fields are already optional by nature.

mourner · 2014-12-11T06:19:27Z

also cc @mick @rclark @sgillies in case you're interested

mourner · 2014-12-11T20:30:02Z

OK guys, today I wrote the encoder of my proposed format in Python to test it out, and the results are pretty mind-blowing.

Here's the comparison for the 100Mb 5.4-million points (1e4 precision) US zip codes GeoJSON (that I used for testing geojson-vt):

	normal	gzipped
GeoJSON	101.85 MB	26.67 MB
current geobuf	73.53 MB	31.09 MB
my geobuf 1e6	22.27 MB	16.39 MB
my geobuf 1e4	12.43 MB	10.62 MB

Here's another sample, Idaho from geojson-fixtures:

	normal	gzipped
GeoJSON	10.93 MB	2.58 MB
current geobuf	5.53 MB	2.07 MB
my geobuf 1e6	1.38 MB	1.18 MB

The keys of properties stored are unique. I think I can squish out just a bit more by making string values unique too.

tmcw · 2014-12-11T20:31:54Z

Cool. To be clear, "your geobuf" can be "our geobuf" :)

mourner · 2014-12-11T20:43:51Z

@tmcw sure, I just wanted to share the motivation before suggesting that I completely rewrite this repo from scratch while breaking compatibility, which can be very annoying. :)

tmcw · 2014-12-11T20:44:46Z

I don't mind anything being rewritten ever as long as it's better :) But the people who might feel the pain of this would be @mick basically, because it's used in cardboard afaik

mick · 2014-12-11T21:00:32Z

@mourner I will feel the pain. But that's is ok, this like a great rewrite. I can migrate my usage of geobuf to the new version.

mourner · 2014-12-12T14:13:07Z

Reference Python encoder here: https://github.com/mapbox/pygeobuf
Working on decoder and some tests, and after that we can start reworking the JS implementation.

kkaefer · 2014-12-12T15:22:48Z

Any comparisons to TopoJSON?

mourner · 2014-12-12T19:16:45Z

@kkaefer depends on the data — Geobuf is a GeoJSON equivalent, without encoding topology. It will probably be smaller compared to TopoJSON, but I'm thinking about making it support TopoJSON encoding as well for even more compression (or making a separate protobuf schema for TopoJSON).

rclark · 2014-12-12T19:29:52Z

making a separate protobuf schema for TopoJSON

👍

sgillies · 2014-12-12T20:25:42Z

@mourner allowing integer or string ids in GeoJSON was a mistake, let's try to fix that here. I'm not sure what the better choice is but I'm convinced there should be only one.

And while we're considering other schemas, how about one for GPX type data? The internet is forever trying to shoehorn GPX into GeoJSON and if geobuf becomes the new binary GeoJSON, we'll see the same kind of interest.

brendan-ward · 2014-12-13T04:49:36Z

For comparison, Idaho from geojson-fixtures (above) encoded as topojson, using defaults:

normal	gzipped
856 KB	99 KB

👍 for a separate protobuf schema for TopoJSON

👍 for integer ids. They are smaller in the protobuf, easier to index, easier to reason about, and often come in that way from other geo encodings. And the vector-tile-spec uses them.

tmcw · 2014-12-13T04:52:52Z

Hmm: @brendan-ward is that using the TopoJSON default of throwing away all attribute data? That setting makes it very apples-oranges

brendan-ward · 2014-12-13T05:43:09Z

@tmcw I obviously missed that default behavior - apologies! The numbers are still pretty close.

Including properties and specifying quantization hopefully makes it less apples and oranges:

	normal	gzipped
default quantization	928 KB	110 KB
quantization at 1e4	956 KB	119 KB
quantization at 1e6	1.95 MB	560 KB
no quantization	5.1 MB	1 MB

mourner · 2014-12-13T07:44:35Z

allowing integer or string ids in GeoJSON was a mistake, let's try to fix that here. I'm not sure what the better choice is but I'm convinced there should be only one.

@sgillies not sure if we should impose our own restrictions on the format as it would make it harder for people to switch to — the idea is that you can throw your GeoJSON at it and it'll just work, and roundtrip back to GeoJSON without issues.

@brendan-ward ha, so we're doing pretty good if we compare it at the same quantization. When we make Topobuf, it'll be even smaller.

2all: wrote an initial decoding implementation (while on a plane) as well. Seems to work except a rounding issue with coordinates which I'm trying to figure out.

mourner · 2014-12-15T17:34:44Z

OK, my Python Geobuf 2 encoder/decoder now works pretty well. It roundtrips (encodes and then decodes into the same JSON) all GeoJSON test cases I have. The only exception is nested properties, but it is solvable.

Encoding/decoding isn't very fast (15s/12s for a 22MB GeoJSON file), but I assume this can be heavily optimized later if necessary (e.g. with Cython) — @sgillies knows this magic well. I'm also sure the JS port will be quite fast.

I'm now looking into TopoJSON support. The idea is to be able to throw any kind of GeoJSON/TopoJSON at the format and store it losslessly in a compact binary.

mourner · 2014-12-15T17:38:06Z

Once I have the Topobuf format & reference implementation ready with some feedback from the outside crowd, we can create the geobuf-spec repo (similar to vector-tile-spec) and start other implementations (JS & C++).

sgillies · 2014-12-15T17:55:56Z

I do have a preference for ids this morning. @brendan-ward strings are better than ints because all ints can be cast/converted to strings, but the inverse is not true.

The idea is to be able to throw any kind of GeoJSON/TopoJSON at the format and store it losslessly in a compact binary.

@mourner what about GeoJSON with properties that are arrays or objects?

{ "properties": {
    "lol": [1,2,3],
    "wut": {"a": 1}
  }
}

mourner · 2014-12-15T18:07:17Z

what about GeoJSON with properties that are arrays or objects?

I think @brendan-ward's suggestion to stringify complex properties as JSON when encoding and parsing them when decoding is a pretty good compromise — it'll enable us to store any type of GeoJSON, while keeping it simple and being efficient most of the time (since nested properties are not that common).

strings are better than ints because all ints can be cast/converted to strings, but the inverse is not true.

All ints can be remembered as ints when encoding, so that we know for sure that they can be converted back from strings when decoding. Thus Geobuf will never try to convert a string to integer if the result is not guaranteed.

sgillies · 2014-12-15T18:10:58Z

👍

mourner · 2014-12-15T18:53:57Z

@sgillies @brendan-ward as expected, nested props support was a trivial change: pygeobuf/pygeobuf@714b5df

mourner · 2014-12-16T14:35:37Z

Topobuf encoding results are not groundbreaking but still nice — in average ~30% of original TopoJSON, same size if gzipped or a bit smaller.

	original	gzipped
mamou TopoJSON	3252 KB	459 KB
mamou Geobuf	978 KB	445 KB
world-50m TopoJSON	727 KB	228 KB
world-50m Geobuf	219 KB	175 KB

In-progress work here: https://github.com/mapbox/pygeobuf/compare/topobuf.

mourner · 2014-12-16T18:06:36Z

TopoJSON support complete. pygeobuf/pygeobuf#7

Now just a bit of finishing touches and we can ask for feedback from the geo crowd. Pretty excited about this.

brendan-ward · 2014-12-16T19:54:29Z

@mourner This looks great - nice work!

I hate to be the one to ask, but is spatial indexing being viewed as a client side problem? Would there be any benefit (albeit at potential cost of additional space) of storing any spatial index related constructs or precursors in the buffer to aid in client side indexing? I realize there are lots of spatial index implementations, and that it might be better to keep that out of the scope of the buffer.

Extracting features based on spatial location seems like a common problem, so it would be nice to avoid repeated work if possible.

What I'm getting at could take a few forms:

simply calculate and store bounding boxes alongside every feature (except single point featuers) during encoding instead of treating them as optional and derived from the geojson
follow the precedent set with property keys / values, and store a global array of bounding boxes, indexed in exactly the same order as features.
serialize some aspect of the R-Tree used for the index. Obviously this one is the most problematic.

Maybe this would be better handled at the next level up from this set of buffers: a container format that combines the geo/topo buffer with an index / bbox buffer, so that geo/topo buffers can focus purely on roundtripping to / from Geo & Topo JSON.

mourner · 2014-12-16T20:57:24Z

@brendan-ward I thought about this. And I also happen to be the author of a pretty awesome R-tree-based spatial indexing library — rbush. So obviously I'd want to serialize the R-tree for this purpose. :)

Since Protobuf handles tree-like structures with recursive embedded messages, I think it's pretty doable. Do you see any specific problems to R-tree serialization? Maybe we should store it as a separate protobuf file by the way, like shapefiles do — this way we wouldn't compromise on the compression and allow storing indexes separately (which is beneficial in some app setups).

mourner · 2014-12-16T21:03:34Z

@brendan-ward btw, numbers for the Idaho TopoJSON test case above:

	normal	gzipped
TopoJSON 1e6	1.95 MB	560 KB
Geobuf	545 KB	451 KB

rclark · 2014-12-16T21:03:54Z

I would just chime in to say that I imagine many implementations of geobuf will use it as a straightforward compression of geographic data while separately managing spatial indexing in some fashion that's applicable to that specific situation. For example cardboard.

I'm not opposed to a more compressed serialization of indexes, but my first reaction is that it is a separate concern that shouldn't be addressed in geobuf.

mourner · 2014-12-16T21:05:07Z

@rclark 👍 to a separate repo for rbush-based spatial indexing and compressed index serialization.

brendan-ward · 2014-12-16T22:06:37Z

A separate repo for rbush-based spatial indexing & serialization sounds like good separation of concerns.

@mourner My concern with serializing the R-Tree vs bounding boxes or other spatial index precursors was simply that the other implementations wouldn't be able to consume the serialized R-Tree directly, their APIs would probably expect the index precursors (e.g., bboxes) instead. But I'm outside my domain on the internals of spatial indexes, so I would definitely defer to you on this. rbush is awesome and would serve as a good implementation to build this around.

mourner · 2014-12-17T20:09:15Z

@brendan-ward I don't know much about the cases involving serialized spatial indexes and what kind of applications would consume it, but I don't see much point in storing an index if it can't actually be used as an index.

E.g. to create an index from bboxes only, you would spend time running R-tree indexing on them after unpacking, which kinda defeats the purpose of having a stored precalculated index. And if you're fine with just bboxes, you could just as well precalculate and put them in the input data (GeoJSON or TopoJSON) since they're in the spec, and then store in Geobuf without any index extensions.

Also, the main challenge of spatial indexing is creating the index. Using it is dead-simple (like, simple depth-first search through a tree is less than 10 lines of code).

joto · 2014-12-19T08:28:03Z

@mourner: very interesting work! We have been missing an efficient modern geo format for a long time. Some ideas:

I like the idea of (optionally) storing bboxes with features. It would allow quickly reading a file picking out only those features inside a given bbox. So even without an index it could allow faster access to some features without the overhead of de-serializing everything. Not sure how useful and how fast it would be in the real world. But reading data sequentially can often be faster than jumping through indexes.
Have you looked at the OSM PBF format? It solves some of the same problems for OSM and also uses Protobuf. What I especially like about it, is its block structure. Data is grouped into blocks and each block is encoded on its own. PBF has the same kind of lookup table for the strings as you have (for tags, usernames, etc. in PBFs case). But the lookup table is stored in each block. This allows easier streaming read and write because you don't have to wait for all the data to be complete before you can write the string table. It also allows for rather trivial parallelization of parsing because each block can be (de)serialized by itself in a different thread. This is what I do in Osmium. Of course it creates some overhead...
Protobuf (de)serialization has quite some overhead in my experience even in C++ due to the need for memory allocation. Your design of Multipolygons as repeated MultiLineStrings which are repeated LineStrings which are repeated coordinates looks straightforward, but might be expensive to encode and decode because of the nested messages. Maybe one large coordinate array with length fields telling us where sub-features start and end is better? This would need, of course, some experimentation.
The format should have some kind of versioning support and maybe some kind of header for optional features. Again, the OSM PBF format has an interesting (although not perfect) approach. It is important that features and settings (such as precision etc.) are located at the beginning of the file so they can be easily read before the rest of the file is parsed.

These are just a few ideas from the top of my head. I'd love to try a C++ implementation to see how things would be implemented there and how performance looks.

mourner · 2014-12-19T15:14:58Z

@joto thanks for the awesome feedback!

But reading data sequentially can often be faster than jumping through indexes.

Here's what I have in mind for the index: it would be very small compared to the data file, because it would only contain bboxes, without geometry or properties, but additionally storing the byte index to jump to for reading a particular feature from the main data file. Then to read the features you need, you would do a simple search through the index, save all the feature byte indexes and then just run through them. If non-sequential reads are a concern, you can just sort the indexes before reading. It seems that it would definitely be faster than reading through every feature. What do you think?

Regarding bboxes for every feature/geometry — it's pretty easy to add as an option, so not a problem.

PBF has the same kind of lookup table for the strings as you have (for tags, usernames, etc. in PBFs case). But the lookup table is stored in each block. This allows easier streaming read and write [...]

I see what you mean. I think we can find a compromise on this. Maybe store the unique keys globally — are usually the same through the whole data and are jut repeated, so the array will be small, and it will be very fast to read at the beginning. But store values per feature — those aren't duplicated as often as keys.

Protobuf (de)serialization has quite some overhead in my experience even in C++ due to the need for memory allocation.

Protobuf 3 is going to solve this with arena allocation — it will know the deserialized size beforehand and will be able to preallocate all the memory needed.

Your design of Multipolygons as repeated MultiLineStrings which are repeated LineStrings which are repeated coordinates looks straightforward, but might be expensive to encode and decode because of the nested messages. Maybe one large coordinate array with length fields telling us where sub-features start and end is better?

This sounds like a great idea. I'll definitely give this a try — sounds pretty easy to implement. So I'll just have an array of coordinates and an array of lengths, e.g. for MultiPolygon:

lengths: [polygons, poly1_rings, ring1, ring2, ..., poly2_rings, ring1, ring2, ...]

I'll also take a closer look at the OSM PBF schema.

I'd love to try a C++ implementation to see how things would be implemented there and how performance looks.

What would you use for the decoder? Our mapbox/pbf.hpp?

joto · 2014-12-19T15:37:56Z

@mourner: Byte indexes are problematic, because the protobuf library hides that from you. Maybe there is a way to get the byte index of a feature from the protobuf lib, but I doubt that. If you are going that route you'd have to serialize each feature separately using protobuf but then write each feature into the file yourself, probably with a length header.

...store the unique keys globally...

That sounds reasonable. As long as we keep the streaming capability, ie we are sure we know before we start writing all strings that will end up in the header.

Protobuf 3 is going to solve this with arena allocation

Well, the point is we don't want to allocate memory at all if we can avoid it. Don't know what Protobuf wants to do about it. Couldn't find anything about Protobuf 3. Do you have a link?

What would you use for the decoder? Our mapbox/pbf.hpp?

In the first version the official protobuf lib. Maybe later something else. When I tried the mapbox pbf parser the last time for Osmium, it didn't implement everything in the Protobuf spec and couldn't do everything that OSM PBF needed. Have to re-evaluate this at some point and see whether it would work for geobuf.

springmeyer · 2014-12-19T19:05:23Z

Big +1 behind giving a shot to @joto's idea about Maybe one large coordinate array with length fields telling us where sub-features start and end is better? This would need, of course, some experimentation. (refs springmeyer/geojsonp#2 (comment)). The goal would be to allow [ packed = true ]; for the entire geometry since this would be potentially much more compact.

Also I agree that while arena allocation in proto3 might mitigate the cost of memory allocation for parsing the message into protobuf objects the ideal in high performance situations would be to avoid needing to allocate at all and instead go directly from messages -> your applications objects (skipping the intermediate objects libprotobuf creates). This is what https://github.com/mapbox/pbf.hpp and https://github.com/mapbox/pbf are about. I think you all understand this but wanting to re-state for clarity.

mourner · 2014-12-20T15:51:55Z

@joto about byte indexes — there can be a problem if you use the official protobuf library which is very high-level. But with a more low-level approach like pbf.hpp there should be no problem keeping track of byte indexes, and no problem serializing everything in one pass without any additional work.

When I tried the mapbox pbf parser the last time for Osmium, it didn't implement everything in the Protobuf spec

Lets create issues on any things lacking there. It's simple enough to add something relatively quickly, and there's @kkaefer to help.

@springmeyer implemented flattened array encoding — it doesn't make things much more compact (just a bit of bytes here and there) because most of the weight is still linestrings which are packed — not as many multilinestrings and multipolygons in real world datasets, and they don't have a lot of rings usually — just a couple.

But what I like about this approach is simplicity. It makes both schema and encoding/decoding code simpler.

joto · 2014-12-20T16:16:56Z

@mourner: re byte indexes: I think it is important to stay compatible with the official protobuf libraries. If nothing else it allows other people to have their own implementations based on those libraries possibly in many different languages.

mourner · 2014-12-20T18:09:44Z

@joto good point. I'll think about this. At least we can do indexing + byte indexes as a separate tool, not baking anything specific to it into the Geobuf schema itself.

mourner · 2014-12-23T14:55:04Z

Made some progress:

all geometries are now encoded as a flat packed coord array + flat packed array of lengths used to derive MultiLineString/Polygon/MultiPolygon structure if necessary — shaves off a bit of bytes and potentially makes encoding/decoding faster
all values are now encoded in each feature as opposed to global values array, to make incremental parsing easier and faster — this doesn't lead to a noticeable size increase since repeated values are not that common
Geobuf now encodes any arbitrary properties of any objects as well, making it possible to store GeoJSON/TopoJSON extensions and custom stuff

mourner · 2015-01-09T23:35:47Z

JS encoder/decoder implemented in #29

mourner self-assigned this Dec 10, 2014

mourner mentioned this issue Dec 10, 2014

Switched to protocol-buffers, consolidated duplicate code in loops #17

Closed

tmcw closed this as completed Dec 11, 2014

tmcw reopened this Dec 11, 2014

mourner mentioned this issue Dec 16, 2014

Fancy encoding for properties #26

Closed

This was referenced Dec 16, 2014

Configurable precision #25

Closed

Switch to sint32 for coords #24

Closed

Delta encoding #23

Closed

This was referenced Dec 22, 2014

Flatten coords in Protobuf pygeobuf/pygeobuf#13

Merged

Simpler values encoding pygeobuf/pygeobuf#14

Merged

mourner closed this as completed Jan 9, 2015

brendan-ward mentioned this issue Jan 20, 2015

Geobuf Format: IDs #33

Closed

mourner mentioned this issue May 8, 2015

Geobuf Index #46

Open

A revised breaking protobuf schema for Geobuf #27

A revised breaking protobuf schema for Geobuf #27

Comments

mourner commented Dec 10, 2014

springmeyer commented Dec 11, 2014

brendan-ward commented Dec 11, 2014

mourner commented Dec 11, 2014

mourner commented Dec 11, 2014

mourner commented Dec 11, 2014

tmcw commented Dec 11, 2014

mourner commented Dec 11, 2014

tmcw commented Dec 11, 2014

mick commented Dec 11, 2014

mourner commented Dec 12, 2014

kkaefer commented Dec 12, 2014

mourner commented Dec 12, 2014

rclark commented Dec 12, 2014

sgillies commented Dec 12, 2014

brendan-ward commented Dec 13, 2014

tmcw commented Dec 13, 2014

brendan-ward commented Dec 13, 2014

mourner commented Dec 13, 2014

mourner commented Dec 15, 2014

mourner commented Dec 15, 2014

sgillies commented Dec 15, 2014

mourner commented Dec 15, 2014

sgillies commented Dec 15, 2014

mourner commented Dec 15, 2014

mourner commented Dec 16, 2014

mourner commented Dec 16, 2014

brendan-ward commented Dec 16, 2014

mourner commented Dec 16, 2014

mourner commented Dec 16, 2014

rclark commented Dec 16, 2014

mourner commented Dec 16, 2014

brendan-ward commented Dec 16, 2014

mourner commented Dec 17, 2014

joto commented Dec 19, 2014

mourner commented Dec 19, 2014

joto commented Dec 19, 2014

springmeyer commented Dec 19, 2014

mourner commented Dec 20, 2014

joto commented Dec 20, 2014

mourner commented Dec 20, 2014

mourner commented Dec 23, 2014

mourner commented Jan 9, 2015