-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A revised breaking protobuf schema for Geobuf #27
Comments
Awesome to learn about oneof. Overall your ideas look excellent. I'll comment more when I've had time to digest the proposed schema. |
This looks like a great direction. What about letting the feature id field leverage
|
@brendan-ward yes, excellent suggestion! Added it to the proto gist, except including oneof directly in the Feature message, since oneof fields are already optional by nature. |
OK guys, today I wrote the encoder of my proposed format in Python to test it out, and the results are pretty mind-blowing. Here's the comparison for the 100Mb 5.4-million points (1e4 precision) US zip codes GeoJSON (that I used for testing geojson-vt):
Here's another sample, Idaho from geojson-fixtures:
The keys of properties stored are unique. I think I can squish out just a bit more by making string values unique too. |
Cool. To be clear, "your geobuf" can be "our geobuf" :) |
@tmcw sure, I just wanted to share the motivation before suggesting that I completely rewrite this repo from scratch while breaking compatibility, which can be very annoying. :) |
I don't mind anything being rewritten ever as long as it's better :) But the people who might feel the pain of this would be @mick basically, because it's used in cardboard afaik |
@mourner I will feel the pain. But that's is ok, this like a great rewrite. I can migrate my usage of geobuf to the new version. |
Reference Python encoder here: https://github.com/mapbox/pygeobuf |
Any comparisons to TopoJSON? |
@kkaefer depends on the data — Geobuf is a GeoJSON equivalent, without encoding topology. It will probably be smaller compared to TopoJSON, but I'm thinking about making it support TopoJSON encoding as well for even more compression (or making a separate protobuf schema for TopoJSON). |
👍 |
@mourner allowing integer or string ids in GeoJSON was a mistake, let's try to fix that here. I'm not sure what the better choice is but I'm convinced there should be only one. And while we're considering other schemas, how about one for GPX type data? The internet is forever trying to shoehorn GPX into GeoJSON and if geobuf becomes the new binary GeoJSON, we'll see the same kind of interest. |
For comparison, Idaho from geojson-fixtures (above) encoded as topojson, using defaults:
👍 for a separate protobuf schema for TopoJSON 👍 for integer ids. They are smaller in the protobuf, easier to index, easier to reason about, and often come in that way from other geo encodings. And the vector-tile-spec uses them. |
Hmm: @brendan-ward is that using the TopoJSON default of throwing away all attribute data? That setting makes it very apples-oranges |
@tmcw I obviously missed that default behavior - apologies! The numbers are still pretty close. Including properties and specifying quantization hopefully makes it less apples and oranges:
|
@sgillies not sure if we should impose our own restrictions on the format as it would make it harder for people to switch to — the idea is that you can throw your GeoJSON at it and it'll just work, and roundtrip back to GeoJSON without issues. @brendan-ward ha, so we're doing pretty good if we compare it at the same quantization. When we make Topobuf, it'll be even smaller. 2all: wrote an initial decoding implementation (while on a plane) as well. Seems to work except a rounding issue with coordinates which I'm trying to figure out. |
OK, my Python Geobuf 2 encoder/decoder now works pretty well. It roundtrips (encodes and then decodes into the same JSON) all GeoJSON test cases I have. The only exception is nested properties, but it is solvable. Encoding/decoding isn't very fast (15s/12s for a 22MB GeoJSON file), but I assume this can be heavily optimized later if necessary (e.g. with Cython) — @sgillies knows this magic well. I'm also sure the JS port will be quite fast. I'm now looking into TopoJSON support. The idea is to be able to throw any kind of GeoJSON/TopoJSON at the format and store it losslessly in a compact binary. |
Once I have the Topobuf format & reference implementation ready with some feedback from the outside crowd, we can create the geobuf-spec repo (similar to vector-tile-spec) and start other implementations (JS & C++). |
I do have a preference for ids this morning. @brendan-ward strings are better than ints because all ints can be cast/converted to strings, but the inverse is not true.
@mourner what about GeoJSON with properties that are arrays or objects?
|
I think @brendan-ward's suggestion to stringify complex properties as JSON when encoding and parsing them when decoding is a pretty good compromise — it'll enable us to store any type of GeoJSON, while keeping it simple and being efficient most of the time (since nested properties are not that common).
All ints can be remembered as ints when encoding, so that we know for sure that they can be converted back from strings when decoding. Thus Geobuf will never try to convert a string to integer if the result is not guaranteed. |
👍 |
@sgillies @brendan-ward as expected, nested props support was a trivial change: pygeobuf/pygeobuf@714b5df |
Topobuf encoding results are not groundbreaking but still nice — in average ~30% of original TopoJSON, same size if gzipped or a bit smaller.
In-progress work here: https://github.com/mapbox/pygeobuf/compare/topobuf. |
TopoJSON support complete. pygeobuf/pygeobuf#7 Now just a bit of finishing touches and we can ask for feedback from the geo crowd. Pretty excited about this. |
@mourner This looks great - nice work! I hate to be the one to ask, but is spatial indexing being viewed as a client side problem? Would there be any benefit (albeit at potential cost of additional space) of storing any spatial index related constructs or precursors in the buffer to aid in client side indexing? I realize there are lots of spatial index implementations, and that it might be better to keep that out of the scope of the buffer. Extracting features based on spatial location seems like a common problem, so it would be nice to avoid repeated work if possible. What I'm getting at could take a few forms:
Maybe this would be better handled at the next level up from this set of buffers: a container format that combines the geo/topo buffer with an index / bbox buffer, so that geo/topo buffers can focus purely on roundtripping to / from Geo & Topo JSON. |
@brendan-ward I thought about this. And I also happen to be the author of a pretty awesome R-tree-based spatial indexing library — rbush. So obviously I'd want to serialize the R-tree for this purpose. :) Since Protobuf handles tree-like structures with recursive embedded messages, I think it's pretty doable. Do you see any specific problems to R-tree serialization? Maybe we should store it as a separate protobuf file by the way, like shapefiles do — this way we wouldn't compromise on the compression and allow storing indexes separately (which is beneficial in some app setups). |
@brendan-ward btw, numbers for the Idaho TopoJSON test case above:
|
I would just chime in to say that I imagine many implementations of geobuf will use it as a straightforward compression of geographic data while separately managing spatial indexing in some fashion that's applicable to that specific situation. For example cardboard. I'm not opposed to a more compressed serialization of indexes, but my first reaction is that it is a separate concern that shouldn't be addressed in geobuf. |
@rclark 👍 to a separate repo for rbush-based spatial indexing and compressed index serialization. |
A separate repo for rbush-based spatial indexing & serialization sounds like good separation of concerns. @mourner My concern with serializing the R-Tree vs bounding boxes or other spatial index precursors was simply that the other implementations wouldn't be able to consume the serialized R-Tree directly, their APIs would probably expect the index precursors (e.g., bboxes) instead. But I'm outside my domain on the internals of spatial indexes, so I would definitely defer to you on this. rbush is awesome and would serve as a good implementation to build this around. |
@brendan-ward I don't know much about the cases involving serialized spatial indexes and what kind of applications would consume it, but I don't see much point in storing an index if it can't actually be used as an index. E.g. to create an index from bboxes only, you would spend time running R-tree indexing on them after unpacking, which kinda defeats the purpose of having a stored precalculated index. And if you're fine with just bboxes, you could just as well precalculate and put them in the input data (GeoJSON or TopoJSON) since they're in the spec, and then store in Geobuf without any index extensions. Also, the main challenge of spatial indexing is creating the index. Using it is dead-simple (like, simple depth-first search through a tree is less than 10 lines of code). |
@mourner: very interesting work! We have been missing an efficient modern geo format for a long time. Some ideas:
These are just a few ideas from the top of my head. I'd love to try a C++ implementation to see how things would be implemented there and how performance looks. |
@joto thanks for the awesome feedback!
Here's what I have in mind for the index: it would be very small compared to the data file, because it would only contain bboxes, without geometry or properties, but additionally storing the byte index to jump to for reading a particular feature from the main data file. Then to read the features you need, you would do a simple search through the index, save all the feature byte indexes and then just run through them. If non-sequential reads are a concern, you can just sort the indexes before reading. It seems that it would definitely be faster than reading through every feature. What do you think? Regarding bboxes for every feature/geometry — it's pretty easy to add as an option, so not a problem.
I see what you mean. I think we can find a compromise on this. Maybe store the unique keys globally — are usually the same through the whole data and are jut repeated, so the array will be small, and it will be very fast to read at the beginning. But store values per feature — those aren't duplicated as often as keys.
Protobuf 3 is going to solve this with arena allocation — it will know the deserialized size beforehand and will be able to preallocate all the memory needed.
This sounds like a great idea. I'll definitely give this a try — sounds pretty easy to implement. So I'll just have an array of coordinates and an array of lengths, e.g. for MultiPolygon:
I'll also take a closer look at the OSM PBF schema.
What would you use for the decoder? Our mapbox/pbf.hpp? |
@mourner: Byte indexes are problematic, because the protobuf library hides that from you. Maybe there is a way to get the byte index of a feature from the protobuf lib, but I doubt that. If you are going that route you'd have to serialize each feature separately using protobuf but then write each feature into the file yourself, probably with a length header.
That sounds reasonable. As long as we keep the streaming capability, ie we are sure we know before we start writing all strings that will end up in the header.
Well, the point is we don't want to allocate memory at all if we can avoid it. Don't know what Protobuf wants to do about it. Couldn't find anything about Protobuf 3. Do you have a link?
In the first version the official protobuf lib. Maybe later something else. When I tried the mapbox pbf parser the last time for Osmium, it didn't implement everything in the Protobuf spec and couldn't do everything that OSM PBF needed. Have to re-evaluate this at some point and see whether it would work for geobuf. |
Big +1 behind giving a shot to @joto's idea about Also I agree that while arena allocation in proto3 might mitigate the cost of memory allocation for parsing the message into protobuf objects the ideal in high performance situations would be to avoid needing to allocate at all and instead go directly from messages -> your applications objects (skipping the intermediate objects libprotobuf creates). This is what https://github.com/mapbox/pbf.hpp and https://github.com/mapbox/pbf are about. I think you all understand this but wanting to re-state for clarity. |
@joto about byte indexes — there can be a problem if you use the official protobuf library which is very high-level. But with a more low-level approach like pbf.hpp there should be no problem keeping track of byte indexes, and no problem serializing everything in one pass without any additional work.
Lets create issues on any things lacking there. It's simple enough to add something relatively quickly, and there's @kkaefer to help. @springmeyer implemented flattened array encoding — it doesn't make things much more compact (just a bit of bytes here and there) because most of the weight is still linestrings which are packed — not as many multilinestrings and multipolygons in real world datasets, and they don't have a lot of rings usually — just a couple. But what I like about this approach is simplicity. It makes both schema and encoding/decoding code simpler. |
@mourner: re byte indexes: I think it is important to stay compatible with the official protobuf libraries. If nothing else it allows other people to have their own implementations based on those libraries possibly in many different languages. |
@joto good point. I'll think about this. At least we can do indexing + byte indexes as a separate tool, not baking anything specific to it into the Geobuf schema itself. |
Made some progress:
|
JS encoder/decoder implemented in #29 |
The more I find ideas to improve the sizes of geobuf, the more I realize that we would need to completely rewrite the schema to support the improvements, breaking compatibility. So I'm opening the ticket to start a discussion about what a perfect Geobuf schema would look like (not to say this is a priority, but still a good thing to discuss).
I wrote a prototype schema with all the improvements I could think of here: https://gist.github.com/mourner/3c6ddca04c9772593302
The main difference is utilizing the power and flexibility of the new oneof statement to create a better and more compact schema.
Features:
Feature
,FeatureCollection
,Geometry
orGeometryCollection
, so you don't need to guess this when decodingData
object, and features only store indexes to them (like vector-tile-spec does); keys and possibly values are reduced to unique values Fancy encoding for properties #26 — for much better properties packingoneof
set of different fields (depending on type), which solves the ambiguity withnull
values since aoneof
field can be empty (without a default value), and also makes it easier to understand and work withoneof
of either geometry or geometry collection instead of repeated geometriessint32
to take full advantage of varint and zig-zag encoding Switch to sint32 for coords #24 — for much more compact geometriesValue
message is also aoneof
of different value typessint32
instead ofint64
— it's more compact and JS doesn't actually handleint64
values; in addition, uint type becomesuint32
@tmcw @springmeyer what do you think?
The text was updated successfully, but these errors were encountered: