Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GeoArrow encoding as an option to the specification #189
Add GeoArrow encoding as an option to the specification #189
Changes from 5 commits
cd8c0a2
3452b53
c94bcaf
3d28ebc
146dcf0
f8e0ae2
35a2919
2a41943
6dc23dc
5a996f8
0fac197
88ae045
ec154cc
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that geometry column with mixed types of geometries cannot be encoded as GeoArrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not until apache/parquet-format#44 is merged (if ever)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can (or will be once we sort out the details of geoarrow/geoarrow#43 ), although it's unclear exactly how we'd do that in Parquet or if it would be useful in Parquet. In any case, it would be a future addition!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columns with mixed geometry values are quite common for most query engines with geospatial support. Most of the time geometry columns have the umbrella type "geometry" or "geography", and it is not practical to first resolve the subtypes of the geometries before writing out parquet files. I'd look forward to a columnar encoding supporting mixed geometry types as well as geometry collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow and Parquet are two different specs. Arrow has a union type, which allows for mixed geometry types in a single column, while maintaining constant-time access to any coordinate. Parquet does not today have a union type, so it's impossible to to write the
Geometry
andGeometryCollection
arrays in geoarrow/geoarrow#43 natively to Parquet.GeoArrow implementations are able to statically know whether they have singly-typed geometries in a column, in which case they can write one of the 6 primitive types. GeoArrow implementations will have to fall back to WKB-encoded geometries for mixed-type columns. I don't see how this is something we could realistically change, unless we essentially re-implement union handling in a struct, which would be a big ask for implementors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good points!
I think Kyle put the summary best:
In the context of this PR, that would mean that the column option
geoarrow_type
could in the future be set to"geoarrow.mixed"
.I don't think we anticipated that writing mixed geometries in geoarrow to Parquet would be the main use-case. If this is an important use, please chime in on geoarrow/geoarrow#43 with some details! We definitely don't want to represent mixed geometries in a way that prevents them being used.
This is only true if there's no ability for a user to supply any information about the encoding. If there is,
write_geoparquet(..., encoding = geoarrow("geoarrow.point"))
should do it. Typically the user does know this information (even if the database does not).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This discussion is making me think that the GeoParquet spec should not be defined in terms of GeoArrow. Rather it should be defined as "native type" or "flat type" or similar. Then a sentence in prose can mention that it overlaps partially with the GeoArrow spec.
I'm also becoming convinced that the serialized form need not exactly overlap with GeoArrow. On the topic of mixed arrays specifically, as possibly the only one who has written an implementation of GeoArrow mixed arrays, I've become an even stronger proponent of using an Arrow union type for GeoArrow mixed arrays because of its ability to know geometry types statically. So I think the best way forward for GeoParquet (for a future PR) would be to discuss a "struct-union" approach for GeoParquet that is not the same in-memory representation as GeoArrow.
I think changing nomenclature will also be clearer to non-arrow-based implementations that reading and writing this "native" encoding of GeoParquet is not dependent on using Arrow internally.
So my recommendation would be to take out most references to geoarrow from this PR. I.e., we don't want the metadata key called
geoarrow_type
if there's a possibility where the GeoParquet union type is not the same as the GeoArrow union type.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think the strength of this PR is the strong delegation to the GeoArrow specification: I don't think we should be developing two specifications, particularly since we have very little engagement on the GeoArrow spec already. We've spent quite a bit of time documenting the memory layouts for geoarrow in that specification and I don't think it would be productive to copy/paste those here and maintain them independently. I also don't think it would be productive to link to the GeoArrow specification for documentation of all the memory layouts but very pointedly not call it GeoArrow.
It may be that representing mixed geometry is not important in the context of GeoParquet (Maybe WKB is just as fast in the context of compression + IO + Parquet's list type? Have we checked?), or it may be that there is a common memory representation that makes sense for both specifications that will improve interoperability (although that would involve quite a lot of reimplementation on Kyle's end 😬 ).
I don't want us to loose track of the main point here, which is that this PR is mostly about enabling very efficient representations of single-type geometries, which are very commonly the types of files that you might want to put in a giant S3 bucket and scan efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could go back to to the question about which "encoding" values to allow, and instead of a generic
"geoarrow"
option (with an additional "geoarrow_type" key to then be more specific), have actual encoding options"point"
,"linestring"
,"polygon"
, etc.(i.e. like one of the options we initially discussed was also "geoarrow.point", "geoarrow.linestring", etc, but then just dropping the "geoarrow." prefix)
For the rest, it is mostly a question about how to document this: how to phrase this exactly in the specification, how strongly to tie it to geoarrow (or just reference as mostly similar), how much to duplicate the details of the memory layout, etc. But that's all "details" about the best way to document it, while keeping the actual specification (what ends up in the metadata in a file) agnostic to geoarrow.
I think I am somewhat convinced by @kylebarron's points on this, and like to idea of having the actual spec changes not use "geoarrow" (and then we can still debate how much to use the term in the explanation of it).
For example, as long as the specification would exactly overlap (or be a strict subset of geoarrow), we can still point to the geoarrow spec for the details to avoid too much duplication (and having to maintain two versions). And this is also easy to change then in the future if we would want to introduce differences.
On the other hand, for an implementation of GeoParquet in some library that has nothing to do with Arrow (doesn't use an Arrow implementation under the hood), the "geoarrow" name is also somewhat uninformative, when strictly looking at it from a GeoParquet point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's worth crafting a new PR that uses the language you all are hoping for with a draft implementation? I don't currently have the bandwidth to do that but am happy to review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep the information of the first sentence somewhere? (i.e. that for a WKB encoding, the geometry column MUST be stores using the
BYTE_ARRAY
parquet type)(you kept the "Implementation note" just below that also mentions
BYTE_ARRAY
, but that is not so specific as the above)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! We could also add some more in here about the Parquet physical description of how nesting works (but maybe in a future PR?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, this doc doesn't even allow interleaved coordinates as GeoParquet.
I'm sensitive to the complexity concerns of having too many options in the spec, but I see this as favoring the "support cloud-native remote queries" use case over the "efficient file format, but reading and writing whole tables" use case. It "feels" like there's still a strong pull in general towards storing interleaved coordinates across the geo ecosystem.
That said, the memcopy to and from separated coordinates is pretty fast, so I can tolerate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The summary of why interleaved coordinates are not a good canidiate as of this writing are:
Demo of column statistics:
Demo of slowness:
Demo of random errors for NULL values:
These are probably all solveable/might be unique to Arrow C++-backed implementations, but I am not sure it is the best encoding to start with (and it does seem like a good idea to start with just one encoding to minimize burden on implementors).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a json schema expert, but would we be able to make this conditionally required? It looks like
dependentRequired
meets what we need, though I don't know what version of json schema we're pinned to.