-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify usage with nested and repeated columns #47
Comments
Thanks for the great feedback! For 1. I think the column path makes good sense. For 2. I lean towards restricting primary geometry column to be top-level, so that conversion to geojson / shapefile is clear, and straightforward in implementation. And I suppose making |
I agree on point 1. My feeling on 2 is that the
|
Call 11/7 For first version (1.0.0) we want to limit geometry columns to only being at the top-level. There are very few geospatial packages that would be able to understand it. But if someone has a use case for nested geometry columns we can potentially add it in the future. And repetition is optional or required (not repeated). Need to update the spec in describing the geometry columns to be specific that we don't support grouped and repetition level is required or optional. |
I think it is right decision for v1. But I also wonder if there are many geospatial packages that support multiple geometry columns? I would think most that don't support nesting / repetition would also ignore all the columns besides "primary_column", and then nesting / repetition of additional geometry columns should not matter :). We do have several customers who use repeated geometry columns. Typically, the primary geometry column is top level required column, and it is broken into parts, which are stored as nested or/and repeated columns. What I remember:
In these cases the primary geometry column is non-nested, non-repeated, but there are other columns that are nested inside repeated struct. |
Yeah, I can imagine this will be something that is revisited. From a writer's perspective, given that Parquet is capable of representing repeated and group fields, it is somewhat odd that a "geo" extension would restrict that. I guess we are anticipating the needs of readers in adding this restriction - but it may turn out to be unnecessarily restrictive. |
GeoPandas supports this, and it seems R I can certainly see the use case of repeated (list/array type) geometry columns. I also assume that databases (like BigQuery) that have both a proper array type and geometry/geography type will typically not limit combining those two in a repeated geometry type? |
The Parquet format supports nested and repeated fields. I assume the geometry columns are not limited to the top-level columns, and can be both nested and repeated.
1. Names
The format spec talks about column names, however with nested structure a name might not uniquely identify a column.
I suggest using column path (like "a.b.c") in the docs to avoid the ambiguity. It would coincide with column name in typical case of top-level geometry.
2. Primary column
Can primary column be a nested column, or a repeated column i.e. contain list of geography values?
There is nothing that prevents this in the standard, but I guess the primary-column was designed to be mapped to built-it geometry column in formats like GeoJson or Shape files, and these assume non-repeated top level columns. We can either
primary_column
metadata optional so it can be omitted when a GeoParquet file has no top-level geometry.The text was updated successfully, but these errors were encountered: