Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify usage with nested and repeated columns #47

Closed
mentin opened this issue Mar 23, 2022 · 6 comments · Fixed by #138
Closed

Clarify usage with nested and repeated columns #47

mentin opened this issue Mar 23, 2022 · 6 comments · Fixed by #138
Assignees
Milestone

Comments

@mentin
Copy link

mentin commented Mar 23, 2022

The Parquet format supports nested and repeated fields. I assume the geometry columns are not limited to the top-level columns, and can be both nested and repeated.

1. Names

The format spec talks about column names, however with nested structure a name might not uniquely identify a column.

I suggest using column path (like "a.b.c") in the docs to avoid the ambiguity. It would coincide with column name in typical case of top-level geometry.

2. Primary column

Can primary column be a nested column, or a repeated column i.e. contain list of geography values?

There is nothing that prevents this in the standard, but I guess the primary-column was designed to be mapped to built-it geometry column in formats like GeoJson or Shape files, and these assume non-repeated top level columns. We can either

  • allow this, and let the tools decide what to do with this, or
  • restrict primary geometry column to be top-level and make primary_column metadata optional so it can be omitted when a GeoParquet file has no top-level geometry.
@cholmes
Copy link
Member

cholmes commented Mar 25, 2022

Thanks for the great feedback!

For 1. I think the column path makes good sense.

For 2. I lean towards restricting primary geometry column to be top-level, so that conversion to geojson / shapefile is clear, and straightforward in implementation. And I suppose making primary_column optional makes sense, but I feel like it'd be good to have something nudging people towards defining it if possible. But I certainly see the usefulness of allowing big parquet datasets that just have a nested geospatial value to be compliant without making them say 'this is a geo file'.

@felixpalmer
Copy link
Collaborator

I agree on point 1.

My feeling on 2 is that the primary_column should be restricted to be a top-level column, for a couple of reasons:

  • As we are designing a geo-format, it feels natural that the geographic information is available at the top level.
  • We are at the 0.1 stage of the spec I think it is best to have this restriction and review it later. It will make it easier to make headway with implementations
  • How does this interact with the https://github.com/geopandas/geo-arrow-spec/? We don't want to add flexibility to the GeoParquet spec which makes it hard to implement in the linked GeoArrow spec

@cholmes cholmes added this to the 1.0.0-beta.1 milestone Oct 24, 2022
@cholmes
Copy link
Member

cholmes commented Nov 7, 2022

Call 11/7

For first version (1.0.0) we want to limit geometry columns to only being at the top-level. There are very few geospatial packages that would be able to understand it. But if someone has a use case for nested geometry columns we can potentially add it in the future.

And repetition is optional or required (not repeated).

Need to update the spec in describing the geometry columns to be specific that we don't support grouped and repetition level is required or optional.

@mentin
Copy link
Author

mentin commented Nov 17, 2022

I think it is right decision for v1.

But I also wonder if there are many geospatial packages that support multiple geometry columns? I would think most that don't support nesting / repetition would also ignore all the columns besides "primary_column", and then nesting / repetition of additional geometry columns should not matter :).

We do have several customers who use repeated geometry columns. Typically, the primary geometry column is top level required column, and it is broken into parts, which are stored as nested or/and repeated columns. What I remember:

  • a building and individual floors as repeated geometry,
  • a linestring path and repeated struct containing vertices from such linestring with additional data columns (think of M/Z dimension on steroids - where you can have many columns of arbitrary types for each vertex).

In these cases the primary geometry column is non-nested, non-repeated, but there are other columns that are nested inside repeated struct.

@tschaub
Copy link
Collaborator

tschaub commented Nov 17, 2022

Yeah, I can imagine this will be something that is revisited. From a writer's perspective, given that Parquet is capable of representing repeated and group fields, it is somewhat odd that a "geo" extension would restrict that. I guess we are anticipating the needs of readers in adding this restriction - but it may turn out to be unnecessarily restrictive.

@jorisvandenbossche
Copy link
Collaborator

But I also wonder if there are many geospatial packages that support multiple geometry columns?

GeoPandas supports this, and it seems R sf does as well (https://cran.r-project.org/web/packages/sf/vignettes/sf6.html#how-does-sf-deal-with-secondary-geometry-columns). PostGIS supports this as well (https://gis.stackexchange.com/questions/176263/can-a-postgis-table-or-view-have-two-geometry-columns).
I know that GDAL also supports this in their OGR data model and C API, but it depends on the bindings to GDAL whether it's actually supported (I know that the python bindings right now will only return a single (first) geometry column).

I can certainly see the use case of repeated (list/array type) geometry columns. I also assume that databases (like BigQuery) that have both a proper array type and geometry/geography type will typically not limit combining those two in a repeated geometry type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants