Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advertizing geometry field "type" in Column metadata ? #41

Closed
rouault opened this issue Mar 10, 2022 · 15 comments · Fixed by #51
Closed

Advertizing geometry field "type" in Column metadata ? #41

rouault opened this issue Mar 10, 2022 · 15 comments · Fixed by #51
Milestone

Comments

@rouault
Copy link
Contributor

rouault commented Mar 10, 2022

A common use case if for a geometry column to hold a single geometry type (Point, LineString, Polygon, ...) for all its records. It could be good to have an optional "type" field name under https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata to capture that when it is known.

This would help for example conversion between GeoParquet and GIS formats (shapefiles, geopackage) that have typically this information in their metadata.

Values for type could be the ones accepted by GeoJSON: Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, GeometryCollection (that would be extended to CircularString, CompoundCurve, CurvePolygon, MultiCurve, MultiSurface, PolyhedralSurface, TIN if we support ISO WKB, and with Z, M or ZM suffixes for other dimensionalities)

What to do when there is mixed geometry types ?

  • do not set "type"
  • set "type": "mixed"
  • set "type": array of values, e.g. [ "Polygon", "MultiPolygon" ] (this one would be typical when converting from shapefiles where the polygonal type can hold both polygons and multipolygons )
@cholmes cholmes added this to the 0.2 milestone Mar 10, 2022
@cholmes
Copy link
Member

cholmes commented Mar 10, 2022

Seems like a good idea to me. I could even see this being a required field - ensure people at least try to fill it out instead of giving an option to be lazy and leave it blank even though clients would like to know that the data is all one type.

So I lean towards options two or three. And option 3 does seem like it'd be useful for the shapefile use case.

Can you explain how GeometryCollection is different than 'mixed'? I have a vague idea but would be interested to hear how it works.

@rouault
Copy link
Contributor Author

rouault commented Mar 10, 2022

Can you explain how GeometryCollection is different than 'mixed'

'mixed' would mean you have for example record 0 is a Polygon and record 1 is a MultiPolygon
'GeometryCollection' would mean record 0 is a GEOMETRYCOLLECTION(POINT ..., LINESTRING ...) and record 1 is a GEOMETRYCOLLECTION(LINESTRING ...., POLYGON ....)

@jorisvandenbossche
Copy link
Collaborator

+1 on adding such an optional field.

We briefly discussed a potential optional "geometry_type" field in the initial version (geoarrow/geoarrow#2), but in the end didn't add it in the first version and left it as a follow-up, as we didn't have a direct need for it (geopandas itself doesn't track this, or doesn't have any optimization depending on knowing that). But I agree it can be useful information in general.

@jorisvandenbossche
Copy link
Collaborator

Values for type could be the ones accepted by GeoJSON

Following the "type" values from GeoJSON sounds good.
Personally I would be hesitant with allowing additional ones (Curve, Surface, etc), even if we allow ISO WKB, given that (I think?) those are not that widely supported. But it's of course also easy to check this value and raise a nice error message if you are a reader that doesn't support those extended types.
Also commented about this on #18

and with Z, M or ZM suffixes for other dimensionalities

For clarification: would it be strictly required to use the Z suffix if you have 3D data? (for example, specifying type="Point" would be invalid metadata if you actually have all 3D points?)
And if so, and you would have mixed 2D and 3D points, would that need to be specified as "mixed"? (or actually you could also use ["Point", "Point Z"] I suppose, if we allow an array of types)

@rouault
Copy link
Contributor Author

rouault commented Mar 15, 2022

would it be strictly required to use the Z suffix if you have 3D data? (for example, specifying type="Point" would be invalid metadata if you actually have all 3D points?)

I'd say yes. For example PostGIS implements very strict dimensionality checks, and if using ogr2ogr to ingest a Parquet file into PostGIS, we could get into issues if the Parquet geometry type is not correctly declared.

example:

$ psql -d autotest
psql (12.9 (Ubuntu 12.9-0ubuntu0.20.04.1))
Type "help" for help.

autotest=# create table test(my_geom geometry(POINTZ));
CREATE TABLE
autotest=# insert into test values ( ST_GeomFromText('POINT(1 2 3)') );
INSERT 0 1
autotest=# insert into test values ( ST_GeomFromText('POINT(1 2)') );
ERROR:  Column has Z dimension but geometry does not

autotest=# drop table test;
DROP TABLE
autotest=# create table test(my_geom geometry(POINT));
CREATE TABLE
autotest=# insert into test values ( ST_GeomFromText('POINT(1 2 3)') );
ERROR:  Geometry has Z dimension but column does not
autotest=# insert into test values ( ST_GeomFromText('POINT(1 2)') );
INSERT 0 1

and you would have mixed 2D and 3D points, would that need to be specified as "mixed"? (or actually you could also use ["Point", "Point Z"] I suppose, if we allow an array of types)

yes

@rouault
Copy link
Contributor Author

rouault commented Mar 24, 2022

For information, the initial OGR Parquet driver writes a provisional "gdal:geometry_type" property in the "geo" metadata for now (on reading it will try both "geometry_type" and "gdal:geometry_type"). For now, it only writes for the 2D case: "Point", "LineString", "Polygon", "MultiPoint", "MultiLineString", "MultiPolygon", "GeometryCollection" or "mixed" (the later matches OGR wkbUnknown status). For the Z / M cases, do we want to append the dimensionality just after, or with a separator space: ie "PointZ" or "Point Z" ?

@jorisvandenbossche
Copy link
Collaborator

or "mixed" (the later matches OGR wkbUnknown status)

Do we want to distinguish "mixed" from "unknown"? (or not set)
I suppose that in the case of "mixed", GDAL would not do the effort of computing the geometry type, while if it is "unknown" it would still do that (by default / depending on the env flag)?

For the Z / M cases, do we cant to append the dimensionality just after, or with a separator space: ie "PointZ" or "Point Z" ?

I would maybe go with a space. Or are you aware of any projects that we could follow here?
Checking with GDAL, it seems that ogrinfo uses "3D Point" instead.

@rouault
Copy link
Contributor Author

rouault commented Mar 25, 2022

Do we want to distinguish "mixed" from "unknown"? (or not set)
I suppose that in the case of "mixed", GDAL would not do the effort of computing the geometry type, while if it is "unknown" it would still do that (by default / depending on the env flag)?

The wkbUnknown state is typically used by GDAL to report geometry columns that may contain mixed geometry types, e.g a GEOMETRY PostGIS column. The OGR PostGIS driver will not try to check if there's a mix of geometry types in the table content or if it is uniform, because that could be too lengthy.
So for GeoParquet, if a "geometry_type" attribute is there and says "unknown" (or "mixed" if that would be what would be used), the OGR driver wouldn't make any effort to guess more. It only does that in the current state of the spec since there's no official way to report a geometry type.
Maybe "mixed" is not so useful. If you know the composition of the mix, then use "geometry_type": [ "Polygon", "MultiPolygon" ], and if you don't (or want a generic schema that can apply to other files), use "geometry_type": "unknown" ?

I would maybe go with a space

would be fine for me. With a space, this is consistent with the beginning of a ISO WKT (if ignoring case), e.g "POINT Z (1 2 3)"

Or are you aware of any projects that we could follow here?

No, can't think of other references.

Checking with GDAL, it seems that ogrinfo uses "3D Point" instead.

This is a human facing representation (and that predates ISO WKT/WKB support). I wouldn't use that for JSON encoding.

@jorisvandenbossche
Copy link
Collaborator

Maybe "mixed" is not so useful. If you know the composition of the mix, then use "geometry_type": [ "Polygon", "MultiPolygon" ], and if you don't (or want a generic schema that can apply to other files), use "geometry_type": "unknown" ?

Sounds good. It's still an optional field, so you also have the option of not including it to signal that it is unknown .. ?

@rouault
Copy link
Contributor Author

rouault commented Mar 25, 2022

It's still an optional field, so you also have the option of not including it to signal that it is unknown .. ?

yes, probably for a file that declares a version of the spec that supports geometry_type, the absence of geometry_type should be understood as the same as a "geometry_type": "unknown", and would not trigger file content analysis.

@cholmes
Copy link
Member

cholmes commented Mar 25, 2022

'unknown' makes sense to me.

I do think it'd be nice to have some nudge on implementors to at least try to add it if possible, since it is clearly useful for a set of tools. Perhaps make it required, but allow 'unknown', so that implementors need to at least think about if they can provide something useful. Versus thinking 'oh, it's an optional field, I don't need to worry about it'.

Looks like Flatgeobuf uses 'Unknown' for similar use case: https://github.com/flatgeobuf/flatgeobuf/blob/6884cb05438686947f7bf3776c7ae5c9febfbf5a/src/fbs/header.fbs#L70

And I think with their header stuff the field is required?

@jorisvandenbossche
Copy link
Collaborator

I just opened a draft PR adding some initial text: #51

Perhaps make it required, but allow 'unknown', so that implementors need to at least think about if they can provide something useful.

It's indeed maybe good to make a choice between either keeping it optional (and leaving out "Unknown") or making it required but allow "Unknown". Because otherwise we just give people two ways to say that it is not known.

(on the other hand, making it optional is more backwards compatible, so even if we make it "required" I would hope that implementations are a bit flexible)

@cholmes
Copy link
Member

cholmes commented Mar 25, 2022

It's indeed maybe good to make a choice between either keeping it optional (and leaving out "Unknown") or making it required but allow "Unknown". Because otherwise we just give people two ways to say that it is not known.

Ah right, yeah - two ways to say it's not known seems less good. I continue to lean towards explicitly saying Unknown. And if implementations choose to support 0.1 then they obviously need to have that flexibility. Right now it's important that implementations have that flexibility, but I'm hoping that when we go 1.0 we'll have a lot more implementations, and that for those it'll be reasonable to just say 'I only support version 1.0' (while there will also be lots of great libraries that support all the early versions).

Note I will shift at some point relatively soon to caring a lot about backward compatibility and do appreciate your push on it @jorisvandenbossche. But I feel in the early 0.1 / 0.2 / 0.3 times we should give ourselves some flexibility to change as we discuss and get wider feedback.

@Jesus89
Copy link
Collaborator

Jesus89 commented Apr 18, 2022

IMO, if the field is optional there is no need for the "Unknown" geometry type. Just not including the field should be enough. Otherwise, we need to explain what "Unknown" means, and the users of the spec should parse the "Unknown" type.

@alasarr
Copy link
Collaborator

alasarr commented Apr 18, 2022

Yes, @Jesus89, I think we're all on the same page.

@jorisvandenbossche comment:

It's indeed maybe good to make a choice between either keeping it optional (and leaving out "Unknown") or making it required but allow "Unknown". Because otherwise we just give people two ways to say that it is not known.

And @cholmes commit suggestion: #51 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants