Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array of geometry types #140

Merged
merged 1 commit into from Nov 18, 2022
Merged

Conversation

tschaub
Copy link
Collaborator

@tschaub tschaub commented Nov 7, 2022

This renames the geometry_type property to geometry_types and updates the JSON type to be an array of strings (instead of a string or array of strings). The spec language has been updated to reflect that an empty array indicates that the geometry types are not known. The schema now enforces that the items in the list are one of the expected types and allows the Z suffix for everything except GeometryCollection.

I updated the example.parquet file to use geometry_types instead of geometry_type. (I followed the readme, but am struggling to run the validator, so admit I haven't yet done that.)

I bumped the schema version to 0.5.0-dev. Ideally this would happen as part of the process for releasing a tag (create release branch, update version, create tag, generate release, bump version back to X.Y.Z-dev, merge branch).

Fixes #132.
Fixes #133.

@rouault
Copy link
Contributor

rouault commented Nov 7, 2022

allows the Z suffix for everything except GeometryCollection.

why not GeometryCollection ? "GeometryCollection Z" is a valid WKB type

@tschaub
Copy link
Collaborator Author

tschaub commented Nov 7, 2022

why not GeometryCollection ? "GeometryCollection Z" is a valid WKB type

My bad. I removed the language about this.

@tschaub
Copy link
Collaborator Author

tschaub commented Nov 7, 2022

Should the validate-examples CI job be updated to use the validator and schema from the repo instead of from pypi?

@tschaub tschaub mentioned this pull request Nov 7, 2022
9 tasks
@kylebarron
Copy link
Collaborator

kylebarron commented Nov 7, 2022

Should the validate-examples CI job be updated to use the validator and schema from the repo instead of from pypi?

If I'm reading it correctly I think it is being installed from this repo

python -m pip install --no-binary geoparquet_validator .

The . is referring to this repo, and --no-binary geoparquet_validator is I think just telling pip to force using the source instead of a binary wheel. Not 100% sure why we'd need --no-binary in any case. geoparquet-validator isn't even pushed to pypi, so it can't be installing from there.

@kylebarron
Copy link
Collaborator

kylebarron commented Nov 7, 2022

Oh @tschaub you have to change this line

"required": ["encoding", "geometry_type"]

format-specs/geoparquet.md Outdated Show resolved Hide resolved
format-specs/geoparquet.md Outdated Show resolved Hide resolved
@tschaub tschaub force-pushed the geometry-types branch 2 times, most recently from 38a30e4 to 98fd71e Compare November 7, 2022 21:36
@tschaub
Copy link
Collaborator Author

tschaub commented Nov 7, 2022

Should the validate-examples CI job be updated to use the validator and schema from the repo instead of from pypi?

If I'm reading it correctly I think it is being installed from this repo

python -m pip install --no-binary geoparquet_validator .

The . is referring to this repo, and --no-binary geoparquet_validator is I think just telling pip to force using the source instead of a binary wheel. Not 100% sure why we'd need --no-binary in any case. geoparquet-validator isn't even pushed to pypi, so it can't be installing from there.

Thanks for the clarification. I missed the ..

@@ -53,7 +53,7 @@ Each geometry column in the dataset must be included in the columns field above
| Field Name | Type | Description |
| --- | --- | --- |
| encoding | string | **REQUIRED** Name of the geometry encoding format. Currently only 'WKB' is supported. |
| geometry_type | string or \[string] | **REQUIRED** The geometry type(s) of all geometries, or 'Unknown' if they are not known. |
| geometry_types | \[string] | **REQUIRED** The geometry types of all geometries, or an empty array if they are not known. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the field can be empty, why require it anyway?

Copy link
Collaborator

@m-mohr m-mohr Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the description should probably mention that it's a set of values (i.e. unique items in the array - see also the other comment regarding the schema)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is optional, I assume we wouldn't assign any different meaning to {} and {"geometry_types": []}. So then do we want "minItems": 1?

I like the idea of avoiding two ways to say the same thing. No strong preference between optional with min length of one vs required. Though it seems slightly more explicit to say that [] means "unknown" (whereas the absence of the property might either mean "unknown" or "I forgot" - not sure if there is any significant difference between the two though).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the point I think you are making though. If it isn't required that the geometry types be known, why require the geometry types in the metadata?

Copy link
Collaborator

@m-mohr m-mohr Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if it is optional, minItems should be set to 1, indeed. I have a slight tendency for not requiring it as for me an empty array reads more as "no type". On the other hand, having it required makes people think about it (if implementations don't set a default such as write_geoparquet(geometry_types = [], ...) - but I guess a validator should anyway throw a warning if the field is present.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some discussion on this topic happened in the original issue #41. The idea was indeed that we wanted to avoid two ways to say the same thing (not specified if optional, vs geometry_type="Unknown".

And I think the rationale for making it required was indeed to have people think about it / recommend to specify a useful value (instead of just leaving out): #41 (comment)

I also find the empty array a bit less clear in indicating "unknown". Another option could also be to keep the field required and allow either the single string "Unknown" or either a list of minimum length 1.

Copy link
Collaborator Author

@tschaub tschaub Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(geometry_types) == 0

vs

geometry_types[0] == "Unknown"

(I misspelled Unknown twice while trying to type that :)

Copy link
Collaborator

@m-mohr m-mohr Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not a big fan of "Unknown" either. Also, thinking a bit more about STAC it seems for everything we made required to make people think about it, it did not really work that way. People still just push things out without validation, so I could also see that people just omit geometry_types although required. So I tend towards not required and minLength: 1 for the array. As an explicit way to formulate "unknown" we could also consider null.

@tschaub
Copy link
Collaborator Author

tschaub commented Nov 10, 2022

Do we have enough of a consensus to merge this as is?

Current summary:

  • A geometry column may have more than one geometry type (before and after this change).
  • The metadata property is named "geometry_types" to reflect that (previously was "geometry_type").
  • The value for "geometry_types" is an array of unique strings (previously was an array or a string).
  • The "geometry_types" property is required (as was "geometry_type" previously).
  • Writers may signal that the geometry types are not known by using an empty array [] (previously "Unknown" or ["Unknown"] signaled the same).

@kylebarron
Copy link
Collaborator

  • Writers may signal that the geometry types are not known by using an empty array

Random thought: is there ever a case when something like ["Polygon", "Unknown"] would be useful? Like if the writer knew that some of the geometries were polygons but couldn't guarantee that all geometries were that type?

@tschaub
Copy link
Collaborator Author

tschaub commented Nov 10, 2022

is there ever a case when something like ["Polygon", "Unknown"] would be useful? Like if the writer knew that some of the geometries were polygons but couldn't guarantee that all geometries were that type?

Oh I hope not :)

My thinking on this is that geometry_types is a useful signal for systems that are reading GeoParquet and writing/storing in some representation where it is required to know ahead of time the geometry types that might be read. It is sensible to include this information in a Parquet file when that file is used as an interchange format because the metadata occurs in the footer and a GeoParquet writer may determine the list of geometry types while writing in a streaming manner. If it had to be in the header, the writer would have to buffer (or read twice), and it might seem onerous for us to require it. If the writer doesn't go to the trouble of determining the geometry types when writing (or somehow can't know it), we still allow them to write GeoParquet, and then the burden is on the reader to buffer before storing/writing if it needs to know all the geometry types.

It's hard for me to imagine where ["Polygon", "Unknown"] would provide any more value than []. In either case, if you are a reader that needs to know what kind of geometries you might read before you can store anything, you are going to have to buffer. Maybe it would be useful if you just wanted to scan a bunch of files and come up with a list of those that probably have polygons? I'm sure someone can come up with another use case, but I don't think the world will be a worse place if we don't allow it.

@rouault
Copy link
Contributor

rouault commented Nov 10, 2022

Random thought: is there ever a case when something like ["Polygon", "Unknown"] would be useful? Like if the writer knew that some of the geometries were polygons but couldn't guarantee that all geometries were that type?

I may be the one that initially pushed for unknown. Let me give the context. The idea is that you may have schemas for layers where you want to be able to put any type of geometries. In some files, geometries might be homogenous (let say only points), in other files mixed. Or you may have homogenous individual files, but one is points only and another ones is polygons only. But in the end, you want that the layer schema says "there might be any type of geometry" so that when you import it to something else (e.g importing a collection of GeoParquet files to a single GeoPackage or PostGIS layer) or compute the union of several files, you get this "any type of geometry" property.

For a writer perspective,

  • either the writer knows in advance all geometries it has to write (doing a pre-scan), and then it can write all geometry types
  • or it doesn't know that before hand from source metadata and doesn't want to do such a pre-scan, and then it will write [] / unknown

So in that context ["Polygon", "Unknown"] would not seem to be useful.

@tschaub
Copy link
Collaborator Author

tschaub commented Nov 15, 2022

@jorisvandenbossche - are you also ok with the latest?

@tschaub tschaub merged commit a92b593 into opengeospatial:main Nov 18, 2022
@tschaub tschaub deleted the geometry-types branch November 18, 2022 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Geometry types Declare Z in the geometry type?
6 participants