Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: suggested practices around required / optional metadata fields and spec extensions #89

Open
brendan-ward opened this issue Apr 28, 2022 · 3 comments

Comments

@brendan-ward
Copy link
Contributor

My goal here is to try and reframe some of the challenges I've seen us struggling with around specific metadata fields and data standardization, and outline some suggestions about how to approach these issues in order to reduce version-to-version churn and increase implementer buy-in. This is intended to be more at a "meta" level; discussion of specific fields should happen in specific GH issues.

In my own opinion, the primary goal of the specification is to support interoperability through standard documentation of what is confidently known about a dataset, so that writers can document what they know in a standardized fashion, and readers can trust and operate correctly on the dataset based what is documented. This can be achieved while still supporting the highly variable underlying data representations (winding order, CRS, etc) that can be contained within the existing encoding (WKB) and support both high-performance, low transformation internal use - which was the primary goal of the original spec from which this emerged - as well as some degree of portability within the broader spatial ecosystem. Unfortunately, this still places some burden on readers to deal with some of the messier issues of lack of standardization in the underlying data.

There are secondary goals around standarizing data representation in order to support greater interoperability, but these typically involve more serious tradeoffs in performance, transformation, and potentially loss of information compared to the original untransformed data. To make this the primary goal through combinations of metadata fields and / or other specification requirements undermines the original benefits of the format and risks forcing implementations to then bifurcate their handling of data encoded into this container: those that optimize internal use and avoid standardization, and those that optimize portability.

My hope is that some of the suggestions below help get at both of those goals in a complementary manner.

Required metadata fields:

These should be rare and new fields should be treated with an abundance of caution. These define the information that must be known in order to safely read any GeoParquet file. These should not assert any nice-to-have standardization of underlying data.

There should be a more gradual process for introducing these, which should include sufficient time for current and future implementers to raise concerns about impacts to performance, ability to safely transform existing input data to meet these requirements, etc. Sometimes we have to raise these with our respective communities in order to better identify issues, and that takes time.

It may be appropriate for required fields to first start out as optional fields while getting buy-in from the ecosystem. Once there is good consensus that a new field absolutely must be present or we'll suffer major errors on deserialization, that is a reasonable time to promote them to required fields.

Optional metadata fields:

These are intended to document properties that are confidentally known about a dataset, in order to support readers of that dataset so that they can better trust the dataset as well as opt-out of standard pre-processing (e.g., if you know winding order on input, you don't have to check underlying geometries and fix it).

Except in very rare cases, the default value of an optional metadata field should be that the specification makes no assertion about the underlying data (aka absent = null); readers are on their own. It can and should encourage writers to provide this information.

What we've seen is that there is a lot of divergence between defaults that seem sensible in theory, and those that are reasonable in practice, and it leads to awkward and avoidable issues within implementations.

When optional fields specify a non-null default, this is a trap. It is logically equivalent to a requirement that says either you state that field=X, or if unstated it MUST be Y. That is effectively a required element at that point, because implementers now need to handle what may indeed be unknown and not safely automatically knowable and coerce that assumed value into either X or Y (aka unsafe / untrustworthy documentation), or prevent writing the data (which is bad for internal use). Thus the default value of optional fields should not be used to make recommendations about underlying data representation.

Instead, optional fields should encourage documenting even the common use cases. E.g., if you know that the encoding is UTF-8 (arbitrary example), then there is no harm in stating it; leaving it unset means that you didn't know it confidently enough to set it, and that setting it when you are not confident is risky.

Provided we default to absent = null, it seems reasonable for optional fields to make recommendations to writers about how to better standardize their data and then document it. E.g., we encourage you to use counterclockwise winding order and document it with orientation="counterclockwise". The emphasis is on documenting according to the spec the data standardization that you've opted in to.

Specification extensions:

There appears to be a real desire to simplify some of the otherwise messy issues of geo data through data standardization. This should be opt-in, because it has real implications for performance, data loss, etc.

Let's define a specification "extension" as a mechanism that leverages all of the existing higher-level required / optional metadata fields AND prescribes specific data representation. It must be set up so that any reader can safely use the data using only the higher-level fields. However, it also signals to readers that if a dataset is of an extension type, some of those fields can be safely ignored and thus avoid some of the complexities of parsing things like CRS WKT. A writer of the extension type must still set those higher-level fields.

I think this gets at some of the ideas originally proposed in some of the default values for optional metadata fields as well as general recommendations within the spec.

I don't use cloud-optimized GeoTIFF yet, but my sense is that what I'm calling an extension type is similar to a COG vs regular GeoTIFF.

For example, let's define an extension type A (because names are hard and distracting) with the following requirements for data representation:

  • data must be in counterclockwise winding order
  • data must be in OGC:CRS84
  • data must have a single geometry type
  • data records must be sorted using the Hilbert curve of the centroid points of their bounding boxes (made this one up just for this example).

In this example, the writer would still set:

  • orientation="counterclockwise"
  • crs=<WKT of OGC:CRS84>
  • geometry_type=<type>
  • whatever optional field is defined re: sort order, if ever

The extension is a bit different than setting those fields on a case-by-case basis, because it can include data standardization not currently expressed via metadata fields, as well as bundling together releated metadata fields. Otherwise, checking them individually within a reader gets to be a bit more complex.

A reader specifically built to consume pre-standardized data that wants to avoid the complexities of mixed CRS, mixed geometry types, etc can specifically look to see if a dataset has extension A set. If so, they can safely opt-out of any extra work to standardize data. If not, they can reject the data outright, or do more involved processing of the higher-level fields.

A writer can allow the user to opt-in to setting this extension. So for example:

  • dataset.to_parquet(filename) => does no extra data standardization or transformation, sets higher-level fields according to the spec
  • dataset.to_parquet(filename, extension="A") => reprojects the data to OGC:CRS84, reorients winding order as needed, sorts the records. Because user intentionally opted-in to this behavior, they are willing to accept the performance impacts and potential loss of data. Because this is optional, the writer can also validate and reject attempts to write data that cannot be coerced to meet the extension; then it is up to the user to standardize / subset / etc their data as needed before attempting to write.

Thus the core idea is that a specification extension allows us to keep a lot of flexibility within the default specification, while still having a path forward that streamlines reading data that can be pre-processed to conform to certain characteristics.

@jorisvandenbossche
Copy link
Collaborator

Thanks @brendan-ward for writing this up. I think that's helpful to frame the discussion, and to think about potential generic solutions.

One question about the extension idea: is your idea that in case of the example extension A, there is then an extension field in the column metadata like extension="A"? (which then implies given values in other fields, so you can know that without having to check those fields exactly)

For reference, the STAC specifications also have a concept of extensions, although this is a bit different: https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md (they define additional fields, and there is an stac_extensions field that lists the used extensions). See eg how it is used in the items spec: https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md#stac_extensions.
(And they actually have a projection extension: https://github.com/stac-extensions/projection/)

Maybe the term "extension" is not fully fitting for what you describe, as it is not really extending the metadata, but rather defining a subset of given values for the metadata (giving a guarantee about the value of some existing metadata fields).

@brendan-ward
Copy link
Contributor Author

I was maybe a bit intentionally vague, and also hadn't fully thought it through, so I wouldn't interpret my use of the term "extension" too narrowly. I think it could go in either direction (but probably not both).

As I used it, "extension" meant some identifier that extends (builds on top of) the existing metadata fields, but also defines a cohesive subset of values within those fields and also a particular approach to data standardization. It could also be something like data_standardization_level: ["A", ...]. I was thinking it might be possible to have more than one of these per dataset, but that any present on a dataset must be complementary and have no ambiguity. Though if those are too mix-and-match, we should instead use only fields and not some new construct. Issues of composition and inheritance would have to get worked out in this approach.

"extension" as used in STAC is probably a more precise use of the term, which could also work well here (well-defined fields for subset of data within the more general model), but wasn't quite the mechanism I was suggesting; I was thinking something that behaved more as a commonly-used shorthand identifier for a group of properties.

@cholmes
Copy link
Member

cholmes commented May 9, 2022

Thanks for writing all this up, and raising it to a meta level of discussion.

I fully agree with your two goals - standardizing the description of data, and standardizing representation of data for greater interoperability. But from my perspective the primary goal is standardizing representation of data, and the secondary goal is standardizing the description. I'll write a bit of the 'why' behind that, but my hope is that this format will be much stronger because it is made from diverse perspectives, and I believe the two goals are not incompatible (but certainly will be the edges where discussion lies). And also agree that some sort of extension/profile thing will be the way forward.

So I do want to articulate the perspective of why I feel it's important we provide guidance and reasonable defaults (while also providing all the options for experts to override the defaults). The main reason for me is that geospatial data is still a 'niche' - though lots of data has a potential geospatial component to it (an address as a string, etc) the vast majority of data isn't represented spatially. And I think a big part of that is that we in the spatial community don't make it welcoming, we don't meet people half way. We all work with tools that make things like CRS parsing and transformations really easy, so we don't think about those who are implementing some geospatial support for the first time and don't know that they should be including a 'proj' dependency. And I don't think we should force them to if they are just working with the most common data.

The user/developer I'm aiming to represent is the one who doesn't know anything about geospatial but has a bunch of data that they got from someone that has a longitude and a latitude as a column. It should be easy for them to find the geoparquet spec, to read it and understand what they need to do, without having to get their head around what a 'CRS' is and how to parse it. But that once they manage to transform their data into a geoparquet with points then they suddenly tap into a much faster ecosystem of tools - they can open it in QGIS, stick it in OpenLayers, drag it into Unfolded Studio, make use of geopandas, etc.

To achieve this I believe the spec does need to provide good defaults and recommendations. But I do absolutely agree that the same spec should be useful for those who are spatial data experts, and we shouldn't force them to use the defaults if they are the experts who know what to do. But I want 'the easy path' to be more built for those who have not already invested a lot in figuring out spatial - I want us to expand our users by making spatial very easy and 'just work' in the most common cases.

The other user I think a lot about it is the provider of geospatial data. They are putting data up for distribution, to have be as useful as possible to a wide a number of people as possible. I'm thinking mostly governments, who open their data, but it could also be commercial companies looking to publish their data so it can be used by a wide variety of tools. So for that user I'd like the easiest route that they go about to make the data available naturally puts it into the best practices for distribution. And I lean towards having 'the expert', who knows they will get more performance by keeping data in a particular CRS, to be the one to override the defaults.

I do full believe the answer is to not have one monolithic spec, but to divide things up for different audiences. But to get this right I think we need to be really thoughtful about our audiences and how we divide things up and present the information architecture. I think a term that might be a bit better to use for what you're proposing would be 'profile'. An extension does seem to imply that you are taking the base structure and adding to it. While I think 'profile' can more easily be a narrowing of things, which is what you're proposing - a fairly 'wide-open' base, with a narrowing of what's allowed for specific use cases.

I think a true 'extension' vision would also be possible, where you'd have the core geoparquet spec would probably only have geometry_type and bbox in column metadata (file would likely stay the same, unless you wanted to limit it to a single geometry). It would have a default specified crs, orientation and edges. Then you could have an extension for 'spherical data', and another for 'coordinate reference systems'. You could debate on what goes in 'core', and what is in 'extensions', like perhaps edges is an option for core. But an geoparquet file would say 'I support the CRS extension', and then readers would know they would need that functionality implemented.

I do lean towards what I'm calling the 'extension' vision (base with defaults, extensions provide more capabilities) vs the 'profile' vision (base is all the options, most all as optional, with profiles providing limits). But I'm not sure we need to 'solve' which way we go immediately. And I fully agree that our information architecture for new users will likely benefit from either way. Like we could have 'Simple GeoParquet' that is the first thing we talk about for new users, and we promote that.

But I do think it'd be better for new users to read about 'geoparquet' and find a really focused format that guides them towards best practices, and then they can read about additional 'extensions' that handle more complex use cases. And to try to nudge the existing ecosystem of data towards more best practices - for every expert user who really needs their data in an obscure CRS there are probably 5 who end up with custom CRS's and are confused by them. If everyone at the core of the project really does just want to serve existing experts then I can just focus my energy on the 'simple profile', but I do think we can make a much bigger impact on the world if we strive to make the default core abstract out the most annoying / confusing parts of working with geospatial information, while also doing the hard work to have great extension points and tooling that works great for when data does really need more complex situations.

Note I don't think we need to solve any of this immediately, I think the core geoparquet spec is slim enough and general enough that it doesn't scream for extensions/profiles, and I also think we can make things with CRS's a bit better, and they are the really hard one. But we should try to sort it before we go to 1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants