Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The name Collections, intended to mean Geospatial Data Layers, is causing confusion #111

Closed
jerstlouis opened this issue Mar 16, 2020 · 18 comments
Labels
Collections Applicable to Collections (consider to use Part 2 instead) Resources of Collections type Issues related to the /collections path Resources types Issues related to resource types and taxonomy

Comments

@jerstlouis
Copy link
Member

jerstlouis commented Mar 16, 2020

If we could find a mechanism by which we can maintain compatibility with OGC API - Features, it would make a lot of sense to rename the concept of an abstract geospatial data layer from "collection" to something else.

Suggestions included:
/geodata
/data
/datasets

Multiple members indicated a preference for datasets.

However, according to Features, an entire OGC API distributes a single dataset, following a micro-service approach.

I do agree that it is a stretch to call a single Features Collection (e.g. Agricultural Surfaces) within a multi-collection dataset (e.g. Daraa OSM TDS) a "data set", because it is really only a sub-set of that Daraa data set. But you can retrieve that single feature collection by itself... And if datasets can be recursive, then I think it does make some kind of sense.
It is still a "set of data" :) and a sub-set, part of a set, is still a set...

I don't believe that an overwhelming majority of OGC APIs will be serving a single dataset (many others have already indicated their interest in serving multiple datasets from a single service). That certainly is not the only way people used the classic services, and I don't see any compelling reason why this would change with the OGC API, if it is made possible.
Rather, I hope that the OGC API makes it easier to serve a large number of datasets in a more organized manner and easier to access, search and filter.

If we can define a conformance class for hierarchical datasets, this could provide a way to resolve this deadlock. (I think this could be in line with my original proposal in #17)

@jerstlouis jerstlouis changed the title The name Collections, intended to mean Geospatial Data Layer, is causing confusion The name Collections, intended to mean Geospatial Data Layers, is causing confusion Mar 16, 2020
@joanma747
Copy link
Contributor

Look at the following paradox:
The OAB moved a motion in 2020-03-02 to adopt the following definition:

Dataset: collection of data
Note: published or curated by a single agent, and available for access or download in one or more serializations or formats.

In the OGC API Features says:

Collection
a set of features from a dataset

The first one says: a "dataset" is a "collection". Ok....
The second one say a "collection" is a "set". If it is a set, it is provably a set of data. (I assume that features are 'data' here). Ok...

I believe that there is an strong argument in the definitions to conclude that "dataset"="collection".

In addition, OGC API Features does not allow for retrieving the single internal "abstract" dataset. It has neither metadata nor name (nor id). We do any harm if we simply remove the "abstract dataset" and we say that an OGC API allows for retrieving multiple collections that are actually datasets (that may or may not be part of a bigger dataset).

@jerstlouis
Copy link
Member Author

jerstlouis commented Mar 16, 2020

@joanma747

It is a paradox, which I think gets resolved if we could accept that a dataset can be recursive (and according to these definitions, it seems that OGC API - Features already defines two levels of 'datasets'), and a collection of features can be a leaf in that hierarchy. Those definitions certainly seem to support that view!

Did you mean "We don't do any harm" there?
You meant the overall dataset of a 'single dataset' OGC API - Features use case, right?

But we would still like an "abstract dataset" version of the individual collections (this does not change anything in practice, it's purely conceptual).
This allows for example for Tiles, and Maps, to reference such an abstract dataset (defined in Common, like Collections are right now), without explicitly referencing Features or Coverages, e.g. to select particular Feature collection(s) and/or coverage(s) to render or tile.

@jerstlouis
Copy link
Member Author

jerstlouis commented Mar 16, 2020

Looking again at that definition, since you can download each feature collection individually, or possibly download the overall dataset as one (e.g. multi-layer tiles), both the overall dataset and the individual collections seem to fit the 'dataset' definition.

@joanma747
Copy link
Contributor

That is not possible in OGC API Features. There, you cannot create a collection that includes collections because it says:
"Each feature in a dataset is part of exactly one collection."
This needs to be relaxed if we want collections of collections.

@jerstlouis
Copy link
Member Author

jerstlouis commented Mar 16, 2020

@joanma747 That particular wording could be simply changed to:

"Each feature in a dataset is part of exactly one feature collection."

And that should not conflict with a collection of collections.

However, I find that requirement oddly restrictive.
I am quite sure a vector dataset exists where the same feature (i.e. same real world object) is represented in multiple collections, e.g. potentially with different set of attributes, or based on different pre-filtering that selected the same feature for different collections.

@rob-metalinkage
Copy link

There has always been lots of debate about datasets, and data set identity, identity of subsets, whether distributions must be of the entire dataset etc. Coming up with a canonical model for this is probably too hard - however coming up with specific guidance for some common cases is necessary - particulary if they involve nuanced definitions.

IMHO the best solution is to provide a canonical means for describing relationships between instances of a very abstract concept of datasets - and pushing the problem to making a canonical description for the implemented model of relationships and dataset types to the metadata layer. Here is where OGCAPI needs to do better than SOAP and W*S and not have ad-hoc and unconnected ways of attaching semantics in different APIs (e.g. strings for "wms:Layer", "wfs:FeatureType") and move to supporting hooks for metadata about both datasets and data type in a consistent way for all APIs. This is actually trivially easy with JSON-LD (which is RDF) and a canonical hook (i.e. a property with an identifing URI which can be safely interpreted because of a JSON-LD context) to a data model (also with a unique URI) for each of these aspects. This is all that is needed to move the problem in a flexible way to implementation profiles of OGC API that could be defined for the form of metadata that is behind these hooks - for example use of a DCAT (or other) profile that uses appropriate data models to support describing what is important to know - for examples that features are part of a dataset. Get this right in the core, then work on the data models.

(The antipattern we see is a mapping system that reads a capabilities document for a WFS attached to a geodatabase and delivers 270 8 letter cryptic short names for feature types it finds - not naming names here but this is exactly what at least one national mapping app did with a hydrology dataset, rendering it unusable...)

@dblodgett-usgs
Copy link

Wise words from @rob-metalinkage. I'm in total agreement. See my #47 comment and #106 comment.

@dr-shorthair
Copy link

The 'collections' issue is also being discussed over in this EDR thread - opengeospatial/ogcapi-environmental-data-retrieval#24

@dr-shorthair
Copy link

dr-shorthair commented Mar 18, 2020

@joanma747 That particular wording could be simply changed to:

"Each feature in a dataset is part of exactly one feature collection."

And that should not conflict with a collection of collections.

However, I find that requirement oddly restrictive.
I am quite sure a vector dataset exists where the same feature (i.e. same real world object) is represented in multiple collections, e.g. potentially with different set of attributes, or based on different pre-filtering that selected the same feature for different collections.

Indeed this is a big problem. There is no logical reason to limit membership to only one collection. But the path structure in the identifier forces you to tie an item to at least a 'primary' collection, of which it is an implicit member.

This is a trap, arising from reflecting relationships (or any other metadata) into identifiers, which are after all intended to be used primarily as anchors or keys. This trap has a tendency to arise in all non-opaque identifier schemes, over time. See TimBL's 1998 description of the problem of URI persistence. The key message is that identifiers that include relationships (paths) are bound to break, so don't do it! Deep URI paths are just a bad idea.

@cportele
Copy link
Member

cportele commented Mar 18, 2020

The feature URI in any API will typically not be a canonical or persistent URI for the feature and most APIs (or API versions) will be deprecated at some point in time. Such URIs have to be organised separately (in ldproxy we support that canonical feature URIs can be configured and a link with rel=canonical to that URI is then included in the feature representations). In the Features examples there is an example with a canonical link.

The "feature in one collection" statement is a result of the resource/path structure and is not a requirement on its own - which is why it is informative text. There was a discussion about this: opengeospatial/ogcapi-features#66

@heidivanparys
Copy link
Contributor

Look at the following paradox:
The OAB moved a motion in 2020-03-02 to adopt the following definition:

Dataset: collection of data
Note: published or curated by a single agent, and available for access or download in one or more serializations or formats.

In the OGC API Features says:

Collection
a set of features from a dataset

The first one says: a "dataset" is a "collection". Ok....
The second one say a "collection" is a "set". If it is a set, it is provably a set of data. (I assume that features are 'data' here). Ok...

I believe that there is an strong argument in the definitions to conclude that "dataset"="collection".

In addition, OGC API Features does not allow for retrieving the single internal "abstract" dataset. It has neither metadata nor name (nor id). We do any harm if we simply remove the "abstract dataset" and we say that an OGC API allows for retrieving multiple collections that are actually datasets (that may or may not be part of a bigger dataset).

The motion from the OAB was a result of the discussion in opengeospatial/ogcapi-features#312

Note that the proposal for a new definition of dataset was "collection of data that is regarded as a unit", see https://github.com/heidivanparys/discussion_paper_dataset/releases for more information about this.

So a data provider decides that the features distributed via the Web API together are one unit, and thus one dataset. The data provider decides that those features are organised in one or more feature collections. The data provider does not regard those feature collections as datasets.

With that reasoning, dataset ≠ (feature) collection.

@jerstlouis
Copy link
Member Author

@heidivanparys The drawback of this is that it imposes this fixed 2-levels of hierarchy for features isolated from non-features datasets.
What if a dataset is made up of 3 Shapefiles and a GeoTIFF, and the provider regards this and wants to distribute this as one dataset?
Or the provider wants to distribute it all as individual datasets, but under a single level flat hierarchy?

As discussed in #99 , if the '/[collections/{collectionId}/' part stood in its own conformance class, this would be possible.

@dblodgett-usgs
Copy link

@dr-shorthair

This trap has a tendency to arise in all non-opaque identifier schemes

Yeah... that's why we should do identification with identification URIs and retrieval of features with information retrieval URIs. Wait, @cportele said that already.

Also -- nice @heidivanparys

With that reasoning, dataset ≠ (feature) collection.

Totally agree. Case in point: We want to have collections of stream gages based on projects that use them. We will have canonical gage IDs but also have urls that put them into collections based on projects. The dataset is all the gages. The collections are just meaningful ways to sort and filter them.

@dr-shorthair
Copy link

OK - that is a great distinction @dblodgett-usgs -

  • URIs with all the path garbage are queries, which are distinct from
  • canonical identifiers for the retrieved items.

Need to make sure that there is enough information in the returned payload to make this clear - e.g. https://tools.ietf.org/html/rfc6596

@rob-metalinkage
Copy link

It makes perfect sense that paths are queries - they reflect organisation not identity.

The path reflects a data model for the data organisation - and the same data can be expressed in multiple data models ( give or take expressivity of those models). Any query requires the query client to understand that data model, or be given a pre-formed query capabable of resolving a specific desired response (e.g. a path)

There is a canonical data model however, which reflects the way identifying agency distinguishing whether an individual is in scope, and distinct from other individuals. This is the "register-driven identity" model.

This means two things;

  1. there needs to be a way to convey the data model if clients are expected to construct queries
  2. there needs to be a way to identify the canonical identifier independently of an arbitrary query path, even though this canonical identifier may be a query path against the registration model.

what is the least complex way of doing this? Who is best placed to cope with complexity in practice?

in the past we have underspecified data which puts all the complexity onto the client to work out how run a query (e.g. a WFS gives us a feature type schema - but no way of finding out what the content is, so you cant generate a query with valid content, or even know which elements of the schema might be populated. )

works fairly well for WMS if humans can interpret layer names.

so IMHO we should actually model the interactions between clients and services, and pick the solution with the lowest total complexity - which is a product of the complexity of the step * the number of times it needs to be done * the number of people who need to learn how to do it. This is mitigated by the motivation to do it - valuable data means people will learn how it is organised, and examples will set expectations and lower costs to implement. Attempting to build SDIs without adequate data description capabilities has been a setback.

not creating a complex but rigid and ultimately inadequate partial data model for server metadata is the key - make sure the extension and profiling mechanisms are in place and let communities gradually invest in better data descriptions as this balance plays out.

@cmheazel
Copy link
Contributor

@dr-shorthair Regarding the query vs. identifier URI concept.

  • How can we use this concept to improve API-Common and reduce the swirl over path names?
  • How does this relate to hypermedia navigation vs. path navigation?
  • Does an OGC Resource taxonomy play a role here as well?

@cmheazel cmheazel added the Collections Applicable to Collections (consider to use Part 2 instead) label May 11, 2020
@dblodgett-usgs
Copy link

Can close based on #140.

@cmheazel
Copy link
Contributor

SWG agrees that this issue was addressed by PR 149.
Moved: @jerstlouis
Second: @cmheazel
NOTUC

Part 2 Version 1 automation moved this from Backlog to Done Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Collections Applicable to Collections (consider to use Part 2 instead) Resources of Collections type Issues related to the /collections path Resources types Issues related to resource types and taxonomy
Projects
Development

No branches or pull requests

8 participants