New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The name Collections, intended to mean Geospatial Data Layers, is causing confusion #111
Comments
Look at the following paradox:
In the OGC API Features says:
The first one says: a "dataset" is a "collection". Ok.... I believe that there is an strong argument in the definitions to conclude that "dataset"="collection". In addition, OGC API Features does not allow for retrieving the single internal "abstract" dataset. It has neither metadata nor name (nor id). We do any harm if we simply remove the "abstract dataset" and we say that an OGC API allows for retrieving multiple collections that are actually datasets (that may or may not be part of a bigger dataset). |
It is a paradox, which I think gets resolved if we could accept that a dataset can be recursive (and according to these definitions, it seems that OGC API - Features already defines two levels of 'datasets'), and a collection of features can be a leaf in that hierarchy. Those definitions certainly seem to support that view! Did you mean "We don't do any harm" there? But we would still like an "abstract dataset" version of the individual collections (this does not change anything in practice, it's purely conceptual). |
Looking again at that definition, since you can download each feature collection individually, or possibly download the overall dataset as one (e.g. multi-layer tiles), both the overall dataset and the individual collections seem to fit the 'dataset' definition. |
That is not possible in OGC API Features. There, you cannot create a collection that includes collections because it says: |
@joanma747 That particular wording could be simply changed to: "Each feature in a dataset is part of exactly one feature collection." And that should not conflict with a collection of collections. However, I find that requirement oddly restrictive. |
There has always been lots of debate about datasets, and data set identity, identity of subsets, whether distributions must be of the entire dataset etc. Coming up with a canonical model for this is probably too hard - however coming up with specific guidance for some common cases is necessary - particulary if they involve nuanced definitions. IMHO the best solution is to provide a canonical means for describing relationships between instances of a very abstract concept of datasets - and pushing the problem to making a canonical description for the implemented model of relationships and dataset types to the metadata layer. Here is where OGCAPI needs to do better than SOAP and W*S and not have ad-hoc and unconnected ways of attaching semantics in different APIs (e.g. strings for "wms:Layer", "wfs:FeatureType") and move to supporting hooks for metadata about both datasets and data type in a consistent way for all APIs. This is actually trivially easy with JSON-LD (which is RDF) and a canonical hook (i.e. a property with an identifing URI which can be safely interpreted because of a JSON-LD context) to a data model (also with a unique URI) for each of these aspects. This is all that is needed to move the problem in a flexible way to implementation profiles of OGC API that could be defined for the form of metadata that is behind these hooks - for example use of a DCAT (or other) profile that uses appropriate data models to support describing what is important to know - for examples that features are part of a dataset. Get this right in the core, then work on the data models. (The antipattern we see is a mapping system that reads a capabilities document for a WFS attached to a geodatabase and delivers 270 8 letter cryptic short names for feature types it finds - not naming names here but this is exactly what at least one national mapping app did with a hydrology dataset, rendering it unusable...) |
Wise words from @rob-metalinkage. I'm in total agreement. See my #47 comment and #106 comment. |
The 'collections' issue is also being discussed over in this EDR thread - opengeospatial/ogcapi-environmental-data-retrieval#24 |
Indeed this is a big problem. There is no logical reason to limit membership to only one collection. But the path structure in the identifier forces you to tie an item to at least a 'primary' collection, of which it is an implicit member. This is a trap, arising from reflecting relationships (or any other metadata) into identifiers, which are after all intended to be used primarily as anchors or keys. This trap has a tendency to arise in all non-opaque identifier schemes, over time. See TimBL's 1998 description of the problem of URI persistence. The key message is that identifiers that include relationships (paths) are bound to break, so don't do it! Deep URI paths are just a bad idea. |
The feature URI in any API will typically not be a canonical or persistent URI for the feature and most APIs (or API versions) will be deprecated at some point in time. Such URIs have to be organised separately (in ldproxy we support that canonical feature URIs can be configured and a link with rel=canonical to that URI is then included in the feature representations). In the Features examples there is an example with a The "feature in one collection" statement is a result of the resource/path structure and is not a requirement on its own - which is why it is informative text. There was a discussion about this: opengeospatial/ogcapi-features#66 |
The motion from the OAB was a result of the discussion in opengeospatial/ogcapi-features#312 Note that the proposal for a new definition of dataset was "collection of data that is regarded as a unit", see https://github.com/heidivanparys/discussion_paper_dataset/releases for more information about this. So a data provider decides that the features distributed via the Web API together are one unit, and thus one dataset. The data provider decides that those features are organised in one or more feature collections. The data provider does not regard those feature collections as datasets. With that reasoning, dataset ≠ (feature) collection. |
@heidivanparys The drawback of this is that it imposes this fixed 2-levels of hierarchy for features isolated from non-features datasets. As discussed in #99 , if the '/[collections/{collectionId}/' part stood in its own conformance class, this would be possible. |
Yeah... that's why we should do identification with identification URIs and retrieval of features with information retrieval URIs. Wait, @cportele said that already. Also -- nice @heidivanparys
Totally agree. Case in point: We want to have collections of stream gages based on projects that use them. We will have canonical gage IDs but also have urls that put them into collections based on projects. The dataset is all the gages. The collections are just meaningful ways to sort and filter them. |
OK - that is a great distinction @dblodgett-usgs -
Need to make sure that there is enough information in the returned payload to make this clear - e.g. https://tools.ietf.org/html/rfc6596 |
It makes perfect sense that paths are queries - they reflect organisation not identity. The path reflects a data model for the data organisation - and the same data can be expressed in multiple data models ( give or take expressivity of those models). Any query requires the query client to understand that data model, or be given a pre-formed query capabable of resolving a specific desired response (e.g. a path) There is a canonical data model however, which reflects the way identifying agency distinguishing whether an individual is in scope, and distinct from other individuals. This is the "register-driven identity" model. This means two things;
what is the least complex way of doing this? Who is best placed to cope with complexity in practice? in the past we have underspecified data which puts all the complexity onto the client to work out how run a query (e.g. a WFS gives us a feature type schema - but no way of finding out what the content is, so you cant generate a query with valid content, or even know which elements of the schema might be populated. ) works fairly well for WMS if humans can interpret layer names. so IMHO we should actually model the interactions between clients and services, and pick the solution with the lowest total complexity - which is a product of the complexity of the step * the number of times it needs to be done * the number of people who need to learn how to do it. This is mitigated by the motivation to do it - valuable data means people will learn how it is organised, and examples will set expectations and lower costs to implement. Attempting to build SDIs without adequate data description capabilities has been a setback. not creating a complex but rigid and ultimately inadequate partial data model for server metadata is the key - make sure the extension and profiling mechanisms are in place and let communities gradually invest in better data descriptions as this balance plays out. |
@dr-shorthair Regarding the query vs. identifier URI concept.
|
Can close based on #140. |
SWG agrees that this issue was addressed by PR 149. |
If we could find a mechanism by which we can maintain compatibility with OGC API - Features, it would make a lot of sense to rename the concept of an abstract geospatial data layer from "collection" to something else.
Suggestions included:
/geodata
/data
/datasets
Multiple members indicated a preference for datasets.
However, according to Features, an entire OGC API distributes a single dataset, following a micro-service approach.
I do agree that it is a stretch to call a single Features Collection (e.g. Agricultural Surfaces) within a multi-collection dataset (e.g. Daraa OSM TDS) a "data set", because it is really only a sub-set of that Daraa data set. But you can retrieve that single feature collection by itself... And if datasets can be recursive, then I think it does make some kind of sense.
It is still a "set of data" :) and a sub-set, part of a set, is still a set...
I don't believe that an overwhelming majority of OGC APIs will be serving a single dataset (many others have already indicated their interest in serving multiple datasets from a single service). That certainly is not the only way people used the classic services, and I don't see any compelling reason why this would change with the OGC API, if it is made possible.
Rather, I hope that the OGC API makes it easier to serve a large number of datasets in a more organized manner and easier to access, search and filter.
If we can define a conformance class for hierarchical datasets, this could provide a way to resolve this deadlock. (I think this could be in line with my original proposal in #17)
The text was updated successfully, but these errors were encountered: