Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical Collections #298

Open
jerstlouis opened this issue Oct 26, 2021 · 21 comments
Open

Hierarchical Collections #298

jerstlouis opened this issue Oct 26, 2021 · 21 comments
Labels
2021-10 Sprint main issue Part 2 Issues to be resolved prior to TC vote

Comments

@jerstlouis
Copy link
Member

In our OGC API server and client, we have implemented support for hierarchical collections to facilitate organizing a large number of collections and facilitating discovery by drilling down to the collection of interest.

We would welcome TIEs with other client or server to validate this as a potential conformance class for an extension to Common / Geospatial Data aka Collections.

The requirements are two-fold:

  • A character is selected as indicating the hierarchy as part of collection IDs. We have opted for the colon (:) at the moment, but it could be made flexible and declared by the server. This allows for an intuitive way to drill down collections e.g. in a Web browser. Since all collections are still listed at /collections, the client can deduct the hierarchical relationship from the collection IDs alone without additional information. (An alternative could be to include additional metadata to describe those relationships, but if the IDs do not also follow such a pattern, it would not have the intuitiveness factor drilling down collections in a web browser)
  • The collections listing requirement module from /collections is re-used at /collections/{collectionId} to list children collections.

A permission is also granted for the HTML representation of /collections to list only the top-level collections.

A great use case for hierarchical collections is to offer access mechanisms (e.g. Features or TileSets) both for individual FeatureTypes, as well as for collections made up of multiple FeatureTypes (or multi-layer TileSets). e.g. we have multi-layer tilesets at https://maps.ecere.com/ogcapi/collections/Daraa/tiles and single-layer tilesets at https://maps.ecere.com/ogcapi/collections/Daraa:AgricultureSrf/tiles . This would also apply for Features (but multi-feature types collections are not yet supported on our server), especially with JSON-FG which allows declaring feature types.

Another example for maps:

https://maps.ecere.com/ogcapi/collections/NaturalEarth:physical:bathymetry/map
https://maps.ecere.com/ogcapi/collections/NaturalEarth:physical:bathymetry:ne_10m_bathymetry_J_1000/map

Original discussion of this topic is in #11 .

@ghobona
Copy link
Contributor

ghobona commented Oct 26, 2021

@aaime Part of the discussion was about the potential impact on namespace prefixes of using a colon as a separator. Since GeoServer supports the use of namespace-qualified names, perhaps you could comment on the proposal?

@ghobona
Copy link
Contributor

ghobona commented Oct 26, 2021

@arnevogt I wonder if T17-API-D165 could be easily configured by editing backend_configuration.json to demonstrate the Hierarchical Collections concept?

Cc: @lieberjosh

@tomkralidis
Copy link

@jerstlouis I like the colon-delimited hierarchy for collections, and +1 for having a server declared delimiter. I wonder whether this would be a conformance class and then an added property to a given /collections response?

@jerstlouis
Copy link
Member Author

jerstlouis commented Oct 26, 2021

@tomkralidis Yes something like "collectionIDHierarchySeparator" : ":" would make sense. It would probably be useful to that property at both /collections as well as at /collections/{collectionId} responses.

Hierarchical Collections would be a a conformance class, yes. Meaning two things: using a hierarchy separator, and adding listing of children collections to parent /collections/{collectionId} responses.

Any chance we could eventually see support in PygeoAPI? :)

@rggibb
Copy link

rggibb commented Oct 27, 2021

DGGS could of course also use this hierarchy notation for its hierarchy of ZoneClasses, ie the levels.

@aaime
Copy link

aaime commented Oct 27, 2021

@ghobona in GeoServer we indeed use ":" for namespacing, its usage is opaque to clients, it's just part of the identifier.
Administrators can have an understanding of it, in the form of "workspace:localName".

Workspaces are non hierarchical, unordered containers, originally designed to allow setting a common namespace URI for all feature types in the workspace (for ease of WFS setup). In time workspaces have become a lightweight filtering mechanism too, and a way to get rid of the prefix.
Compare:
https://gs-main.geosolutionsgroup.com/geoserver/ogc/features/collections?f=text%2Fhtml
with:
https://gs-main.geosolutionsgroup.com/geoserver/tiger/ogc/features/collections?f=text%2Fhtml
The second only has collections in the "tiger" workspace, and prefixes have been stripped. For context, the landing page of the features service is at "https://gs-main.geosolutionsgroup.com/geoserver/ogc/features", the workspace prefix has to go between "geoserver" and "ogc".
These are also known as a "workspace specific service", from them, it's not possible to access collections belonging to other workspaces.

However, to support WMS hierarchical capability document, we also have another concept: a layer group. It's a WMS specific concept, mind, does not exist anywhere else in GeoServer.
A layer group is a hierarchical, ordered container, that can contain layers and other groups. If requested directly by the users, it will return all layers defined in it. A layer group can be part of a workspace (but can also be "global", not contained in any workspace).

Say that in GeoServer we have a layer group contained in a workspace (sf), that contains another layer group, and we are using global services (so, prefixes are still there). Of course we cannot use : as the separator, let's imagine we use ) as the separator, then we could be looking at a URL as follows:

https://gs-main.geosolutionsgroup.com/geoserver/ogc/features/collections/sf:spearfish)sf:subgroup)sf:arcsites/items

while if we access a workspace specific service, we'd use:

https://gs-main.geosolutionsgroup.com/geoserver/ogc/features/collections/spearfish)subgroup)arcsites/items

Seems it would work... however, it really kills me to see special characters being used to represent a hierarchy, when the URL structure itself is hierarchical.
I realize that conflicts are possible, because we have sub-resource popping under the collection one every other day (items, tiles, coverage, map are already reserved, to name a few, more are incoming).

An approach that I have not seen in use would be to just have a "collections" resource under the collection, representing the nested collections. The path would become:

https://gs-main.geosolutionsgroup.com/geoserver/ogc/features/collections/spearfish/collections/subgroup/collections/arcsites/items

Does not look as weird as the above path, but it's longer. Even just reserving "c" as path element, would make it use 3 chars instead of one, e.g:

https://gs-main.geosolutionsgroup.com/geoserver/ogc/features/collections/spearfish/c/subgroup/c/arcsites/items

Another consideration is indeed... length. Whatever proposal we are looking at, the structure ends up represented in the path, whose length is limited, and already has other hungry competitors for it (e.g., a filter CQL expression, a polygon geometry used for spatial filtering in some services).
WMS did not carry around this issue, the capabilities document had a hierarchical structure, but each layer could be invoked directly by its name, without calling every parent along with it. Another advantage, is that it allowed for layers to be shared in multiple sub-trees. This approach having unique names in the service, but leaves plenty of space for other parameters.
Something like could be implemented by having a tree like structure in the "collections" resource, to show what relationships are there, and leaving basic clients just follow the links in the "links" array without understanding the eventual relationships.

@cportele
Copy link
Member

"Dataset" is a key context In the W3C Data on the Web Best Practices when sharing data on the web. The examples that I have seen seem to mainly share multiple datasets via a single API. Any extension for hierarchical collections that allows this should clarify which resources are a considered a dataset by the data publisher and which are, for example, subsets. This could be through another member in the Collection resource that clarifies the type of collection.

Personally, I think it is clearer and cleaner to share multiple datasets via separate APIs (which can then also evolve and be versioned separately) and have a kind of super landing page on top of them.

That said, I also see cases where it can be intuitive to users to present the data of a single dataset in a collection hierarchy with a depth > 1.

I do not see any need for a special character requirement, even if we flatten all the collections in the API (ie. only have /collections/{collectionId}) and the parent/child relationships are only expressed through expressing the relationship in the Collection resource. Concatenating node ids along the path separated by a reserved cahracter, if used, would only be a convention of a tool, but I do not see a need to standardize this. And clients should not be required to parse collectionIds.

@jerstlouis
Copy link
Member Author

jerstlouis commented Oct 27, 2021

@cportele Agreed, it would be nice to have something like "isDataSet" : "true" to indicate a dataset.

I do not see any need for a special character requirement, even if we flatten all the collections in the API (ie. only have /collections/{collectionId}) and the parent/child relationships are only expressed through expressing the relationship in the Collection resource. Concatenating node ids along the path separated by a reserved cahracter, if used, would only be a convention of a tool, but I do not see a need to standardize this. And clients should not be required to parse collectionIds

Well, it could be a convention + explicit relationships like a "parent" property. But the separator approach had the benefit of being considerably lighter, e.g. "parent" : "NaturalEarth:physical:bathymetry" for every child of bathymetry, which would always repeat the same information already contained within the convention (and a use case for this is thousands of collections, so that is a considerable advantage). I also think it would be confusing for the user (in web browser especially) if not all servers use a delimiter in collection IDs, and the hierarchy isn't made obvious in the ID.

However I would prefer to standardize something rather than nothing, so something like a "parent" property + collections listing in parent collections as well would be a great step forward.

@ghobona
Copy link
Contributor

ghobona commented Oct 28, 2021

If using a property to identify the relationship, the following options are relevant:

  • link relation type up, which is described as "Refers to a parent document in a hierarchy of documents."
  • skos:Collection to identify a collection and skos:member to identify members of the collection

@ghobona ghobona transferred this issue from opengeospatial/ogcapi-code-sprint-2021-10 Nov 2, 2021
@jerstlouis
Copy link
Member Author

To re-iterate my latest proposal, revised to address @cportele 's and others' concerns of using a particular delimiter like : and having to figure out relationships implied from identifiers:

  • I still very much think there is value in a simple hierarchical relationship between collections, and there are several use cases for them, that would not be addressed by anything that does not rely solely on the /collections array of collection objects.
  • All that would be needed from a hierarchical collection conformance class is:
    • a parent property for collection objects where another {collectionId} can be specified, and
    • a ?parent={collectionId} query parameter on /collections to retrieve only immediate children of that parent.

@cportele
Copy link
Member

cportele commented May 4, 2022

Before we invent new collection properties we should check, if we can leverage existing conventions, in particular link relation types.

As Gobe has pointed out, we could use up to reference the parent collection using a link. For example:

"links": [ ..., { "href": "../the_parent", "rel": "up", "title": "..." } ]

And we could use type to identify resources that are datasets (pointing to http://purl.org/dc/dcmitype/Dataset or https://schema.org/Dataset). For example:

"links": [ ..., { "href": "http://purl.org/dc/dcmitype/Dataset", "rel": "type", "title": "This collection is a dataset." } ]

Since the collections are hierarchical, I assume the following statements are all true, if C is a hierarchical collection with sub-collections C1 and C2:

  • Every item (e.g. feature) in C1 is also a member of C.
  • Every item (e.g. feature) in C2 is also a member of C.
  • Every item in C is also a member of C1 or C2 (could also be both, I guess).

Correct?

@jerstlouis
Copy link
Member Author

jerstlouis commented May 4, 2022

@cportele Many thanks for engaging on this, I still hold this topic dear :)

Correct?

Conceptually, yes, I think this is correct.
A use case for this may be e.g., feature types, as we discussed in T17 / FG-JSON, with top-level collection including multiple feature types, and sub-collections only including one feature type.

However, I think implementation should be allowed to support different access mechanisms (i.e., different OGC API specs) at different levels of the hierarchies. e.g., whether to provide /items or /tiles at the upper and/or lower levels.

This would allow collections that are only organizing the leafs, or only providing multi-layer vector tiles in the top-collections, etc. That would simply be done by including or not certain links in the collection object.

As Gobe has pointed out, we could use up to reference the parent collection using a link.

This approach might be fine for /collections/{collectionId}, but my main concern is for organizing in a hierarchical manner a list of collections at /collections (e.g., presenting it in a tree view control), without having to individually retrieve every collection.

Repeating the title of the parent in this case (which would already be in the parent in the same array, for the list of collections) seems overkill too.

When retrieving the list of collections, the client already have those multiple objects in memory (within the array) from the collections list resource, so I think whether links should be used or not to establish hierarchical relationships within those objects of the array is debatable.

Particularly from a client's perspective (and perhaps a less "webby" client perspective), it's much more complicated to look through links and look for a particular relation type, and parse a URL, than to simply include a property that directly uses the collection ID (rather than URL, which might be relative).

"links": [ ..., { "href": "../the_parent", "rel": "up", "title": "..." } ]

vs.

"parent": "the_parent"

That being said, I would much prefer agreeing to a best web practice that enables hierarchical collections than not agreeing on how to define hierarchical collections.

And we could use type to identify resources that are datasets

That particular approach also seems a bit complicated to me from a client implementer perspective (instead of simply having an "isDataSet": true property), but again I prefer a best web practice I dislike to not reaching an agreement.

@cportele
Copy link
Member

cportele commented May 5, 2022

@jerstlouis

Conceptually, yes, I think this is correct.

OK, so that would need to be made clear in the spec for this.

I do not have an issue with using different API building blocks for different collections in a hierarchy. But if an API supports, e.g., features or vector tiles for all (sub-)collections, then the collections would have meet the constraints.

Particularly from a client's perspective (and perhaps a less "webby" client perspective), it's much more complicated to look through links and look for a particular relation type, and parse a URL, than to simply include a property that directly uses the collection ID (rather than URL, which might be relative).

Yes, I see that point. Maybe it would be good to collect implementation feedback and test it in a few code sprints. (If we end up with OGC-specific conventions we can still support an option in our implementation to represent the links in API deployments that prefer to leverage Web linking.)

@jerstlouis
Copy link
Member Author

Thanks @cportele .

But if an API supports, e.g., features or vector tiles for all (sub-)collections, then the collections would have meet the constraints.

If what you mean is that both the parent collection and its sub-collections e.g., all support Features, then yes I agree.

test it in a few code sprints.

We did some initial testing in past code sprints with pygeoapi server implementation in the past, but perhaps we could now test this updated approach?

@tomkralidis will you be participating in the Tiles / Coverages / DGGS / EDR "Space Partitioning" Code Sprint next week?

@tomkralidis
Copy link

@jerstlouis yes I will be participating with a lense on OACov and EDR.

@jerstlouis
Copy link
Member Author

Great to hear @tomkralidis . If you're interested we could discuss Hierarchical Collections and do some TIEs with our client in the context of OGC API - Coverages to evaluate the approach(es) described above and provide feedback.

@pvretano
Copy link
Collaborator

pvretano commented May 5, 2022

Are we still proposing to use ":" or some other non-slash character as the collection seperator?

To me, this is not a hierachy:
https://maps.ecere.com/ogcapi/collections/NaturalEarth:physical:bathymetry/map
this is a hierarchy
https://maps.ecere.com/ogcapi/collections/NaturalEarth/physical/bathymetry/map.

The trick is to figure out what the path elements between collections and map mean and what you get if you do a GET on an intermediate path llke https://maps.ecere.com/ogcapi/collections/NaturalEarth/physical.

My thinking goes something like this:

  1. GET /collections always gets you the flat list of collections as it always has so clients that don't know about hierarchies can continue to work as always.
  2. GET /collections?hierarchy=true (or something like that) gets you the list of collections but organized in a hierarchy. This would mean extending the current collections schema but I think this can be done is a backwards compatible way.
  3. Getting a sub-collections like GET /collections/NaturalEarth gets you a JSON (or other format) document describing what the NatrualEarth sub-collection is all about (including what sub-sub-collections are part of the NaturalEarth sub-collection) and also includes navigation links to the children collections or the current sub-collection. I assume this JSON (or other format) document would be the same one you get with GET /collections?hierarchy=true anchored at the current sub-collection rather than /collections.
  4. Among the things that the sub-collection document can include are links to well known OGC endpoints like maps and items with the appropriate rels (e.g. items, etc.). If such links exists it means that you can get a map or features or coverage or whatever of the sub-collection and all its children collections. So GET /collections/NaturalEarth/map gets you a map with all the children collections rendered. This could be inefficient so a service may choose to simply provide navigation to the children collections without the ability to render the sub-collection as a map (... or feature, or coverage, etc.). Of course, eventually you will reach a node like .../collections/NaturalEarth/physical/bathymetry that would include links to OGC endpoints like items or map or coverage or whatever and then you could access the resource as the endpoint dictates (i.e. as a map, as features, as a coverage, etc.)
  5. As @cportele pointed out at each level in the hierarchy links (rel=up) are included to connect the nodes.

I really dislike the colon notation that is being proposed because it means that clients would need to parse the collection id which always makes my "Spidey sense" tingle!
Of course, I have not described all the details here but perhaps we can put this approach on the agenda for next week's code sprint to see if it has legs.

@jerstlouis
Copy link
Member Author

jerstlouis commented May 5, 2022

@pvretano

Are we still proposing to use ":" or some other non-slash character as the collection seperator?

I really dislike the colon notation that is being proposed because it means that clients would need to parse the collection id which always makes my "Spidey sense" tingle!

I agreed with you and @cportele that this tingles the Spidey sense and moved away from relying on a particular separator.
Instead, what I proposed is a simple "parent" : "{parentCollectionId}" to be included in each collection object.
e.g., collection NaturalEarth:physical:bathymetry:ne_10m_bathymetry_I_2000 would have its parent set to NaturalEarth:physical:bathymetry, but the collections could be named Foo and Bar instead.

This allows to easily and unambiguously establish hierarchical relations between collections when requesting all collections at /collections, and present them all e.g., in a tree view control, with a single server round-trip.

Including a parent property there makes the flat list a hierarchy for clients that understand it, without requiring a separate ?hierarchy=true mode, while being fully compatible with clients that simply ignore it.

A server could use whichever convention for hierarchy separators, or no particular separator. In the past, when we used / instead of :, that didn't seem to break any clients, so possibly / could be used, but I think it is less proper for the OpenAPI descriptions for {collectionId} to include slashes (and we don't want to break compatibility with clients that do not understand the hierarchical collections extension).

If it is decided that this is done with a rel=up link instead of a parent property, that works as well, but is heavier in the array of collection objects.

The ?parent= query parameter in turn would make it possible to only retrieve immediate children. A mechanism to specify the top level parent would be needed, which could be ?parent= with nothing following the =, or something else, so that a client can request only the top-level collections without including the full hierarchy.

The inclusion of collections list property for children collections in the parent collection resource (e.g., /collections/NaturalEarth) is what we currently do (e.g. see https://maps.ecere.com/ogcapi/collections/NaturalEarth?f=json).

The equivalent for listing the sub-collections in this new proposal would be /collections?parent=NaturalEarth instead, but we could also specify that any non-leaf collection should include the list of immediate children in a collections property.

@pvretano
Copy link
Collaborator

pvretano commented May 5, 2022

@jerstlouis thanks ... lets discuss at the code sprint next week. Looks like we have lots of source material to consider which is a good thing.

@jerstlouis
Copy link
Member Author

@pvretano off-topic, but I also hope we can discuss Common building blocks related to the Features Search extension that we proposed for Coverages and DGGS ( opengeospatial/ogcapi-coverages#164 ). Glad to hear you will be participating in this Code Sprint! :) This is what we will be focusing on.

@jerstlouis
Copy link
Member Author

At the OGC API - Common session of the 127th Members Meeting in Singapore we briefly discussed this topic and there was no outspoken objection to draft and review an optional "Hierarchical collections" requirements class for Part 2 adding which would:

  • Add a parent property to a collection object, which can reference a parent {collectionId}
  • Add a parent parameter for GET /collections request which would limit the response to immediate children of the specified collection ID (a special none value would need to be determined to retrieve the root collections as opposed to the default full hierarchy when parent is omitted)

This would also replace capabilities that were specifically included in the 3D GeoVolumes spec ( opengeospatial/ogcapi-3d-geovolumes#5 and opengeospatial/ogcapi-3d-geovolumes#12 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2021-10 Sprint main issue Part 2 Issues to be resolved prior to TC vote
Projects
Development

No branches or pull requests

7 participants