New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convention for communicating that dataset that is file listing associated with another dataset #176
Comments
This applies to all issues tagged "association". It seems that RDF was invented to address associations. So I am planning on studying https://www.w3.org/TR/rdf11-primer/. |
Hi Rebecca, We are about to attempt to tackle the problem of connecting datasets in the HAPI specification. A few of the things we want to be able to express:
This seems to be an RDF use case. Do you have any suggestions on how we should proceed or know anyone with experience with this that could help? I am not experienced in RDF, but Ryan (cc’d) is. Catherine, our digital librarian, is also experienced in metadata. In general, I recommend imitating or copying the DataCite schema (https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf), especially for datasets, and then mapping/copying that to other schema (e.g. HAPI).
Time series data is also a current topic in the ESIP SOSO group (second link). Some earth science groups have been using the approach linked below to get time series data in schema.org, too. Likely another good link to HAPI. Rebecca |
From Baptiste: Yes, indeed that is a nice use case. The first step would be to check what relations are needed (e.g.: build an information model / schema), and build a "graph" with nodes and relations. E.g. (not formally in any language, but just to propose something to start with :-): dataset from_observatory RBSP . distrib0 has_resolution 1 sec . distrib1 has_resolution 10 sec . dataset_b is burst_mode . Then, see if there are existing terms/relations already available in other schemas/ontologies. For instance, the concept of "dataset" is rather well defined in DCAT (https://www.w3.org/TR/vocab-dcat-3/), and it allows to describe the "distribution" of the dataset. |
The ESIP SOSO link I sent has a link to their living agenda, which has several useful links on how other sciences are approaching this using the schema.org structure. |
Before Wednesday's meeting, review info bout RDF and DCAT definitions. |
The building blocks of RDF are subject-predicate-object triples that each represent a single fact, as exemplified by Baptiste above. You can link the same subject (e.g. a dataset) to multiple objects via a meaningful predicate. This is similar to defining properties for an object but more loosely coupled. An object from one triple can be used as a subject for another triple thus allowing you to build a graph. If you only want to hang properties off of a dataset without deeper linking, RDF might be overkill. Though you could still take inspiration from various ontologies for naming things. The "R" is for "resource" so think of each of those three triple components as resources with unique identifiers and well defined semantics. There is no limit on how you name these things but it is clearly more useful if you adopt a preexisting ontology (think schema). Schema.org and DCAT seem to be the most popular for dataset related metadata. Google dataset search claims to support both, though it seems like the emphasis is on schema.org. DataCite also seems like a reasonable way to link related resources. Maybe even SPASE? At LASP, we take most of our inspiration from DCAT. We've added our own concepts to better capture our needs. We then strive to be able to crosswalk our metadata to other ontologies/schema. Another important part of RDF is to be able to share your metadata in a standard format. JSON-LD (for "linked data") seems to be a common option. If we embrace RDF here, we might want to rethink the "info" response. |
Ideas on dataset relationships Cadence (this one is special so that it is machine interpretable) Maybe have this in a separate endpoint for Argument for external: while having it in Others could be denied, but then are not necessarily machine interpretable - up to people to use as needed |
[ We define a list of predicates. No need to specify reverse relationships. Look into RDF predicates for dataset relationships. |
Next task: Come up with JSON schema for above. |
To make full use of associations between datasets for interactive plotting, the association between datasets is a first step. But we need to also be able to have a meaningful mapping on the parameter level. For example, I currently work with high rate satellite datasets, where it is useful to also have a low rate dataset with the per orbit (or lower cadence) minimum, mean and maximum of some (but not all) of these parameters. The OMNI datasets also have different parameter names for the same observable at the 1min, 5min and 1hr cadences. Is this something that can be accomplished with RDF, schema.org and the like? I’ll have to look into it. |
As mentioned by @dlindhol, and since we use JSON in the HAPI headers, opting for JSON-LD (or other linked-data flavour) is important for interoperability (as usual). I hope we don't reinvent yet another linked-data format. We also should reuse predicate from existing ontologies so that our links are understandable by generic tools. As a by-product, we would have a better FAIR score when assessing our products/services with FAIR assessment tools. |
@BaptisteCecconi We decided to use a very basic schema like the one above. The motivation for keeping the schema minimal is so that it will get used. If server developers need to learn something RDF, JSON-LD, etc. to communicate the linking information, the information is unlikely to get provided. As we develop the schema, we'll develop in parallel software and/or a service that crawls all HAPI servers and provides what is needed for interoperability. |
Our thinking is that you would take the response from
and inspect the metadata for |
It seems like some of these relationships have properties that could be associated with them. So instead of this: You can add the list of parameter mappings too, with the mappings going from dataset1 to dataset2 But then the statistics info (min, max mean in the averaging interval) in the longer cadence dataset are actually additional parameters, and they have specific meanings. Both Eelco and Jeremy wanted these kinds of summary stats for averaged parameters. Are these kinds of averaging stats common enough that they belong in the relationship mapping language? Seems like the might be. Especially if there are already terms for this in one of the standard set of relationship names that Baptiste mentioned. We should look at the existing, standard sets of RDF relationships and relationship terms and try to use them since we are ultimately looking to map to them anyway (with the standardizing layer that Bob mentioned). |
I have to support Baptiste's argument here about not making up our own syntax for linking data. I have now looked a bit into JSON-LD and it does not seem so complicated and it looks to be quite well supported for programmers. I also prefer the idea behind it that relations/links are defined where the data items (in our case the datasets and parameters) are defined. So as additional items under the /hapi/info endpoint, instead of, for example, having a separate 'relations configuration document' under a /hapi/relations endpoint, which would then contain some duplication of the structure we already have in the /hapi/catalog and /hapi/info endpoints. This would also add a burden of keeping this duplicated structure consistent. To me, it seems much easier to give HAPI server developers the option to expand the /hapi/info endpoints with some JSON-LD elements instead. It looks like the JSON-LD libraries would be helpful for crawling HAPI servers, to create the relations graph that can then be used in applications like the timeline viewer. |
Thanks @eelcodoornbos :-) Just as an example: I recently looked up the W3C Annotation standard, which proposes JSON-LD as their preferred serialisation. The have prepared a specific
So if we prepare a dedicated HAPI JSON-LD context file, then the JSON-LD section of the HAPI response could be rather straightforward to write (and validate). |
@BaptisteCecconi—perhaps a simple example would help clarify things. Suppose we wanted to say dataset1:parameter1 is the same as dataset2:parameter1 except for cadence. What would that look like in JSON-LD? I've reviewed these documents many times and have concluded I'd need much more time to understand them enough to use them. |
I found this useful: https://developers.google.com/search/docs/appearance/structured-data/dataset I recall discussing the fact that we should create json-ld for HAPI servers. It would be something an external resource builds based on HAPI JSON responses. In terms of syntax, the choices from https://schema.org/Dataset are limited: of hasPart, isPartOf, isBasedOn. |
The first step is to build the information model (the predicates). So far I saw:
The
Note: Of course, this is version rudimentary, and we need to explore in more details. However, from this first example, I would say that it looks rather non-RDF-ic to list the "same[...]As" predicates in the record. This is the job of a graph database ingesting the records, so that it can be queried and kept up-to-date. This is the job of the SPASE (or any future name) registry to list, e.g., what other HAPI datasets contains data from the same mission name. Same as for the datasets with different cadences: it seems more efficient to have a registry to manage such queries. When building such linked-data resources, the underlying assumption should be that you want to hard-code links to your resources only (same server), since you don't control the URL of the other servers. (of course, this is a quick and dirty example) |
This is very helpful. Based on what you wrote, I think we have to address another issue. I see that we've identified two types of predicates:
I suggest that we constrain ourselves to case 2. because we've always tried to avoid building an overarching metadata model and have decided to use existing metadata instead. (All of the issues tag association fall into these two categories). Before proceeding, we should probably clarify our statement in the standard that "the HAPI metadata standard is not intended for complex search and discovery." so we can more easily categorize metadata additions that are out-of-scope. (In particular, we should explain what we mean by "complex".) The case 2. instances are
|
Do we want one server with
datasetID
that is numerical data anddatasetID/files
that is URLs and another that uses the conventiondatasetID
andFilesForDatasetID
?Should we have a recomentation? Or would this be addressed by grouping/linking as discussed in #118?
The text was updated successfully, but these errors were encountered: