Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convention for communicating that dataset that is file listing associated with another dataset #176

Open
rweigel opened this issue May 4, 2023 · 20 comments

Comments

@rweigel
Copy link
Contributor

rweigel commented May 4, 2023

Do we want one server with datasetID that is numerical data and datasetID/files that is URLs and another that uses the convention datasetID and FilesForDatasetID?

Should we have a recomentation? Or would this be addressed by grouping/linking as discussed in #118?

@rweigel rweigel added this to the Version 3.3 milestone Mar 18, 2024
@rweigel rweigel closed this as completed Mar 25, 2024
@rweigel rweigel reopened this Mar 25, 2024
@rweigel
Copy link
Contributor Author

rweigel commented Mar 25, 2024

This applies to all issues tagged "association". It seems that RDF was invented to address associations. So I am planning on studying https://www.w3.org/TR/rdf11-primer/.

@rweigel
Copy link
Contributor Author

rweigel commented Mar 26, 2024

Hi Rebecca,

We are about to attempt to tackle the problem of connecting datasets in the HAPI specification. A few of the things we want to be able to express:

  • Another version of this dataset is at a different cadence or quality control level (e.g., L0, L1, ... in NASA terminology and preliminary, quasi-definitive, and definitive in magnetometer speak).

  • This dataset contains only data when in burst mode. The nominal mode data are in another dataset.

  • This dataset was used to generate an event list dataset

  • The files used for this dataset over a given time range are available by requesting a hapi server using a different dataset ID with the same time range.

  • This dataset is from a satellite that is part of a constellation (e.g., RBSP-A, RBSP-B) (SPASE does this, so we may not need it).

This seems to be an RDF use case. Do you have any suggestions on how we should proceed or know anyone with experience with this that could help?


I am not experienced in RDF, but Ryan (cc’d) is. Catherine, our digital librarian, is also experienced in metadata. In general, I recommend imitating or copying the DataCite schema (https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf), especially for datasets, and then mapping/copying that to other schema (e.g. HAPI).

  • The version issue can be taken care of with different version numbers, similar to software. See the Version field (clarify whether timestamps are for the start, middle or end of the interval #15 in table 2).
  • Likely best in the description
  • Can use the relatedItem field likely with Collection as the item type and IsOriginalForm as the relationship (see bottom of page 33 and table 9 in the appendix)
  • The most useful option would be alternateIdentifier = the link to the HAPI dataset with some explanatory text in alternateIdentifierType.
  • The international jury is out on the best way to indicate links to facilities/missions/observatories and instruments, with an answer hoped for in the next two years. In the meantime, the current RDA recommendations do accommodate for DOIs to be created for each mission/instrument, with some good arguments for using this approach, but there are some issues (see https://www.rd-alliance.org/system/files/pidinst-schema-1.0_Final.pdf, particularly Table 1 on p. 4, and https://github.com/rdawg-pidinst/schema/blob/master/schema-datacite.rst for a deeper dive). The best way to do this in DataCite appears to be the RelatedItem field (free text) with Physical Object as the object type and possibly with IsDerivedFrom as the relation.

Time series data is also a current topic in the ESIP SOSO group (second link). Some earth science groups have been using the approach linked below to get time series data in schema.org, too. Likely another good link to HAPI.

Rebecca

iodepo/odis-arch#125

https://github.com/ESIPFed/science-on-schema.org

@rweigel
Copy link
Contributor Author

rweigel commented Mar 27, 2024

From Baptiste:

Yes, indeed that is a nice use case.

The first step would be to check what relations are needed (e.g.: build an information model / schema), and build a "graph" with nodes and relations.

E.g. (not formally in any language, but just to propose something to start with :-):

dataset from_observatory RBSP .
dataset has_distribution distrib0 .
dataset has_distribution distrib1 .
dataset is nominal_mode .
dataset see_also dataset_L0 .

distrib0 has_resolution 1 sec .
distrib0 has_hapi_server https://...hapiurl...
distrib0 has_hapi_dataset hapi_dataset_id0
distrib0 other_resolution distrib1 .

distrib1 has_resolution 10 sec .
distrib1 has_hapi_server https://...hapiurl…
distrib1 has_hapi_dataset hapi_dataset_id1
distrib1 other_resolution distrib0 .

dataset_b is burst_mode .
dataset_b is_supplement_to dataset .

Then, see if there are existing terms/relations already available in other schemas/ontologies.

For instance, the concept of "dataset" is rather well defined in DCAT (https://www.w3.org/TR/vocab-dcat-3/), and it allows to describe the "distribution" of the dataset.

@rebeccaringuette
Copy link

The ESIP SOSO link I sent has a link to their living agenda, which has several useful links on how other sciences are approaching this using the schema.org structure.

@jvandegriff
Copy link
Collaborator

Before Wednesday's meeting, review info bout RDF and DCAT definitions.
See if Doug L has any input.

@dlindhol
Copy link

dlindhol commented Apr 2, 2024

The building blocks of RDF are subject-predicate-object triples that each represent a single fact, as exemplified by Baptiste above. You can link the same subject (e.g. a dataset) to multiple objects via a meaningful predicate. This is similar to defining properties for an object but more loosely coupled. An object from one triple can be used as a subject for another triple thus allowing you to build a graph. If you only want to hang properties off of a dataset without deeper linking, RDF might be overkill. Though you could still take inspiration from various ontologies for naming things.

The "R" is for "resource" so think of each of those three triple components as resources with unique identifiers and well defined semantics. There is no limit on how you name these things but it is clearly more useful if you adopt a preexisting ontology (think schema). Schema.org and DCAT seem to be the most popular for dataset related metadata. Google dataset search claims to support both, though it seems like the emphasis is on schema.org. DataCite also seems like a reasonable way to link related resources. Maybe even SPASE? At LASP, we take most of our inspiration from DCAT. We've added our own concepts to better capture our needs. We then strive to be able to crosswalk our metadata to other ontologies/schema.

Another important part of RDF is to be able to share your metadata in a standard format. JSON-LD (for "linked data") seems to be a common option. If we embrace RDF here, we might want to rethink the "info" response.

@jvandegriff
Copy link
Collaborator

Ideas on dataset relationships

Cadence (this one is special so that it is machine interpretable)
This needs work to clarify to be specific enough to be machine usable, but not tangled in the weeds.
Needs some isomorphism amongst parameter names so that potting tools can easily switch between datasets and still plot the same parameters.
Key point: we want to support Eelco's use case of auto-selection of cadence by a timeline viewer (client-side plotting tool).
The right kind of descriptor could specify a linkage at the dataset level (dataset A is linked to dataset B by cadence) and if the datasets are not similar enough to do this (i.e., the parameters have different names, even though they are just different by cadence), then the linkage descriptor could specify connections between specific parameters.

Maybe have this in a separate endpoint for linkages or relationships? Linkages are a kind of overlay, but it would also be nice to see it in the info response.

Argument for external: while having it in info is convenient, we likely need to have an external place to manage the complexity of the different kinds of linkages.

Others could be denied, but then are not necessarily machine interpretable - up to people to use as needed
Calibration Level (NASA Level 0, Level 1)
Processing Version
Quality level
Transform (FFT, coordinate, statistical (min/max). background removal)

@rweigel
Copy link
Contributor Author

rweigel commented Apr 3, 2024

/relations returns this information in some JSON from similar to

[
[server:dataset1, hasRelatedCadence, server:dataset2],
[server:dataset1, sameMissionName, server:dataset2],
[server:dataset1:parameter1, isATransformedVersionOf, server:dataset1:parameter2],
[server1:dataset1:parameter1, isCoordinateTransformOf, server2:dataset1:dataset3],
[server1:dataset1, isDifferentCalibrationLevelOf, server1:dataset4]
[server1:dataset5, isFileListingOf, server1:dataset1]
[dataset1, x_SameReviewerAs, dataset4]
]

We define a list of predicates. No need to specify reverse relationships. Look into RDF predicates for dataset relationships.

@rweigel
Copy link
Contributor Author

rweigel commented Apr 3, 2024

Next task: Come up with JSON schema for above.

@eelcodoornbos
Copy link

To make full use of associations between datasets for interactive plotting, the association between datasets is a first step. But we need to also be able to have a meaningful mapping on the parameter level. For example, I currently work with high rate satellite datasets, where it is useful to also have a low rate dataset with the per orbit (or lower cadence) minimum, mean and maximum of some (but not all) of these parameters. The OMNI datasets also have different parameter names for the same observable at the 1min, 5min and 1hr cadences. Is this something that can be accomplished with RDF, schema.org and the like? I’ll have to look into it.

@BaptisteCecconi
Copy link

BaptisteCecconi commented Apr 23, 2024

As mentioned by @dlindhol, and since we use JSON in the HAPI headers, opting for JSON-LD (or other linked-data flavour) is important for interoperability (as usual). I hope we don't reinvent yet another linked-data format.

We also should reuse predicate from existing ontologies so that our links are understandable by generic tools.

As a by-product, we would have a better FAIR score when assessing our products/services with FAIR assessment tools.

@rweigel
Copy link
Contributor Author

rweigel commented Apr 23, 2024

@BaptisteCecconi We decided to use a very basic schema like the one above. The motivation for keeping the schema minimal is so that it will get used. If server developers need to learn something RDF, JSON-LD, etc. to communicate the linking information, the information is unlikely to get provided. As we develop the schema, we'll develop in parallel software and/or a service that crawls all HAPI servers and provides what is needed for interoperability.

@rweigel
Copy link
Contributor Author

rweigel commented Apr 23, 2024

@eelcodoornbos

Our thinking is that you would take the response from /relations, which could have

[server:dataset1, hasRelatedCadence, server:dataset2]

and inspect the metadata for dataset1 and dataset2 to determine the available cadences. We started discussing how to provide more details in response to a /relations request and realized that RDF is the solution, but it is too complex (see my response to @BaptisteCecconi.

@jvandegriff
Copy link
Collaborator

It seems like some of these relationships have properties that could be associated with them.

So instead of this:
[server:dataset1, hasRelatedCadence, server:dataset2]

You can add the list of parameter mappings too, with the mappings going from dataset1 to dataset2
[server:dataset1, hasRelatedCadence, server:dataset2,
param1_in_dataset1:param1_name_in_dataset2,
param2_in_dataset1:param2_name_in_dataset2,
param3_in_dataset1:param3_name_in_dataset2,
param4_in_dataset1:param4_name_in_dataset2,
param5_in_dataset1:param5_name_in_dataset2
]

But then the statistics info (min, max mean in the averaging interval) in the longer cadence dataset are actually additional parameters, and they have specific meanings. Both Eelco and Jeremy wanted these kinds of summary stats for averaged parameters.

Are these kinds of averaging stats common enough that they belong in the relationship mapping language? Seems like the might be. Especially if there are already terms for this in one of the standard set of relationship names that Baptiste mentioned.

We should look at the existing, standard sets of RDF relationships and relationship terms and try to use them since we are ultimately looking to map to them anyway (with the standardizing layer that Bob mentioned).

@eelcodoornbos
Copy link

I have to support Baptiste's argument here about not making up our own syntax for linking data. I have now looked a bit into JSON-LD and it does not seem so complicated and it looks to be quite well supported for programmers.

I also prefer the idea behind it that relations/links are defined where the data items (in our case the datasets and parameters) are defined. So as additional items under the /hapi/info endpoint, instead of, for example, having a separate 'relations configuration document' under a /hapi/relations endpoint, which would then contain some duplication of the structure we already have in the /hapi/catalog and /hapi/info endpoints. This would also add a burden of keeping this duplicated structure consistent. To me, it seems much easier to give HAPI server developers the option to expand the /hapi/info endpoints with some JSON-LD elements instead.

It looks like the JSON-LD libraries would be helpful for crawling HAPI servers, to create the relations graph that can then be used in applications like the timeline viewer.

@BaptisteCecconi
Copy link

Thanks @eelcodoornbos :-)

Just as an example: I recently looked up the W3C Annotation standard, which proposes JSON-LD as their preferred serialisation. The have prepared a specific context JSON-LD file, so that the JSON-LD instances are not cluttered with namespaces and prefixes.

So if we prepare a dedicated HAPI JSON-LD context file, then the JSON-LD section of the HAPI response could be rather straightforward to write (and validate).

@rweigel
Copy link
Contributor Author

rweigel commented Apr 24, 2024

@BaptisteCecconi—perhaps a simple example would help clarify things. Suppose we wanted to say dataset1:parameter1 is the same as dataset2:parameter1 except for cadence. What would that look like in JSON-LD? I've reviewed these documents many times and have concluded I'd need much more time to understand them enough to use them.

@rweigel
Copy link
Contributor Author

rweigel commented Apr 24, 2024

I found this useful: https://developers.google.com/search/docs/appearance/structured-data/dataset

I recall discussing the fact that we should create json-ld for HAPI servers. It would be something an external resource builds based on HAPI JSON responses.

In terms of syntax, the choices from https://schema.org/Dataset are limited: of hasPart, isPartOf, isBasedOn.

@BaptisteCecconi
Copy link

BaptisteCecconi commented Apr 25, 2024

@BaptisteCecconi—perhaps a simple example would help clarify things. Suppose we wanted to say dataset1:parameter1 is the same as dataset2:parameter1 except for cadence. What would that look like in JSON-LD? I've reviewed these documents many times and have concluded I'd need much more time to understand them enough to use them.

The first step is to build the information model (the predicates). So far I saw:

  • cadence
  • calibrationLevel
  • processingVersion
  • qualityLevel
  • transform
  • parameter
  • missionName
  • sameMissionName
  • isATransformedVersionOf
  • isCoordinateTransformOf
  • isDifferentCalibrationLevelOf
  • isFileListingOf
  • reviewer
  • sameReviewerAs
{
   "@context": "https://github.com/hapi-server/rdf/hapi-context.json",
   "@id": "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H2_MFI#BGSMc",
   "type": "Dataset",
   "conformsTo": "HAPI",
   "cadence": "PT92S",
   "otherCadences": [ 
      {
         "@id": "uri_to_dataset_with_cadence_300s",
         "cadence": "PT300S",
      },
      {
         "@id": "uri_to_dataset_with_cadence_10s",
         "cadence": "PT10S",
      },
   ]
   "calibrationLevel": "Calibrated",
   "processingVersion": "K0",
   "parameter": "BGSMc",
   "missionName": "wind",
   "instrumentName": "mfi",
   "sameMissionName": [
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WIND_3DP_ECHSFITS_E0-YR",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_AT_DEF",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_AT_PRE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EHPD_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EHSP_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_ELM2_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_ELPD_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_ELSP_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EMFITS_E0_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EM_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EPACT_STEP-DIFFERENTIAL-ION-FLUX-1HR",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EPACT_STEP-DIRECTIONAL-DIFF-CNO-FLUX-10MIN",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EPACT_STEP-DIRECTIONAL-DIFF-FE-FLUX-10MIN",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EPACT_STEP-DIRECTIONAL-DIFF-H-FLUX-10MIN",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EPACT_STEP-DIRECTIONAL-DIFF-HE-FLUX-10MIN",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H0_MFI@0",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H0_MFI@1",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H0_MFI@2",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H0_SWE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H0_WAV",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H1_SWE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H1_WAV@0",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H1_WAV@1",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H2_MFI",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H3-RTN_MFI@0",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H3-RTN_MFI@1",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H3-RTN_MFI@2",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H3_SWE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H4-RTN_MFI",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H4_SWE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H5_SWE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_EPA",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_SMS",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_SPHA",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_SWE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_WAV",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-1HOUR-SEP_EPACT-APE_B",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-1HOUR-SEP_EPACT-LEMT",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-30MIN_SMS-STICS-AFM-MAGNETOSPHERE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-30MIN_SMS-STICS-AFM-SOLARWIND",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-30MIN_SMS-STICS-ERPA-MAGNETOSPHERE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-30MIN_SMS-STICS-ERPA-SOLARWIND",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-3MIN_SMS-STICS-VDF-MAGNETOSPHERE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-3MIN_SMS-STICS-VDF-SOLARWIND",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-5MIN-SEP_EPACT-LEMT",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2_3MIN_SMS-STICS-NVT-MAGNETOSPHERE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2_3MIN_SMS-STICS-NVT-SOLARWIND",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L3-DUSTIMPACT_WAVES",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_M0_SWE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_M2_SWE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_OR_DEF",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_OR_PRE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_PLSP_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_PM_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_SFPD_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_SFSP_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_SOPD_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_SOSP_3DP",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_STRAHL0_SWE",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_SW-ION-DIST_SWE-FARADAY",
      "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_WA_RAD1_L3_DF"
   ],
   "isATransformedVersionOf": "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H2_MFI",
   "isCoordinateTransformOf": "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H2_MFI#BGSEc"
}

The @context file (https://github.com/hapi-server/rdf/hapi-context.json), would have the information to link the predicate to an information schema to be written.

{
   "@context": {
       "hapi": "http://hapi-server/rdf/hapi-schema#",
       "dctypes": "http://purl.org/dc/dcmitype/",
       "foaf": "http://xmlns.com/foaf/0.1/",

       "type": {"@type": "@id", "@id": "@type"},

       "Dataset": "dctypes:Dataset", 
       "cadence": "hapi:cadence",
       "calibrationLevel": "hapi:calibrationLevel",
       "processingVersion": "hapi:processingVersion",
       "qualityLevel": "hapi:qualityLevel",
       "transform": "hapi:transform",
       "parameter": "hapi:parameter",
       "missionName": "foaf:Project",
       "instrumentName": "foaf:Project",
       "sameMissionName": "hapi:sameMissionName",
       "isATransformedVersionOf": "hapi:isATransformedVersionOf",
       "isCoordinateTransformOf": "hapi:isCoordinateTransformOf",
       "isDifferentCalibrationLevelOf": "hapi:isDifferentCalibrationLevelOf",
       "isFileListingOf": "hapi:isFileListingOf",
       "reviewer": "foaf:Person",
       "sameReviewerAs": "hapi:sameReviewerAs",

       "conformsTo":    {"@type": "@id", "@id": "dcterms:conformsTo"}
   }
}

Note: @id in JSON-LD means the linked-data graph node identifier (so it has to be unique unique in the local context). If you want to have an actual PID, then would have include an extra "identifier" predicate (from "dcterms" or "schema.org").

Of course, this is version rudimentary, and we need to explore in more details. However, from this first example, I would say that it looks rather non-RDF-ic to list the "same[...]As" predicates in the record. This is the job of a graph database ingesting the records, so that it can be queried and kept up-to-date. This is the job of the SPASE (or any future name) registry to list, e.g., what other HAPI datasets contains data from the same mission name. Same as for the datasets with different cadences: it seems more efficient to have a registry to manage such queries.

When building such linked-data resources, the underlying assumption should be that you want to hard-code links to your resources only (same server), since you don't control the URL of the other servers.

(of course, this is a quick and dirty example)

@rweigel
Copy link
Contributor Author

rweigel commented Apr 25, 2024

This is very helpful.

Based on what you wrote, I think we have to address another issue. I see that we've identified two types of predicates:

  1. Those that are redundant because the metadata for this already exists and are not required for automatic processing and are more needed for search and discovery. For example, missionName exists elsewhere. If I wanted to determine the missionName associated with a HAPI dataset, I could do a SPASE query. Do we want to include this information if it already exists? Some servers may not have SPASE metadata, in which case having it will be useful. However, if we do include it, we are going down the path of developing metadata that goes beyond our stopping point, which is primarily (a) metadata needed for a machine to produce a scientifically sensible plot automatically and (b) metadata needed for science use (contact name, citation info, etc). In this case, someone who wants to build a drop-down menu for a server that does not have SPASE (for example, INTERMAGNET or a non-helio data server) would need to develop menu logic for each server by either querying another SPASE-like database if it exists or inferring relationships based on dataset name. (For example, if datasets were named a/b, a/c, the menu could have a top-level of a and children of b and c.)

  2. Those that cannot be determined from existing metadata (and are unlikely to exist in the future) and are required for automatic processing to create a scientifically sensible plot automatically. Examples include "parameterMin is the min of parameter in a window given by the cadence of parameterMin". This information is needed for automatic processing and producing a sensible (in this case, correct) plot when a plot of a long time range is requested.

I suggest that we constrain ourselves to case 2. because we've always tried to avoid building an overarching metadata model and have decided to use existing metadata instead. (All of the issues tag association fall into these two categories). Before proceeding, we should probably clarify our statement in the standard that "the HAPI metadata standard is not intended for complex search and discovery." so we can more easily categorize metadata additions that are out-of-scope. (In particular, we should explain what we mean by "complex".)

The case 2. instances are

  • otherCadences (Assuming it is case 2. because the automatic processor wants to overlay the two related datasets, it is case 2. Otherwise, it is case 1.)

  • "parameter{Min,Max,Ave,Std} is the {min,max,ave,std} of parameter in a window given by the cadence of parameter{Min,Max,Ave,Std}". An automatic processing algorithm would need to verify that the window for parameter{Min,Max} calculations are aligned such that the parameter{Min,Max} can be used in the way Eelco uses it (so if parameter has cadence of PT1H and timeStampLocation=center and time stamps at T00:30, T01:30, then parameterMin must have PT24H with a timeStameLocation=center and time stamps at T12:00).

  • "parameterFiles are the files from which data for parameter was drawn." This could also work at the dataset level.

  • "parameterBurst was measured by the same instrument but in a different sampling mode (more channels, for example)." I'd argue that it is a stretch that this would be used for automatic plot generation. For example, would plotting software have an option that says "plot all related datasets"? I think more likely the user would need to discover this from a "complex search and discovery" data bases and we should avoid developing a metadata model that captures this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants