Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect ZMT catalogue as ODIS node #276

Open
jmckenna opened this issue Jul 11, 2023 · 26 comments
Open

connect ZMT catalogue as ODIS node #276

jmckenna opened this issue Jul 11, 2023 · 26 comments
Labels

Comments

@jmckenna
Copy link
Contributor

jmckenna commented Jul 11, 2023

Summary:

  • existing PANGAEA catalogue
  • ZMT team are mapping metadata properties through the ODIS Book examples
  • possible JSON-LD templates to also use:
  • ZMT team will inform OIH team when sitemap.xml or JSON-LD is ready
  • ZMT team should also create an entry inside the ODIS Catalogue
    • important fields are Startpoint URL for ODIS-Arch (the url to your sitemap) and Type of the ODIS-Arch URL (choose "sitemap")

This issue will allow questions, updates, and discussions by both teams.

cc @fils @pbuttigieg

@acwittmann
Copy link

Hi @jmckenna @fils @pbuttigieg @fspreck
my colleague @uschindler from PANGAEA tested implementing "Event" for the JSON-LDs of PANGAEA datasets as in the ODIS Book example, see
https://doi.pangaea.de/10.1594/PANGAEA.948712?format=metadata_jsonld&incubation=true
It seems Google Search is quite particular when it comes to using the term "Event", as he promptly received the following error message. We may have to stick with working with temporal and spatial coverage, unless we (ZMT & OIH) do not need to worry about Google.

Problems of type "Structured Data Events" detected on doi.pangaea.de

To the owner of doi.pangaea.de:

The Search Console has identified that your website is affected by 13 problem(s) of type "Structured Data Events". The following problems have been found on your website. We recommend addressing these issues, if possible, to ensure optimal functioning and high visibility in Google search results.
Most common critical issues*

Missing "startDate" field

Missing "location" field

*Critical issues prevent a page or feature from appearing in search results.
Most common non-critical issues‡

Missing "offers" field

Missing "performer" field

Missing "eventAttendanceMode" field

Missing "eventStatus" field

Missing "image" field

‡Non-critical issues are suggestions for improvement. They do not prevent a page or feature from appearing in Google search results. Some non-critical issues may negatively impact the display in search results, while others may be escalated to critical issues later on.

@uschindler
Copy link

To add more information: The problem comes from the "Event" in english language having more than one meaning, in Schema.org used as the German word "Veranstaltung" (artistical event) not as abstract "Ereignis" (generic event like in PANGAEA).

The problem with Google interpreting the "subjectOf" relation is that the dataset is now linked to an artistic event. Google extracts from the datasets multiple events and also wants to publish them separately to the dataset as "artistic event", so at end it will work like "User searches for movie name" and google presents events related to that. They extract all events from a given page (in our case a dataset) because in most cases cinema homepages have a list of events for a specific cinema hall, so for datasets they also expect multiple events as separate entities.

As PANGAEA wants to prevent that its events are shown as artistic events in google search we have to stop adding events to schema.org, as it is the wrong entity type.

P.S.: I am in contact with Natasha Noy regarding this.

@TimmFitschen
Copy link

TimmFitschen commented Sep 20, 2023

@jmckenna Hi, I am preparing the sitemap and json+ld resources. The documentation says that the crawlers expect a script tag inside of a html document.

<script type="application/ld+json">JSON_LD content</script>

Is it possible to direct the crawler to a json+ld file directly? I mean, I know how to do this in the sitemap. The question is, rather, will the crawler accept that as well? Or do we need the "detour" via the html document?

@jmckenna
Copy link
Contributor Author

@TimmFitschen if you're asking just about Google and other search engines, they expect the JSON-LD to be inline only (see related StackExchange thread)>. But I believe ODIS itself will accept it (@fils can you confirm?).

@uschindler
Copy link

Hi,
Yes the source must be inside the script tag and therefore in the html. Technically it would be correct to add a href attribute to link to an url. This would be better for mobile browsers, as the transfer size gets smaller, but according to documentation this is not allowed.

I have contact to Google, maybe there's a change. An easy way is to simply test it. After setting it up with a href Link you can run the Google structured data analyzer.

P.S.: PANGAEA also delivers the Schema.org, when you do a content negotiation on landing page using accept header (see signposting.org). The FUJI fair checker also uses content negotiation, if available.

@jmckenna
Copy link
Contributor Author

@uschindler interesting, I'm curious of Google's updated view on this, keep us posted.

@fspreck-indiscale
Copy link

Hi,

we now have the sitemap with all public datasets online. Is it possible to do a crawler test run before we enter it to the ODIS catalogue?

@fspreck-indiscale
Copy link

Hi @jmckenna, we updated our Jsons according to the things we discussed last time. Can you run your tests again against the sitemap and check whether

  • @id looks fine
  • publisher is now the correct property and located correctly within the json
  • spatialCoverage works in form of an array of Places each specified by a GeoCoordinates object instead of the boxes we had before

Thank you!

@jmckenna
Copy link
Contributor Author

updates since today's meeting:

  • we now handle the type:GeoCoordinates as points in the ODIS front-end spatial search (see screen capture below of the ZMT spatial records)

zmt-geocoordinates

@uschindler
Copy link

Hi,

This PANGAEA one has no spatial coverage. That's not an issue in your portal.

@jmckenna
Copy link
Contributor Author

jmckenna commented Nov 27, 2023

@uschindler today in the meeting I had mentioned that some records in the ZMT sitemap do not have spatialCoverage and the reaction from the ZMT team I believe was "all records should have spatialCoverage", so, I am not understanding your response.

@uschindler
Copy link

uschindler commented Nov 27, 2023

@uschindler today in the meeting I had mentioned that some records in the ZMT sitemap do not have spatialCoverage and the reaction from the ZMT team I believe was "all records should have spatialCoverage", so, I am not understanding your response.

The problem is that the link posted is about data harvested from PANGAEA: https://dataportal.leibniz-zmt.de/Entity/19378; this entry refers to this PANGAEA dataset: https://doi.org/10.1594/PANGAEA.890177

This one has no spatial coverage and will never have one. It is correct. If you harvest PANGAEA, you have to live with the fact that datasets may not have a coverage. I won't try to explain this here why there's no coverage available, but in short: it is not mandatory and for this dataset there's no way to provide a coverage. It has none.

@fspreck-indiscale
Copy link

@jmckenna @uschindler, sorry that was too bold a claim, then. And it will be even more so in the future, unfortunately, once we included more non-PANGAEA dataset in the portal -- they will most probably not have geo information at all.

@fspreck-indiscale
Copy link

@jmckenna How does your frontend treat entries like https://dataportal.leibniz-zmt.de/Entity/18288 (view-source:https://dataportal.leibniz-zmt.de/oih/dataset_18288.html for the json, respectively) where we have an array of places in the spatial coverage? There should be a lot more points than datasets if you show the full array on your map (~900 locations vs ~150 datasets).

@jmckenna
Copy link
Contributor Author

@fspreck-indiscale good point, we don't handle a list of geocoordinates yet, but we should. (we only use the first point) Thanks for reporting this.

@fspreck-indiscale
Copy link

@jmckenna Hi, I just updated the JSONs again; they now have sdPublisher and creditText.

@jmckenna
Copy link
Contributor Author

jmckenna commented Dec 1, 2023

thanks @fspreck-indiscale, will do another harvest here...

@fspreck-indiscale
Copy link

@jmckenna We added keywords (simple array of strings for now) to some of the datasets; do they look good after harvesting?

The schema.org validator passes.

@jmckenna
Copy link
Contributor Author

updates from meeting on 2024-01-15:

@jmckenna
Copy link
Contributor Author

jmckenna commented Apr 5, 2024

@fspreck-indiscale thanks for updating the keywords syntax. I notice that some have odd characters inside the JSON-LD, such as this record:

 "keywords": [
  "coral climatology",
  "oxygen isotope",
  "trace elements ratio",
  "\u03b418Oseawater"
 ],

@fspreck-indiscale
Copy link

Hi @jmckenna, good point, we've not considered these characters so far. Escaping non-ASCI is the safe default of the exporter but by no means is it necessary (on the landing page, it's UTF8 δ). May we use UTF8 strings in the JSON-LD?

@jmckenna
Copy link
Contributor Author

jmckenna commented Apr 5, 2024

Hi @fspreck-indiscale, in fact on the ODIS search front-end it appears as follows, so I think it is OK to use these unicode characters. (does that keyword look ok here in this screen capture to you?)

unicode

@jmckenna
Copy link
Contributor Author

jmckenna commented Apr 5, 2024

@fspreck-indiscale the ZMT records (201) are now on the production server ( https://oceaninfohub.org/ ).

There is an issue however, on our side: the "Provider" facet lists 2 different providers for your records: "Leibniz Centre for Tropical Marine Research, Bremen, Germany" and then "Leibniz Center for Tropical Marine Research (ZMT)" (the second one comes from the name in the ODIS config). It seems the provider name in the JSON-LD and the prov:wasAttributedTo name are both being used here for some reason (again, this is a problem on our front-end/indexing side).

Here is the harvested JSON-LD example: https://api.search.oceaninfohub.org/source?id=https%3A%2F%2Fdataportal.leibniz-zmt.de%2Foih%2Fdataset_19754.html&_gl=1*1qbvbqk*_ga*NjkyMjg3NDkwLjE3MTIzNDMxMjM.*_ga_MQDK6BB0YQ*MTcxMjM0MzEyMy4xLjEuMTcxMjM0NzExNC4wLjAuMA..*_ga_QJ5XJMZFXW*MTcxMjM0MzEzNi4xLjEuMTcxMjM0NzExNC4wLjAuMA..

zmt1

@pbuttigieg @fils can you see the source of the problem here?

@jmckenna
Copy link
Contributor Author

jmckenna commented Apr 5, 2024

More info: the records harvested inside Solr (search index) contain only one provider:

"txt_provider":["Leibniz Centre for Tropical Marine Research, Bremen, Germany"]

This is puzzling.

@jmckenna
Copy link
Contributor Author

jmckenna commented Apr 5, 2024

Ah, it could be that no other partner is setting "provider" to themself. CIOOS uses "provider" to point to their regional partner who 'provides' the catalogue (such as CIOOS-Atlantic, or CIOOS-Pacific).

Example JSON-LD for CIOOS record: https://api.search.oceaninfohub.org/source?id=https%3A%2F%2Fcatalogue.cioos.ca%2Fdataset%2F777530f0-adaf-4ddb-86bb-6f1269dcb259.jsonld&_gl=1*15ety5n*_ga*MTQ3NjY4NzQyNy4xNzEyMzIxMTE5*_ga_MQDK6BB0YQ*MTcxMjM1MTk2Mi4zLjEuMTcxMjM1MjI2Ni4wLjAuMA..*_ga_QJ5XJMZFXW*MTcxMjM1MTk3MS4zLjEuMTcxMjM1MjI2Ni4wLjAuMA..

I'd need @pbuttigieg @fils to clarify how we should assume the correct use of "provider" is.

Maybe we should setup another ZMT-ODIS technical meeting in the next 2 weeks, to examine this together.

@pbuttigieg
Copy link
Collaborator

From the node perspective, the provider should be the entity that provided them with the data that the JSON-LD record is about.

they have sdPublisher for identifying the entity that created the JSON-LD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants