Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371

Closed
jggautier opened this issue Dec 8, 2017 · 25 comments

Comments

@jggautier
Copy link
Contributor

jggautier commented Dec 8, 2017

In issue #2243, some metadata fields important for dataset discovery were excluded from mapping to Schema.org. We said we'd include them in a later issue. This is that issue, and these are those fields (dun dun):

  • Creator types (person or organization) for dataset authors (a separate ticket has been opened for this, Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029, since the missing creator types are making Google's Dataset Search engine not display creator names and there are UI implications)
  • Dataset identifier (DOI or HDL as a URL) using the @id property
  • Dataset identifier (DOI or HDL as a URL) using the url property
  • Name of funding source (that's all schema.org supports for now; how to include other funding source details (like grant numbers) are discussed in this github issue in schema.org's repo)
  • Author identifiers
  • Geographic coverage
  • Multiple dataset descriptions
  • File metadata (see schema.org from Zenodo and from ICPSR for an example of how these fields are used)
    • File PID
    • File download URL (when there is one - excludes restricted files and files in datasets with guestbooks)
    • File name
    • File description
    • File format

We'll also need to fix:

Which fields are added to the Schema.org metadata template (draft) and how they're mapped will probably be adjusted after community discussion (within Dataverse community and hopefully with a proposed RDA group focused on ways to make data more discoverable by search engines).

@scolapasta asked me to add to the definition of done that we should make sure that the methods used to pull metadata values from different fields into different exports (DDI, DC, DataCite, Schema.org, native JSON (?)) are consistent.

@jggautier
Copy link
Contributor Author

jggautier commented Jul 3, 2018

After a discussion in today's Community Call about sending DataCite file-level metadata that includes the file checksum, @mercecrosas added in this Google Groups comment a table recommending ways to map to schema.org more dataset.

The table includes more metadata fields that have been added.

An open question, that might deserve its own github issue, is if Dataverse should produce schema.org metadata at the file level.

@jggautier jggautier changed the title As a researcher, I want more metadata in schema.org exports so that my data is more discoverable As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable Jul 17, 2018
@djbrooke
Copy link
Contributor

djbrooke commented Oct 3, 2018

Let's exclude File Download URL for now. It can follow on in a separate issue.

@pameyer
Copy link
Contributor

pameyer commented Oct 10, 2018

Thanks to @jggautier I was able to track down some tools for validating schema.org JSON-LD. https://github.com/jessedc/ajv-cli can be used from the command line to validate the JSON against a schema ; after using https://github.com/scrapinghub/extruct to retrieve the JSON-LD from the generated html/xhtml.

pdurbin added a commit that referenced this issue Oct 11, 2018
"I think Google Dataset Search is ignoring author and prefers creator"
pdurbin added a commit that referenced this issue Oct 11, 2018
from 41.16% to 42.94% for DatasetVersion
pdurbin added a commit that referenced this issue Oct 11, 2018
Use the installation brand name instead.
pdurbin added a commit that referenced this issue Oct 11, 2018
pdurbin added a commit that referenced this issue Oct 11, 2018
pdurbin added a commit that referenced this issue Oct 11, 2018
pdurbin added a commit that referenced this issue Oct 11, 2018
@kcondon kcondon self-assigned this Nov 5, 2018
pdurbin added a commit that referenced this issue Nov 5, 2018
@kcondon
Copy link
Contributor

kcondon commented Nov 5, 2018

So, aside from internal code restructuring, this pr:
-adds new fields to schema.org
(create/export dataset, verify against list julian provides)
-changes the structure of some fields in schema.org (multiple, object type)
(same as above, add multiple where appropriate, paste into Google validation tool)
-adds optional hide files jvm option to block download urls in export
(verify on/off behavior and pubic/restricted file behavior)
-publisher and provider will be the instance name (root dv)
(verify against export)

@kcondon
Copy link
Contributor

kcondon commented Nov 6, 2018

Issues/questions:

  1. files/distribution section contains additional, unspecified info: file name, pid.
  2. author id is missing if value is entered in a nonconforming format but not indication exists what the conforming format is.

Discussed above with Julian and he will complete review. Will discuss with Julian and Phil to see what needs to be addressed.

@jggautier
Copy link
Contributor Author

Another issue:

  1. The URLs for the related publications show up wrapped in html:
"citation": [
    {
      "@type": "CreativeWork",
      "text": "Related pub citation 1",
      "@id": "<a href=\"https://doi.org/10.7910/DVN/P7EVGF\" target=\"_blank\">https://doi.org/10.7910/DVN/P7EVGF</a>",
      "identifier": "<a href=\"https://doi.org/10.7910/DVN/P7EVGF\" target=\"_blank\">https://doi.org/10.7910/DVN/P7EVGF</a>"

@jggautier
Copy link
Contributor Author

  1. @id is missing from the files/distribution section @kcondon mentioned in his comment (@pdurbin and I agreed to keep the extra file info.) We're always using @id whenever identifier is used. I discussed with @pdurbin and he'll update.
"distribution": [
    {
      "@type": "DataDownload",
      "name": "file1.txt",
      "fileFormat": "text/plain",
      "contentSize": 26,
      "description": "File description 1",
      "identifier": "https://hdl.handle.net/20.500.12050/FK2/TWFVRE/222222",
      "@id": "https://hdl.handle.net/20.500.12050/FK2/TWFVRE/222222",
      "contentUrl": "https://demo.dataverse.org/api/access/datafile/:persistentId?persistentId=doi:10.5072/FK2/CFWNSH/ZEHFD0"

(contentUrl should appear only when the installation indicates that they want download URLs appearing in their schema.org exports.)

pdurbin added a commit that referenced this issue Nov 6, 2018
@pdurbin
Copy link
Member

pdurbin commented Nov 6, 2018

Yep, I got rid of the "href" stuff in fcae94e and added @id at the file level in 0e0b55d.

@jggautier
Copy link
Contributor Author

jggautier commented Nov 7, 2018

I looked at the schema.org export and all four issues are resolved!

@kcondon noticed that contentUrl isn't showing up in the schema.org export of a test dataset, although we expect it to. (It's the dataset titled "Test Schema Org Julian 5 Schema" on the "internal" test instance.)

@kcondon kcondon closed this as completed Nov 7, 2018
@kcondon kcondon removed the Status: QA label Nov 7, 2018
@pdurbin
Copy link
Member

pdurbin commented Nov 7, 2018

For the record, as discussed with @kcondon and @jggautier , the FileUtil.isPubliclyDownloadable logic is used to contentUrl wasn't being shown because the dataset had terms of use. It also checks for guestbooks. Both of these require a popup to agree to or fill out in the UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants