Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping metadata to Datacite #101

Closed
awead opened this issue Dec 3, 2019 · 5 comments
Closed

Mapping metadata to Datacite #101

awead opened this issue Dec 3, 2019 · 5 comments
Milestone

Comments

@awead
Copy link
Contributor

awead commented Dec 3, 2019

Which fields are required for datacite and which additional fields do we want required for Datacite's record?

WDLL

  • json object that represents the object's metadata to be sent to Datacite for either updating or creating
  • creating metadata objects with incomplete fields should be prohibited
@DanCoughlin DanCoughlin mentioned this issue Dec 3, 2019
10 tasks
@awead
Copy link
Contributor Author

awead commented Dec 6, 2019

@srerickson will provide a mapping scholarphere resources types and datacite resource types.

@DanCoughlin DanCoughlin added this to the December 20 milestone Dec 9, 2019
@rschenk
Copy link
Collaborator

rschenk commented Dec 16, 2019

Some questions for @srerickson as I'm coding up an initial sketch of this, and I will probably update this comment as I encounter more issues:

DataCite's publicationYear

  1. publicationYear is a required attribute in the DataCite metadata, but our WorkVersion#published_date field is not required for publication in our data model. We should probably change this
  2. publicationYear in DataCite is a single-value field, but WorkVersion#published_date is a multiple-value field. What are the rules for reducing those multiple values to a single value?
  3. WorkVersion#published_date is a string field in our data model, but DataCite's publicationYear wants an integer. This means we will need to parse our string values into date objects. This is a good opportunity to discuss what date formats we will accept in our forms, and we should write a ticket for validating and parsing that format on the data-entry side.

DataCite's creator

  1. The DataCite creator type has a bunch of sub-fields that we could populate. So far I am only submitting the user's given- and family-names, although our data model also includes their email. To that point, our User data model is populated directly from the LDAP record (via the oauth proxy), and has no validations beyond the User's email address. That is, our app would be fine with the user's given and family names being empty, but DataCite would error. Questions:
    1. What is the likelihood in LDAP that the user's first and last name is empty? (@awead @whereismyjetpack)
    2. Should we tighten the validations on our user model to require the above?
    3. What portions of user data should we be sending over to DataCite?

@srerickson
Copy link
Contributor

Resource Type Mappings

The mapping of ScholarSphere Resource Types to DataCite Resource Types is here.

DataCite's publicationYear

Responding to @rschenk:

  1. I agree that published_date should be required (not just because it's required by DataCite but also because it's required by Google Scholar). The problem with making it required is that a lot of existing ScholarSphere Works would become invalid. Maybe that's OK, it's just something to note. (One of our metadata clean-up projects is to add publication years for all Works).

  2. I'm not sure why publication_date (along with other WorkVersion fields) is a multivalue field. I expect this is for compatibility with some other part of the stack (@awead ?). I don't think the UI has or needs to support multiple publication dates, so just choosing the first one seems fine.

  3. I would prefer to enforce ISO 8601 format (YYYY-MM-DD, where MM and DD are optional) as the format for the publication_date field. There are several problems with this approach but I think we can live with them: (1) it will invalidate current works (but curators can fix as needed); (2) I don't think the value currently used for DataCite's publicationYear field is the publication date; i think it's based on the creation date. So there's some inconsistency there as well, but hopefully we (curators) can clean that up later.

@whereismyjetpack
Copy link
Contributor

@rschenk it's unlikely that a persons given_name or surname is empty, but it is possible, and our directory allows it. we have 100s of records in ldap where given_name is missing. most of them look like non persons, but since we (Penn State) allow it, we (ScholarSphere) should look for it.

@rschenk
Copy link
Collaborator

rschenk commented Dec 17, 2019

@srerickson I talked to @awead today, and to get the ball rolling, we've decided for now to only send the bare minimum required metadata to DataCite, which is:

{
  "titles": [
    {
      "title": "The title of the WorkVersion"
    }
  ],
  "creators": [
    {
      "givenName": "the given_name of the depositor",
      "familyName": "the surname of the depositor"
    }
  ],
  "publicationYear": 2019, [[see note below]]
  "types": {
    "resourceTypeGeneral": "Dataset" [mapped from your google doc]
  }
}

The way "publicationYear" works currently is, if we can parse the WorkVersion#published_date we return the year of that parsed string. If we cannot parse it, then we fall back on the year of WorkVersion's database created_at timestamp. This ensures that (probably) every WorkVersion in the database will produce a valid set of required metadata for DataCite.

If you'd like to see more metadata in the DataCite record, we can add that in the future. It would be wonderful if you could specify the mapping, using DataCite's Swagger documention seems to be the most accurate and least painful way to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants