Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for multiple person identifiers of the same type. #48

Closed
wants to merge 3 commits into from

Conversation

@wilkos-dans
Copy link
Contributor

commented Nov 12, 2018

In theory, one can have only one identifier per identifier-type for a person. However in practise, this is not always true. This PR is about allowing multiple (person) identifiers of the same type per person.

@jdvorak001

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

(The CI failure is not connected to the proposed modification.)

@jdvorak001

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

I agree it sometimes happens. At the metadata provider side it should probably trigger some data quality improvement processes; I understand these may take long and may not always resolve the issue.

But for the metadata consumer, we need to make it clear that such metadata is discouraged.
I can think about a warning from Schematron validation.
What approach do you think would be appropriate, @joschirr , @abollini , @lremy ?

@rvanheest

This comment has been minimized.

Copy link

commented Nov 14, 2018

In light of #49 that just got merged, should this PR also add maxOccurs="unbounded" to the new <DAI> element?

@jdvorak001

This comment has been minimized.

Copy link
Contributor

commented Nov 14, 2018

Yes, it should. If you can, please update your branch with current master: it contains important fixes.

@jdvorak001

This comment has been minimized.

Copy link
Contributor

commented Nov 14, 2018

Thanks! We'll talk about this PR at the CERIF TG meeting this Friday.

@jdvorak001

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2018

We discussed the issue and came to the conclusion this is an abnormal situation.
If multiple identifiers of the same type are though to be assigned to a single person, then it's one of the following cases:

  1. One of them is the preferred, currently used one and the rest are "past", "deprecated", "alternative", "other" identifiers that should have been merged into the leading one (or a reconcilement process has at least been initiated).
  2. There is no information available to distinguish among them. But then none of them can be presented as THE value of the identifier: there is simply an uncertainty in that piece of data. A suggestion/request for a reconcillation should be communicated to the sources of the different values.

In neither case should this be supported within the "normal" syntax, we think.

But perhaps we are missing something that is essential or special about your context. Could you please tell us more about how this situation arises in NARCIS? How frequent is it? Do you do anything about it in NARCIS itself?

@wilkos-dans

This comment has been minimized.

Copy link
Contributor Author

commented Nov 26, 2018

Well, university A may provide an DAI for a certain person, while university B does the same.
Same story with ORCID, but less frequent.
NARCIS holds a few persons with multiple name-identifiers.
NARCIS supports multiple person-identifiers of the same type for completeness.
NARCIS does not (anymore) actively supply feedback to the researchers or universities in case of multiple identifiers.

If OpenAIRE expects only one or none identifier per type, perhaps NARCIS/one should skip supplying an identifier, for it is hard to tell and choose which identifier is preferred, as you already described.

@rvanheest

This comment has been minimized.

Copy link

commented Nov 26, 2018

Just to be clear, @wilkos-dans, are you suggesting to not provide any identifier at all of a certain kind if the system provides multiple identifiers of that kind? That would be unfortunate for the lost data. Still I think that is not a bad idea. It would avoid having discussions with people who claim the wrong identifier is provided. Giving no identifiers at all seems to be the best option in such a case...

Any thoughts @jdvorak001?

@jdvorak001

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2018

Yes, not providing the identifier at all if you have got diverging values from different sources is an option, and not a bad one. You perhaps can allow some kind of "voting" threshold, so if you've got value X from institutions A, B, C, D, E, F and value Y from just one other institution, you could as well say value X is the value of the identifier for this person.

The solution we were suggesting is to introduce additional elements such as or . These would have maxOccurs="unbounded" and you could use these for the identifier values in the case they diverge.

Probably needless to say, but anyway: the occurrence of different values of an identifier of course puts your grouping of the person records under a possible question, so these cases should probably go through some kind manual check before being released to the public.

@rvanheest

This comment has been minimized.

Copy link

commented Nov 27, 2018

First of all, NARCIS is not a registry agency and we do not anymore actively supply feedback to the researchers/institutes in case they provide diverging identifiers. Also 'voting' is not an option, as we do not have the data to do so.

The <or> syntax sounds a bit weird to me. I expect you have something like this in mind:

<or>
    <ISNI>...</ISNI>
    <ISNI>...</ISNI>
</or>

I would say that this proposal is exactly the same as we propose in this PR. It is semantically the same, where our solution is shorter/easier and probably requires less code changes for implementers of this schema. Besides, what will OpenAIRE do with this <or> syntax? Will you guys use all of the identifiers inside the <or>, or use just one?

@jdvorak001

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2018

Ok, thanks for the additional info. My concept of the responsibilities of an information aggregator includes the requirement to reconcile any conflicts in the information being aggregated, but I can understand this brings in a need for extensive communication which may not always be viable from the point of view of available resources.

@jdvorak001

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2018

Apologies for not making my XML elements visible, I will improve on that.

  1. If you are having a single valid ISNI for a person, you use the normal <ISNI> element.
  2. If you are having two or more different valid values of ISNI for a person, and:
    2.1. If one of those values can be flagged as "the correct one", or "the currently used one", place that value in the <ISNI> element and put the other value(s) into <AlternativeISNI> element(s).
    2.2. If none of the values can be flagged as preferred, put all of them into <AlternativeISNI> elements.

So <ISNI> means "this is the value of ISNI of the person".
And <AlternativeISNI> means "this is a possible value of ISNI for the person; we know there is some issue so we are not so sure about it".

@rvanheest

This comment has been minimized.

Copy link

commented Nov 28, 2018

The proposal with <Alternative> sounds great. It's probably more expressive than the change @wilkos-dans and I are proposing in this PR. I assume you plan to do the same for the other identifier types (ORCID, ISNI, DAI, ResearcherID and ScopusAuthorID)?
We're looking forward to this new feature/PR soon, so that we can incorporate it in our codebase in the near future.

@jdvorak001

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2018

Yes, there will be <AlternativeORCID>, <AlternativeResearcherID>, <AlternativeScopusAuthorID>, <AlternativeISNI> and <AlternativeDAI>, all of them having the respective regexp validations.

@rvanheest

This comment has been minimized.

Copy link

commented Nov 28, 2018

Awesome! Looking forward to them soon.

@jdvorak001 jdvorak001 closed this in 72f35c7 Dec 1, 2018

@jdvorak001

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2018

Now on master. To be released in a couple of days in v.1.1.1.

@rvanheest

This comment has been minimized.

Copy link

commented Dec 1, 2018

Thanks a lot! We'll try this new feature once it is released.

@rvanheest rvanheest referenced this pull request Dec 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.