Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep the extracted affiliations if none found from consolidation. #563

Merged
merged 14 commits into from
Aug 12, 2020

Conversation

Aazhar
Copy link
Collaborator

@Aazhar Aazhar commented Mar 25, 2020

No description provided.

@Aazhar Aazhar requested a review from kermitt2 March 25, 2020 14:49
@coveralls
Copy link

coveralls commented Mar 25, 2020

Coverage Status

Coverage increased (+0.3%) to 37.314% when pulling f1c3cf9 on keep_affiliations_after_consolidation into c7b4b20 on master.

Copy link
Collaborator

@lfoppiano lfoppiano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be possible to rename bibo and bib with more appropriate names?
Might be a lot of work but this class deserves to have a unit test perhaps :-)

See my other comment below ;-)

// we have a match (initial)
if ( StringUtils.isNotBlank(aut2.getFirstName()) && StringUtils.isNotBlank(aut.getFirstName())
&&( aut.getFirstName().equals(aut2.getFirstName()) ||
((aut.getFirstName().length() == 1) && (aut.getFirstName().equals(aut2.getFirstName().substring(0,1)))))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the initial is with a dot? so it has a length of 2?
What about the initial is present in the aut2 instead of aut? shouldnt you check that condition too?

Copy link
Collaborator Author

@Aazhar Aazhar Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes actually we should compare the initials for the firstnames either for aut2 or aut , , so the check would be either they are equal or they have the same initial letter in their firstname , WDYT ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the check (whch was like this before, so I might be wrong) first check if aut has length 1, and if so it compares aut with the first character of aut2

what if aut does not have length 1, but aut2 is just the initial?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I didn't change this check actually, it was like this..
basically if it is not the same firstname or aut firstname has more than one character then it 's not considered the same author form

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I suggest is for firstnames to compare only the first char (initials)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

I noticed another thing, at line 4205, there is bib.setFullAuthors(bibo.getFullAuthors()); so the authors are copied anyway from the second biblio... or should I better be in bed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a couple of tests, there is an ignored one that is failing and I'm wondering what is the rule here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the data about authors coming from consolidation is considered more accurate than the parsed one,
in this PR I would like to keep the affiliations if none fetched from the consolidation which seems to be more frequent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the test, so far the affiliations from crossref are just raw and not deserialised, so maybe we could keep the affiliations parsed from the pdf ? I'm not sure how good are the affiliations from crossref actually

@Aazhar Aazhar requested a review from lfoppiano May 2, 2020 14:41
@kermitt2
Copy link
Owner

So as explained on Mattermost, I updated this part with the new Person object deduplication, and this part becomes simply now:

       if (bibo.getFullAuthors() != null) {
            if (CollectionUtils.isEmpty(bib.getFullAuthors()))
                bib.setFullAuthors(bibo.getFullAuthors());
            else {
                // we have the complete list of authors so we can take them from the second
                // biblio item and merge some possible extra from the first when a match is 
                // reliable
                List<Person> thePersons = bib.getFullAuthors();
                thePersons.addAll(bibo.getFullAuthors());
                Person.deduplicate(thePersons);
            }
        }

and the deduplication handles the merging of affiliation.

@kermitt2
Copy link
Owner

With this approach, it works fine when authors are well recognized initially, but when the authors are badly extracted from the PDF, we will have as authors at the same time the erroneous extraction and the correct consolidated ones. So maybe this could be refined to prioritize the consolidated authors?

@kermitt2
Copy link
Owner

ok too many authors added with this approach, I'll try to find something safer...

@kermitt2 kermitt2 merged commit 9bb1380 into master Aug 12, 2020
@lfoppiano lfoppiano deleted the keep_affiliations_after_consolidation branch June 19, 2023 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants