Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsers: create an NLM parser #209

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

szymonlopaciuk
Copy link
Contributor

@szymonlopaciuk szymonlopaciuk commented Jan 16, 2018

Description

This is an implementation of a parser for the NLM format, it takes a very similar approach to the JATS parser which we already have, using LiteratureBuilder to build HEP records.

Related Issue

This is a step towards refreshing the IOP spider (#205)

Motivation and Context

IOP uses NLM format to publish their citation records. Currently the IOP spider uses web-scraping, however we will move to using OAI-PMH and this instead.

Checklist:

  • I have all the information that I need (if not, move to RFC and look for it).
  • I linked the related issue(s) in the corresponding commit logs.
  • I wrote good commit log messages.
  • My code follows the code style of this project.
  • I've added any new docs if API/utils methods were added.
  • I have updated the existing documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

Args:
nlm_records (Union[string, scrapy.selector.Selector]): records
source (Optional[string]): source passed to `__init__`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document return value

day = node.xpath('./Day/text()').extract_first()
month = node.xpath('./Month/text()').extract_first()
year = node.xpath('./Year/text()').extract_first()
return PartialDate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use PartialDate.from_parts, which handles empty values and non-numeric months just fine:

In [1]: from inspire_utils.date import PartialDate

In [2]: PartialDate.from_parts(2017, 'Jan')
Out[2]: PartialDate(year=2017, month=1, day=None)

pub_type = self.root.xpath('./PublicationType/text()').extract_first()

if 'Conference' in pub_type or pub_type == 'Congresses':
return 'proceedings'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is conference paper rather than proceedings, but would need to look at some examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got an example IOP update from @david-caro with a few records, but unfortunately none of them actually have the <PublicationType> set. Maybe when we get access, there will be more records, or maybe IOP don't use the field at all... Meanwhile I found a few in this at NLM, so I think that means you are right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it. But I would not be surprised if IOP put its own values there anyway, that have nothing to do with those in the spec.

authors = self.root.xpath('./AuthorList/Author')
authors_in_collaborations = self.root.xpath(
'./GroupList/Group'
'[GroupName/text()=../../AuthorList/Author/CollectiveName/text()]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<CollectiveName> inside the <Author> acts as sort of a pointer to the <Group> of the same name, where the actual people of the group are listed, like here: https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Can_Collaborator_Names_be. So this gets all the people from groups referenced in <Authors>. Though maybe it is too strict, now that I think about it, I don't think there is a use case for an "unreferenced" group?

return self.root.xpath('./Journal/Volume/text()').extract_first()

@property
def material(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PublicationType may also contain Published Erratum, which maps to erratum. Don't know how this relates to the NLM field you are reading here. Maybe you should link to https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Object_O, here or close to the NLM_OBJECT_TYPE_TO_HEP_MAP definition.


NLM_OBJECT_TYPE_TO_HEP_MAP = {
'Erratum': 'erratum',
'Reprint': 'reprint',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Republished': 'reprint' also

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Check in PublicationType for `Published Erratum` too,
if `<Object>` check didn't return any matches.

Add references to NLM docs.

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants