parsers: create an NLM parser #209

szymonlopaciuk · 2018-01-16T13:36:04Z

Description

This is an implementation of a parser for the NLM format, it takes a very similar approach to the JATS parser which we already have, using LiteratureBuilder to build HEP records.

Related Issue

This is a step towards refreshing the IOP spider (#205)

Motivation and Context

IOP uses NLM format to publish their citation records. Currently the IOP spider uses web-scraping, however we will move to using OAI-PMH and this instead.

Checklist:

I have all the information that I need (if not, move to RFC and look for it).
I linked the related issue(s) in the corresponding commit logs.
I wrote good commit log messages.
My code follows the code style of this project.
I've added any new docs if API/utils methods were added.
I have updated the existing documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

michamos · 2018-01-16T13:40:30Z

hepcrawl/parsers/nlm.py

+
+        Args:
+            nlm_records (Union[string, scrapy.selector.Selector]): records
+            source (Optional[string]): source passed to `__init__`


please document return value

michamos · 2018-01-16T13:45:48Z

hepcrawl/parsers/nlm.py

+            day = node.xpath('./Day/text()').extract_first()
+            month = node.xpath('./Month/text()').extract_first()
+            year = node.xpath('./Year/text()').extract_first()
+            return PartialDate(


It's better to use PartialDate.from_parts, which handles empty values and non-numeric months just fine:

In [1]: from inspire_utils.date import PartialDate In [2]: PartialDate.from_parts(2017, 'Jan') Out[2]: PartialDate(year=2017, month=1, day=None)

michamos · 2018-01-16T13:54:01Z

hepcrawl/parsers/nlm.py

+        pub_type = self.root.xpath('./PublicationType/text()').extract_first()
+
+        if 'Conference' in pub_type or pub_type == 'Congresses':
+            return 'proceedings'


I think this is conference paper rather than proceedings, but would need to look at some examples.

I got an example IOP update from @david-caro with a few records, but unfortunately none of them actually have the <PublicationType> set. Maybe when we get access, there will be more records, or maybe IOP don't use the field at all... Meanwhile I found a few in this at NLM, so I think that means you are right?

Looks like it. But I would not be surprised if IOP put its own values there anyway, that have nothing to do with those in the spec.

michamos · 2018-01-16T13:55:50Z

hepcrawl/parsers/nlm.py

+        authors = self.root.xpath('./AuthorList/Author')
+        authors_in_collaborations = self.root.xpath(
+            './GroupList/Group'
+            '[GroupName/text()=../../AuthorList/Author/CollectiveName/text()]'


what's the purpose of this?

<CollectiveName> inside the <Author> acts as sort of a pointer to the <Group> of the same name, where the actual people of the group are listed, like here: https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Can_Collaborator_Names_be. So this gets all the people from groups referenced in <Authors>. Though maybe it is too strict, now that I think about it, I don't think there is a use case for an "unreferenced" group?

michamos · 2018-01-16T13:59:28Z

hepcrawl/parsers/nlm.py

+        return self.root.xpath('./Journal/Volume/text()').extract_first()
+
+    @property
+    def material(self):


PublicationType may also contain Published Erratum, which maps to erratum. Don't know how this relates to the NLM field you are reading here. Maybe you should link to https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Object_O, here or close to the NLM_OBJECT_TYPE_TO_HEP_MAP definition.

michamos · 2018-01-16T14:00:27Z

hepcrawl/parsers/nlm.py

+
+NLM_OBJECT_TYPE_TO_HEP_MAP = {
+    'Erratum': 'erratum',
+    'Reprint': 'reprint',


'Republished': 'reprint' also

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

Check in PublicationType for `Published Erratum` too, if `<Object>` check didn't return any matches. Add references to NLM docs. Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

szymonlopaciuk added 2 commits January 16, 2018 14:26

parsers: create an NLM parser

eec9f7a

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

tests: fix invalid DOI in IOP tests

724ce2d

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

szymonlopaciuk added the Status: Ready for review label Jan 16, 2018

szymonlopaciuk requested review from david-caro and michamos January 16, 2018 13:36

ghost assigned szymonlopaciuk Jan 16, 2018

ghost added the in progress label Jan 16, 2018

michamos requested changes Jan 16, 2018

View reviewed changes

szymonlopaciuk added 5 commits January 16, 2018 15:58

parsers.nlm: add mapping from 'Republished'

7b925d0

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: document return value on bulk_parse

6b96a6c

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: add case to material extraction

ad95e0b

Check in PublicationType for `Published Erratum` too, if `<Object>` check didn't return any matches. Add references to NLM docs. Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: use PartialDate.from_parts

d1e4cbe

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers: add NLMParser to __init__ of parsers

2b87bfe

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

szymonlopaciuk mentioned this pull request Jan 17, 2018

IOP Spider: improve and add tests #206

Open

8 tasks

szymonlopaciuk added 3 commits January 17, 2018 10:45

parsers.nlm: fix bug when no PublicationType

8cdf47a

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: allow lack of publication date

d745f78

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: handle maths in abstract

3106759

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

david-caro removed in progress labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsers: create an NLM parser #209

parsers: create an NLM parser #209

szymonlopaciuk commented Jan 16, 2018 •

edited

michamos Jan 16, 2018

michamos Jan 16, 2018

michamos Jan 16, 2018

szymonlopaciuk Jan 16, 2018

michamos Jan 16, 2018

michamos Jan 16, 2018

szymonlopaciuk Jan 16, 2018

michamos Jan 16, 2018

michamos Jan 16, 2018

parsers: create an NLM parser #209

Are you sure you want to change the base?

parsers: create an NLM parser #209

Conversation

szymonlopaciuk commented Jan 16, 2018 • edited

Description

Related Issue

Motivation and Context

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szymonlopaciuk commented Jan 16, 2018 •

edited