IOP Spider: improve and add tests #206

szymonlopaciuk · 2018-01-12T13:40:20Z

Depends on #209.

Description

This adds test records from IOP and fixes issues with the IOP spider.

Related Issue

Fixes #205.

Checklist:

I have all the information that I need (if not, move to RFC and look for it).
I linked the related issue(s) in the corresponding commit logs.
I wrote good commit log messages.
My code follows the code style of this project.
I've added any new docs if API/utils methods were added.
I have updated the existing documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

michamos

some small comments. More generally, I think you should try to refactor this, looking at what's still needed and what can be done better with the more powerful builder we have now (e.g. dates could use PartialDate) and make sure that the XPath selectors are not too brittle.

michamos · 2018-01-12T14:06:32Z

hepcrawl/extractors/nlm.py

@@ -146,10 +146,10 @@ def get_page_numbers(node):

        fpage = node.xpath(".//FirstPage/text()").extract_first()
        lpage = node.xpath(".//LastPage/text()").extract_first()
-        if fpage and lpage:
+        try:


this is already done by the literature builder, so not needed here.

michamos · 2018-01-12T14:10:28Z

hepcrawl/tohep.py

@@ -243,7 +243,7 @@ def _filter_affiliation(affiliations):
    for author in crawler_record.get('authors', []):
        builder.add_author(builder.make_author(
            full_name=author['full_name'],
-            affiliations=_filter_affiliation(author['affiliations']),
+            affiliations=_filter_affiliation(author.get('affiliations', [])),


the _filter_affiliation thing is not needed, as the builder is already cleaning up empty values. Besides, this should be raw_affiliations instead of affiliations, see #185.

michamos · 2018-01-12T14:19:38Z

@david-caro do we get feeds for IOP in the end? IIRC, @fschwenn was saying the other day that he's doing webscraping currently.

david-caro · 2018-01-12T14:49:08Z

I'm getting updates through email from the stacks service, I've contacted them to see if we can use oai, but no reply so far.

david-caro · 2018-01-12T14:51:21Z

Well, I just got a reply saying that there's an OAI service available :), now I asked for access.

david-caro · 2018-01-12T18:25:13Z

@szymonlopaciuk so I guess that it's safe to start working on the oai version of this spider ;)

david-caro · 2018-01-16T12:47:53Z

tests/unit/test_parsers_nlm.py

+def test_field(field_name, expected, parser):
+    # if field_name == 'authors':
+    #     import pdb
+    #     pdb.set_trace()


remove comment

david-caro · 2018-01-16T12:48:26Z

tests/unit/test_parsers_nlm.py

+    #     import pdb
+    #     pdb.set_trace()
+
+    result = getattr(parser, field_name)


assert field_name in expected

david-caro · 2018-01-16T12:48:41Z

tests/unit/test_parsers_nlm.py

+
+
+def test_print_publication_date(expected, parser):
+    assert expected['print_publication_date'] == parser.print_publication_date.dumps()


assert 'print_publication_date' in expected

david-caro · 2018-01-16T12:48:59Z

hepcrawl/parsers/nlm.py

@@ -0,0 +1,334 @@
+# -*- coding: utf-8 -*-


This to it's own PR

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

Check in PublicationType for `Published Erratum` too, if `<Object>` check didn't return any matches. Add references to NLM docs. Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

This adds test records from IOP and fixes some simple issues with IOP spider, to make the tests pass. Introduces a functional tests of the IOP spider. Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

szymonlopaciuk added the Status: WIP label Jan 12, 2018

ghost assigned szymonlopaciuk Jan 12, 2018

ghost added the in progress label Jan 12, 2018

szymonlopaciuk force-pushed the refresh_iop_spider branch from 2173f2c to 8978e83 Compare January 12, 2018 13:56

michamos requested changes Jan 12, 2018

View reviewed changes

szymonlopaciuk force-pushed the refresh_iop_spider branch from 8978e83 to 3101fbf Compare January 16, 2018 12:07

david-caro reviewed Jan 16, 2018

View reviewed changes

hepcrawl/parsers/nlm.py

@@ -0,0 +1,334 @@

# -*- coding: utf-8 -*-

Copy link

Contributor

david-caro Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to it's own PR

szymonlopaciuk added 2 commits January 16, 2018 14:26

parsers: create an NLM parser

eec9f7a

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

tests: fix invalid DOI in IOP tests

724ce2d

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

szymonlopaciuk force-pushed the refresh_iop_spider branch from 3101fbf to 471cf26 Compare January 16, 2018 13:37

szymonlopaciuk added 8 commits January 16, 2018 15:58

parsers.nlm: add mapping from 'Republished'

7b925d0

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: document return value on bulk_parse

6b96a6c

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: add case to material extraction

ad95e0b

Check in PublicationType for `Published Erratum` too, if `<Object>` check didn't return any matches. Add references to NLM docs. Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: use PartialDate.from_parts

d1e4cbe

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers: add NLMParser to __init__ of parsers

2b87bfe

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: fix bug when no PublicationType

8cdf47a

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: allow lack of publication date

d745f78

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

parsers.nlm: handle maths in abstract

3106759

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

szymonlopaciuk force-pushed the refresh_iop_spider branch 2 times, most recently from 90987ae to 75cc606 Compare January 17, 2018 15:47

szymonlopaciuk added 3 commits January 30, 2018 17:18

IOP Spider: improve and add tests

bc8d5d4

This adds test records from IOP and fixes some simple issues with IOP spider, to make the tests pass. Introduces a functional tests of the IOP spider. Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

tohep: rm _filter_affiliation, handled in builder

9573ab5

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

IOP Spider: make use of NLM parser

c2757ab

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>

szymonlopaciuk force-pushed the refresh_iop_spider branch from 75cc606 to c2757ab Compare January 30, 2018 16:26

david-caro removed in progress labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IOP Spider: improve and add tests #206

IOP Spider: improve and add tests #206

szymonlopaciuk commented Jan 12, 2018 •

edited

michamos left a comment

michamos Jan 12, 2018

michamos Jan 12, 2018

michamos commented Jan 12, 2018

david-caro commented Jan 12, 2018

david-caro commented Jan 12, 2018

david-caro commented Jan 12, 2018

david-caro Jan 16, 2018

david-caro Jan 16, 2018

david-caro Jan 16, 2018

david-caro Jan 16, 2018



		def test_print_publication_date(expected, parser):
		assert expected['print_publication_date'] == parser.print_publication_date.dumps()

IOP Spider: improve and add tests #206

Are you sure you want to change the base?

IOP Spider: improve and add tests #206

Conversation

szymonlopaciuk commented Jan 12, 2018 • edited

Description

Related Issue

Checklist:

michamos left a comment

Choose a reason for hiding this comment

michamos Jan 12, 2018

Choose a reason for hiding this comment

michamos Jan 12, 2018

Choose a reason for hiding this comment

michamos commented Jan 12, 2018

david-caro commented Jan 12, 2018

david-caro commented Jan 12, 2018

david-caro commented Jan 12, 2018

david-caro Jan 16, 2018

Choose a reason for hiding this comment

david-caro Jan 16, 2018

Choose a reason for hiding this comment

david-caro Jan 16, 2018

Choose a reason for hiding this comment

david-caro Jan 16, 2018

Choose a reason for hiding this comment

szymonlopaciuk commented Jan 12, 2018 •

edited