Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IOP Spider: improve and add tests #206

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

szymonlopaciuk
Copy link
Contributor

@szymonlopaciuk szymonlopaciuk commented Jan 12, 2018

Depends on #209.

Description

This adds test records from IOP and fixes issues with the IOP spider.

Related Issue

Fixes #205.

Checklist:

  • I have all the information that I need (if not, move to RFC and look for it).
  • I linked the related issue(s) in the corresponding commit logs.
  • I wrote good commit log messages.
  • My code follows the code style of this project.
  • I've added any new docs if API/utils methods were added.
  • I have updated the existing documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Copy link
Contributor

@michamos michamos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small comments. More generally, I think you should try to refactor this, looking at what's still needed and what can be done better with the more powerful builder we have now (e.g. dates could use PartialDate) and make sure that the XPath selectors are not too brittle.

@@ -146,10 +146,10 @@ def get_page_numbers(node):

fpage = node.xpath(".//FirstPage/text()").extract_first()
lpage = node.xpath(".//LastPage/text()").extract_first()
if fpage and lpage:
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already done by the literature builder, so not needed here.

@@ -243,7 +243,7 @@ def _filter_affiliation(affiliations):
for author in crawler_record.get('authors', []):
builder.add_author(builder.make_author(
full_name=author['full_name'],
affiliations=_filter_affiliation(author['affiliations']),
affiliations=_filter_affiliation(author.get('affiliations', [])),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the _filter_affiliation thing is not needed, as the builder is already cleaning up empty values. Besides, this should be raw_affiliations instead of affiliations, see #185.

@michamos
Copy link
Contributor

@david-caro do we get feeds for IOP in the end? IIRC, @fschwenn was saying the other day that he's doing webscraping currently.

@david-caro
Copy link
Contributor

I'm getting updates through email from the stacks service, I've contacted them to see if we can use oai, but no reply so far.

@david-caro
Copy link
Contributor

Well, I just got a reply saying that there's an OAI service available :), now I asked for access.

@david-caro
Copy link
Contributor

@szymonlopaciuk so I guess that it's safe to start working on the oai version of this spider ;)

def test_field(field_name, expected, parser):
# if field_name == 'authors':
# import pdb
# pdb.set_trace()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

# import pdb
# pdb.set_trace()

result = getattr(parser, field_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert field_name in expected



def test_print_publication_date(expected, parser):
assert expected['print_publication_date'] == parser.print_publication_date.dumps()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert 'print_publication_date' in expected

@@ -0,0 +1,334 @@
# -*- coding: utf-8 -*-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to it's own PR

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Check in PublicationType for `Published Erratum` too,
if `<Object>` check didn't return any matches.

Add references to NLM docs.

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
@szymonlopaciuk szymonlopaciuk force-pushed the refresh_iop_spider branch 2 times, most recently from 90987ae to 75cc606 Compare January 17, 2018 15:47
This adds test records from IOP and fixes some simple
issues with IOP spider, to make the tests pass.

Introduces a functional tests of the IOP spider.

Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Signed-off-by: Szymon Łopaciuk <szymon.lopaciuk@cern.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spider: Refresh the IOP spider
3 participants