Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import archive.org bulk marc items #1058

Merged
merged 36 commits into from
Sep 15, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
4050d5c
fixes #1029 import subjects from ia metadata
hornc Aug 24, 2018
0236ecd
raise error in identifier not provided
hornc Aug 24, 2018
5cfb8f0
add docstrings
hornc Aug 27, 2018
17f45e5
raise 400 Bad Requests on error instead of 200s
hornc Aug 27, 2018
68e36fb
document add_book methods used by the import API
hornc Aug 27, 2018
f28a84a
add import from archive.org bulk_marc items
hornc Aug 27, 2018
3e417e1
comment and remove deprecated imports
hornc Aug 29, 2018
729a321
found and commented a bug in deprecated functions
hornc Aug 29, 2018
5872309
remove utils.ia find_item lookup
hornc Aug 29, 2018
0f34ee1
make it clear which fns belong to fast_parse
hornc Aug 29, 2018
c8bb135
don't keep looping if we got a response
hornc Aug 29, 2018
add6383
handle missing coverstore servers in dev
hornc Aug 29, 2018
b6f5174
return length and offset of next record when reading bulk MARC
hornc Aug 29, 2018
5e83d32
link to all initial source records in item history
hornc Aug 29, 2018
cb25df8
refactor and remove unused deprecated code
hornc Aug 29, 2018
0dd5770
remove non-functioning tests
hornc Aug 29, 2018
6b85403
fix True assert tests in add_book
hornc Aug 30, 2018
3bb8884
MarcBinary to raise exceptions itself rather than have get_ia return …
hornc Aug 30, 2018
23b8ec5
parametrize get_marc_record_from_ia tests for better reporting
hornc Aug 30, 2018
3e4e271
rename ils only import tests
hornc Sep 5, 2018
f968ba5
address docstring issues
hornc Sep 6, 2018
972f793
move MARC tests to own dir
hornc Sep 6, 2018
647cb54
move fast_parse tests to test dir
hornc Sep 6, 2018
cb88426
correct rtype
hornc Sep 6, 2018
3035437
re-add missing test data
hornc Sep 6, 2018
0c75981
raise BadMarc exception if unexpected/empty data
hornc Sep 6, 2018
0383d86
remove conditionals from tests
hornc Sep 6, 2018
82d7e5f
confirm behaviour of ia MARC import add_book
hornc Sep 6, 2018
26326ca
docstrings for load_book used in import api path
hornc Sep 6, 2018
1bf18cf
docstrings and tests for merge_marc code used by import
hornc Sep 7, 2018
0292910
DRY isbn gathering code
hornc Sep 7, 2018
1792e9b
refactor add_book matching
hornc Sep 7, 2018
51baf2e
address review comments
hornc Sep 7, 2018
c81e186
make ia urls constants
hornc Sep 12, 2018
54e2cc6
load ia base url from config
hornc Sep 12, 2018
09b6817
address review comments, refactor
hornc Sep 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions conf/openlibrary.yml
Original file line number Diff line number Diff line change
Expand Up @@ -205,3 +205,4 @@ internal_tests_api_key: '8oPd1tx747YH374ohs48ZO5s2Nt1r9yD'
ia_availability_api_url: 'https://archive.org/services/loans/beta/loan/index.php' # to be deprecated in favor of _v1 below
ia_availability_api_v1_url: 'https://archive.org/services/loans/beta/loan/index.php'
ia_availability_api_v2_url: 'https://archive.org/services/availability/'
ia_base_url: 'https://archive.org'
4 changes: 3 additions & 1 deletion openlibrary/catalog/get_ia.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
from openlibrary.catalog.marc.marc_binary import MarcBinary
from openlibrary.catalog.marc.marc_xml import MarcXml
from openlibrary.catalog.marc import parse
from infogami import config, load_config
from lxml import etree
import xml.parsers.expat
import urllib2, os.path, socket
from time import sleep
import traceback
from openlibrary.core import ia

IA_BASE_URL = 'https://archive.org'
load_config('conf/openlibrary.yml')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mekarpeles I'm not completely happy about this line, openlibrary/plugins/importapi/code.py already has config available via config.get(), but openlibrary/catalog/get_ia.py does not 🤷‍♂️ I don't understand the purpose of https://github.com/internetarchive/openlibrary/blob/master/openlibrary/config.py either, it looks like a confusingly named / name collision version of infogami.config . This seems to work, but I'm not sure I fully understand how config is meant to be shared, or passed around parts of OL.

IA_BASE_URL = config.get('ia_base_url')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

IA_DOWNLOAD_URL = '%s/download/' % IA_BASE_URL
hornc marked this conversation as resolved.
Show resolved Hide resolved
MAX_MARC_LENGTH = 100000

Expand Down
2 changes: 1 addition & 1 deletion openlibrary/plugins/importapi/code.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
from lxml import etree
import logging

IA_BASE_URL = 'https://archive.org'
IA_BASE_URL = config.get('ia_base_url')
logger = logging.getLogger("openlibrary.importapi")

class DataError(ValueError):
Expand Down
5 changes: 3 additions & 2 deletions openlibrary/tests/core/test_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,10 +84,11 @@ def test_urlsafe():
assert h.urlsafe("a?") == "a"

def test_get_coverstore_url(monkeypatch):
assert h.get_coverstore_url() == "https://covers.openlibrary.org"

from infogami import config

monkeypatch.delattr(config, "coverstore_url", raising=False)
assert h.get_coverstore_url() == "https://covers.openlibrary.org"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like another good opportunity to move canonical covers service address into a config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is testing that the fallback (in the absence of config or config file) cover url is https://covers.openlibrary.org, so it is taken form config, if it exists, otherwise is set to https://covers.openlibrary.org, which seems ok to me, although we should be consistent. Not all config values have defaults and expect the config to be set.

As policy, should we hardcode defaults, or simply reply on config?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rely on config (defaults).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a separate issue, merging for now


monkeypatch.setattr(config, "coverstore_url", "https://0.0.0.0:8090", raising=False)
assert h.get_coverstore_url() == "https://0.0.0.0:8090"

Expand Down