Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import archive.org bulk marc items #1058

Merged
merged 36 commits into from
Sep 15, 2018
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
4050d5c
fixes #1029 import subjects from ia metadata
hornc Aug 24, 2018
0236ecd
raise error in identifier not provided
hornc Aug 24, 2018
5cfb8f0
add docstrings
hornc Aug 27, 2018
17f45e5
raise 400 Bad Requests on error instead of 200s
hornc Aug 27, 2018
68e36fb
document add_book methods used by the import API
hornc Aug 27, 2018
f28a84a
add import from archive.org bulk_marc items
hornc Aug 27, 2018
3e417e1
comment and remove deprecated imports
hornc Aug 29, 2018
729a321
found and commented a bug in deprecated functions
hornc Aug 29, 2018
5872309
remove utils.ia find_item lookup
hornc Aug 29, 2018
0f34ee1
make it clear which fns belong to fast_parse
hornc Aug 29, 2018
c8bb135
don't keep looping if we got a response
hornc Aug 29, 2018
add6383
handle missing coverstore servers in dev
hornc Aug 29, 2018
b6f5174
return length and offset of next record when reading bulk MARC
hornc Aug 29, 2018
5e83d32
link to all initial source records in item history
hornc Aug 29, 2018
cb25df8
refactor and remove unused deprecated code
hornc Aug 29, 2018
0dd5770
remove non-functioning tests
hornc Aug 29, 2018
6b85403
fix True assert tests in add_book
hornc Aug 30, 2018
3bb8884
MarcBinary to raise exceptions itself rather than have get_ia return …
hornc Aug 30, 2018
23b8ec5
parametrize get_marc_record_from_ia tests for better reporting
hornc Aug 30, 2018
3e4e271
rename ils only import tests
hornc Sep 5, 2018
f968ba5
address docstring issues
hornc Sep 6, 2018
972f793
move MARC tests to own dir
hornc Sep 6, 2018
647cb54
move fast_parse tests to test dir
hornc Sep 6, 2018
cb88426
correct rtype
hornc Sep 6, 2018
3035437
re-add missing test data
hornc Sep 6, 2018
0c75981
raise BadMarc exception if unexpected/empty data
hornc Sep 6, 2018
0383d86
remove conditionals from tests
hornc Sep 6, 2018
82d7e5f
confirm behaviour of ia MARC import add_book
hornc Sep 6, 2018
26326ca
docstrings for load_book used in import api path
hornc Sep 6, 2018
1bf18cf
docstrings and tests for merge_marc code used by import
hornc Sep 7, 2018
0292910
DRY isbn gathering code
hornc Sep 7, 2018
1792e9b
refactor add_book matching
hornc Sep 7, 2018
51baf2e
address review comments
hornc Sep 7, 2018
c81e186
make ia urls constants
hornc Sep 12, 2018
54e2cc6
load ia base url from config
hornc Sep 12, 2018
09b6817
address review comments, refactor
hornc Sep 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
28 changes: 27 additions & 1 deletion openlibrary/catalog/add_book/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,15 @@ def new_work(q, rec, cover_id):
return w

def load_data(rec):
"""
Creates a new Edition,
hornc marked this conversation as resolved.
Show resolved Hide resolved
searches for an existing Work,
creates a new one if required,
otherwise adds to existing Work,
possibly with modification.
:param dict rec: Edition record to load (now we have established it should be added)
hornc marked this conversation as resolved.
Show resolved Hide resolved
:rtype: dict {"success": bool, "error": string | "work": {"key": <key>, "status": "created" | "modified" | "matched"} , "edition": {"key": <key>, "status": "created"}}
hornc marked this conversation as resolved.
Show resolved Hide resolved
hornc marked this conversation as resolved.
Show resolved Hide resolved
"""
cover_url = None
if 'cover' in rec:
cover_url = rec['cover']
Expand Down Expand Up @@ -250,6 +259,11 @@ def find_match(e1, edition_pool):
return edition_key

def build_pool(rec):
"""
Searches for existing edition matches on title and bibliographic keys.
:param dict rec: Edition record
hornc marked this conversation as resolved.
Show resolved Hide resolved
:rtype: dict {<identifier: title | isbn | lccn etc>: [ list of /books/OL..M keys that match rec on <identifier>]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put dict info in :return: section

"""
pool = defaultdict(set)

## Find records with matching title
Expand Down Expand Up @@ -300,6 +314,10 @@ def add_db_name(rec):
re_lang = re.compile('^/languages/([a-z]{3})$')

def early_exit(rec):
"""Attempts to quickly find an existing item match using bibliographic keys.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline before/after

:param dict rec: Edition record
:rtype: (str|None) First key matched of format "/books/OL..M" or False if no match found.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brackets unnecessary here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add :return: block

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be str|bool?

"""
f = 'ocaid'
# Anand - August 2014
# If openlibrary ID is already specified in the record, then use it.
Expand All @@ -316,6 +334,7 @@ def early_exit(rec):
if ekeys:
return ekeys[0]

#TODO: Check whether 'isbn' should also be used here.
hornc marked this conversation as resolved.
Show resolved Hide resolved
if 'isbn_10' or 'isbn_13' in rec:
isbns = rec.get("isbn_10", []) + rec.get("isbn_13", [])
isbns = [isbn.strip().replace("-", "") for isbn in isbns]
Expand Down Expand Up @@ -378,6 +397,12 @@ def find_exact_match(rec, edition_pool):
return False

def add_cover(cover_url, ekey):
"""
Adds a cover to coverstore and returns the cover id.
hornc marked this conversation as resolved.
Show resolved Hide resolved
:param str cover_url: URL of cover image
:param str ekey: Edition key /book/OL..M
:rtype: int Cover id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add :return:

"""
olid = ekey.split("/")[-1]
coverstore_url = config.get('coverstore_url').rstrip('/')
upload_url = coverstore_url + '/b/upload2'
Expand All @@ -391,6 +416,7 @@ def add_cover(cover_url, ekey):
'olid': olid,
'ip': web.ctx.ip,
}
reply = None
for attempt in range(10):
try:
res = urllib.urlopen(upload_url, urllib.urlencode(params))
Expand All @@ -399,7 +425,7 @@ def add_cover(cover_url, ekey):
sleep(2)
continue
body = res.read()
if body != '':
if body not in ['', 'None']:
reply = json.loads(body)
hornc marked this conversation as resolved.
Show resolved Hide resolved
if res.getcode() == 200 and body != '':
if 'id' in reply:
Expand Down
60 changes: 30 additions & 30 deletions openlibrary/catalog/add_book/test_add_book.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ def test_load_test_item(mock_site, add_languages, ia_writeback):
'languages': ['eng'],
}
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'created'
e = mock_site.get(reply['edition']['key'])
assert e.type.key == '/type/edition'
Expand All @@ -89,7 +89,7 @@ def test_load_with_subjects(mock_site, ia_writeback):
'source_records': 'ia:test_item',
}
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
w = mock_site.get(reply['work']['key'])
assert w.title == 'Test item'
assert w.subjects == ['Protected DAISY', 'In library']
Expand All @@ -102,7 +102,7 @@ def test_load_with_new_author(mock_site, ia_writeback):
'source_records': 'ia:test_item',
}
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
w = mock_site.get(reply['work']['key'])
if 'authors' in reply:
assert reply['authors'][0]['status'] == 'created'
Expand All @@ -121,7 +121,7 @@ def test_load_with_new_author(mock_site, ia_writeback):
'source_records': 'ia:test_item',
}
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
# Below asserts imply the existing author should be updated, but these were never run:
#assert reply['authors'][0]['status'] == 'modified'
#akey2 = reply['authors'][0]['key']
Expand Down Expand Up @@ -156,7 +156,7 @@ def test_duplicate_ia_book(mock_site, add_languages, ia_writeback):
'languages': ['eng'],
}
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'created'
e = mock_site.get(reply['edition']['key'])
assert e.type.key == '/type/edition'
Expand All @@ -171,7 +171,7 @@ def test_duplicate_ia_book(mock_site, add_languages, ia_writeback):
'languages': ['fre'],
}
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'matched'

def test_from_marc_3(mock_site, add_languages):
Expand All @@ -181,7 +181,7 @@ def test_from_marc_3(mock_site, add_languages):
rec = read_edition(MarcBinary(data))
rec['source_records'] = ['ia:' + ia]
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'created'
e = mock_site.get(reply['edition']['key'])
assert e.type.key == '/type/edition'
Expand All @@ -194,12 +194,12 @@ def test_from_marc_2(mock_site, add_languages):
rec = read_edition(MarcBinary(data))
rec['source_records'] = ['ia:' + ia]
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'created'
e = mock_site.get(reply['edition']['key'])
assert e.type.key == '/type/edition'
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'matched'

def test_from_marc(mock_site, add_languages):
Expand All @@ -210,7 +210,7 @@ def test_from_marc(mock_site, add_languages):
assert len(data) == int(data[:5])
rec = read_edition(MarcBinary(data))
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
akey1 = reply['authors'][0]['key']
a = mock_site.get(akey1)
assert a.type.key == '/type/author'
Expand Down Expand Up @@ -268,21 +268,21 @@ def test_load_multiple(mock_site):
'authors': [{'name': 'Smith, John', 'birth_date': '1980'}],
}
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
ekey1 = reply['edition']['key']

reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
ekey2 = reply['edition']['key']
assert ekey1 == ekey2

reply = load({'title': 'Test item', 'source_records': ['ia:test_item2'], 'lccn': ['456']})
assert reply['success'] == True
assert reply['success'] is True
ekey3 = reply['edition']['key']
assert ekey3 != ekey1

reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
ekey4 = reply['edition']['key']

assert ekey1 == ekey2 == ekey4
Expand Down Expand Up @@ -312,10 +312,10 @@ def test_from_marc(mock_site, add_languages):
rec = read_edition(marc)
rec['source_records'] = ['ia:' + ia]
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'created'
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'matched'

ia = 'flatlandromanceo00abbouoft'
Expand All @@ -324,10 +324,10 @@ def test_from_marc(mock_site, add_languages):
rec = read_edition(marc)
rec['source_records'] = ['ia:' + ia]
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'created'
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'matched'

def test_real_example(mock_site, add_languages):
Expand All @@ -336,17 +336,17 @@ def test_real_example(mock_site, add_languages):
rec = read_edition(marc)
rec['source_records'] = ['marc:' + src]
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'matched'

src = 'v39.i28.records.utf8--5362776-1764'
marc = MarcBinary(open_test_data(src).read())
rec = read_edition(marc)
rec['source_records'] = ['marc:' + src]
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'modified'

def test_missing_ocaid(mock_site, add_languages):
Expand All @@ -356,11 +356,11 @@ def test_missing_ocaid(mock_site, add_languages):
rec = read_edition(marc)
rec['source_records'] = ['marc:testdata.mrc']
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
rec['source_records'] = ['ia:' + ia]
rec['ocaid'] = ia
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
e = mock_site.get(reply['edition']['key'])
assert e.ocaid == ia
assert 'ia:' + ia in e.source_records
Expand Down Expand Up @@ -397,12 +397,12 @@ def test_extra_author(mock_site, add_languages):
rec['source_records'] = ['ia:' + ia]

reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True

w = mock_site.get(reply['work']['key'])

reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
w = mock_site.get(reply['work']['key'])
#assert len(w['authors']) == 1

Expand Down Expand Up @@ -459,7 +459,7 @@ def test_missing_source_records(mock_site, add_languages):
rec['source_records'] = ['ia:' + ia]

reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
e = mock_site.get(reply['edition']['key'])
assert 'source_records' in e

Expand Down Expand Up @@ -509,7 +509,7 @@ def test_no_extra_author(mock_site, add_languages):
rec['source_records'] = ['marc:' + src]

reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True

a = mock_site.get(reply['authors'][0]['key'])

Expand Down Expand Up @@ -588,19 +588,19 @@ def test_don_quixote(mock_site):
work_keys = list(mock_site.things(q))
assert work_keys

assert reply['success'] == True
assert reply['success'] is True

def test_same_twice(mock_site, add_languages):
rec = {
'source_records': ['ia:test_item'],
"publishers": ["Ten Speed Press"], "pagination": "20 p.", "description": "A macabre mash-up of the children's classic Pat the Bunny and the present-day zombie phenomenon, with the tactile features of the original book revoltingly re-imagined for an adult audience.", "title": "Pat The Zombie", "isbn_13": ["9781607740360"], "languages": ["eng"], "isbn_10": ["1607740362"], "authors": [{"entity_type": "person", "name": "Aaron Ximm", "personal_name": "Aaron Ximm"}], "contributions": ["Kaveh Soofi (Illustrator)"]}
reply = load(rec)
assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] == 'created'
assert reply['work']['status'] == 'created'
reply = load(rec)

assert reply['success'] == True
assert reply['success'] is True
assert reply['edition']['status'] != 'created'
assert reply['work']['status'] != 'created'