Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import archive.org bulk marc items #1058

Merged
merged 36 commits into from
Sep 15, 2018
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
4050d5c
fixes #1029 import subjects from ia metadata
hornc Aug 24, 2018
0236ecd
raise error in identifier not provided
hornc Aug 24, 2018
5cfb8f0
add docstrings
hornc Aug 27, 2018
17f45e5
raise 400 Bad Requests on error instead of 200s
hornc Aug 27, 2018
68e36fb
document add_book methods used by the import API
hornc Aug 27, 2018
f28a84a
add import from archive.org bulk_marc items
hornc Aug 27, 2018
3e417e1
comment and remove deprecated imports
hornc Aug 29, 2018
729a321
found and commented a bug in deprecated functions
hornc Aug 29, 2018
5872309
remove utils.ia find_item lookup
hornc Aug 29, 2018
0f34ee1
make it clear which fns belong to fast_parse
hornc Aug 29, 2018
c8bb135
don't keep looping if we got a response
hornc Aug 29, 2018
add6383
handle missing coverstore servers in dev
hornc Aug 29, 2018
b6f5174
return length and offset of next record when reading bulk MARC
hornc Aug 29, 2018
5e83d32
link to all initial source records in item history
hornc Aug 29, 2018
cb25df8
refactor and remove unused deprecated code
hornc Aug 29, 2018
0dd5770
remove non-functioning tests
hornc Aug 29, 2018
6b85403
fix True assert tests in add_book
hornc Aug 30, 2018
3bb8884
MarcBinary to raise exceptions itself rather than have get_ia return …
hornc Aug 30, 2018
23b8ec5
parametrize get_marc_record_from_ia tests for better reporting
hornc Aug 30, 2018
3e4e271
rename ils only import tests
hornc Sep 5, 2018
f968ba5
address docstring issues
hornc Sep 6, 2018
972f793
move MARC tests to own dir
hornc Sep 6, 2018
647cb54
move fast_parse tests to test dir
hornc Sep 6, 2018
cb88426
correct rtype
hornc Sep 6, 2018
3035437
re-add missing test data
hornc Sep 6, 2018
0c75981
raise BadMarc exception if unexpected/empty data
hornc Sep 6, 2018
0383d86
remove conditionals from tests
hornc Sep 6, 2018
82d7e5f
confirm behaviour of ia MARC import add_book
hornc Sep 6, 2018
26326ca
docstrings for load_book used in import api path
hornc Sep 6, 2018
1bf18cf
docstrings and tests for merge_marc code used by import
hornc Sep 7, 2018
0292910
DRY isbn gathering code
hornc Sep 7, 2018
1792e9b
refactor add_book matching
hornc Sep 7, 2018
51baf2e
address review comments
hornc Sep 7, 2018
c81e186
make ia urls constants
hornc Sep 12, 2018
54e2cc6
load ia base url from config
hornc Sep 12, 2018
09b6817
address review comments, refactor
hornc Sep 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
28 changes: 27 additions & 1 deletion openlibrary/catalog/add_book/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,15 @@ def new_work(q, rec, cover_id):
return w

def load_data(rec):
"""
Creates a new Edition,
hornc marked this conversation as resolved.
Show resolved Hide resolved
searches for an existing Work,
creates a new one if required,
otherwise adds to existing Work,
possibly with modification.
:param dict rec: Edition record to load (now we have established it should be added)
hornc marked this conversation as resolved.
Show resolved Hide resolved
:rtype: dict {"success": bool, "error": string | "work": {"key": <key>, "status": "created" | "modified" | "matched"} , "edition": {"key": <key>, "status": "created"}}
hornc marked this conversation as resolved.
Show resolved Hide resolved
hornc marked this conversation as resolved.
Show resolved Hide resolved
"""
cover_url = None
if 'cover' in rec:
cover_url = rec['cover']
Expand Down Expand Up @@ -250,6 +259,11 @@ def find_match(e1, edition_pool):
return edition_key

def build_pool(rec):
"""
Searches for existing edition matches on title and bibliographic keys.
:param dict rec: Edition record
hornc marked this conversation as resolved.
Show resolved Hide resolved
:rtype: dict {<identifier: title | isbn | lccn etc>: [ list of /books/OL..M keys that match rec on <identifier>]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put dict info in :return: section

"""
pool = defaultdict(set)

## Find records with matching title
Expand Down Expand Up @@ -300,6 +314,10 @@ def add_db_name(rec):
re_lang = re.compile('^/languages/([a-z]{3})$')

def early_exit(rec):
"""Attempts to quickly find an existing item match using bibliographic keys.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline before/after

:param dict rec: Edition record
:rtype: (str|None) First key matched of format "/books/OL..M" or False if no match found.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brackets unnecessary here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add :return: block

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be str|bool?

"""
f = 'ocaid'
# Anand - August 2014
# If openlibrary ID is already specified in the record, then use it.
Expand All @@ -316,6 +334,7 @@ def early_exit(rec):
if ekeys:
return ekeys[0]

#TODO: Check whether 'isbn' should also be used here.
hornc marked this conversation as resolved.
Show resolved Hide resolved
if 'isbn_10' or 'isbn_13' in rec:
isbns = rec.get("isbn_10", []) + rec.get("isbn_13", [])
isbns = [isbn.strip().replace("-", "") for isbn in isbns]
Expand Down Expand Up @@ -378,6 +397,12 @@ def find_exact_match(rec, edition_pool):
return False

def add_cover(cover_url, ekey):
"""
Adds a cover to coverstore and returns the cover id.
hornc marked this conversation as resolved.
Show resolved Hide resolved
:param str cover_url: URL of cover image
:param str ekey: Edition key /book/OL..M
:rtype: int Cover id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add :return:

"""
olid = ekey.split("/")[-1]
coverstore_url = config.get('coverstore_url').rstrip('/')
upload_url = coverstore_url + '/b/upload2'
Expand All @@ -391,6 +416,7 @@ def add_cover(cover_url, ekey):
'olid': olid,
'ip': web.ctx.ip,
}
reply = None
for attempt in range(10):
try:
res = urllib.urlopen(upload_url, urllib.urlencode(params))
Expand All @@ -399,7 +425,7 @@ def add_cover(cover_url, ekey):
sleep(2)
continue
body = res.read()
if body != '':
if body not in ['', 'None']:
reply = json.loads(body)
hornc marked this conversation as resolved.
Show resolved Hide resolved
if res.getcode() == 200 and body != '':
if 'id' in reply:
Expand Down
215 changes: 74 additions & 141 deletions openlibrary/catalog/get_ia.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
from openlibrary.catalog.marc import fast_parse, read_xml
from openlibrary.catalog.utils import error_mail
from openlibrary.catalog.marc.marc_binary import MarcBinary
from openlibrary.catalog.marc.marc_xml import MarcXml
from openlibrary.catalog.marc import parse
from lxml import etree
import xml.parsers.expat
import urllib2, os.path, socket
from time import sleep
import traceback
from openlibrary.utils.ia import find_item
from openlibrary.core import ia

base = "https://archive.org/download/"
Expand All @@ -19,20 +17,13 @@ def urlopen_keep_trying(url):
for i in range(3):
try:
f = urllib2.urlopen(url)
return f
except urllib2.HTTPError, error:
if error.code in (403, 404):
#print "404 for '%s'" % url
if error.code in (403, 404, 416):
raise
else:
print 'error:', error.code, error.msg
pass
except urllib2.URLError:
pass
else:
return f
print url, "failed"
sleep(2)
print "trying again"

def bad_ia_xml(ia):
if ia == 'revistadoinstit01paulgoog':
Expand All @@ -43,33 +34,24 @@ def bad_ia_xml(ia):
return '<!--' in urlopen_keep_trying(base + loc).read()

def get_marc_ia_data(ia, host=None, path=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially confusing since this clobbers ia the variable imported above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch, this ia is supposed to be an ocaid -- I didn't change it earlier because these are (mostly?) deprecated methods and I'm not sure that they are even used, but this naming conflict is a great reason to tidy it up, thanks!

ia = ia.strip() # 'cyclopdiaofedu00kidd '
"""
DEPRECATED
"""
ending = 'meta.mrc'
if host and path:
url = 'http://%s%s/%s_%s' % (host, path, ia, ending)
else:
url = 'http://www.archive.org/download/' + ia + '/' + ia + '_' + ending
url = base + ia + '/' + ia + '_' + ending
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where's base coming from, can this be capitalized so it's more clear its a constant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be read from the config file, does capitalizing still apply if we read it from config?

f = urlopen_keep_trying(url)
return f.read() if f else None

def get_marc_ia(ia):
ia = ia.strip() # 'cyclopdiaofedu00kidd '
url = base + ia + "/" + ia + "_meta.mrc"
data = urlopen_keep_trying(url).read()
length = int(data[0:5])
if len(data) != length:
data = data.decode('utf-8').encode('raw_unicode_escape')
assert len(data) == length

assert 'Internet Archive: Error' not in data
print 'leader:', data[:24]
return data
return fast_parse.read_edition(data, accept_electronic = True)

def get_marc_record_from_ia(identifier):
"""Takes IA identifiers and returns MARC record instance.
"""
Takes IA identifiers and returns MARC record instance.
11/2017: currently called by openlibrary/plugins/importapi/code.py
when the /api/import/ia endpoint is POSTed to.
:param str identifier: ocaid
:rtype: (MarcXML | MarcBinary)
"""
metadata = ia.get_metadata(identifier)
filenames = metadata['_filenames']
Expand Down Expand Up @@ -97,62 +79,18 @@ def get_marc_record_from_ia(identifier):
# BinaryMARCs with incorrectly converted unicode characters do not match.
return MarcBinary(data)

def get_ia(ia):
ia = ia.strip() # 'cyclopdiaofedu00kidd '
# read MARC record of scanned book from archive.org
# try the XML first because it has better character encoding
# if there is a problem with the XML switch to the binary MARC
xml_file = ia + "_marc.xml"
loc = ia + "/" + xml_file
try:
print base + loc
f = urlopen_keep_trying(base + loc)
except urllib2.HTTPError, error:
if error.code == 404:
raise NoMARCXML
else:
print 'error:', error.code, error.msg
raise
assert f
if f:
try:
return read_xml.read_edition(f)
except read_xml.BadXML:
print "read_xml BADXML"
pass
except xml.parsers.expat.ExpatError:
#print 'IA:', repr(ia)
#print 'XML parse error:', base + loc
print "read_xml ExpatError"
pass
print base + loc
if '<title>Internet Archive: Page Not Found</title>' in urllib2.urlopen(base + loc).read(200):
raise NoMARCXML
url = base + ia + "/" + ia + "_meta.mrc"
print url
try:
f = urlopen_keep_trying(url)
except urllib2.URLError:
pass
if not f:
return None
data = f.read()
length = data[0:5]
loc = ia + "/" + ia + "_meta.mrc:0:" + length
if len(data) == 0:
print 'zero length MARC for', url
return None
if 'Internet Archive: Error' in data:
print 'internet archive error for', url
return None
if data.startswith('<html>\n<head>'):
print 'internet archive error for', url
return None
try:
return fast_parse.read_edition(data, accept_electronic = True)
except (ValueError, AssertionError, fast_parse.BadDictionary):
print(repr(data))
raise
def get_ia(identifier):
"""
DEPRECATED: Use get_marc_record_from_ia() above + parse.read_edition()
hornc marked this conversation as resolved.
Show resolved Hide resolved
Triggers UnboundLocalError: local variable 'v' referenced before assignment
Read MARC record of scanned book from archive.org
try the XML first because it has better character encoding
if there is a problem with the XML switch to the binary MARC
:param str identifier: ocaid
:rtype: (None | dict)
"""
marc = get_marc_record_from_ia(identifier)
return parse.read_edition(marc)

def files(archive_id):
url = base + archive_id + "/" + archive_id + "_files.xml"
Expand Down Expand Up @@ -196,49 +134,57 @@ def get_data(loc):
return buf

def get_from_archive(locator):
"""
Gets a single binary MARC record from within an Archive.org
bulk MARC item - data only.

:param str locator: Locator ocaid/filename:offset:length
:rtype: (str|None) Binary MARC data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split

"""
data, offset, length = get_from_archive_bulk(locator)
return data

def get_from_archive_bulk(locator):
"""
Gets a single binary MARC record from within an Archive.org
bulk MARC item, and return the offset and length of the next
item.
If offset or length are `None`, then there is no next record.

:param str locator: Locator ocaid/filename:offset:length
:rtype: (str, int|None, int|None) (Binary MARC data, Next record offset, Next record length)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split

"""
hornc marked this conversation as resolved.
Show resolved Hide resolved
if locator.startswith('marc:'):
locator = locator[5:]
filename, offset, length = locator.split (":")
offset = int (offset)
length = int (length)

ia, rest = filename.split('/', 1)

for attempt in range(5):
try:
host, path = find_item(ia)
break
except socket.timeout:
if attempt == 4:
raise
print 'retry, attempt', attempt
offset = int(offset)
length = int(length)

r0, r1 = offset, offset+length-1
url = 'http://' + host + path + '/' + rest
# get the next record's length in this request
r1 += 5
url = base + filename

assert 0 < length < 100000
hornc marked this conversation as resolved.
Show resolved Hide resolved

ureq = urllib2.Request(url, None, {'Range':'bytes=%d-%d'% (r0, r1)},)

f = None
for i in range(3):
try:
f = urllib2.urlopen(ureq)
except urllib2.HTTPError, error:
if error.code == 416:
raise
elif error.code == 404:
print "404 for '%s'" % url
raise
else:
print url
print 'error:', error.code, error.msg
except urllib2.URLError:
pass
ureq = urllib2.Request(url, None, {'Range': 'bytes=%d-%d' % (r0, r1)})
f = urlopen_keep_trying(ureq)
data = None
if f:
return f.read(100000)
else:
print locator, url, 'failed'
data = f.read(100000)
len_in_rec = int(data[:5])
if len_in_rec != length:
data, next_offset, next_length = get_from_archive_bulk('%s:%d:%d' % (filename, offset, len_in_rec))
else:
next_length = data[length:]
data = data[:length]
if len(next_length) == 5:
# We have data for the next record
next_offset = offset + len_in_rec
next_length = int(next_length)
else:
next_offset = next_length = None
return data, next_offset, next_length

def get_from_local(locator):
try:
Expand All @@ -253,6 +199,12 @@ def get_from_local(locator):
return buf

def read_marc_file(part, f, pos=0):
"""
:param str part:
:param str f: Full binary MARC data containing many records
:param int pos: Start position within the data
:rtype: (int, str, str) (Next position, Current source_record name, Current single MARC record)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split

"""
try:
for data, int_length in fast_parse.read_file(f):
loc = "marc:%s:%d:%d" % (part, pos, int_length)
Expand All @@ -272,18 +224,15 @@ def marc_formats(ia, host=None, path=None):
if host and path:
url = 'http://%s%s/%s_%s' % (host, path, ia, ending)
else:
url = 'http://www.archive.org/download/' + ia + '/' + ia + '_' + ending
url = base + ia + '/' + ia + '_' + ending
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please make base more apparently a CONST?

for attempt in range(10):
f = urlopen_keep_trying(url)
if f is not None:
break
sleep(10)
if f is None:
msg_from = 'load_scribe@archive.org'
msg_to = ['edward@archive.org']
subject = "error reading %s_files.xml" % ia
msg = url
error_mail(msg_from, msg_to, subject, msg)
#TODO: log this, if anything uses this code
msg = "error reading %s_files.xml" % ia
return has
data = f.read()
try:
Expand All @@ -298,19 +247,3 @@ def marc_formats(ia, host=None, path=None):
if all(has.values()):
break
return has

def test_get_ia():
hornc marked this conversation as resolved.
Show resolved Hide resolved
ia = "poeticalworksoft00grayiala"
expect = {
'publisher': ['Printed by C. Whittingham for T. N. Longman and O. Rees [etc]'],
'number_of_pages': 223,
'full_title': 'The poetical works of Thomas Gray with some account of his life and writings ; the whole carefully revised and illustrated by notes ; to which are annexed, Poems addressed to, and in memory of Mr. Gray ; several of which were never before collected.',
'publish_date': '1800',
'publish_country': 'enk',
'authors': [
{'db_name': 'Gray, Thomas 1716-1771.', 'name': 'Gray, Thomas'}
],
'oclc': ['5047966']
}
assert get_ia(ia) == expect

2 changes: 1 addition & 1 deletion openlibrary/catalog/importer/load_scribe.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from openlibrary.catalog.utils.query import query, withKey
from openlibrary.catalog.importer.merge import try_merge
from openlibrary.catalog.importer.update import add_source_records
from openlibrary.catalog.get_ia import get_ia, urlopen_keep_trying, NoMARCXML, bad_ia_xml, marc_formats, get_marc_ia, get_marc_ia_data
from openlibrary.catalog.get_ia import get_ia, urlopen_keep_trying, NoMARCXML, bad_ia_xml, marc_formats, get_marc_ia_data
from openlibrary.catalog.title_page_img.load import add_cover_image
from openlibrary.solr.update_work import update_work, solr_update
#from openlibrary.catalog.works.find_works import find_title_redirects, find_works, get_books, books_query, update_works
Expand Down