Import archive.org bulk marc items #1058

hornc · 2018-08-27T11:26:30Z

Currently work in progress,

Documents various import endpoints
Adds tests to protect import functionality
refactoring and DRY
Adds the ability to import from bulk binary records stored on archive.org
Makes the import API respond with proper HTTP response codes on error
Resolves Import subjects from IA metadata when MARC is not present #1029 , adds subjects from archive.org metadata

openlibrary/plugins/importapi/code.py

mekarpeles · 2018-08-28T21:45:03Z

openlibrary/catalog/get_ia.py


    r0, r1 = offset, offset+length-1
-    url = 'http://' + host + path + '/' + rest
+    url = 'https://archive.org/download/%s' % filename


If possible, it would be great to have this as a config (as it was before)

Can we extract archive.org into a variable and also not comment the code above but instead have find_item fall back to this if it fails:

def resolve_server(identifier): """ Returns: a dict containing this item's: server: e.g. ia601507.us.archive.org dir: on disk containing files, e.g. /2/items/recurringwordsth00ethe """ metadata = requests.get('%s/metadata/%s' % (API_BASEURL, identifier)).json() return { 'dir': metadata['dir'], 'server': metadata['server'] }

#1060

The case we care about is that we're not changing how this code works on production because whatever black magic this is works...

But because it breaks on dev, we want to at least provide a way to unblock developers in the case where the existing code path fails

Let's bypass all of this by having an archive_org_url which is in the default openlibrary.yml config which we use here w/ the /download link

openlibrary/catalog/get_ia.py

…None on problems

openlibrary/catalog/marc/marc_binary.py

cdrini

Got distracted by the docstrings :P The logic of this general section of the code still escapes me :/

openlibrary/catalog/add_book/__init__.py

openlibrary/plugins/importapi/code.py

cdrini · 2018-09-05T18:32:49Z

Looks like github lets your resolve conversations in a code review now! \o/

openlibrary/catalog/add_book/__init__.py

mekarpeles · 2018-09-06T23:12:55Z

openlibrary/catalog/get_ia.py

    ending = 'meta.mrc'
    if host and path:
        url = 'http://%s%s/%s_%s' % (host, path, ia, ending)
    else:
-        url = 'http://www.archive.org/download/' + ia + '/' + ia + '_' + ending
+        url = base + ia + '/' + ia + '_' + ending


where's base coming from, can this be capitalized so it's more clear its a constant?

This should really be read from the config file, does capitalizing still apply if we read it from config?

openlibrary/catalog/get_ia.py

mekarpeles · 2018-09-06T23:16:33Z

openlibrary/catalog/get_ia.py

@@ -272,18 +224,15 @@ def marc_formats(ia, host=None, path=None):
    if host and path:
        url = 'http://%s%s/%s_%s' % (host, path, ia, ending)
    else:
-        url = 'http://www.archive.org/download/' + ia + '/' + ia + '_' + ending
+        url = base + ia + '/' + ia + '_' + ending


Can we please make base more apparently a CONST?

mekarpeles · 2018-09-06T23:41:07Z

openlibrary/plugins/importapi/code.py

+        :param str identifier: ocaid
+        :rtype: dict
+        :return: Edition record
+        """
        edition['ocaid'] = identifier
        edition['source_records'] = "ia:" + identifier
        edition['cover'] = "https://archive.org/download/{0}/{0}/page/title.jpg".format(identifier)


Should this also use base / BASE?

mekarpeles · 2018-09-06T23:43:10Z

suggestions, otherwise LGTM. Is there anything unstable preventing this from being merged?

Would prefer if we can keep change-sets on the smaller side (hard to find a contiguous block of time to peruse through the review :) )

hornc · 2018-09-07T08:20:14Z

@mekarpeles re. language codes, they are ISO 639-2 , which are all three character representations.

edit, you may be thinking of locations codes (can't remember the standard off the top of my head), most are 2 character for countries, but there are 3 character codes for many US locations.

hornc · 2018-09-07T08:28:50Z

@mekarpeles I agree, this PR is getting too big. There are still things I think need to be refactored down and made sensible, but it'll need to be broken down into smaller chunks.

Once I address the base url / config comment it should be good to merge as I am confident it adds the bulk import functionality taking advantage of the existing import api. If there are problems, they are likely to be existing issues with imports.

hornc · 2018-09-12T05:51:22Z

openlibrary/catalog/get_ia.py

 from lxml import etree
 import xml.parsers.expat
 import urllib2, os.path, socket
 from time import sleep
 import traceback
 from openlibrary.core import ia

-IA_BASE_URL = 'https://archive.org'
+load_config('conf/openlibrary.yml')


@mekarpeles I'm not completely happy about this line, openlibrary/plugins/importapi/code.py already has config available via config.get(), but openlibrary/catalog/get_ia.py does not 🤷‍♂️ I don't understand the purpose of https://github.com/internetarchive/openlibrary/blob/master/openlibrary/config.py either, it looks like a confusingly named / name collision version of infogami.config . This seems to work, but I'm not sure I fully understand how config is meant to be shared, or passed around parts of OL.

mekarpeles · 2018-09-12T22:04:40Z

openlibrary/catalog/get_ia.py

 from lxml import etree
 import xml.parsers.expat
 import urllib2, os.path, socket
 from time import sleep
 import traceback
 from openlibrary.core import ia

-base = "https://archive.org/download/"
+load_config('conf/openlibrary.yml')
+IA_BASE_URL = config.get('ia_base_url')


mekarpeles · 2018-09-12T22:05:31Z

openlibrary/catalog/get_ia.py

@@ -32,7 +35,8 @@ def bad_ia_xml(ia):
    # need to handle 404s:
    # http://www.archive.org/details/index1858mary
    loc = ia + "/" + ia + "_marc.xml"


What is ia here? from openlibrary.core import ia

mekarpeles · 2018-09-12T22:06:40Z

openlibrary/catalog/get_ia.py

@@ -32,7 +35,8 @@ def bad_ia_xml(ia):
    # need to handle 404s:
    # http://www.archive.org/details/index1858mary
    loc = ia + "/" + ia + "_marc.xml"
-    return '<!--' in urlopen_keep_trying(base + loc).read()
+
+    return '<!--' in urlopen_keep_trying(IA_DOWNLOAD_URL + loc).read()

 def get_marc_ia_data(ia, host=None, path=None):


especially confusing since this clobbers ia the variable imported above

great catch, this ia is supposed to be an ocaid -- I didn't change it earlier because these are (mostly?) deprecated methods and I'm not sure that they are even used, but this naming conflict is a great reason to tidy it up, thanks!

openlibrary/catalog/get_ia.py

mekarpeles · 2018-09-12T22:09:49Z

openlibrary/plugins/importapi/code.py

@@ -348,7 +349,7 @@ def populate_edition_data(self, edition, identifier):
        """
        edition['ocaid'] = identifier
        edition['source_records'] = "ia:" + identifier
-        edition['cover'] = "https://archive.org/download/{0}/{0}/page/title.jpg".format(identifier)
+        edition['cover'] = "{0}/download/{1}/{1}/page/title.jpg".format(IA_BASE_URL, identifier)


Should this use IA_DOWNLOAD_URL instead of {0}/download?

This is in a different file / module, and since it is only used once I thought only creating the on URL constant for IA_BASE_URL was sufficient.

mekarpeles · 2018-09-12T22:10:19Z

openlibrary/catalog/get_ia.py


-    item_base = base + "/" + identifier + "/"
+    item_base = IA_DOWNLOAD_URL + '/' + identifier + '/'


remove + '/' I think? I believe IA_DOWNLOAD_URL ends w/ a slash already

mekarpeles · 2018-09-12T22:10:48Z

openlibrary/tests/core/test_helpers.py

    from infogami import config

+    monkeypatch.delattr(config, "coverstore_url", raising=False)
+    assert h.get_coverstore_url() == "https://covers.openlibrary.org"


This seems like another good opportunity to move canonical covers service address into a config

This test is testing that the fallback (in the absence of config or config file) cover url is https://covers.openlibrary.org, so it is taken form config, if it exists, otherwise is set to https://covers.openlibrary.org, which seems ok to me, although we should be consistent. Not all config values have defaults and expect the config to be set.

As policy, should we hardcode defaults, or simply reply on config?

I think we should rely on config (defaults).

This can be a separate issue, merging for now

hornc added 6 commits August 25, 2018 14:10

fixes internetarchive#1029 import subjects from ia metadata

4050d5c

raise error in identifier not provided

0236ecd

add docstrings

5cfb8f0

raise 400 Bad Requests on error instead of 200s

17f45e5

document add_book methods used by the import API

68e36fb

add import from archive.org bulk_marc items

f28a84a

hornc commented Aug 27, 2018

View reviewed changes

openlibrary/plugins/importapi/code.py Outdated Show resolved Hide resolved

mekarpeles reviewed Aug 28, 2018

View reviewed changes

hornc added 10 commits August 29, 2018 15:39

comment and remove deprecated imports

3e417e1

found and commented a bug in deprecated functions

729a321

remove utils.ia find_item lookup

5872309

make it clear which fns belong to fast_parse

0f34ee1

don't keep looping if we got a response

c8bb135

handle missing coverstore servers in dev

add6383

return length and offset of next record when reading bulk MARC

b6f5174

link to all initial source records in item history

5e83d32

refactor and remove unused deprecated code

cb25df8

remove non-functioning tests

0dd5770

hornc commented Aug 29, 2018

View reviewed changes

openlibrary/catalog/get_ia.py Outdated Show resolved Hide resolved

hornc added 3 commits August 30, 2018 21:42

fix True assert tests in add_book

6b85403

MarcBinary to raise exceptions itself rather than have get_ia return …

3bb8884

…None on problems

parametrize get_marc_record_from_ia tests for better reporting

23b8ec5

hornc commented Aug 30, 2018

View reviewed changes

openlibrary/catalog/marc/marc_binary.py Outdated Show resolved Hide resolved

cdrini requested changes Sep 5, 2018

View reviewed changes

hornc added 5 commits September 6, 2018 11:07

rename ils only import tests

3e4e271

address docstring issues

f968ba5

move MARC tests to own dir

972f793

move fast_parse tests to test dir

647cb54

correct rtype

cb88426

docstrings for load_book used in import api path

26326ca

mekarpeles reviewed Sep 6, 2018

View reviewed changes

openlibrary/catalog/add_book/__init__.py Show resolved Hide resolved

mekarpeles reviewed Sep 6, 2018

View reviewed changes

openlibrary/catalog/get_ia.py Show resolved Hide resolved

mekarpeles reviewed Sep 6, 2018

View reviewed changes

openlibrary/catalog/get_ia.py Outdated Show resolved Hide resolved

mekarpeles reviewed Sep 6, 2018

View reviewed changes

hornc added 4 commits September 7, 2018 12:29

docstrings and tests for merge_marc code used by import

1bf18cf

DRY isbn gathering code

0292910

refactor add_book matching

1792e9b

address review comments

51baf2e

hornc changed the title ~~WIP: Import archive.org bulk marc items~~ Import archive.org bulk marc items Sep 7, 2018

hornc added 2 commits September 12, 2018 16:21

make ia urls constants

c81e186

load ia base url from config

54e2cc6

hornc commented Sep 12, 2018

View reviewed changes

mekarpeles reviewed Sep 12, 2018

View reviewed changes

openlibrary/catalog/get_ia.py Show resolved Hide resolved

mekarpeles reviewed Sep 12, 2018

View reviewed changes

address review comments, refactor

09b6817

mekarpeles merged commit fd9b214 into internetarchive:master Sep 15, 2018

cclauss mentioned this pull request Nov 6, 2018

Eliminate get_from_local() and get_data() from get_ia.py? #1532

Closed

hornc mentioned this pull request Jul 17, 2019

Non-MARC archive.org import data issues #2220

Closed

xayhewalo mentioned this pull request Oct 30, 2019

Refactor/Consolidate ia utilities to support bulk importing #1060

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import archive.org bulk marc items #1058

Import archive.org bulk marc items #1058

hornc commented Aug 27, 2018

mekarpeles Aug 28, 2018

mekarpeles Aug 28, 2018

mekarpeles Aug 28, 2018

mekarpeles Aug 28, 2018

cdrini left a comment

cdrini commented Sep 5, 2018

mekarpeles Sep 6, 2018

hornc Sep 7, 2018

mekarpeles Sep 6, 2018

mekarpeles Sep 6, 2018

mekarpeles commented Sep 6, 2018

hornc commented Sep 7, 2018 •

edited

hornc commented Sep 7, 2018

hornc Sep 12, 2018

mekarpeles Sep 12, 2018

mekarpeles Sep 12, 2018 •

edited

mekarpeles Sep 12, 2018

hornc Sep 12, 2018

mekarpeles Sep 12, 2018

hornc Sep 12, 2018

mekarpeles Sep 12, 2018

mekarpeles Sep 12, 2018

hornc Sep 12, 2018

mekarpeles Sep 15, 2018

mekarpeles Sep 15, 2018


		item_base = base + "/" + identifier + "/"
		item_base = IA_DOWNLOAD_URL + '/' + identifier + '/'

Import archive.org bulk marc items #1058

Import archive.org bulk marc items #1058

Conversation

hornc commented Aug 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdrini left a comment

Choose a reason for hiding this comment

cdrini commented Sep 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mekarpeles commented Sep 6, 2018

hornc commented Sep 7, 2018 • edited

hornc commented Sep 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mekarpeles Sep 12, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hornc commented Sep 7, 2018 •

edited

mekarpeles Sep 12, 2018 •

edited