Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import archive.org bulk marc items #1058

Merged
merged 36 commits into from Sep 15, 2018
Merged

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented Aug 27, 2018

Currently work in progress,

  • Documents various import endpoints
  • Adds tests to protect import functionality
  • refactoring and DRY
  • Adds the ability to import from bulk binary records stored on archive.org
  • Makes the import API respond with proper HTTP response codes on error
  • Resolves Import subjects from IA metadata when MARC is not present #1029 , adds subjects from archive.org metadata


r0, r1 = offset, offset+length-1
url = 'http://' + host + path + '/' + rest
url = 'https://archive.org/download/%s' % filename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, it would be great to have this as a config (as it was before)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract archive.org into a variable and also not comment the code above but instead have find_item fall back to this if it fails:

def resolve_server(identifier):
    """
    Returns:
      a dict containing this item's:
        server: e.g. ia601507.us.archive.org
        dir: on disk containing files, e.g. /2/items/recurringwordsth00ethe
    """
    metadata = requests.get('%s/metadata/%s' % (API_BASEURL, identifier)).json()
    return {
        'dir': metadata['dir'],
        'server': metadata['server']
    }

#1060

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case we care about is that we're not changing how this code works on production because whatever black magic this is works...

But because it breaks on dev, we want to at least provide a way to unblock developers in the case where the existing code path fails

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's bypass all of this by having an archive_org_url which is in the default openlibrary.yml config which we use here w/ the /download link

openlibrary/catalog/get_ia.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got distracted by the docstrings :P The logic of this general section of the code still escapes me :/

openlibrary/catalog/add_book/__init__.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/__init__.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/__init__.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/__init__.py Outdated Show resolved Hide resolved
openlibrary/catalog/add_book/__init__.py Outdated Show resolved Hide resolved
openlibrary/plugins/importapi/code.py Show resolved Hide resolved
openlibrary/plugins/importapi/code.py Show resolved Hide resolved
openlibrary/plugins/importapi/code.py Outdated Show resolved Hide resolved
openlibrary/plugins/importapi/code.py Show resolved Hide resolved
openlibrary/plugins/importapi/code.py Outdated Show resolved Hide resolved
@cdrini
Copy link
Collaborator

cdrini commented Sep 5, 2018

Looks like github lets your resolve conversations in a code review now! \o/

ending = 'meta.mrc'
if host and path:
url = 'http://%s%s/%s_%s' % (host, path, ia, ending)
else:
url = 'http://www.archive.org/download/' + ia + '/' + ia + '_' + ending
url = base + ia + '/' + ia + '_' + ending
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where's base coming from, can this be capitalized so it's more clear its a constant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be read from the config file, does capitalizing still apply if we read it from config?

@@ -272,18 +224,15 @@ def marc_formats(ia, host=None, path=None):
if host and path:
url = 'http://%s%s/%s_%s' % (host, path, ia, ending)
else:
url = 'http://www.archive.org/download/' + ia + '/' + ia + '_' + ending
url = base + ia + '/' + ia + '_' + ending
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please make base more apparently a CONST?

:param str identifier: ocaid
:rtype: dict
:return: Edition record
"""
edition['ocaid'] = identifier
edition['source_records'] = "ia:" + identifier
edition['cover'] = "https://archive.org/download/{0}/{0}/page/title.jpg".format(identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also use base / BASE?

@mekarpeles
Copy link
Member

suggestions, otherwise LGTM. Is there anything unstable preventing this from being merged?

Would prefer if we can keep change-sets on the smaller side (hard to find a contiguous block of time to peruse through the review :) )

@hornc
Copy link
Collaborator Author

hornc commented Sep 7, 2018

@mekarpeles re. language codes, they are ISO 639-2 , which are all three character representations.

edit, you may be thinking of locations codes (can't remember the standard off the top of my head), most are 2 character for countries, but there are 3 character codes for many US locations.

@hornc hornc changed the title WIP: Import archive.org bulk marc items Import archive.org bulk marc items Sep 7, 2018
@hornc
Copy link
Collaborator Author

hornc commented Sep 7, 2018

@mekarpeles I agree, this PR is getting too big. There are still things I think need to be refactored down and made sensible, but it'll need to be broken down into smaller chunks.

Once I address the base url / config comment it should be good to merge as I am confident it adds the bulk import functionality taking advantage of the existing import api. If there are problems, they are likely to be existing issues with imports.

from lxml import etree
import xml.parsers.expat
import urllib2, os.path, socket
from time import sleep
import traceback
from openlibrary.core import ia

IA_BASE_URL = 'https://archive.org'
load_config('conf/openlibrary.yml')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mekarpeles I'm not completely happy about this line, openlibrary/plugins/importapi/code.py already has config available via config.get(), but openlibrary/catalog/get_ia.py does not 🤷‍♂️ I don't understand the purpose of https://github.com/internetarchive/openlibrary/blob/master/openlibrary/config.py either, it looks like a confusingly named / name collision version of infogami.config . This seems to work, but I'm not sure I fully understand how config is meant to be shared, or passed around parts of OL.

from lxml import etree
import xml.parsers.expat
import urllib2, os.path, socket
from time import sleep
import traceback
from openlibrary.core import ia

base = "https://archive.org/download/"
load_config('conf/openlibrary.yml')
IA_BASE_URL = config.get('ia_base_url')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -32,7 +35,8 @@ def bad_ia_xml(ia):
# need to handle 404s:
# http://www.archive.org/details/index1858mary
loc = ia + "/" + ia + "_marc.xml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is ia here? from openlibrary.core import ia

@@ -32,7 +35,8 @@ def bad_ia_xml(ia):
# need to handle 404s:
# http://www.archive.org/details/index1858mary
loc = ia + "/" + ia + "_marc.xml"
return '<!--' in urlopen_keep_trying(base + loc).read()

return '<!--' in urlopen_keep_trying(IA_DOWNLOAD_URL + loc).read()

def get_marc_ia_data(ia, host=None, path=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially confusing since this clobbers ia the variable imported above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch, this ia is supposed to be an ocaid -- I didn't change it earlier because these are (mostly?) deprecated methods and I'm not sure that they are even used, but this naming conflict is a great reason to tidy it up, thanks!

@@ -348,7 +349,7 @@ def populate_edition_data(self, edition, identifier):
"""
edition['ocaid'] = identifier
edition['source_records'] = "ia:" + identifier
edition['cover'] = "https://archive.org/download/{0}/{0}/page/title.jpg".format(identifier)
edition['cover'] = "{0}/download/{1}/{1}/page/title.jpg".format(IA_BASE_URL, identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use IA_DOWNLOAD_URL instead of {0}/download?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in a different file / module, and since it is only used once I thought only creating the on URL constant for IA_BASE_URL was sufficient.


item_base = base + "/" + identifier + "/"
item_base = IA_DOWNLOAD_URL + '/' + identifier + '/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove + '/' I think? I believe IA_DOWNLOAD_URL ends w/ a slash already

from infogami import config

monkeypatch.delattr(config, "coverstore_url", raising=False)
assert h.get_coverstore_url() == "https://covers.openlibrary.org"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like another good opportunity to move canonical covers service address into a config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is testing that the fallback (in the absence of config or config file) cover url is https://covers.openlibrary.org, so it is taken form config, if it exists, otherwise is set to https://covers.openlibrary.org, which seems ok to me, although we should be consistent. Not all config values have defaults and expect the config to be set.

As policy, should we hardcode defaults, or simply reply on config?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rely on config (defaults).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a separate issue, merging for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants