New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import archive.org bulk marc items #1058
Conversation
openlibrary/catalog/get_ia.py
Outdated
|
||
r0, r1 = offset, offset+length-1 | ||
url = 'http://' + host + path + '/' + rest | ||
url = 'https://archive.org/download/%s' % filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, it would be great to have this as a config (as it was before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract archive.org into a variable and also not comment the code above but instead have find_item fall back to this if it fails:
def resolve_server(identifier):
"""
Returns:
a dict containing this item's:
server: e.g. ia601507.us.archive.org
dir: on disk containing files, e.g. /2/items/recurringwordsth00ethe
"""
metadata = requests.get('%s/metadata/%s' % (API_BASEURL, identifier)).json()
return {
'dir': metadata['dir'],
'server': metadata['server']
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case we care about is that we're not changing how this code works on production because whatever black magic this is works...
But because it breaks on dev, we want to at least provide a way to unblock developers in the case where the existing code path fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's bypass all of this by having an archive_org_url which is in the default openlibrary.yml config which we use here w/ the /download
link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got distracted by the docstrings :P The logic of this general section of the code still escapes me :/
Looks like github lets your resolve conversations in a code review now! \o/ |
openlibrary/catalog/get_ia.py
Outdated
ending = 'meta.mrc' | ||
if host and path: | ||
url = 'http://%s%s/%s_%s' % (host, path, ia, ending) | ||
else: | ||
url = 'http://www.archive.org/download/' + ia + '/' + ia + '_' + ending | ||
url = base + ia + '/' + ia + '_' + ending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where's base coming from, can this be capitalized so it's more clear its a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should really be read from the config file, does capitalizing still apply if we read it from config?
openlibrary/catalog/get_ia.py
Outdated
@@ -272,18 +224,15 @@ def marc_formats(ia, host=None, path=None): | |||
if host and path: | |||
url = 'http://%s%s/%s_%s' % (host, path, ia, ending) | |||
else: | |||
url = 'http://www.archive.org/download/' + ia + '/' + ia + '_' + ending | |||
url = base + ia + '/' + ia + '_' + ending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please make base more apparently a CONST?
:param str identifier: ocaid | ||
:rtype: dict | ||
:return: Edition record | ||
""" | ||
edition['ocaid'] = identifier | ||
edition['source_records'] = "ia:" + identifier | ||
edition['cover'] = "https://archive.org/download/{0}/{0}/page/title.jpg".format(identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also use base
/ BASE
?
suggestions, otherwise LGTM. Is there anything unstable preventing this from being merged? Would prefer if we can keep change-sets on the smaller side (hard to find a contiguous block of time to peruse through the review :) ) |
@mekarpeles re. language codes, they are ISO 639-2 , which are all three character representations. edit, you may be thinking of locations codes (can't remember the standard off the top of my head), most are 2 character for countries, but there are 3 character codes for many US locations. |
@mekarpeles I agree, this PR is getting too big. There are still things I think need to be refactored down and made sensible, but it'll need to be broken down into smaller chunks. Once I address the base url / config comment it should be good to merge as I am confident it adds the bulk import functionality taking advantage of the existing import api. If there are problems, they are likely to be existing issues with imports. |
openlibrary/catalog/get_ia.py
Outdated
from lxml import etree | ||
import xml.parsers.expat | ||
import urllib2, os.path, socket | ||
from time import sleep | ||
import traceback | ||
from openlibrary.core import ia | ||
|
||
IA_BASE_URL = 'https://archive.org' | ||
load_config('conf/openlibrary.yml') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mekarpeles I'm not completely happy about this line, openlibrary/plugins/importapi/code.py
already has config available via config.get()
, but openlibrary/catalog/get_ia.py
does not 🤷♂️ I don't understand the purpose of https://github.com/internetarchive/openlibrary/blob/master/openlibrary/config.py either, it looks like a confusingly named / name collision version of infogami.config
. This seems to work, but I'm not sure I fully understand how config is meant to be shared, or passed around parts of OL.
from lxml import etree | ||
import xml.parsers.expat | ||
import urllib2, os.path, socket | ||
from time import sleep | ||
import traceback | ||
from openlibrary.core import ia | ||
|
||
base = "https://archive.org/download/" | ||
load_config('conf/openlibrary.yml') | ||
IA_BASE_URL = config.get('ia_base_url') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
openlibrary/catalog/get_ia.py
Outdated
@@ -32,7 +35,8 @@ def bad_ia_xml(ia): | |||
# need to handle 404s: | |||
# http://www.archive.org/details/index1858mary | |||
loc = ia + "/" + ia + "_marc.xml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is ia
here? from openlibrary.core import ia
openlibrary/catalog/get_ia.py
Outdated
@@ -32,7 +35,8 @@ def bad_ia_xml(ia): | |||
# need to handle 404s: | |||
# http://www.archive.org/details/index1858mary | |||
loc = ia + "/" + ia + "_marc.xml" | |||
return '<!--' in urlopen_keep_trying(base + loc).read() | |||
|
|||
return '<!--' in urlopen_keep_trying(IA_DOWNLOAD_URL + loc).read() | |||
|
|||
def get_marc_ia_data(ia, host=None, path=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
especially confusing since this clobbers ia
the variable imported above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch, this ia is supposed to be an ocaid -- I didn't change it earlier because these are (mostly?) deprecated methods and I'm not sure that they are even used, but this naming conflict is a great reason to tidy it up, thanks!
@@ -348,7 +349,7 @@ def populate_edition_data(self, edition, identifier): | |||
""" | |||
edition['ocaid'] = identifier | |||
edition['source_records'] = "ia:" + identifier | |||
edition['cover'] = "https://archive.org/download/{0}/{0}/page/title.jpg".format(identifier) | |||
edition['cover'] = "{0}/download/{1}/{1}/page/title.jpg".format(IA_BASE_URL, identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use IA_DOWNLOAD_URL instead of {0}/download
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in a different file / module, and since it is only used once I thought only creating the on URL constant for IA_BASE_URL was sufficient.
openlibrary/catalog/get_ia.py
Outdated
|
||
item_base = base + "/" + identifier + "/" | ||
item_base = IA_DOWNLOAD_URL + '/' + identifier + '/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove + '/'
I think? I believe IA_DOWNLOAD_URL ends w/ a slash already
from infogami import config | ||
|
||
monkeypatch.delattr(config, "coverstore_url", raising=False) | ||
assert h.get_coverstore_url() == "https://covers.openlibrary.org" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like another good opportunity to move canonical covers service address into a config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is testing that the fallback (in the absence of config or config file) cover url is https://covers.openlibrary.org
, so it is taken form config, if it exists, otherwise is set to https://covers.openlibrary.org
, which seems ok to me, although we should be consistent. Not all config values have defaults and expect the config to be set.
As policy, should we hardcode defaults, or simply reply on config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rely on config (defaults).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a separate issue, merging for now
Currently work in progress,