Broken gutengerg.org search #12

Open
rlevi opened this Issue Oct 19, 2012 · 4 comments

Projects

None yet

2 participants

@rlevi
rlevi commented Oct 19, 2012

Hello! Looks like bug:

  • gutenberg.org now denies default python User-Agent;
  • html structure of search result changed;

so search option doesn't work by default :( However the following kludgy patch solve this:

diff --git a/mgutenberg/gutenbergweb.py b/mgutenberg/gutenbergweb.py
index 6c6e020..c1206c7 100644
--- a/mgutenberg/gutenbergweb.py
+++ b/mgutenberg/gutenbergweb.py
@@ -178,8 +178,6 @@ _GUTEN_ETEXT_RE_0 = _re.compile("""

 _GUTEN_ETEXT_RE_1 = _re.compile('''
 .*?
-<tr[^>]*pgterms:file[^>]*>
-  \s*
   <td[^>]*dcterms:format[^>]*>
     \s*
     <a.*\s+href="(?P<url>[^"]*)"[^>]*>(?P<description>[^>]*)</a>
diff --git a/mgutenberg/util.py b/mgutenberg/util.py
index f4960b6..856fc9c 100644
--- a/mgutenberg/util.py
+++ b/mgutenberg/util.py
@@ -12,6 +12,7 @@ class HTTPError(IOError):
         return "HTTP: %d %s" % (self.args[1], self.args[2])

 class MyURLOpener(_urllib.FancyURLopener):
+    version = "Mgutenberg"
     def http_error_default(self, url, fp, errcode, errmsg, headers):
         fp.close()
         raise HTTPError(errcode, errmsg, headers)   

@pv
Owner
pv commented Oct 19, 2012

Hi, thanks! It indeed seems broken nowadays --- scraping HTML from web sites is a really hacky approach to do this anyway, too bad that PG doesn't have a json/etc web interface. Need to consider another bugfix release...

@rlevi
rlevi commented Oct 23, 2012

Hi! It's kludgy approach, but it works:)
Also, current search url(http://www.gutenberg.org/catalog/world/results) is deprecated now, and sometimes didn't work(like in #7). However, i failed to find the way for query with author/title/etc on /ebooks/ or /ebooks/search.mobile/ :(

@pv
Owner
pv commented Oct 23, 2012

Internet Archive allows searching PG contents with JSON output, http://archive.org/help/json.php
This would probably be a better choice than scraping web pages.

@rlevi
rlevi commented Oct 23, 2012

That's nice API. Thanks for the link!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment