Broken search #12

rlevi opened this Issue Oct 19, 2012 · 4 comments


None yet

2 participants

rlevi commented Oct 19, 2012

Hello! Looks like bug:

  • now denies default python User-Agent;
  • html structure of search result changed;

so search option doesn't work by default :( However the following kludgy patch solve this:

diff --git a/mgutenberg/ b/mgutenberg/
index 6c6e020..c1206c7 100644
--- a/mgutenberg/
+++ b/mgutenberg/
@@ -178,8 +178,6 @@ _GUTEN_ETEXT_RE_0 = _re.compile("""

 _GUTEN_ETEXT_RE_1 = _re.compile('''
-  \s*
diff --git a/mgutenberg/ b/mgutenberg/
index f4960b6..856fc9c 100644
--- a/mgutenberg/
+++ b/mgutenberg/
@@ -12,6 +12,7 @@ class HTTPError(IOError):
         return "HTTP: %d %s" % (self.args[1], self.args[2])

 class MyURLOpener(_urllib.FancyURLopener):
+    version = "Mgutenberg"
     def http_error_default(self, url, fp, errcode, errmsg, headers):
         raise HTTPError(errcode, errmsg, headers)   

pv commented Oct 19, 2012

Hi, thanks! It indeed seems broken nowadays --- scraping HTML from web sites is a really hacky approach to do this anyway, too bad that PG doesn't have a json/etc web interface. Need to consider another bugfix release...

rlevi commented Oct 23, 2012

Hi! It's kludgy approach, but it works:)
Also, current search url( is deprecated now, and sometimes didn't work(like in #7). However, i failed to find the way for query with author/title/etc on /ebooks/ or /ebooks/ :(

pv commented Oct 23, 2012

Internet Archive allows searching PG contents with JSON output,
This would probably be a better choice than scraping web pages.

rlevi commented Oct 23, 2012

That's nice API. Thanks for the link!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment