Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

response-str used slurp*, which always uses *default-encoding*, which…

… is generally not related to the encoding of pages you want to crawl.


This change makes it use the encoding as returned by HttpMethodBase.getResponseCharSet(), which uses the HTTP response header "Content-Type", or defaults to ISO-8859-1 if no charset is specified.

This is a first step in figuring out the correct encoding of a page, and the second step would be to inspect the actual content of the file headers (html, xml), that might override the encoding in the http header.

But this first step already lets me crawl non-latin pages that I could not crawl before.

By the way, we could have just used
(.getResponseBodyAsString method)
which "directly" gives us a back a String, using the Content-Type just as mentioned above. The reason I did not go this route was the following warning:
WARNING: Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended.

Not that this makes things any better, as now we are doing the same potentially all-memory-consuming operation on the clojure slurp side, but at least I did not make a "breaking" change by introducing a new warning.

See also:
http://hc.apache.org/httpclient-3.x/charencodings.html#Request_Response_Body
http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/HttpMethodBase.html#getResponseCharSet()

P.S. There was a comment in the code, saying "uses slurp* here otherwise we get a annoying warning from commons-client". I guess it is referring to the warning I mentioned above, so it would be a comment to the effect "uses getResponseBodyAsStream rather than getResponseAsString" and not (what I initially assumed) "use slurp* over slurp" which (at least nowadays) is part of clojure core and would not require us to depend on duck-streams, which I decided to do here and I don't get any warnings.
  • Loading branch information...
commit bf63b4748657ffa8f2ccb4c26c936fdb4bee4719 1 parent d6c804e
EugenDueck EugenDueck authored
Showing with 2 additions and 4 deletions.
  1. +2 −4 clj_web_crawler.clj
6 clj_web_crawler.clj
View
@@ -2,8 +2,7 @@
(:import (org.apache.commons.httpclient HttpClient NameValuePair URI HttpStatus)
(org.apache.commons.httpclient.cookie CookiePolicy CookieSpec)
(org.apache.commons.httpclient.methods GetMethod PostMethod DeleteMethod
- TraceMethod HeadMethod PutMethod))
- (:use [clojure.contrib.duck-streams :only (slurp*)]))
+ TraceMethod HeadMethod PutMethod)))
(defn redirect-location
"Returns the redirection location string in the method, nil or false if
@@ -74,8 +73,7 @@
(defn response-str
"Returns the response from the method as a string."
([method]
- ; uses slurp* here otherwise we get a annoying warning from commons-client
- (slurp* (.getResponseBodyAsStream method)))
+ (slurp (.getResponseBodyAsStream method) :encoding (.getResponseCharSet method)))
([method client]
(let [redirect (redirect-location method)
new-method (if redirect (method redirect))]
Please sign in to comment.
Something went wrong with that request. Please try again.