…a java.io.Reader instead of a slurped String.
… is generally not related to the encoding of pages you want to crawl. This change makes it use the encoding as returned by HttpMethodBase.getResponseCharSet(), which uses the HTTP response header "Content-Type", or defaults to ISO-8859-1 if no charset is specified. This is a first step in figuring out the correct encoding of a page, and the second step would be to inspect the actual content of the file headers (html, xml), that might override the encoding in the http header. But this first step already lets me crawl non-latin pages that I could not crawl before. By the way, we could have just used (.getResponseBodyAsString method) which "directly" gives us a back a String, using the Content-Type just as mentioned above. The reason I did not go this route was the following warning: WARNING: Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended. Not that this makes things any better, as now we are doing the same potentially all-memory-consuming operation on the clojure slurp side, but at least I did not make a "breaking" change by introducing a new warning. See also: http://hc.apache.org/httpclient-3.x/charencodings.html#Request_Response_Body http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/HttpMethodBase.html#getResponseCharSet() P.S. There was a comment in the code, saying "uses slurp* here otherwise we get a annoying warning from commons-client". I guess it is referring to the warning I mentioned above, so it would be a comment to the effect "uses getResponseBodyAsStream rather than getResponseAsString" and not (what I initially assumed) "use slurp* over slurp" which (at least nowadays) is part of clojure core and would not require us to depend on duck-streams, which I decided to do here and I don't get any warnings.