Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Use character set from "Content-Type" response header in response-str #1

Merged
merged 2 commits into from Jun 25, 2011

Conversation

Projects
None yet
2 participants
Contributor

EugenDueck commented Jun 25, 2011

response-str used **slurp*** to convert bytes to characters, which uses *default-encoding*, and that is generally not related to the encoding of pages you want to crawl.

This change makes it use the encoding as returned by HttpMethodBase.getResponseCharSet(), which uses the HTTP response header "Content-Type", or defaults to ISO-8859-1 if no charset is specified.

This is a first step in figuring out the correct encoding of a page, and the second step would be to inspect the actual content of the file headers (html, xml), that might override the encoding in the http header.

But this first step already lets me crawl non-latin pages that I could not crawl before.

By the way, we could have just used

(.getResponseBodyAsString method)

which "directly" gives us a back a String, using the Content-Type just as mentioned above. The reason I did not go this route was the following warning:

WARNING: Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended.

Not that this makes things any better, as now we are doing the same potentially all-memory-consuming operation on the clojure slurp side, but at least I did not make a "breaking" change by introducing a new warning.

See also:
http://hc.apache.org/httpclient-3.x/charencodings.html#Request_Response_Body
http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/HttpMethodBase.html#getResponseCharSet()

P.S. There was a comment in the code, saying "uses slurp* here otherwise we get a annoying warning from commons-client". I guess it is referring to the warning I mentioned above, so it would be a comment to the effect "uses getResponseBodyAsStream rather than getResponseAsString" and not (what I initially assumed) "use slurp* over slurp" which (at least nowadays) is part of clojure core and would not require us to depend on duck-streams, which I decided to do here and I don't get any warnings.

EugenDueck added some commits Jun 25, 2011

response-str used slurp*, which always uses *default-encoding*, which…
… is generally not related to the encoding of pages you want to crawl.


This change makes it use the encoding as returned by HttpMethodBase.getResponseCharSet(), which uses the HTTP response header "Content-Type", or defaults to ISO-8859-1 if no charset is specified.

This is a first step in figuring out the correct encoding of a page, and the second step would be to inspect the actual content of the file headers (html, xml), that might override the encoding in the http header.

But this first step already lets me crawl non-latin pages that I could not crawl before.

By the way, we could have just used
(.getResponseBodyAsString method)
which "directly" gives us a back a String, using the Content-Type just as mentioned above. The reason I did not go this route was the following warning:
WARNING: Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended.

Not that this makes things any better, as now we are doing the same potentially all-memory-consuming operation on the clojure slurp side, but at least I did not make a "breaking" change by introducing a new warning.

See also:
http://hc.apache.org/httpclient-3.x/charencodings.html#Request_Response_Body
http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/HttpMethodBase.html#getResponseCharSet()

P.S. There was a comment in the code, saying "uses slurp* here otherwise we get a annoying warning from commons-client". I guess it is referring to the warning I mentioned above, so it would be a comment to the effect "uses getResponseBodyAsStream rather than getResponseAsString" and not (what I initially assumed) "use slurp* over slurp" which (at least nowadays) is part of clojure core and would not require us to depend on duck-streams, which I decided to do here and I don't get any warnings.
Added response-reader, which is similar to response-str, but returns …
…a java.io.Reader instead of a slurped String.

heyZeus added a commit that referenced this pull request Jun 25, 2011

Merge pull request #1 from EugenDueck/patch-1
Use character set from "Content-Type" response header in response-str

@heyZeus heyZeus merged commit b61dd15 into heyZeus:master Jun 25, 2011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment