Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Branch: master
Commits on Jun 25, 2011
  1. Merge pull request #1 from EugenDueck/patch-1

    authored
    Use character set from "Content-Type" response header in response-str
  2. @EugenDueck

    Added response-reader, which is similar to response-str, but returns …

    EugenDueck authored
    …a java.io.Reader instead of a slurped String.
  3. @EugenDueck

    response-str used slurp*, which always uses *default-encoding*, which…

    EugenDueck authored
    … is generally not related to the encoding of pages you want to crawl.
    
    
    This change makes it use the encoding as returned by HttpMethodBase.getResponseCharSet(), which uses the HTTP response header "Content-Type", or defaults to ISO-8859-1 if no charset is specified.
    
    This is a first step in figuring out the correct encoding of a page, and the second step would be to inspect the actual content of the file headers (html, xml), that might override the encoding in the http header.
    
    But this first step already lets me crawl non-latin pages that I could not crawl before.
    
    By the way, we could have just used
    (.getResponseBodyAsString method)
    which "directly" gives us a back a String, using the Content-Type just as mentioned above. The reason I did not go this route was the following warning:
    WARNING: Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended.
    
    Not that this makes things any better, as now we are doing the same potentially all-memory-consuming operation on the clojure slurp side, but at least I did not make a "breaking" change by introducing a new warning.
    
    See also:
    http://hc.apache.org/httpclient-3.x/charencodings.html#Request_Response_Body
    http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/HttpMethodBase.html#getResponseCharSet()
    
    P.S. There was a comment in the code, saying "uses slurp* here otherwise we get a annoying warning from commons-client". I guess it is referring to the warning I mentioned above, so it would be a comment to the effect "uses getResponseBodyAsStream rather than getResponseAsString" and not (what I initially assumed) "use slurp* over slurp" which (at least nowadays) is part of clojure core and would not require us to depend on duck-streams, which I decided to do here and I don't get any warnings.
Commits on Mar 16, 2009
Commits on Mar 10, 2009
  1. updated some docs

    authored
Commits on Mar 8, 2009
  1. accounts for redirects

    authored
  2. updated readme

    authored
  3. updated readme

    authored
  4. updated the readme

    authored
  5. updated the readme

    authored
  6. renamed scrape to crawl

    authored
Commits on Mar 7, 2009
Commits on Mar 6, 2009
  1. updated readme

    authored
  2. updated readme

    authored
  3. updated readme

    authored
  4. updated readme

    authored
  5. update readme

    authored
  6. updated readme

    authored
  7. updated readme

    authored
  8. updated readme

    authored
  9. updated readme with textile

    authored
  10. adding license

    authored
Commits on Mar 5, 2009
Commits on Mar 4, 2009
Commits on Mar 3, 2009
  1. updated README

    authored
  2. first commit

    authored
Something went wrong with that request. Please try again.