Permalink
Commits on Jun 25, 2011
  1. Merge pull request #1 from EugenDueck/patch-1

    Use character set from "Content-Type" response header in response-str
    committed Jun 25, 2011
  2. Added response-reader, which is similar to response-str, but returns …

    …a java.io.Reader instead of a slurped String.
    EugenDueck committed Jun 25, 2011
  3. response-str used slurp*, which always uses *default-encoding*, which…

    … is generally not related to the encoding of pages you want to crawl.
    
    
    This change makes it use the encoding as returned by HttpMethodBase.getResponseCharSet(), which uses the HTTP response header "Content-Type", or defaults to ISO-8859-1 if no charset is specified.
    
    This is a first step in figuring out the correct encoding of a page, and the second step would be to inspect the actual content of the file headers (html, xml), that might override the encoding in the http header.
    
    But this first step already lets me crawl non-latin pages that I could not crawl before.
    
    By the way, we could have just used
    (.getResponseBodyAsString method)
    which "directly" gives us a back a String, using the Content-Type just as mentioned above. The reason I did not go this route was the following warning:
    WARNING: Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended.
    
    Not that this makes things any better, as now we are doing the same potentially all-memory-consuming operation on the clojure slurp side, but at least I did not make a "breaking" change by introducing a new warning.
    
    See also:
    http://hc.apache.org/httpclient-3.x/charencodings.html#Request_Response_Body
    http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/HttpMethodBase.html#getResponseCharSet()
    
    P.S. There was a comment in the code, saying "uses slurp* here otherwise we get a annoying warning from commons-client". I guess it is referring to the warning I mentioned above, so it would be a comment to the effect "uses getResponseBodyAsStream rather than getResponseAsString" and not (what I initially assumed) "use slurp* over slurp" which (at least nowadays) is part of clojure core and would not require us to depend on duck-streams, which I decided to do here and I don't get any warnings.
    EugenDueck committed Jun 25, 2011
Commits on Mar 16, 2009
Commits on Mar 10, 2009
  1. updated some docs

    committed Mar 10, 2009
Commits on Mar 8, 2009
  1. accounts for redirects

    committed Mar 8, 2009
  2. updated readme

    committed Mar 8, 2009
  3. updated readme

    committed Mar 8, 2009
  4. updated the readme

    committed Mar 8, 2009
  5. updated the readme

    committed Mar 8, 2009
  6. renamed scrape to crawl

    committed Mar 8, 2009
Commits on Mar 7, 2009
  1. scrape now accepts string args

    committed Mar 7, 2009
Commits on Mar 6, 2009
  1. updated readme

    committed Mar 6, 2009
  2. updated readme

    committed Mar 6, 2009
  3. updated readme

    committed Mar 6, 2009
  4. updated readme

    committed Mar 6, 2009
  5. update readme

    committed Mar 6, 2009
  6. updated readme

    committed Mar 6, 2009
  7. updated readme

    committed Mar 6, 2009
  8. updated readme

    committed Mar 6, 2009
  9. updated readme with textile

    committed Mar 6, 2009
  10. adding license

    committed Mar 6, 2009
Commits on Mar 5, 2009
Commits on Mar 4, 2009
Commits on Mar 3, 2009
  1. updated README

    committed Mar 3, 2009
  2. first commit

    committed Mar 3, 2009