Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

A wrapper around Apache commons-client for the Clojure programming language.

tree: 7101a86a70

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 test
Octocat-spinner-32 MIT-license.txt adding license
Octocat-spinner-32 README.textile
Octocat-spinner-32 clj_web_crawler.clj
README.textile

clj-web-crawler allows you to easily crawl and scrape the web from the comfort of
your own computer. This Clojure script is a wrapper around the Apache
commons-client Java library.

Usage



; The crawl macro also accepts a client, a method and a body. 
; In the body of crawl, you can examine various things like 
; the status code, the response, etc.
(let [clj-ws (client "http://www.clojure.org")
      home  (method "/")] 
  (crawl clj-ws home
    (println (.getStatusCode home))  
    (println (response-str home))))  

; If you need to login to a website you can do that too and
; verify any cookies are set to validate the login
(let [site (client "http://www.example.com")
      login (method "/accounts/login" :post {:login "doctor_no" :password "clojurerox"})] 
  (crawl site login 
    (if (assert-cookie-names site "username" "logged-time")  
      (println "yeah, I'm in")
      (println "I can't remember my password again!"))))

; the crawl-response function is for quick and dirty things and
; just returns the response as a string from the server
(println (crawl-response "http://www.clojure.org"))

; prints the HTML of the clojure.org/api page 
(println (crawl-response "http://www.clojure.org" "/api"))

I’ve only implemented some basic functionality to make the commons-client
more in line with functional programming. There are lots of things that
could be added to this Clojure script. I’ve just scratched the surface.

Since I’m calling commons-client under the covers I’ve using the underlying
naming conventions. The client function returns a
org.apache.commons.httpclient.HttpClient and the method function returns
an implementation of org.apache.commons.httpclient.HttpMethodBase. Please
see the commons-client API docs for reference.

Requirements

  • clojure-contrib
  • Apache commons-client 3.1 and its dependencies (only tested with the 3.1 version)

Any comments, suggestions, improvements are welcome!

Something went wrong with that request. Please try again.