Crawlista is a support library for Clojure applications that crawl the Web.
- Detection of crawlable content (using Pantomime)
- Extraction of links with normalization (using Urly)
- Detection of link presence
- Extraction of content (like article text) from pages (using Boilerpipe Core)
With Leiningen
[clojurewerkz/crawlista "1.0.0-alpha17"]
Artifacts are published to clojars.org.
Crawlista is built from the ground up for Clojure 1.3 and up.
Crawlista is a work in progress. Please see our test suite for code examples. Once APIs and core functionality stabilizes, we will begin writing documentation guides and update this document.
Crawlista is part of the group of Clojure libraries known as ClojureWerkz, together with Neocons, Langohr, Elastisch, Welle, Monger, Quartzite and several others.
CI is hosted by travis-ci.org
Crawlista uses Leiningen 2. Make sure you have it installed and then run tests against all supported Clojure versions using
lein all test
Then create a branch and make your changes on it. Once you are done with your changes and all tests pass, submit a pull request on Github.
If you make changes to the Ragel-based robots.txt
parser in Crawlista, you need to regenerate it:
ragel -J -o src/java/clojurewerkz/crawlista/robots/Parser.java src/rl/clojurewerkz/crawlista/robots/Parser.rl
lein javac
and then run robots.txt
parser test suite with
lein test :robots
Copyright (C) 2011-2016 Michael S. Klishin, Alex Petrov, and the ClojureWerkz team.
Distributed under the Eclipse Public License, the same as Clojure.