Skip to content

Crawlista is a support library for Clojure applications that crawl the Web

Notifications You must be signed in to change notification settings

michaelklishin/crawlista

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is Crawlista

Crawlista is a support library for Clojure applications that crawl the Web.

Continuous Integration status

Features

  • Detection of crawlable content (using Pantomime)
  • Extraction of links with normalization (using Urly)
  • Detection of link presence
  • Extraction of content (like article text) from pages (using Boilerpipe Core)

Usage

Installation

With Leiningen

[clojurewerkz/crawlista "1.0.0-alpha17"]

Artifacts are published to clojars.org.

Supported Clojure versions

Crawlista is built from the ground up for Clojure 1.3 and up.

Documentation & Examples

Crawlista is a work in progress. Please see our test suite for code examples. Once APIs and core functionality stabilizes, we will begin writing documentation guides and update this document.

Crawlista Is a ClojureWerkz Project

Crawlista is part of the group of Clojure libraries known as ClojureWerkz, together with Neocons, Langohr, Elastisch, Welle, Monger, Quartzite and several others.

Continuous Integration

Continuous Integration status

CI is hosted by travis-ci.org

Development

Crawlista uses Leiningen 2. Make sure you have it installed and then run tests against all supported Clojure versions using

lein all test

Then create a branch and make your changes on it. Once you are done with your changes and all tests pass, submit a pull request on Github.

Regenerating robots.txt Parser

If you make changes to the Ragel-based robots.txt parser in Crawlista, you need to regenerate it:

ragel -J -o src/java/clojurewerkz/crawlista/robots/Parser.java src/rl/clojurewerkz/crawlista/robots/Parser.rl
lein javac

and then run robots.txt parser test suite with

lein test :robots

License

Copyright (C) 2011-2016 Michael S. Klishin, Alex Petrov, and the ClojureWerkz team.

Distributed under the Eclipse Public License, the same as Clojure.

About

Crawlista is a support library for Clojure applications that crawl the Web

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages