A web scraping library inspired by Scrapy using core.async.
I started this project mainly as an exercise for learning core.async. I have used scrapy before but it's a big project and attercop mostly covers only those features of scrapy that I have myself used. Not recommended for production use yet.
[attercop "0.1.0-SNAPSHOT"](let [config {:name "test-spider"
:allowed-domains #{"localhost" "127.0.0.1"}
:start-urls ["http://127.0.0.1:5000/"]
:rules [[#"[a-zA-Z]+.html$" {:scrape (fn [x] ..)
:follow true}]
[#"\w+-\d+.html" {:scrape (fn [x] ..)
:follow true}]
[:default {:scrape nil :follow false}]]
:pipeline [(fn [x] ..)
prn]
:max-wait 5000}]
(attercop.spider/run config))See attercop.examples.dmoz-lisp for a working example. It crawls
pages on the dmoz directory that are related
to lisp based functional programming languages and scraps the links to
resources for each language. It can be run as follows,
$ lein trampoline run -m attercop.examples.dmoz_lispList of the keys that can be specified in the spider config:
-
:name
requiredstringA human readable name for the spider.
-
:allowed-domains
requiredsetOnly URLs of the these domains will be considered for scraping and crawling.
-
:start-urls
requiredseqThe spider will start crawling/scraping with these URLs.
-
:rules
requiredseqThis is the most important configuration which is used to define what exactly is to be done for URLs matching a particular pattern. It's a sequence of multiple rules where each rule is a vector of two elements -
- A regex for matching the URL
- A map with the fields
:scrapeand:follow
The value for
:scrapecan either be a function ornil. If a function is provided, it will be used to scrape the page and it should return a sequence. Even if only 1 item is to be scraped off the page, it must be wrapped in a single element seq. If:scrapeis nil, then the page will not be scraped.:followon the other hand can take 3 kinds of valuestrue: the page will be followed ie. URLs on the page (that are of the allowed domains) will be queued for crawling)falseornil: the page will not be followed- a function: if a function is provided, it is expected to return a sequence of urls to crawl.
The function values for both
:scrapeand:followwill be called with the response map having keys:url,:status,:htmland:html-nodes(these are enlive nodes so enlive functions can be directly called on them).Besides the above type of rules, there is another special default rule which is recommended to be added as the last rule in the seq as follows,
[:default {:scrape nil :follow false}]This will make sure that the URLs that don't match with any of the previous rules will be ignored (both :scrape and :follow falsy). If you forget to add this, then the spider will ensure a no-op default rule is added at the end of the rules sequence.
-
:pipeline
requiredseqA seq of functions which will be called in order for every scraped item. The scraped item will be passed to the functions and they should return a (may be) modified version of the item which will be passed to the next function in the pipeline. The result of the last function in the seq will be ignored.
Typically the pipeline functions would either process the item data or store it somewhere for later use.
-
:max-wait
optionalnumberdefault: 5000The spider will wait for these many ms for the HTTP requests to complete. Default is 5000ms.
-
:rate-limit
optionalvectordefault: [5 3000]The spider will use this config to rate limit the HTTP requests. A value of type
[m, n]means a maximum ofmrequests will be made innms. The default is 5 requests in 3000ms. -
:handle-status-codes
optionalsetdefault: #{}Additional status codes to be handled by the scraper besides the standard valid ones ie. 2xx.
-
:graceful-shutdown?
optionalboolean|numberdefault: 5000If non-falsy, the spider will be gracefully shutdown ie. all queued URLs will be processed before stopping. Truthy values may be boolean or a number representing time to wait in milliseconds. If boolean, the timeout will be same as max-wait.
-
:user-agent
optionalstringdefault: Attercop (https://github.com/naiquevin/attercop)The User Agent to use when crawling.
$ lein test- Add support for middlewares (Spider and Downloader as in Scrapy)
- Respect robots.txt
Copyright © 2015 Vineet Naik
Distributed under the Eclipse Public License, the same as Clojure.