Pegasus is a highly-modular, durable and scalable crawler for clojure.
Parallelism is achieved with core.async
.
Durability is achieved with durable-queue
and LMDB.
A blog post on how pegasus works: [link]
Leiningen dependencies:
A few example crawls:
This one crawls 20 docs from my blog (http://blog.shriphani.com).
URLs are extracted using enlive
selectors.
(ns your.namespace
(:require [org.bovinegenius.exploding-fish :as uri]
[net.cgrand.enlive-html :as html]
[pegasus.core :refer [crawl]])
(:import (java.io StringReader)))
(defn crawl-sp-blog
[]
(crawl {:seeds ["http://blog.shriphani.com"]
:user-agent "Pegasus web crawler"
:extractor
(fn [obj]
;; ensure that we only extract in domain
(when (= "blog.shriphani.com"
(-> obj :url uri/host))
(let [url (:url obj)
resource (-> obj
:body
(StringReader.)
html/html-resource)
;; extract the articles
articles (html/select resource
[:article :header :h2 :a])
;; the pagination links
pagination (html/select resource
[:ul.pagination :a])
a-tags (concat articles pagination)
;; resolve the URLs and stay within the same domain
links (filter
#(= (uri/host %)
"blog.shriphani.com")
(map
#(->> %
:attrs
:href
(uri/resolve-uri (:url obj)))
a-tags))]
;; add extracted links to the supplied object
(merge obj
{:extracted links}))))
:corpus-size 20 ;; crawl 20 documents
:job-dir "/tmp/sp-blog-corpus"})) ;; store all crawl data in /tmp/sp-blog-corpus/
;; start crawling
(crawl-sp-blog)
This one uses XPath queries courtesy of clj-xpath
.
Using XPaths:
(ns your.namespace
(:require [org.bovinegenius.exploding-fish :as uri]
[net.cgrand.enlive-html :as html]
[pegasus.core :refer [crawl]]
[clj-xpath.core :refer [$x $x:text xml->doc]]))
(defn crawl-sp-blog-xpaths
[]
(crawl {:seeds ["http://blog.shriphani.com/feeds/all.rss.xml"]
:user-agent "Pegasus web crawler"
:extractor
(fn [obj]
;; ensure that we only extract in domain
(when (= "blog.shriphani.com"
(-> obj :url uri/host))
(let [url (:url obj)
resource (try (-> obj
:body
xml->doc)
(catch Exception e nil))
;; extract the articles
articles (map
:text
(try ($x "//item/link" resource)
(catch Exception e nil)))]
;; add extracted links to the supplied object
(merge obj
{:extracted articles}))))
:corpus-size 20 ;; crawl 20 documents
:job-dir "/tmp/sp-blog-corpus"}))
;; start crawling
(crawl-sp-blog-xpaths)
Copyright © 2015-2016 Shriphani Palakodety
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.