Skip to content

joshuaeckroth/pegasus

 
 

Repository files navigation

pegasus

Circle CI

Pegasus is a highly-modular, durable and scalable crawler for clojure.

Parallelism is achieved with core.async. Durability is achieved with durable-queue and LMDB.

A blog post on how pegasus works: [link]

Usage

Leiningen dependencies:

Clojars Project

A few example crawls:

This one crawls 20 docs from my blog (http://blog.shriphani.com).

URLs are extracted using enlive selectors.

(ns your.namespace
  (:require [org.bovinegenius.exploding-fish :as uri]
            [net.cgrand.enlive-html :as html]
            [pegasus.core :refer [crawl]])
  (:import (java.io StringReader)))

(defn crawl-sp-blog
  []
  (crawl {:seeds ["http://blog.shriphani.com"]
          :user-agent "Pegasus web crawler"
          :extractor
          (fn [obj]
            ;; ensure that we only extract in domain
            (when (= "blog.shriphani.com"
                     (-> obj :url uri/host))
              
              (let [url (:url obj)
                    resource (-> obj
                                 :body
                                 (StringReader.)
                                 html/html-resource)

                    ;; extract the articles
                    articles (html/select resource
                                          [:article :header :h2 :a])

                    ;; the pagination links
                    pagination (html/select resource
                                            [:ul.pagination :a])

                    a-tags (concat articles pagination)

                    ;; resolve the URLs and stay within the same domain
                    links (filter
                           #(= (uri/host %)
                               "blog.shriphani.com")
                           (map
                            #(->> %
                                  :attrs
                                  :href
                                  (uri/resolve-uri (:url obj)))
                            a-tags))]

                ;; add extracted links to the supplied object
                (merge obj
                       {:extracted links}))))
          
          :corpus-size 20 ;; crawl 20 documents
          :job-dir "/tmp/sp-blog-corpus"})) ;; store all crawl data in /tmp/sp-blog-corpus/

;; start crawling
(crawl-sp-blog)

This one uses XPath queries courtesy of clj-xpath.

Using XPaths:

(ns your.namespace
  (:require [org.bovinegenius.exploding-fish :as uri]
            [net.cgrand.enlive-html :as html]
            [pegasus.core :refer [crawl]]
            [clj-xpath.core :refer [$x $x:text xml->doc]]))
            
(defn crawl-sp-blog-xpaths
  []
    (crawl {:seeds ["http://blog.shriphani.com/feeds/all.rss.xml"]
            :user-agent "Pegasus web crawler"
            :extractor
            (fn [obj]
              ;; ensure that we only extract in domain
              (when (= "blog.shriphani.com"
                     (-> obj :url uri/host))
                
                (let [url (:url obj)
                      resource (try (-> obj
                                        :body
                                        xml->doc)
                                    (catch Exception e nil))

                      ;; extract the articles
                      articles (map
                                :text
                                (try ($x "//item/link" resource)
                                     (catch Exception e nil)))]
                  
                  ;; add extracted links to the supplied object
                  (merge obj
                         {:extracted articles}))))
          
          :corpus-size 20 ;; crawl 20 documents
          :job-dir "/tmp/sp-blog-corpus"}))

;; start crawling
(crawl-sp-blog-xpaths)          

License

Copyright © 2015-2016 Shriphani Palakodety

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

About

🐎✈️ Pegasus is a scalable, modular, polite web-crawler for Clojure

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Clojure 100.0%