Fork of Mike Burns' ruby spider
Latest commit a625528 May 5, 2010 1 @johnewart Formatting tabs to spaces (thanks a lot TM...) and added code to use …
…URI to compute base path (for URLs like the ones Drupal makes)

Spider, a Web spidering library for Ruby. It handles the robots.txt, scraping, collecting, and looping so that you can just handle the data.

== Examples

=== Crawl the Web, loading each page in turn, until you run out of memory

require 'spider' Spider.start_at('') {}

=== To handle erroneous responses

require 'spider' Spider.start_at('') do |s| s.on :failure do |a_url, resp, prior_url| puts "URL failed: #{a_url}" puts " linked from #{prior_url}" end end

=== Or handle successful responses

require 'spider' Spider.start_at('') do |s| s.on :success do |a_url, resp, prior_url| puts "#{a_url}: #{resp.code}" puts resp.body puts end end

=== Limit to just one domain

require 'spider' Spider.start_at('') do |s| s.add_url_check do |a_url| a_url =~ %r{^*} end end

=== Pass headers to some requests

require 'spider' Spider.start_at('') do |s| s.setup do |a_url| if a_url =~ %r{^http://.wikipedia.} headers['User-Agent'] = "Mozilla/5.0 (compatible; Googlebot/2.1; +" end end end

=== Use memcached to track cycles

require 'spider' require 'spider/included_in_memcached' SERVERS = ['','',''] Spider.start_at('') do |s| s.check_already_seen_with end

=== Track cycles with a custom object

require 'spider' class ExpireLinks < Hash def <<(v) self[v] = end def include?(v) self[v].kind_of?(Time) && (self[v] + 86400) >= end end

Spider.start_at('') do |s| s.check_already_seen_with end

=== Store nodes to visit with Amazon SQS

require 'spider' require 'spider/next_urls_in_sqs' Spider.start_at('') do |s| s.store_next_urls_with, AWS_SECRET_ACCESS_KEY) end

==== Store nodes to visit with a custom object

require 'spider' class MyArray < Array def pop super end

def push(a_msg) super(a_msg) end end

Spider.start_at('') do |s| s.store_next_urls_with end

=== Create a URL graph

require 'spider' nodes = {} Spider.start_at('') do |s| s.add_url_check {|a_url| a_url =~ %r{^*} }

s.on(:every) do |a_url, resp, prior_url| nodes[prior_url] ||= [] nodes[prior_url] << a_url end end

=== Use a proxy

require 'net/http_configuration' require 'spider' http_conf = => '', :proxy_port => 8881)
http_conf.apply do Spider.start_at('') do |s| s.on(:success) do |a_url, resp, prior_url|'/',':'),'w') do |f| f.write(resp.body) end end end end

== Author

John Ewart

John Nagro

Mike Burns (original author)

Many thanks to: Matt Horan Henri Cook Sander van der Vliet John Buckley Brian Campbell

With `robot_rules' from James Edward Gray II via