Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Starting the project off with the 0.4.1 code base.

  • Loading branch information...
commit 590ed8011a64f2b0c4510f11cb8e4264132da554 0 parents
@johnnagro authored
38 CHANGES
@@ -0,0 +1,38 @@
+2007-11-09:
+* Handle redirects that assume a base URL.
+
+2007-11-08:
+* Move spider_instance.rb, robot_rules.rb, and included_in_memcached.rb into
+ spider subdirectory.
+
+2007-11-02:
+* Memcached support.
+
+2007-10-31:
+* Add `setup' and `teardown' handlers.
+* Can set the headers for a HTTP request.
+* Changed :any to :every .
+* Changed the arguments to the :every, :success, :failure, and code handler.
+
+2007-10-23:
+* URLs without a page component but with a query component.
+* HTTP Redirect.
+* HTTPS.
+* Version 0.2.1 .
+
+2007-10-22:
+* Use RSpec to ensure that it mostly works.
+* Use WEBrick to create a small test server for additional testing.
+* Completely re-do the API to prepare for future expansion.
+* Add the ability to apply each URL to a series of custom allowed?-like
+ matchers.
+* BSD license.
+* Version 0.2.0 .
+
+2007-03-30:
+* Clean up the documentation.
+
+2007-03-28:
+* Change the tail recursion to a `while' loop, to please Ruby.
+* Documentation.
+* Initial release: version 0.1.0 .
114 README
@@ -0,0 +1,114 @@
+Spider, a Web spidering library for Ruby. It handles the robots.txt,
+scraping, collecting, and looping so that you can just handle the data.
+
+== Examples
+
+=== Crawl the Web, loading each page in turn, until you run out of memory
+
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') {}
+
+=== To handle erroneous responses
+
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.on :failure do |a_url, resp, prior_url|
+ puts "URL failed: #{a_url}"
+ puts " linked from #{prior_url}"
+ end
+ end
+
+=== Or handle successful responses
+
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.on :success do |a_url, resp, prior_url|
+ puts "#{a_url}: #{resp.code}"
+ puts resp.body
+ puts
+ end
+ end
+
+=== Limit to just one domain
+
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.add_url_check do |a_url|
+ a_url =~ %r{^http://mike-burns.com.*}
+ end
+ end
+
+=== Pass headers to some requests
+
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.setup do |a_url|
+ if a_url =~ %r{^http://.*wikipedia.*}
+ headers['User-Agent'] = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
+ end
+ end
+ end
+
+=== Use memcached to track cycles
+
+ require 'spider'
+ require 'spider/included_in_memcached'
+ SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.check_already_seen_with IncludedInMemcached.new(SERVERS)
+ end
+
+=== Track cycles with a custom object
+
+ require 'spider'
+
+ class ExpireLinks < Hash
+ def <<(v)
+ [v] = Time.now
+ end
+ def include?(v)
+ [v] && (Time.now + 86400) <= [v]
+ end
+ end
+
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.check_already_seen_with ExpireLinks.new
+ end
+
+=== Create a URL graph
+
+ require 'spider'
+ nodes = {}
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.add_url_check {|a_url| a_url =~ %r{^http://mike-burns.com.*} }
+
+ s.on(:every) do |a_url, resp, prior_url|
+ nodes[prior_url] ||= []
+ nodes[prior_url] << a_url
+ end
+ end
+
+=== Use a proxy
+
+ require 'net/http_configuration'
+ require 'spider'
+ http_conf = Net::HTTP::Configuration.new(:proxy_host => '7proxies.org',
+ :proxy_port => 8881)
+ http_conf.apply do
+ Spider.start_at('http://img.4chan.org/b/') do |s|
+ s.on(:success) do |a_url, resp, prior_url|
+ File.open(a_url.gsub('/',':'),'w') do |f|
+ f.write(resp.body)
+ end
+ end
+ end
+ end
+
+== Author
+
+Mike Burns http://mike-burns.com mike@mike-burns.com
+
+Help from Matt Horan, John Nagro, and Henri Cook.
+
+With `robot_rules' from James Edward Gray II via
+http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589
226 doc/classes/IncludedInMemcached.html
@@ -0,0 +1,226 @@
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>Class: IncludedInMemcached</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <meta http-equiv="Content-Script-Type" content="text/javascript" />
+ <link rel="stylesheet" href=".././rdoc-style.css" type="text/css" media="screen" />
+ <script type="text/javascript">
+ // <![CDATA[
+
+ function popupCode( url ) {
+ window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
+ }
+
+ function toggleCode( id ) {
+ if ( document.getElementById )
+ elem = document.getElementById( id );
+ else if ( document.all )
+ elem = eval( "document.all." + id );
+ else
+ return false;
+
+ elemStyle = elem.style;
+
+ if ( elemStyle.display != "block" ) {
+ elemStyle.display = "block"
+ } else {
+ elemStyle.display = "none"
+ }
+
+ return true;
+ }
+
+ // Make codeblocks hidden by default
+ document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
+
+ // ]]>
+ </script>
+
+</head>
+<body>
+
+
+
+ <div id="classHeader">
+ <table class="header-table">
+ <tr class="top-aligned-row">
+ <td><strong>Class</strong></td>
+ <td class="class-name-in-header">IncludedInMemcached</td>
+ </tr>
+ <tr class="top-aligned-row">
+ <td><strong>In:</strong></td>
+ <td>
+ <a href="../files/lib/spider/included_in_memcached_rb.html">
+ lib/spider/included_in_memcached.rb
+ </a>
+ <br />
+ </td>
+ </tr>
+
+ <tr class="top-aligned-row">
+ <td><strong>Parent:</strong></td>
+ <td>
+ Object
+ </td>
+ </tr>
+ </table>
+ </div>
+ <!-- banner header -->
+
+ <div id="bodyContent">
+
+
+
+ <div id="contextContent">
+
+ <div id="description">
+ <p>
+A specialized class using memcached to track items stored. It supports
+three operations: <a href="IncludedInMemcached.html#M000001">new</a>,
+&lt;&lt;, and <a href="IncludedInMemcached.html#M000003">include?</a> .
+Together these can be used to add items to the memcache, then determine
+whether the item has been added.
+</p>
+<p>
+To use it with <a href="Spider.html">Spider</a> use the
+check_already_seen_with method:
+</p>
+<pre>
+ Spider.start_at('http://example.com/') do |s|
+ s.check_already_seen_with IncludedInMemcached.new('localhost:11211')
+ end
+</pre>
+
+ </div>
+
+
+ </div>
+
+ <div id="method-list">
+ <h3 class="section-bar">Methods</h3>
+
+ <div class="name-list">
+ <a href="#M000002">&lt;&lt;</a>&nbsp;&nbsp;
+ <a href="#M000003">include?</a>&nbsp;&nbsp;
+ <a href="#M000001">new</a>&nbsp;&nbsp;
+ </div>
+ </div>
+
+ </div>
+
+
+ <!-- if includes -->
+
+ <div id="section">
+
+
+
+
+
+
+
+
+ <!-- if method_list -->
+ <div id="methods">
+ <h3 class="section-bar">Public Class methods</h3>
+
+ <div id="method-M000001" class="method-detail">
+ <a name="M000001"></a>
+
+ <div class="method-heading">
+ <a href="#M000001" class="method-signature">
+ <span class="method-name">new</span><span class="method-args">(*a)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Construct a <a href="IncludedInMemcached.html#M000001">new</a> <a
+href="IncludedInMemcached.html">IncludedInMemcached</a> instance. All
+arguments here are passed to MemCache (part of the memcache-client gem).
+</p>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000001-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000001-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/included_in_memcached.rb, line 39</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">initialize</span>(<span class="ruby-operator">*</span><span class="ruby-identifier">a</span>)
+ <span class="ruby-ivar">@c</span> = <span class="ruby-constant">MemCache</span>.<span class="ruby-identifier">new</span>(<span class="ruby-operator">*</span><span class="ruby-identifier">a</span>)
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+ <h3 class="section-bar">Public Instance methods</h3>
+
+ <div id="method-M000002" class="method-detail">
+ <a name="M000002"></a>
+
+ <div class="method-heading">
+ <a href="#M000002" class="method-signature">
+ <span class="method-name">&lt;&lt;</span><span class="method-args">(v)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Add an item to the memcache.
+</p>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000002-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000002-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/included_in_memcached.rb, line 44</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-operator">&lt;&lt;</span>(<span class="ruby-identifier">v</span>)
+ <span class="ruby-ivar">@c</span>.<span class="ruby-identifier">add</span>(<span class="ruby-identifier">v</span>.<span class="ruby-identifier">to_s</span>, <span class="ruby-identifier">v</span>)
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+ <div id="method-M000003" class="method-detail">
+ <a name="M000003"></a>
+
+ <div class="method-heading">
+ <a href="#M000003" class="method-signature">
+ <span class="method-name">include?</span><span class="method-args">(v)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+True if the item is in the memcache.
+</p>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000003-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000003-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/included_in_memcached.rb, line 49</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">include?</span>(<span class="ruby-identifier">v</span>)
+ <span class="ruby-ivar">@c</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">v</span>.<span class="ruby-identifier">to_s</span>) <span class="ruby-operator">==</span> <span class="ruby-identifier">v</span>
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+
+ </div>
+
+
+ </div>
+
+
+<div id="validator-badges">
+ <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
+</div>
+
+</body>
+</html>
182 doc/classes/Spider.html
@@ -0,0 +1,182 @@
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>Class: Spider</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <meta http-equiv="Content-Script-Type" content="text/javascript" />
+ <link rel="stylesheet" href=".././rdoc-style.css" type="text/css" media="screen" />
+ <script type="text/javascript">
+ // <![CDATA[
+
+ function popupCode( url ) {
+ window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
+ }
+
+ function toggleCode( id ) {
+ if ( document.getElementById )
+ elem = document.getElementById( id );
+ else if ( document.all )
+ elem = eval( "document.all." + id );
+ else
+ return false;
+
+ elemStyle = elem.style;
+
+ if ( elemStyle.display != "block" ) {
+ elemStyle.display = "block"
+ } else {
+ elemStyle.display = "none"
+ }
+
+ return true;
+ }
+
+ // Make codeblocks hidden by default
+ document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
+
+ // ]]>
+ </script>
+
+</head>
+<body>
+
+
+
+ <div id="classHeader">
+ <table class="header-table">
+ <tr class="top-aligned-row">
+ <td><strong>Class</strong></td>
+ <td class="class-name-in-header">Spider</td>
+ </tr>
+ <tr class="top-aligned-row">
+ <td><strong>In:</strong></td>
+ <td>
+ <a href="../files/lib/spider_rb.html">
+ lib/spider.rb
+ </a>
+ <br />
+ </td>
+ </tr>
+
+ <tr class="top-aligned-row">
+ <td><strong>Parent:</strong></td>
+ <td>
+ Object
+ </td>
+ </tr>
+ </table>
+ </div>
+ <!-- banner header -->
+
+ <div id="bodyContent">
+
+
+
+ <div id="contextContent">
+
+ <div id="description">
+ <p>
+A spidering library for Ruby. Handles robots.txt, scraping, finding more
+links, and doing it all over again.
+</p>
+
+ </div>
+
+
+ </div>
+
+ <div id="method-list">
+ <h3 class="section-bar">Methods</h3>
+
+ <div class="name-list">
+ <a href="#M000011">start_at</a>&nbsp;&nbsp;
+ </div>
+ </div>
+
+ </div>
+
+
+ <!-- if includes -->
+
+ <div id="section">
+
+
+
+
+
+
+
+
+ <!-- if method_list -->
+ <div id="methods">
+ <h3 class="section-bar">Public Class methods</h3>
+
+ <div id="method-M000011" class="method-detail">
+ <a name="M000011"></a>
+
+ <div class="method-heading">
+ <a href="#M000011" class="method-signature">
+ <span class="method-name">start_at</span><span class="method-args">(a_url, &amp;block)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Runs the spider starting at the given URL. Also takes a block that is given
+the <a href="SpiderInstance.html">SpiderInstance</a>. Use the block to
+define the rules and handlers for the discovered Web pages. See <a
+href="SpiderInstance.html">SpiderInstance</a> for the possible rules and
+handlers.
+</p>
+<pre>
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.add_url_check do |a_url|
+ a_url =~ %r{^http://mike-burns.com.*}
+ end
+
+ s.on 404 do |a_url, resp, prior_url|
+ puts &quot;URL not found: #{a_url}&quot;
+ end
+
+ s.on :success do |a_url, resp, prior_url|
+ puts &quot;body: #{resp.body}&quot;
+ end
+
+ s.on :every do |a_url, resp, prior_url|
+ puts &quot;URL returned anything: #{a_url} with this code #{resp.code}&quot;
+ end
+ end
+</pre>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000011-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000011-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider.rb, line 54</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-keyword kw">self</span>.<span class="ruby-identifier">start_at</span>(<span class="ruby-identifier">a_url</span>, <span class="ruby-operator">&amp;</span><span class="ruby-identifier">block</span>)
+ <span class="ruby-identifier">rules</span> = <span class="ruby-constant">RobotRules</span>.<span class="ruby-identifier">new</span>(<span class="ruby-value str">'Ruby Spider 1.0'</span>)
+ <span class="ruby-identifier">a_spider</span> = <span class="ruby-constant">SpiderInstance</span>.<span class="ruby-identifier">new</span>({<span class="ruby-keyword kw">nil</span> =<span class="ruby-operator">&gt;</span> <span class="ruby-identifier">a_url</span>}, [], <span class="ruby-identifier">rules</span>, [])
+ <span class="ruby-identifier">block</span>.<span class="ruby-identifier">call</span>(<span class="ruby-identifier">a_spider</span>)
+ <span class="ruby-identifier">a_spider</span>.<span class="ruby-identifier">start!</span>
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+
+ </div>
+
+
+ </div>
+
+
+<div id="validator-badges">
+ <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
+</div>
+
+</body>
+</html>
381 doc/classes/SpiderInstance.html
@@ -0,0 +1,381 @@
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>Class: SpiderInstance</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <meta http-equiv="Content-Script-Type" content="text/javascript" />
+ <link rel="stylesheet" href=".././rdoc-style.css" type="text/css" media="screen" />
+ <script type="text/javascript">
+ // <![CDATA[
+
+ function popupCode( url ) {
+ window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
+ }
+
+ function toggleCode( id ) {
+ if ( document.getElementById )
+ elem = document.getElementById( id );
+ else if ( document.all )
+ elem = eval( "document.all." + id );
+ else
+ return false;
+
+ elemStyle = elem.style;
+
+ if ( elemStyle.display != "block" ) {
+ elemStyle.display = "block"
+ } else {
+ elemStyle.display = "none"
+ }
+
+ return true;
+ }
+
+ // Make codeblocks hidden by default
+ document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
+
+ // ]]>
+ </script>
+
+</head>
+<body>
+
+
+
+ <div id="classHeader">
+ <table class="header-table">
+ <tr class="top-aligned-row">
+ <td><strong>Class</strong></td>
+ <td class="class-name-in-header">SpiderInstance</td>
+ </tr>
+ <tr class="top-aligned-row">
+ <td><strong>In:</strong></td>
+ <td>
+ <a href="../files/lib/spider/spider_instance_rb.html">
+ lib/spider/spider_instance.rb
+ </a>
+ <br />
+ </td>
+ </tr>
+
+ <tr class="top-aligned-row">
+ <td><strong>Parent:</strong></td>
+ <td>
+ Object
+ </td>
+ </tr>
+ </table>
+ </div>
+ <!-- banner header -->
+
+ <div id="bodyContent">
+
+
+
+ <div id="contextContent">
+
+
+
+ </div>
+
+ <div id="method-list">
+ <h3 class="section-bar">Methods</h3>
+
+ <div class="name-list">
+ <a href="#M000004">add_url_check</a>&nbsp;&nbsp;
+ <a href="#M000005">check_already_seen_with</a>&nbsp;&nbsp;
+ <a href="#M000010">clear_headers</a>&nbsp;&nbsp;
+ <a href="#M000009">headers</a>&nbsp;&nbsp;
+ <a href="#M000006">on</a>&nbsp;&nbsp;
+ <a href="#M000007">setup</a>&nbsp;&nbsp;
+ <a href="#M000008">teardown</a>&nbsp;&nbsp;
+ </div>
+ </div>
+
+ </div>
+
+
+ <!-- if includes -->
+
+ <div id="section">
+
+
+
+
+
+
+
+
+ <!-- if method_list -->
+ <div id="methods">
+ <h3 class="section-bar">Public Instance methods</h3>
+
+ <div id="method-M000004" class="method-detail">
+ <a name="M000004"></a>
+
+ <div class="method-heading">
+ <a href="#M000004" class="method-signature">
+ <span class="method-name">add_url_check</span><span class="method-args">(&amp;block)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Add a predicate that determines whether to continue down this URL&#8216;s
+path. All predicates must be true in order for a URL to proceed.
+</p>
+<p>
+Takes a block that takes a string and produces a boolean. For example, this
+will ensure that the URL starts with &#8216;<a
+href="http://mike-burns.com">mike-burns.com</a>&#8217;:
+</p>
+<pre>
+ add_url_check { |a_url| a_url =~ %r{^http://mike-burns.com.*}
+</pre>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000004-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000004-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 70</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">add_url_check</span>(<span class="ruby-operator">&amp;</span><span class="ruby-identifier">block</span>)
+ <span class="ruby-ivar">@url_checks</span> <span class="ruby-operator">&lt;&lt;</span> <span class="ruby-identifier">block</span>
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+ <div id="method-M000005" class="method-detail">
+ <a name="M000005"></a>
+
+ <div class="method-heading">
+ <a href="#M000005" class="method-signature">
+ <span class="method-name">check_already_seen_with</span><span class="method-args">(cacher)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+The Web is a graph; to avoid cycles we store the nodes (URLs) already
+visited. The Web is a really, really, really big graph; as such, this list
+of visited nodes grows really, really, really big.
+</p>
+<p>
+Change the object used to store these seen nodes with this. The default
+object is an instance of Array. Available with <a
+href="Spider.html">Spider</a> is a wrapper of memcached.
+</p>
+<p>
+You can implement a custom class for this; any object passed to <a
+href="SpiderInstance.html#M000005">check_already_seen_with</a> must
+understand just &lt;&lt; and included? .
+</p>
+<pre>
+ # default
+ check_already_seen_with Array.new
+
+ # memcached
+ require 'spider/included_in_memcached'
+ check_already_seen_with IncludedInMemcached.new('localhost:11211')
+</pre>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000005-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000005-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 91</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">check_already_seen_with</span>(<span class="ruby-identifier">cacher</span>)
+ <span class="ruby-keyword kw">if</span> <span class="ruby-identifier">cacher</span>.<span class="ruby-identifier">respond_to?</span>(<span class="ruby-identifier">:&lt;&lt;</span>) <span class="ruby-operator">&amp;&amp;</span> <span class="ruby-identifier">cacher</span>.<span class="ruby-identifier">respond_to?</span>(<span class="ruby-identifier">:include?</span>)
+ <span class="ruby-ivar">@seen</span> = <span class="ruby-identifier">cacher</span>
+ <span class="ruby-keyword kw">else</span>
+ <span class="ruby-identifier">raise</span> <span class="ruby-constant">ArgumentError</span>, <span class="ruby-value str">'expected something that responds to &lt;&lt; and included?'</span>
+ <span class="ruby-keyword kw">end</span>
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+ <div id="method-M000010" class="method-detail">
+ <a name="M000010"></a>
+
+ <div class="method-heading">
+ <a href="#M000010" class="method-signature">
+ <span class="method-name">clear_headers</span><span class="method-args">()</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Reset the <a href="SpiderInstance.html#M000009">headers</a> hash.
+</p>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000010-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000010-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 158</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">clear_headers</span>
+ <span class="ruby-ivar">@headers</span> = {}
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+ <div id="method-M000009" class="method-detail">
+ <a name="M000009"></a>
+
+ <div class="method-heading">
+ <a href="#M000009" class="method-signature">
+ <span class="method-name">headers</span><span class="method-args">()</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Use like a hash:
+</p>
+<pre>
+ headers['Cookies'] = 'user_id=1;password=btrross3'
+</pre>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000009-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000009-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 146</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">headers</span>
+ <span class="ruby-constant">HeaderSetter</span>.<span class="ruby-identifier">new</span>(<span class="ruby-keyword kw">self</span>)
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+ <div id="method-M000006" class="method-detail">
+ <a name="M000006"></a>
+
+ <div class="method-heading">
+ <a href="#M000006" class="method-signature">
+ <span class="method-name">on</span><span class="method-args">(code, p = nil, &amp;block)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Add a response handler. A response handler&#8216;s trigger can be :every,
+:success, :failure, or any HTTP status code. The handler itself can be
+either a Proc or a block.
+</p>
+<p>
+The arguments to the block are: the URL as a string, an instance of
+Net::HTTPResponse, and the prior URL as a string.
+</p>
+<p>
+For example:
+</p>
+<pre>
+ on 404 do |a_url, resp, prior_url|
+ puts &quot;URL not found: #{a_url}&quot;
+ end
+
+ on :success do |a_url, resp, prior_url|
+ puts a_url
+ puts resp.body
+ end
+
+ on :every do |a_url, resp, prior_url|
+ puts &quot;Given this code: #{resp.code}&quot;
+ end
+</pre>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000006-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000006-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 121</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">on</span>(<span class="ruby-identifier">code</span>, <span class="ruby-identifier">p</span> = <span class="ruby-keyword kw">nil</span>, <span class="ruby-operator">&amp;</span><span class="ruby-identifier">block</span>)
+ <span class="ruby-identifier">f</span> = <span class="ruby-identifier">p</span> <span class="ruby-value">? </span><span class="ruby-identifier">p</span> <span class="ruby-operator">:</span> <span class="ruby-identifier">block</span>
+ <span class="ruby-keyword kw">case</span> <span class="ruby-identifier">code</span>
+ <span class="ruby-keyword kw">when</span> <span class="ruby-constant">Fixnum</span>
+ <span class="ruby-ivar">@callbacks</span>[<span class="ruby-identifier">code</span>] = <span class="ruby-identifier">f</span>
+ <span class="ruby-keyword kw">else</span>
+ <span class="ruby-ivar">@callbacks</span>[<span class="ruby-identifier">code</span>.<span class="ruby-identifier">to_sym</span>] = <span class="ruby-identifier">f</span>
+ <span class="ruby-keyword kw">end</span>
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+ <div id="method-M000007" class="method-detail">
+ <a name="M000007"></a>
+
+ <div class="method-heading">
+ <a href="#M000007" class="method-signature">
+ <span class="method-name">setup</span><span class="method-args">(p = nil, &amp;block)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Run before the HTTP request. Given the URL as a string.
+</p>
+<pre>
+ setup do |a_url|
+ headers['Cookies'] = 'user_id=1;admin=true'
+ end
+</pre>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000007-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000007-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 135</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">setup</span>(<span class="ruby-identifier">p</span> = <span class="ruby-keyword kw">nil</span>, <span class="ruby-operator">&amp;</span><span class="ruby-identifier">block</span>)
+ <span class="ruby-ivar">@setup</span> = <span class="ruby-identifier">p</span> <span class="ruby-value">? </span><span class="ruby-identifier">p</span> <span class="ruby-operator">:</span> <span class="ruby-identifier">block</span>
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+ <div id="method-M000008" class="method-detail">
+ <a name="M000008"></a>
+
+ <div class="method-heading">
+ <a href="#M000008" class="method-signature">
+ <span class="method-name">teardown</span><span class="method-args">(p = nil, &amp;block)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Run last, once for each page. Given the URL as a string.
+</p>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000008-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000008-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 140</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">teardown</span>(<span class="ruby-identifier">p</span> = <span class="ruby-keyword kw">nil</span>, <span class="ruby-operator">&amp;</span><span class="ruby-identifier">block</span>)
+ <span class="ruby-ivar">@teardown</span> = <span class="ruby-identifier">p</span> <span class="ruby-value">? </span><span class="ruby-identifier">p</span> <span class="ruby-operator">:</span> <span class="ruby-identifier">block</span>
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+
+ </div>
+
+
+ </div>
+
+
+<div id="validator-badges">
+ <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
+</div>
+
+</body>
+</html>
1  doc/created.rid
@@ -0,0 +1 @@
+Sat, 10 Nov 2007 00:25:19 -0500
223 doc/files/README.html
@@ -0,0 +1,223 @@
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>File: README</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <meta http-equiv="Content-Script-Type" content="text/javascript" />
+ <link rel="stylesheet" href=".././rdoc-style.css" type="text/css" media="screen" />
+ <script type="text/javascript">
+ // <![CDATA[
+
+ function popupCode( url ) {
+ window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
+ }
+
+ function toggleCode( id ) {
+ if ( document.getElementById )
+ elem = document.getElementById( id );
+ else if ( document.all )
+ elem = eval( "document.all." + id );
+ else
+ return false;
+
+ elemStyle = elem.style;
+
+ if ( elemStyle.display != "block" ) {
+ elemStyle.display = "block"
+ } else {
+ elemStyle.display = "none"
+ }
+
+ return true;
+ }
+
+ // Make codeblocks hidden by default
+ document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
+
+ // ]]>
+ </script>
+
+</head>
+<body>
+
+
+
+ <div id="fileHeader">
+ <h1>README</h1>
+ <table class="header-table">
+ <tr class="top-aligned-row">
+ <td><strong>Path:</strong></td>
+ <td>README
+ </td>
+ </tr>
+ <tr class="top-aligned-row">
+ <td><strong>Last Update:</strong></td>
+ <td>Thu Nov 08 17:51:17 -0500 2007</td>
+ </tr>
+ </table>
+ </div>
+ <!-- banner header -->
+
+ <div id="bodyContent">
+
+
+
+ <div id="contextContent">
+
+ <div id="description">
+ <p>
+<a href="../classes/Spider.html">Spider</a>, a Web spidering library for
+Ruby. It handles the robots.txt, scraping, collecting, and looping so that
+you can just handle the data.
+</p>
+<h2>Examples</h2>
+<h3>Crawl the Web, loading each page in turn, until you run out of memory</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') {}
+</pre>
+<h3>To handle erroneous responses</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.on :failure do |a_url, resp, prior_url|
+ puts &quot;URL failed: #{a_url}&quot;
+ puts &quot; linked from #{prior_url}&quot;
+ end
+ end
+</pre>
+<h3>Or handle successful responses</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.on :success do |a_url, resp, prior_url|
+ puts &quot;#{a_url}: #{resp.code}&quot;
+ puts resp.body
+ puts
+ end
+ end
+</pre>
+<h3>Limit to just one domain</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.add_url_check do |a_url|
+ a_url =~ %r{^http://mike-burns.com.*}
+ end
+ end
+</pre>
+<h3>Pass headers to some requests</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.setup do |a_url|
+ if a_url =~ %r{^http://.*wikipedia.*}
+ headers['User-Agent'] = &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;
+ end
+ end
+ end
+</pre>
+<h3>Use memcached to track cycles</h3>
+<pre>
+ require 'spider'
+ require 'spider/included_in_memcached'
+ SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.check_already_seen_with IncludedInMemcached.new(SERVERS)
+ end
+</pre>
+<h3>Track cycles with a custom object</h3>
+<pre>
+ require 'spider'
+
+ class ExpireLinks &lt; Hash
+ def &lt;&lt;(v)
+ [v] = Time.now
+ end
+ def include?(v)
+ [v] &amp;&amp; (Time.now + 86400) &lt;= [v]
+ end
+ end
+
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.check_already_seen_with ExpireLinks.new
+ end
+</pre>
+<h3>Create a URL graph</h3>
+<pre>
+ require 'spider'
+ nodes = {}
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.add_url_check {|a_url| a_url =~ %r{^http://mike-burns.com.*} }
+
+ s.on(:every) do |a_url, resp, prior_url|
+ nodes[prior_url] ||= []
+ nodes[prior_url] &lt;&lt; a_url
+ end
+ end
+</pre>
+<h3>Use a proxy</h3>
+<pre>
+ require 'net/http_configuration'
+ require 'spider'
+ http_conf = Net::HTTP::Configuration.new(:proxy_host =&gt; '7proxies.org',
+ :proxy_port =&gt; 8881)
+ http_conf.apply do
+ Spider.start_at('http://img.4chan.org/b/') do |s|
+ s.on(:success) do |a_url, resp, prior_url|
+ File.open(a_url.gsub('/',':'),'w') do |f|
+ f.write(resp.body)
+ end
+ end
+ end
+ end
+</pre>
+<h2>Author</h2>
+<p>
+Mike Burns <a href="http://mike-burns.com">mike-burns.com</a>
+mike@mike-burns.com
+</p>
+<p>
+Help from Matt Horan, John Nagro, and Henri Cook.
+</p>
+<p>
+With `robot_rules&#8217; from James Edward Gray II via <a
+href="http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589">blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589</a>
+</p>
+
+ </div>
+
+
+ </div>
+
+
+ </div>
+
+
+ <!-- if includes -->
+
+ <div id="section">
+
+
+
+
+
+
+
+
+ <!-- if method_list -->
+
+
+ </div>
+
+
+<div id="validator-badges">
+ <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
+</div>
+
+</body>
+</html>
114 doc/files/lib/spider/included_in_memcached_rb.html
@@ -0,0 +1,114 @@
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>File: included_in_memcached.rb</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <meta http-equiv="Content-Script-Type" content="text/javascript" />
+ <link rel="stylesheet" href="../../.././rdoc-style.css" type="text/css" media="screen" />
+ <script type="text/javascript">
+ // <![CDATA[
+
+ function popupCode( url ) {
+ window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
+ }
+
+ function toggleCode( id ) {
+ if ( document.getElementById )
+ elem = document.getElementById( id );
+ else if ( document.all )
+ elem = eval( "document.all." + id );
+ else
+ return false;
+
+ elemStyle = elem.style;
+
+ if ( elemStyle.display != "block" ) {
+ elemStyle.display = "block"
+ } else {
+ elemStyle.display = "none"
+ }
+
+ return true;
+ }
+
+ // Make codeblocks hidden by default
+ document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
+
+ // ]]>
+ </script>
+
+</head>
+<body>
+
+
+
+ <div id="fileHeader">
+ <h1>included_in_memcached.rb</h1>
+ <table class="header-table">
+ <tr class="top-aligned-row">
+ <td><strong>Path:</strong></td>
+ <td>lib/spider/included_in_memcached.rb
+ </td>
+ </tr>
+ <tr class="top-aligned-row">
+ <td><strong>Last Update:</strong></td>
+ <td>Sat Nov 10 00:24:11 -0500 2007</td>
+ </tr>
+ </table>
+ </div>
+ <!-- banner header -->
+
+ <div id="bodyContent">
+
+
+
+ <div id="contextContent">
+
+ <div id="description">
+ <p>
+Use memcached to track cycles.
+</p>
+
+ </div>
+
+ <div id="requires-list">
+ <h3 class="section-bar">Required files</h3>
+
+ <div class="name-list">
+ memcache&nbsp;&nbsp;
+ </div>
+ </div>
+
+ </div>
+
+
+ </div>
+
+
+ <!-- if includes -->
+
+ <div id="section">
+
+
+
+
+
+
+
+
+ <!-- if method_list -->
+
+
+ </div>
+
+
+<div id="validator-badges">
+ <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
+</div>
+
+</body>
+</html>
118 doc/files/lib/spider/spider_instance_rb.html
@@ -0,0 +1,118 @@
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>File: spider_instance.rb</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <meta http-equiv="Content-Script-Type" content="text/javascript" />
+ <link rel="stylesheet" href="../../.././rdoc-style.css" type="text/css" media="screen" />
+ <script type="text/javascript">
+ // <![CDATA[
+
+ function popupCode( url ) {
+ window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
+ }
+
+ function toggleCode( id ) {
+ if ( document.getElementById )
+ elem = document.getElementById( id );
+ else if ( document.all )
+ elem = eval( "document.all." + id );
+ else
+ return false;
+
+ elemStyle = elem.style;
+
+ if ( elemStyle.display != "block" ) {
+ elemStyle.display = "block"
+ } else {
+ elemStyle.display = "none"
+ }
+
+ return true;
+ }
+
+ // Make codeblocks hidden by default
+ document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
+
+ // ]]>
+ </script>
+
+</head>
+<body>
+
+
+
+ <div id="fileHeader">
+ <h1>spider_instance.rb</h1>
+ <table class="header-table">
+ <tr class="top-aligned-row">
+ <td><strong>Path:</strong></td>
+ <td>lib/spider/spider_instance.rb
+ </td>
+ </tr>
+ <tr class="top-aligned-row">
+ <td><strong>Last Update:</strong></td>
+ <td>Sat Nov 10 00:25:04 -0500 2007</td>
+ </tr>
+ </table>
+ </div>
+ <!-- banner header -->
+
+ <div id="bodyContent">
+
+
+
+ <div id="contextContent">
+
+ <div id="description">
+ <p>
+Specialized spidering rules.
+</p>
+
+ </div>
+
+ <div id="requires-list">
+ <h3 class="section-bar">Required files</h3>
+
+ <div class="name-list">
+ robot_rules&nbsp;&nbsp;
+ open-uri&nbsp;&nbsp;
+ uri&nbsp;&nbsp;
+ net/http&nbsp;&nbsp;
+ net/https&nbsp;&nbsp;
+ </div>
+ </div>
+
+ </div>
+
+
+ </div>
+
+
+ <!-- if includes -->
+
+ <div id="section">
+
+
+
+
+
+
+
+
+ <!-- if method_list -->
+
+
+ </div>
+
+
+<div id="validator-badges">
+ <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
+</div>
+
+</body>
+</html>
223 doc/files/lib/spider_rb.html
@@ -0,0 +1,223 @@
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>File: spider.rb</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <meta http-equiv="Content-Script-Type" content="text/javascript" />
+ <link rel="stylesheet" href="../.././rdoc-style.css" type="text/css" media="screen" />
+ <script type="text/javascript">
+ // <![CDATA[
+
+ function popupCode( url ) {
+ window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
+ }
+
+ function toggleCode( id ) {
+ if ( document.getElementById )
+ elem = document.getElementById( id );
+ else if ( document.all )
+ elem = eval( "document.all." + id );
+ else
+ return false;
+
+ elemStyle = elem.style;
+
+ if ( elemStyle.display != "block" ) {
+ elemStyle.display = "block"
+ } else {
+ elemStyle.display = "none"
+ }
+
+ return true;
+ }
+
+ // Make codeblocks hidden by default
+ document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
+
+ // ]]>
+ </script>
+
+</head>
+<body>
+
+
+
+ <div id="fileHeader">
+ <h1>spider.rb</h1>
+ <table class="header-table">
+ <tr class="top-aligned-row">
+ <td><strong>Path:</strong></td>
+ <td>lib/spider.rb
+ </td>
+ </tr>
+ <tr class="top-aligned-row">
+ <td><strong>Last Update:</strong></td>
+ <td>Thu Nov 08 17:29:01 -0500 2007</td>
+ </tr>
+ </table>
+ </div>
+ <!-- banner header -->
+
+ <div id="bodyContent">
+
+
+
+ <div id="contextContent">
+
+ <div id="description">
+ <p>
+Copyright 2007 Mike Burns <a href="../../classes/Spider.html">Spider</a>, a
+Web spidering library for Ruby. It handles the robots.txt, scraping,
+collecting, and looping so that you can just handle the data.
+</p>
+<h2>Examples</h2>
+<h3>Crawl the Web, loading each page in turn, until you run out of memory</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') {}
+</pre>
+<h3>To handle erroneous responses</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.on :failure do |a_url, resp, prior_url|
+ puts &quot;URL failed: #{a_url}&quot;
+ puts &quot; linked from #{prior_url}&quot;
+ end
+ end
+</pre>
+<h3>Or handle successful responses</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.on :success do |a_url, resp, prior_url|
+ puts &quot;#{a_url}: #{resp.code}&quot;
+ puts resp.body
+ puts
+ end
+ end
+</pre>
+<h3>Limit to just one domain</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.add_url_check do |a_url|
+ a_url =~ %r{^http://mike-burns.com.*}
+ end
+ end
+</pre>
+<h3>Pass headers to some requests</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.setup do |a_url|
+ if a_url =~ %r{^http://.*wikipedia.*}
+ headers['User-Agent'] = &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;
+ end
+ end
+ end
+</pre>
+<h3>Use memcached to track cycles</h3>
+<pre>
+ require 'spider'
+ require 'spider/included_in_memcached'
+ SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.check_already_seen_with IncludedInMemcached.new(SERVERS)
+ end
+</pre>
+<h3>Track cycles with a custom object</h3>
+<pre>
+ require 'spider'
+
+ class ExpireLinks &lt; Hash
+ def &lt;&lt;(v)
+ [v] = Time.now
+ end
+ def include?(v)
+ [v] &amp;&amp; (Time.now + 86400) &lt;= [v]
+ end
+ end
+
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.check_already_seen_with ExpireLinks.new
+ end
+</pre>
+<h3>Create a URL graph</h3>
+<pre>
+ require 'spider'
+ nodes = {}
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.add_url_check {|a_url| a_url =~ %r{^http://mike-burns.com.*} }
+
+ s.on(:every) do |a_url, resp, prior_url|
+ nodes[prior_url] ||= []
+ nodes[prior_url] &lt;&lt; a_url
+ end
+ end
+</pre>
+<h3>Use a proxy</h3>
+<pre>
+ require 'net/http_configuration'
+ require 'spider'
+ http_conf = Net::HTTP::Configuration.new(:proxy_host =&gt; '7proxies.org',
+ :proxy_port =&gt; 8881)
+ http_conf.apply do
+ Spider.start_at('http://img.4chan.org/b/') do |s|
+ s.on(:success) do |a_url, resp, prior_url|
+ File.open(a_url.gsub('/',':'),'w') do |f|
+ f.write(resp.body)
+ end
+ end
+ end
+ end
+</pre>
+<h2>Author</h2>
+<p>
+Mike Burns <a href="http://mike-burns.com">mike-burns.com</a>
+mike@mike-burns.com
+</p>
+<p>
+Help from Matt Horan, John Nagro, and Henri Cook.
+</p>
+<p>
+With `robot_rules&#8217; from James Edward Gray II via <a
+href="http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589">blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589</a>
+</p>
+
+ </div>
+
+
+ </div>
+
+
+ </div>
+
+
+ <!-- if includes -->
+
+ <div id="section">
+
+
+
+
+
+
+
+
+ <!-- if method_list -->
+
+
+ </div>
+
+
+<div id="validator-badges">
+ <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
+</div>
+
+</body>
+</html>
29 doc/fr_class_index.html
@@ -0,0 +1,29 @@
+
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<!--
+
+ Classes
+
+ -->
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>Classes</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <link rel="stylesheet" href="rdoc-style.css" type="text/css" />
+ <base target="docwin" />
+</head>
+<body>
+<div id="index">
+ <h1 class="section-bar">Classes</h1>
+ <div id="index-entries">
+ <a href="classes/IncludedInMemcached.html">IncludedInMemcached</a><br />
+ <a href="classes/Spider.html">Spider</a><br />
+ <a href="classes/SpiderInstance.html">SpiderInstance</a><br />
+ </div>
+</div>
+</body>
+</html>
30 doc/fr_file_index.html
@@ -0,0 +1,30 @@
+
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<!--
+
+ Files
+
+ -->
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>Files</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <link rel="stylesheet" href="rdoc-style.css" type="text/css" />
+ <base target="docwin" />
+</head>
+<body>
+<div id="index">
+ <h1 class="section-bar">Files</h1>
+ <div id="index-entries">
+ <a href="files/README.html">README</a><br />
+ <a href="files/lib/spider_rb.html">lib/spider.rb</a><br />
+ <a href="files/lib/spider/included_in_memcached_rb.html">lib/spider/included_in_memcached.rb</a><br />
+ <a href="files/lib/spider/spider_instance_rb.html">lib/spider/spider_instance.rb</a><br />
+ </div>
+</div>
+</body>
+</html>
37 doc/fr_method_index.html
@@ -0,0 +1,37 @@
+
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<!--
+
+ Methods
+
+ -->
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>Methods</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <link rel="stylesheet" href="rdoc-style.css" type="text/css" />
+ <base target="docwin" />
+</head>
+<body>
+<div id="index">
+ <h1 class="section-bar">Methods</h1>
+ <div id="index-entries">
+ <a href="classes/IncludedInMemcached.html#M000002"><< (IncludedInMemcached)</a><br />
+ <a href="classes/SpiderInstance.html#M000004">add_url_check (SpiderInstance)</a><br />
+ <a href="classes/SpiderInstance.html#M000005">check_already_seen_with (SpiderInstance)</a><br />
+ <a href="classes/SpiderInstance.html#M000010">clear_headers (SpiderInstance)</a><br />
+ <a href="classes/SpiderInstance.html#M000009">headers (SpiderInstance)</a><br />
+ <a href="classes/IncludedInMemcached.html#M000003">include? (IncludedInMemcached)</a><br />
+ <a href="classes/IncludedInMemcached.html#M000001">new (IncludedInMemcached)</a><br />
+ <a href="classes/SpiderInstance.html#M000006">on (SpiderInstance)</a><br />
+ <a href="classes/SpiderInstance.html#M000007">setup (SpiderInstance)</a><br />
+ <a href="classes/Spider.html#M000011">start_at (Spider)</a><br />
+ <a href="classes/SpiderInstance.html#M000008">teardown (SpiderInstance)</a><br />
+ </div>
+</div>
+</body>
+</html>
24 doc/index.html
@@ -0,0 +1,24 @@
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
+
+<!--
+
+ RDoc Documentation
+
+ -->
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>RDoc Documentation</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+</head>
+<frameset rows="20%, 80%">
+ <frameset cols="25%,35%,45%">
+ <frame src="fr_file_index.html" title="Files" name="Files" />
+ <frame src="fr_class_index.html" name="Classes" />
+ <frame src="fr_method_index.html" name="Methods" />
+ </frameset>
+ <frame src="files/lib/spider_rb.html" name="docwin" />
+</frameset>
+</html>
208 doc/rdoc-style.css
@@ -0,0 +1,208 @@
+
+body {
+ font-family: Verdana,Arial,Helvetica,sans-serif;
+ font-size: 90%;
+ margin: 0;
+ margin-left: 40px;
+ padding: 0;
+ background: white;
+}
+
+h1,h2,h3,h4 { margin: 0; color: #efefef; background: transparent; }
+h1 { font-size: 150%; }
+h2,h3,h4 { margin-top: 1em; }
+
+a { background: #eef; color: #039; text-decoration: none; }
+a:hover { background: #039; color: #eef; }
+
+/* Override the base stylesheet's Anchor inside a table cell */
+td > a {
+ background: transparent;
+ color: #039;
+ text-decoration: none;
+}
+
+/* and inside a section title */
+.section-title > a {
+ background: transparent;
+ color: #eee;
+ text-decoration: none;
+}
+
+/* === Structural elements =================================== */
+
+div#index {
+ margin: 0;
+ margin-left: -40px;
+ padding: 0;
+ font-size: 90%;
+}
+
+
+div#index a {
+ margin-left: 0.7em;
+}
+
+div#index .section-bar {
+ margin-left: 0px;
+ padding-left: 0.7em;
+ background: #ccc;
+ font-size: small;
+}
+
+
+div#classHeader, div#fileHeader {
+ width: auto;
+ color: white;
+ padding: 0.5em 1.5em 0.5em 1.5em;
+ margin: 0;
+ margin-left: -40px;
+ border-bottom: 3px solid #006;
+}
+
+div#classHeader a, div#fileHeader a {
+ background: inherit;
+ color: white;
+}
+
+div#classHeader td, div#fileHeader td {
+ background: inherit;
+ color: white;
+}
+
+
+div#fileHeader {
+ background: #057;
+}
+
+div#classHeader {
+ background: #048;
+}
+
+
+.class-name-in-header {
+ font-size: 180%;
+ font-weight: bold;
+}
+
+
+div#bodyContent {
+ padding: 0 1.5em 0 1.5em;
+}
+
+div#description {
+ padding: 0.5em 1.5em;
+ background: #efefef;
+ border: 1px dotted #999;
+}
+
+div#description h1,h2,h3,h4,h5,h6 {
+ color: #125;;
+ background: transparent;
+}
+
+div#validator-badges {
+ text-align: center;
+}
+div#validator-badges img { border: 0; }
+
+div#copyright {
+ color: #333;
+ background: #efefef;
+ font: 0.75em sans-serif;
+ margin-top: 5em;
+ margin-bottom: 0;
+ padding: 0.5em 2em;
+}
+
+
+/* === Classes =================================== */
+
+table.header-table {
+ color: white;
+ font-size: small;
+}
+
+.type-note {
+ font-size: small;
+ color: #DEDEDE;
+}
+
+.xxsection-bar {
+ background: #eee;
+ color: #333;
+ padding: 3px;
+}
+
+.section-bar {
+ color: #333;
+ border-bottom: 1px solid #999;
+ margin-left: -20px;
+}
+
+
+.section-title {
+ background: #79a;
+ color: #eee;
+ padding: 3px;
+ margin-top: 2em;
+ margin-left: -30px;
+ border: 1px solid #999;
+}
+
+.top-aligned-row { vertical-align: top }
+.bottom-aligned-row { vertical-align: bottom }
+
+/* --- Context section classes ----------------------- */
+
+.context-row { }
+.context-item-name { font-family: monospace; font-weight: bold; color: black; }
+.context-item-value { font-size: small; color: #448; }
+.context-item-desc { color: #333; padding-left: 2em; }
+
+/* --- Method classes -------------------------- */
+.method-detail {
+ background: #efefef;
+ padding: 0;
+ margin-top: 0.5em;
+ margin-bottom: 1em;
+ border: 1px dotted #ccc;
+}
+.method-heading {
+ color: black;
+ background: #ccc;
+ border-bottom: 1px solid #666;
+ padding: 0.2em 0.5em 0 0.5em;
+}
+.method-signature { color: black; background: inherit; }
+.method-name { font-weight: bold; }
+.method-args { font-style: italic; }
+.method-description { padding: 0 0.5em 0 0.5em; }
+
+/* --- Source code sections -------------------- */
+
+a.source-toggle { font-size: 90%; }
+div.method-source-code {
+ background: #262626;
+ color: #ffdead;
+ margin: 1em;
+ padding: 0.5em;
+ border: 1px dashed #999;
+ overflow: hidden;
+}
+
+div.method-source-code pre { color: #ffdead; overflow: hidden; }
+
+/* --- Ruby keyword styles --------------------- */
+
+.standalone-code { background: #221111; color: #ffdead; overflow: hidden; }
+
+.ruby-constant { color: #7fffd4; background: transparent; }
+.ruby-keyword { color: #00ffff; background: transparent; }
+.ruby-ivar { color: #eedd82; background: transparent; }
+.ruby-operator { color: #00ffee; background: transparent; }
+.ruby-identifier { color: #ffdead; background: transparent; }
+.ruby-node { color: #ffa07a; background: transparent; }
+.ruby-comment { color: #b22222; font-weight: bold; background: transparent; }
+.ruby-regexp { color: #ffa07a; background: transparent; }
+.ruby-value { color: #7fffd4; background: transparent; }
60 lib/spider.rb
@@ -0,0 +1,60 @@
+# Copyright 2007 Mike Burns
+# :include: README
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+# * Neither the name Mike Burns nor the
+# names of his contributors may be used to endorse or promote products
+# derived from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY Mike Burns ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL Mike Burns BE LIABLE FOR ANY
+# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+require File.dirname(__FILE__)+'/spider/spider_instance'
+
+# A spidering library for Ruby. Handles robots.txt, scraping, finding more
+# links, and doing it all over again.
+class Spider
+ # Runs the spider starting at the given URL. Also takes a block that is given
+ # the SpiderInstance. Use the block to define the rules and handlers for
+ # the discovered Web pages. See SpiderInstance for the possible rules and
+ # handlers.
+ #
+ # Spider.start_at('http://mike-burns.com/') do |s|
+ # s.add_url_check do |a_url|
+ # a_url =~ %r{^http://mike-burns.com.*}
+ # end
+ #
+ # s.on 404 do |a_url, resp, prior_url|
+ # puts "URL not found: #{a_url}"
+ # end
+ #
+ # s.on :success do |a_url, resp, prior_url|
+ # puts "body: #{resp.body}"
+ # end
+ #
+ # s.on :every do |a_url, resp, prior_url|
+ # puts "URL returned anything: #{a_url} with this code #{resp.code}"
+ # end
+ # end
+
+ def self.start_at(a_url, &block)
+ rules = RobotRules.new('Ruby Spider 1.0')
+ a_spider = SpiderInstance.new({nil => a_url}, [], rules, [])
+ block.call(a_spider)
+ a_spider.start!
+ end
+end
52 lib/spider/included_in_memcached.rb
@@ -0,0 +1,52 @@
+# Use memcached to track cycles.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+# * Neither the name Mike Burns nor the
+# names of his contributors may be used to endorse or promote products
+# derived from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY Mike Burns ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL Mike Burns BE LIABLE FOR ANY
+# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+require 'memcache'
+
+# A specialized class using memcached to track items stored. It supports
+# three operations: new, <<, and include? . Together these can be used to
+# add items to the memcache, then determine whether the item has been added.
+#
+# To use it with Spider use the check_already_seen_with method:
+#
+# Spider.start_at('http://example.com/') do |s|
+# s.check_already_seen_with IncludedInMemcached.new('localhost:11211')
+# end
+class IncludedInMemcached
+ # Construct a new IncludedInMemcached instance. All arguments here are
+ # passed to MemCache (part of the memcache-client gem).
+ def initialize(*a)
+ @c = MemCache.new(*a)
+ end
+
+ # Add an item to the memcache.
+ def <<(v)
+ @c.add(v.to_s, v)
+ end
+
+ # True if the item is in the memcache.
+ def include?(v)
+ @c.get(v.to_s) == v
+ end
+end
77 lib/spider/robot_rules.rb
@@ -0,0 +1,77 @@
+# Understand robots.txt.
+
+# Created by James Edward Gray II on 2006-01-31.
+# Copyright 2006 Gray Productions. All rights reserved.
+
+require "uri"
+
+# Based on Perl's WWW::RobotRules module, by Gisle Aas.
+class RobotRules
+ def initialize( user_agent )
+ @user_agent = user_agent.scan(/\S+/).first.sub(%r{/.*}, "").downcase
+ @rules = Hash.new { |rules, rule| rules[rule] = Array.new }
+ end
+
+ def parse( text_uri, robots_data )
+ uri = URI.parse(text_uri)
+ location = "#{uri.host}:#{uri.port}"
+ @rules.delete(location)
+
+ rules = robots_data.split(/[\015\012]+/).map do |rule|
+ rule.sub(/\s*#.*$/, "")
+ end
+ anon_rules = Array.new
+ my_rules = Array.new
+ current = anon_rules
+ rules.each do |rule|
+ case rule
+ when /^\s*User-Agent\s*:\s*(.+?)\s*$/i
+ break unless my_rules.empty?
+
+ current = if $1 == "*"
+ anon_rules
+ elsif $1.downcase.index(@user_agent)
+ my_rules
+ else
+ nil
+ end
+ when /^\s*Disallow\s*:\s*(.*?)\s*$/i
+ next if current.nil?
+
+ if $1.empty?
+ current << nil
+ else
+ disallow = URI.parse($1)
+
+ next unless disallow.scheme.nil? or disallow.scheme ==
+ uri.scheme
+ next unless disallow.port.nil? or disallow.port == uri.port
+ next unless disallow.host.nil? or
+ disallow.host.downcase == uri.host.downcase
+
+ disallow = disallow.path
+ disallow = "/" if disallow.empty?
+ disallow = "/#{disallow}" unless disallow[0] == ?/
+
+ current << disallow
+ end
+ end
+ end
+
+ @rules[location] = if my_rules.empty?
+ anon_rules.compact
+ else
+ my_rules.compact
+ end
+ end
+
+ def allowed?( text_uri )
+ uri = URI.parse(text_uri)
+ location = "#{uri.host}:#{uri.port}"
+ path = uri.path
+
+ return true unless %w{http https}.include?(uri.scheme)
+
+ not @rules[location].any? { |rule| path.index(rule) == 0 }
+ end
+end
294 lib/spider/spider_instance.rb
@@ -0,0 +1,294 @@
+# Specialized spidering rules.
+
+# Copyright 2007 Mike Burns
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+# * Neither the name Mike Burns nor the
+# names of his contributors may be used to endorse or promote products
+# derived from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY Mike Burns ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL Mike Burns BE LIABLE FOR ANY
+# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+require File.dirname(__FILE__)+'/robot_rules.rb'
+require 'open-uri'
+require 'uri'
+require 'net/http'
+require 'net/https'
+
+module Net #:nodoc:
+ class HTTPResponse #:nodoc:
+ def success?; false; end
+ def redirect?; false; end
+ end
+ class HTTPSuccess #:nodoc:
+ def success?; true; end
+ end
+ class HTTPRedirection #:nodoc:
+ def redirect?; true; end
+ end
+end
+
+class NilClass #:nodoc:
+ def merge(h); h; end
+end
+
+class SpiderInstance
+ def initialize(next_urls, seen = [], rules = nil, robots_seen = []) #:nodoc:
+ @url_checks = []
+ @cache = :memory
+ @callbacks = {}
+ @next_urls = next_urls
+ @seen = seen
+ @rules = rules || RobotRules.new('Ruby Spider 1.0')
+ @robots_seen = robots_seen
+ @headers = {}
+ @setup = nil
+ @teardown = nil
+ end
+
+ # Add a predicate that determines whether to continue down this URL's path.
+ # All predicates must be true in order for a URL to proceed.
+ #
+ # Takes a block that takes a string and produces a boolean. For example, this
+ # will ensure that the URL starts with 'http://mike-burns.com':
+ #
+ # add_url_check { |a_url| a_url =~ %r{^http://mike-burns.com.*}
+ def add_url_check(&block)
+ @url_checks << block
+ end
+
+ # The Web is a graph; to avoid cycles we store the nodes (URLs) already
+ # visited. The Web is a really, really, really big graph; as such, this list
+ # of visited nodes grows really, really, really big.
+ #
+ # Change the object used to store these seen nodes with this. The default
+ # object is an instance of Array. Available with Spider is a wrapper of
+ # memcached.
+ #
+ # You can implement a custom class for this; any object passed to
+ # check_already_seen_with must understand just << and included? .
+ #
+ # # default
+ # check_already_seen_with Array.new
+ #
+ # # memcached
+ # require 'spider/included_in_memcached'
+ # check_already_seen_with IncludedInMemcached.new('localhost:11211')
+ def check_already_seen_with(cacher)
+ if cacher.respond_to?(:<<) && cacher.respond_to?(:include?)
+ @seen = cacher
+ else
+ raise ArgumentError, 'expected something that responds to << and included?'
+ end
+ end
+
+ # Add a response handler. A response handler's trigger can be :every,
+ # :success, :failure, or any HTTP status code. The handler itself can be
+ # either a Proc or a block.
+ #
+ # The arguments to the block are: the URL as a string, an instance of
+ # Net::HTTPResponse, and the prior URL as a string.
+ #
+ #
+ # For example:
+ #
+ # on 404 do |a_url, resp, prior_url|
+ # puts "URL not found: #{a_url}"
+ # end
+ #
+ # on :success do |a_url, resp, prior_url|
+ # puts a_url
+ # puts resp.body
+ # end
+ #
+ # on :every do |a_url, resp, prior_url|
+ # puts "Given this code: #{resp.code}"
+ # end
+ def on(code, p = nil, &block)
+ f = p ? p : block
+