Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

Starting the project off with the 0.4.1 code base.

  • Loading branch information...
commit 590ed8011a64f2b0c4510f11cb8e4264132da554 0 parents
John Nagro authored
38 CHANGES
... ... @@ -0,0 +1,38 @@
  1 +2007-11-09:
  2 +* Handle redirects that assume a base URL.
  3 +
  4 +2007-11-08:
  5 +* Move spider_instance.rb, robot_rules.rb, and included_in_memcached.rb into
  6 + spider subdirectory.
  7 +
  8 +2007-11-02:
  9 +* Memcached support.
  10 +
  11 +2007-10-31:
  12 +* Add `setup' and `teardown' handlers.
  13 +* Can set the headers for a HTTP request.
  14 +* Changed :any to :every .
  15 +* Changed the arguments to the :every, :success, :failure, and code handler.
  16 +
  17 +2007-10-23:
  18 +* URLs without a page component but with a query component.
  19 +* HTTP Redirect.
  20 +* HTTPS.
  21 +* Version 0.2.1 .
  22 +
  23 +2007-10-22:
  24 +* Use RSpec to ensure that it mostly works.
  25 +* Use WEBrick to create a small test server for additional testing.
  26 +* Completely re-do the API to prepare for future expansion.
  27 +* Add the ability to apply each URL to a series of custom allowed?-like
  28 + matchers.
  29 +* BSD license.
  30 +* Version 0.2.0 .
  31 +
  32 +2007-03-30:
  33 +* Clean up the documentation.
  34 +
  35 +2007-03-28:
  36 +* Change the tail recursion to a `while' loop, to please Ruby.
  37 +* Documentation.
  38 +* Initial release: version 0.1.0 .
114 README
... ... @@ -0,0 +1,114 @@
  1 +Spider, a Web spidering library for Ruby. It handles the robots.txt,
  2 +scraping, collecting, and looping so that you can just handle the data.
  3 +
  4 +== Examples
  5 +
  6 +=== Crawl the Web, loading each page in turn, until you run out of memory
  7 +
  8 + require 'spider'
  9 + Spider.start_at('http://mike-burns.com/') {}
  10 +
  11 +=== To handle erroneous responses
  12 +
  13 + require 'spider'
  14 + Spider.start_at('http://mike-burns.com/') do |s|
  15 + s.on :failure do |a_url, resp, prior_url|
  16 + puts "URL failed: #{a_url}"
  17 + puts " linked from #{prior_url}"
  18 + end
  19 + end
  20 +
  21 +=== Or handle successful responses
  22 +
  23 + require 'spider'
  24 + Spider.start_at('http://mike-burns.com/') do |s|
  25 + s.on :success do |a_url, resp, prior_url|
  26 + puts "#{a_url}: #{resp.code}"
  27 + puts resp.body
  28 + puts
  29 + end
  30 + end
  31 +
  32 +=== Limit to just one domain
  33 +
  34 + require 'spider'
  35 + Spider.start_at('http://mike-burns.com/') do |s|
  36 + s.add_url_check do |a_url|
  37 + a_url =~ %r{^http://mike-burns.com.*}
  38 + end
  39 + end
  40 +
  41 +=== Pass headers to some requests
  42 +
  43 + require 'spider'
  44 + Spider.start_at('http://mike-burns.com/') do |s|
  45 + s.setup do |a_url|
  46 + if a_url =~ %r{^http://.*wikipedia.*}
  47 + headers['User-Agent'] = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
  48 + end
  49 + end
  50 + end
  51 +
  52 +=== Use memcached to track cycles
  53 +
  54 + require 'spider'
  55 + require 'spider/included_in_memcached'
  56 + SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
  57 + Spider.start_at('http://mike-burns.com/') do |s|
  58 + s.check_already_seen_with IncludedInMemcached.new(SERVERS)
  59 + end
  60 +
  61 +=== Track cycles with a custom object
  62 +
  63 + require 'spider'
  64 +
  65 + class ExpireLinks < Hash
  66 + def <<(v)
  67 + [v] = Time.now
  68 + end
  69 + def include?(v)
  70 + [v] && (Time.now + 86400) <= [v]
  71 + end
  72 + end
  73 +
  74 + Spider.start_at('http://mike-burns.com/') do |s|
  75 + s.check_already_seen_with ExpireLinks.new
  76 + end
  77 +
  78 +=== Create a URL graph
  79 +
  80 + require 'spider'
  81 + nodes = {}
  82 + Spider.start_at('http://mike-burns.com/') do |s|
  83 + s.add_url_check {|a_url| a_url =~ %r{^http://mike-burns.com.*} }
  84 +
  85 + s.on(:every) do |a_url, resp, prior_url|
  86 + nodes[prior_url] ||= []
  87 + nodes[prior_url] << a_url
  88 + end
  89 + end
  90 +
  91 +=== Use a proxy
  92 +
  93 + require 'net/http_configuration'
  94 + require 'spider'
  95 + http_conf = Net::HTTP::Configuration.new(:proxy_host => '7proxies.org',
  96 + :proxy_port => 8881)
  97 + http_conf.apply do
  98 + Spider.start_at('http://img.4chan.org/b/') do |s|
  99 + s.on(:success) do |a_url, resp, prior_url|
  100 + File.open(a_url.gsub('/',':'),'w') do |f|
  101 + f.write(resp.body)
  102 + end
  103 + end
  104 + end
  105 + end
  106 +
  107 +== Author
  108 +
  109 +Mike Burns http://mike-burns.com mike@mike-burns.com
  110 +
  111 +Help from Matt Horan, John Nagro, and Henri Cook.
  112 +
  113 +With `robot_rules' from James Edward Gray II via
  114 +http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589
226 doc/classes/IncludedInMemcached.html
... ... @@ -0,0 +1,226 @@
  1 +<?xml version="1.0" encoding="iso-8859-1"?>
  2 +<!DOCTYPE html
  3 + PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  4 + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  5 +
  6 +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  7 +<head>
  8 + <title>Class: IncludedInMemcached</title>
  9 + <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
  10 + <meta http-equiv="Content-Script-Type" content="text/javascript" />
  11 + <link rel="stylesheet" href=".././rdoc-style.css" type="text/css" media="screen" />
  12 + <script type="text/javascript">
  13 + // <![CDATA[
  14 +
  15 + function popupCode( url ) {
  16 + window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
  17 + }
  18 +
  19 + function toggleCode( id ) {
  20 + if ( document.getElementById )
  21 + elem = document.getElementById( id );
  22 + else if ( document.all )
  23 + elem = eval( "document.all." + id );
  24 + else
  25 + return false;
  26 +
  27 + elemStyle = elem.style;
  28 +
  29 + if ( elemStyle.display != "block" ) {
  30 + elemStyle.display = "block"
  31 + } else {
  32 + elemStyle.display = "none"
  33 + }
  34 +
  35 + return true;
  36 + }
  37 +
  38 + // Make codeblocks hidden by default
  39 + document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
  40 +
  41 + // ]]>
  42 + </script>
  43 +
  44 +</head>
  45 +<body>
  46 +
  47 +
  48 +
  49 + <div id="classHeader">
  50 + <table class="header-table">
  51 + <tr class="top-aligned-row">
  52 + <td><strong>Class</strong></td>
  53 + <td class="class-name-in-header">IncludedInMemcached</td>
  54 + </tr>
  55 + <tr class="top-aligned-row">
  56 + <td><strong>In:</strong></td>
  57 + <td>
  58 + <a href="../files/lib/spider/included_in_memcached_rb.html">
  59 + lib/spider/included_in_memcached.rb
  60 + </a>
  61 + <br />
  62 + </td>
  63 + </tr>
  64 +
  65 + <tr class="top-aligned-row">
  66 + <td><strong>Parent:</strong></td>
  67 + <td>
  68 + Object
  69 + </td>
  70 + </tr>
  71 + </table>
  72 + </div>
  73 + <!-- banner header -->
  74 +
  75 + <div id="bodyContent">
  76 +
  77 +
  78 +
  79 + <div id="contextContent">
  80 +
  81 + <div id="description">
  82 + <p>
  83 +A specialized class using memcached to track items stored. It supports
  84 +three operations: <a href="IncludedInMemcached.html#M000001">new</a>,
  85 +&lt;&lt;, and <a href="IncludedInMemcached.html#M000003">include?</a> .
  86 +Together these can be used to add items to the memcache, then determine
  87 +whether the item has been added.
  88 +</p>
  89 +<p>
  90 +To use it with <a href="Spider.html">Spider</a> use the
  91 +check_already_seen_with method:
  92 +</p>
  93 +<pre>
  94 + Spider.start_at('http://example.com/') do |s|
  95 + s.check_already_seen_with IncludedInMemcached.new('localhost:11211')
  96 + end
  97 +</pre>
  98 +
  99 + </div>
  100 +
  101 +
  102 + </div>
  103 +
  104 + <div id="method-list">
  105 + <h3 class="section-bar">Methods</h3>
  106 +
  107 + <div class="name-list">
  108 + <a href="#M000002">&lt;&lt;</a>&nbsp;&nbsp;
  109 + <a href="#M000003">include?</a>&nbsp;&nbsp;
  110 + <a href="#M000001">new</a>&nbsp;&nbsp;
  111 + </div>
  112 + </div>
  113 +
  114 + </div>
  115 +
  116 +
  117 + <!-- if includes -->
  118 +
  119 + <div id="section">
  120 +
  121 +
  122 +
  123 +
  124 +
  125 +
  126 +
  127 +
  128 + <!-- if method_list -->
  129 + <div id="methods">
  130 + <h3 class="section-bar">Public Class methods</h3>
  131 +
  132 + <div id="method-M000001" class="method-detail">
  133 + <a name="M000001"></a>
  134 +
  135 + <div class="method-heading">
  136 + <a href="#M000001" class="method-signature">
  137 + <span class="method-name">new</span><span class="method-args">(*a)</span>
  138 + </a>
  139 + </div>
  140 +
  141 + <div class="method-description">
  142 + <p>
  143 +Construct a <a href="IncludedInMemcached.html#M000001">new</a> <a
  144 +href="IncludedInMemcached.html">IncludedInMemcached</a> instance. All
  145 +arguments here are passed to MemCache (part of the memcache-client gem).
  146 +</p>
  147 + <p><a class="source-toggle" href="#"
  148 + onclick="toggleCode('M000001-source');return false;">[Source]</a></p>
  149 + <div class="method-source-code" id="M000001-source">
  150 +<pre>
  151 +<span class="ruby-comment cmt"># File lib/spider/included_in_memcached.rb, line 39</span>
  152 + <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">initialize</span>(<span class="ruby-operator">*</span><span class="ruby-identifier">a</span>)
  153 + <span class="ruby-ivar">@c</span> = <span class="ruby-constant">MemCache</span>.<span class="ruby-identifier">new</span>(<span class="ruby-operator">*</span><span class="ruby-identifier">a</span>)
  154 + <span class="ruby-keyword kw">end</span>
  155 +</pre>
  156 + </div>
  157 + </div>
  158 + </div>
  159 +
  160 + <h3 class="section-bar">Public Instance methods</h3>
  161 +
  162 + <div id="method-M000002" class="method-detail">
  163 + <a name="M000002"></a>
  164 +
  165 + <div class="method-heading">
  166 + <a href="#M000002" class="method-signature">
  167 + <span class="method-name">&lt;&lt;</span><span class="method-args">(v)</span>
  168 + </a>
  169 + </div>
  170 +
  171 + <div class="method-description">
  172 + <p>
  173 +Add an item to the memcache.
  174 +</p>
  175 + <p><a class="source-toggle" href="#"
  176 + onclick="toggleCode('M000002-source');return false;">[Source]</a></p>
  177 + <div class="method-source-code" id="M000002-source">
  178 +<pre>
  179 +<span class="ruby-comment cmt"># File lib/spider/included_in_memcached.rb, line 44</span>
  180 + <span class="ruby-keyword kw">def</span> <span class="ruby-operator">&lt;&lt;</span>(<span class="ruby-identifier">v</span>)
  181 + <span class="ruby-ivar">@c</span>.<span class="ruby-identifier">add</span>(<span class="ruby-identifier">v</span>.<span class="ruby-identifier">to_s</span>, <span class="ruby-identifier">v</span>)
  182 + <span class="ruby-keyword kw">end</span>
  183 +</pre>
  184 + </div>
  185 + </div>
  186 + </div>
  187 +
  188 + <div id="method-M000003" class="method-detail">
  189 + <a name="M000003"></a>
  190 +
  191 + <div class="method-heading">
  192 + <a href="#M000003" class="method-signature">
  193 + <span class="method-name">include?</span><span class="method-args">(v)</span>
  194 + </a>
  195 + </div>
  196 +
  197 + <div class="method-description">
  198 + <p>
  199 +True if the item is in the memcache.
  200 +</p>
  201 + <p><a class="source-toggle" href="#"
  202 + onclick="toggleCode('M000003-source');return false;">[Source]</a></p>
  203 + <div class="method-source-code" id="M000003-source">
  204 +<pre>
  205 +<span class="ruby-comment cmt"># File lib/spider/included_in_memcached.rb, line 49</span>
  206 + <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">include?</span>(<span class="ruby-identifier">v</span>)
  207 + <span class="ruby-ivar">@c</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">v</span>.<span class="ruby-identifier">to_s</span>) <span class="ruby-operator">==</span> <span class="ruby-identifier">v</span>
  208 + <span class="ruby-keyword kw">end</span>
  209 +</pre>
  210 + </div>
  211 + </div>
  212 + </div>
  213 +
  214 +
  215 + </div>
  216 +
  217 +
  218 + </div>
  219 +
  220 +
  221 +<div id="validator-badges">
  222 + <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
  223 +</div>
  224 +
  225 +</body>
  226 +</html>
182 doc/classes/Spider.html
... ... @@ -0,0 +1,182 @@
  1 +<?xml version="1.0" encoding="iso-8859-1"?>
  2 +<!DOCTYPE html
  3 + PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  4 + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  5 +
  6 +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  7 +<head>
  8 + <title>Class: Spider</title>
  9 + <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
  10 + <meta http-equiv="Content-Script-Type" content="text/javascript" />
  11 + <link rel="stylesheet" href=".././rdoc-style.css" type="text/css" media="screen" />
  12 + <script type="text/javascript">
  13 + // <![CDATA[
  14 +
  15 + function popupCode( url ) {
  16 + window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
  17 + }
  18 +
  19 + function toggleCode( id ) {
  20 + if ( document.getElementById )
  21 + elem = document.getElementById( id );
  22 + else if ( document.all )
  23 + elem = eval( "document.all." + id );
  24 + else
  25 + return false;
  26 +
  27 + elemStyle = elem.style;
  28 +
  29 + if ( elemStyle.display != "block" ) {
  30 + elemStyle.display = "block"
  31 + } else {
  32 + elemStyle.display = "none"
  33 + }
  34 +
  35 + return true;
  36 + }
  37 +
  38 + // Make codeblocks hidden by default
  39 + document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
  40 +
  41 + // ]]>
  42 + </script>
  43 +
  44 +</head>
  45 +<body>
  46 +
  47 +
  48 +
  49 + <div id="classHeader">
  50 + <table class="header-table">
  51 + <tr class="top-aligned-row">
  52 + <td><strong>Class</strong></td>
  53 + <td class="class-name-in-header">Spider</td>
  54 + </tr>
  55 + <tr class="top-aligned-row">
  56 + <td><strong>In:</strong></td>
  57 + <td>
  58 + <a href="../files/lib/spider_rb.html">
  59 + lib/spider.rb
  60 + </a>
  61 + <br />
  62 + </td>
  63 + </tr>
  64 +
  65 + <tr class="top-aligned-row">
  66 + <td><strong>Parent:</strong></td>
  67 + <td>
  68 + Object
  69 + </td>
  70 + </tr>
  71 + </table>
  72 + </div>
  73 + <!-- banner header -->
  74 +
  75 + <div id="bodyContent">
  76 +
  77 +
  78 +
  79 + <div id="contextContent">
  80 +
  81 + <div id="description">
  82 + <p>
  83 +A spidering library for Ruby. Handles robots.txt, scraping, finding more
  84 +links, and doing it all over again.
  85 +</p>
  86 +
  87 + </div>
  88 +
  89 +
  90 + </div>
  91 +
  92 + <div id="method-list">
  93 + <h3 class="section-bar">Methods</h3>
  94 +
  95 + <div class="name-list">
  96 + <a href="#M000011">start_at</a>&nbsp;&nbsp;
  97 + </div>
  98 + </div>
  99 +
  100 + </div>
  101 +
  102 +
  103 + <!-- if includes -->
  104 +
  105 + <div id="section">
  106 +
  107 +
  108 +
  109 +
  110 +
  111 +
  112 +
  113 +
  114 + <!-- if method_list -->
  115 + <div id="methods">
  116 + <h3 class="section-bar">Public Class methods</h3>
  117 +
  118 + <div id="method-M000011" class="method-detail">
  119 + <a name="M000011"></a>
  120 +
  121 + <div class="method-heading">
  122 + <a href="#M000011" class="method-signature">
  123 + <span class="method-name">start_at</span><span class="method-args">(a_url, &amp;block)</span>
  124 + </a>
  125 + </div>
  126 +
  127 + <div class="method-description">
  128 + <p>
  129 +Runs the spider starting at the given URL. Also takes a block that is given
  130 +the <a href="SpiderInstance.html">SpiderInstance</a>. Use the block to
  131 +define the rules and handlers for the discovered Web pages. See <a
  132 +href="SpiderInstance.html">SpiderInstance</a> for the possible rules and
  133 +handlers.
  134 +</p>
  135 +<pre>
  136 + Spider.start_at('http://mike-burns.com/') do |s|
  137 + s.add_url_check do |a_url|
  138 + a_url =~ %r{^http://mike-burns.com.*}
  139 + end
  140 +
  141 + s.on 404 do |a_url, resp, prior_url|
  142 + puts &quot;URL not found: #{a_url}&quot;
  143 + end
  144 +
  145 + s.on :success do |a_url, resp, prior_url|
  146 + puts &quot;body: #{resp.body}&quot;
  147 + end
  148 +
  149 + s.on :every do |a_url, resp, prior_url|
  150 + puts &quot;URL returned anything: #{a_url} with this code #{resp.code}&quot;
  151 + end
  152 + end
  153 +</pre>
  154 + <p><a class="source-toggle" href="#"
  155 + onclick="toggleCode('M000011-source');return false;">[Source]</a></p>
  156 + <div class="method-source-code" id="M000011-source">
  157 +<pre>
  158 +<span class="ruby-comment cmt"># File lib/spider.rb, line 54</span>
  159 + <span class="ruby-keyword kw">def</span> <span class="ruby-keyword kw">self</span>.<span class="ruby-identifier">start_at</span>(<span class="ruby-identifier">a_url</span>, <span class="ruby-operator">&amp;</span><span class="ruby-identifier">block</span>)
  160 + <span class="ruby-identifier">rules</span> = <span class="ruby-constant">RobotRules</span>.<span class="ruby-identifier">new</span>(<span class="ruby-value str">'Ruby Spider 1.0'</span>)
  161 + <span class="ruby-identifier">a_spider</span> = <span class="ruby-constant">SpiderInstance</span>.<span class="ruby-identifier">new</span>({<span class="ruby-keyword kw">nil</span> =<span class="ruby-operator">&gt;</span> <span class="ruby-identifier">a_url</span>}, [], <span class="ruby-identifier">rules</span>, [])
  162 + <span class="ruby-identifier">block</span>.<span class="ruby-identifier">call</span>(<span class="ruby-identifier">a_spider</span>)
  163 + <span class="ruby-identifier">a_spider</span>.<span class="ruby-identifier">start!</span>
  164 + <span class="ruby-keyword kw">end</span>
  165 +</pre>
  166 + </div>
  167 + </div>
  168 + </div>
  169 +
  170 +
  171 + </div>
  172 +
  173 +
  174 + </div>
  175 +
  176 +
  177 +<div id="validator-badges">
  178 + <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
  179 +</div>
  180 +
  181 +</body>
  182 +</html>
381 doc/classes/SpiderInstance.html
... ... @@ -0,0 +1,381 @@
  1 +<?xml version="1.0" encoding="iso-8859-1"?>
  2 +<!DOCTYPE html
  3 + PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  4 + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  5 +
  6 +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  7 +<head>
  8 + <title>Class: SpiderInstance</title>
  9 + <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
  10 + <meta http-equiv="Content-Script-Type" content="text/javascript" />
  11 + <link rel="stylesheet" href=".././rdoc-style.css" type="text/css" media="screen" />
  12 + <script type="text/javascript">
  13 + // <![CDATA[
  14 +
  15 + function popupCode( url ) {
  16 + window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
  17 + }
  18 +
  19 + function toggleCode( id ) {
  20 + if ( document.getElementById )
  21 + elem = document.getElementById( id );
  22 + else if ( document.all )
  23 + elem = eval( "document.all." + id );
  24 + else
  25 + return false;
  26 +
  27 + elemStyle = elem.style;
  28 +
  29 + if ( elemStyle.display != "block" ) {
  30 + elemStyle.display = "block"
  31 + } else {
  32 + elemStyle.display = "none"
  33 + }
  34 +
  35 + return true;
  36 + }
  37 +
  38 + // Make codeblocks hidden by default
  39 + document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
  40 +
  41 + // ]]>
  42 + </script>
  43 +
  44 +</head>
  45 +<body>
  46 +
  47 +
  48 +
  49 + <div id="classHeader">
  50 + <table class="header-table">
  51 + <tr class="top-aligned-row">
  52 + <td><strong>Class</strong></td>
  53 + <td class="class-name-in-header">SpiderInstance</td>
  54 + </tr>
  55 + <tr class="top-aligned-row">
  56 + <td><strong>In:</strong></td>
  57 + <td>
  58 + <a href="../files/lib/spider/spider_instance_rb.html">
  59 + lib/spider/spider_instance.rb
  60 + </a>
  61 + <br />
  62 + </td>
  63 + </tr>
  64 +
  65 + <tr class="top-aligned-row">
  66 + <td><strong>Parent:</strong></td>
  67 + <td>
  68 + Object
  69 + </td>
  70 + </tr>
  71 + </table>
  72 + </div>
  73 + <!-- banner header -->
  74 +
  75 + <div id="bodyContent">
  76 +
  77 +
  78 +
  79 + <div id="contextContent">
  80 +
  81 +
  82 +
  83 + </div>
  84 +
  85 + <div id="method-list">
  86 + <h3 class="section-bar">Methods</h3>
  87 +
  88 + <div class="name-list">
  89 + <a href="#M000004">add_url_check</a>&nbsp;&nbsp;
  90 + <a href="#M000005">check_already_seen_with</a>&nbsp;&nbsp;
  91 + <a href="#M000010">clear_headers</a>&nbsp;&nbsp;
  92 + <a href="#M000009">headers</a>&nbsp;&nbsp;
  93 + <a href="#M000006">on</a>&nbsp;&nbsp;
  94 + <a href="#M000007">setup</a>&nbsp;&nbsp;
  95 + <a href="#M000008">teardown</a>&nbsp;&nbsp;
  96 + </div>
  97 + </div>
  98 +
  99 + </div>
  100 +
  101 +
  102 + <!-- if includes -->
  103 +
  104 + <div id="section">
  105 +
  106 +
  107 +
  108 +
  109 +
  110 +
  111 +
  112 +
  113 + <!-- if method_list -->
  114 + <div id="methods">
  115 + <h3 class="section-bar">Public Instance methods</h3>
  116 +
  117 + <div id="method-M000004" class="method-detail">
  118 + <a name="M000004"></a>
  119 +
  120 + <div class="method-heading">
  121 + <a href="#M000004" class="method-signature">
  122 + <span class="method-name">add_url_check</span><span class="method-args">(&amp;block)</span>
  123 + </a>
  124 + </div>
  125 +
  126 + <div class="method-description">
  127 + <p>
  128 +Add a predicate that determines whether to continue down this URL&#8216;s
  129 +path. All predicates must be true in order for a URL to proceed.
  130 +</p>
  131 +<p>
  132 +Takes a block that takes a string and produces a boolean. For example, this
  133 +will ensure that the URL starts with &#8216;<a
  134 +href="http://mike-burns.com">mike-burns.com</a>&#8217;:
  135 +</p>
  136 +<pre>
  137 + add_url_check { |a_url| a_url =~ %r{^http://mike-burns.com.*}
  138 +</pre>
  139 + <p><a class="source-toggle" href="#"
  140 + onclick="toggleCode('M000004-source');return false;">[Source]</a></p>
  141 + <div class="method-source-code" id="M000004-source">
  142 +<pre>
  143 +<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 70</span>
  144 + <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">add_url_check</span>(<span class="ruby-operator">&amp;</span><span class="ruby-identifier">block</span>)
  145 + <span class="ruby-ivar">@url_checks</span> <span class="ruby-operator">&lt;&lt;</span> <span class="ruby-identifier">block</span>
  146 + <span class="ruby-keyword kw">end</span>
  147 +</pre>
  148 + </div>
  149 + </div>
  150 + </div>
  151 +
  152 + <div id="method-M000005" class="method-detail">
  153 + <a name="M000005"></a>
  154 +
  155 + <div class="method-heading">
  156 + <a href="#M000005" class="method-signature">
  157 + <span class="method-name">check_already_seen_with</span><span class="method-args">(cacher)</span>
  158 + </a>
  159 + </div>
  160 +
  161 + <div class="method-description">
  162 + <p>
  163 +The Web is a graph; to avoid cycles we store the nodes (URLs) already
  164 +visited. The Web is a really, really, really big graph; as such, this list
  165 +of visited nodes grows really, really, really big.
  166 +</p>
  167 +<p>
  168 +Change the object used to store these seen nodes with this. The default
  169 +object is an instance of Array. Available with <a
  170 +href="Spider.html">Spider</a> is a wrapper of memcached.
  171 +</p>
  172 +<p>
  173 +You can implement a custom class for this; any object passed to <a
  174 +href="SpiderInstance.html#M000005">check_already_seen_with</a> must
  175 +understand just &lt;&lt; and included? .
  176 +</p>
  177 +<pre>
  178 + # default
  179 + check_already_seen_with Array.new
  180 +
  181 + # memcached
  182 + require 'spider/included_in_memcached'
  183 + check_already_seen_with IncludedInMemcached.new('localhost:11211')
  184 +</pre>
  185 + <p><a class="source-toggle" href="#"
  186 + onclick="toggleCode('M000005-source');return false;">[Source]</a></p>
  187 + <div class="method-source-code" id="M000005-source">
  188 +<pre>
  189 +<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 91</span>
  190 + <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">check_already_seen_with</span>(<span class="ruby-identifier">cacher</span>)
  191 + <span class="ruby-keyword kw">if</span> <span class="ruby-identifier">cacher</span>.<span class="ruby-identifier">respond_to?</span>(<span class="ruby-identifier">:&lt;&lt;</span>) <span class="ruby-operator">&amp;&amp;</span> <span class="ruby-identifier">cacher</span>.<span class="ruby-identifier">respond_to?</span>(<span class="ruby-identifier">:include?</span>)
  192 + <span class="ruby-ivar">@seen</span> = <span class="ruby-identifier">cacher</span>
  193 + <span class="ruby-keyword kw">else</span>
  194 + <span class="ruby-identifier">raise</span> <span class="ruby-constant">ArgumentError</span>, <span class="ruby-value str">'expected something that responds to &lt;&lt; and included?'</span>
  195 + <span class="ruby-keyword kw">end</span>
  196 + <span class="ruby-keyword kw">end</span>
  197 +</pre>
  198 + </div>
  199 + </div>
  200 + </div>
  201 +
  202 + <div id="method-M000010" class="method-detail">
  203 + <a name="M000010"></a>
  204 +
  205 + <div class="method-heading">
  206 + <a href="#M000010" class="method-signature">
  207 + <span class="method-name">clear_headers</span><span class="method-args">()</span>
  208 + </a>
  209 + </div>
  210 +
  211 + <div class="method-description">
  212 + <p>
  213 +Reset the <a href="SpiderInstance.html#M000009">headers</a> hash.
  214 +</p>
  215 + <p><a class="source-toggle" href="#"
  216 + onclick="toggleCode('M000010-source');return false;">[Source]</a></p>
  217 + <div class="method-source-code" id="M000010-source">
  218 +<pre>
  219 +<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 158</span>
  220 + <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">clear_headers</span>
  221 + <span class="ruby-ivar">@headers</span> = {}
  222 + <span class="ruby-keyword kw">end</span>
  223 +</pre>
  224 + </div>
  225 + </div>
  226 + </div>
  227 +
  228 + <div id="method-M000009" class="method-detail">
  229 + <a name="M000009"></a>
  230 +
  231 + <div class="method-heading">
  232 + <a href="#M000009" class="method-signature">
  233 + <span class="method-name">headers</span><span class="method-args">()</span>
  234 + </a>
  235 + </div>
  236 +
  237 + <div class="method-description">
  238 + <p>
  239 +Use like a hash:
  240 +</p>
  241 +<pre>
  242 + headers['Cookies'] = 'user_id=1;password=btrross3'
  243 +</pre>
  244 + <p><a class="source-toggle" href="#"
  245 + onclick="toggleCode('M000009-source');return false;">[Source]</a></p>
  246 + <div class="method-source-code" id="M000009-source">
  247 +<pre>
  248 +<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 146</span>
  249 + <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">headers</span>
  250 + <span class="ruby-constant">HeaderSetter</span>.<span class="ruby-identifier">new</span>(<span class="ruby-keyword kw">self</span>)
  251 + <span class="ruby-keyword kw">end</span>
  252 +</pre>
  253 + </div>
  254 + </div>
  255 + </div>
  256 +
  257 + <div id="method-M000006" class="method-detail">
  258 + <a name="M000006"></a>
  259 +
  260 + <div class="method-heading">
  261 + <a href="#M000006" class="method-signature">
  262 + <span class="method-name">on</span><span class="method-args">(code, p = nil, &amp;block)</span>
  263 + </a>
  264 + </div>
  265 +
  266 + <div class="method-description">
  267 + <p>
  268 +Add a response handler. A response handler&#8216;s trigger can be :every,
  269 +:success, :failure, or any HTTP status code. The handler itself can be
  270 +either a Proc or a block.
  271 +</p>
  272 +<p>
  273 +The arguments to the block are: the URL as a string, an instance of
  274 +Net::HTTPResponse, and the prior URL as a string.
  275 +</p>
  276 +<p>
  277 +For example:
  278 +</p>
  279 +<pre>
  280 + on 404 do |a_url, resp, prior_url|
  281 + puts &quot;URL not found: #{a_url}&quot;
  282 + end
  283 +
  284 + on :success do |a_url, resp, prior_url|
  285 + puts a_url
  286 + puts resp.body
  287 + end
  288 +
  289 + on :every do |a_url, resp, prior_url|
  290 + puts &quot;Given this code: #{resp.code}&quot;
  291 + end
  292 +</pre>
  293 + <p><a class="source-toggle" href="#"
  294 + onclick="toggleCode('M000006-source');return false;">[Source]</a></p>
  295 + <div class="method-source-code" id="M000006-source">
  296 +<pre>
  297 +<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 121</span>
  298 + <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">on</span>(<span class="ruby-identifier">code</span>, <span class="ruby-identifier">p</span> = <span class="ruby-keyword kw">nil</span>, <span class="ruby-operator">&amp;</span><span class="ruby-identifier">block</span>)
  299 + <span class="ruby-identifier">f</span> = <span class="ruby-identifier">p</span> <span class="ruby-value">? </span><span class="ruby-identifier">p</span> <span class="ruby-operator">:</span> <span class="ruby-identifier">block</span>
  300 + <span class="ruby-keyword kw">case</span> <span class="ruby-identifier">code</span>
  301 + <span class="ruby-keyword kw">when</span> <span class="ruby-constant">Fixnum</span>
  302 + <span class="ruby-ivar">@callbacks</span>[<span class="ruby-identifier">code</span>] = <span class="ruby-identifier">f</span>
  303 + <span class="ruby-keyword kw">else</span>
  304 + <span class="ruby-ivar">@callbacks</span>[<span class="ruby-identifier">code</span>.<span class="ruby-identifier">to_sym</span>] = <span class="ruby-identifier">f</span>
  305 + <span class="ruby-keyword kw">end</span>
  306 + <span class="ruby-keyword kw">end</span>
  307 +</pre>
  308 + </div>
  309 + </div>
  310 + </div>
  311 +
  312 + <div id="method-M000007" class="method-detail">
  313 + <a name="M000007"></a>
  314 +
  315 + <div class="method-heading">
  316 + <a href="#M000007" class="method-signature">
  317 + <span class="method-name">setup</span><span class="method-args">(p = nil, &amp;block)</span>
  318 + </a>
  319 + </div>
  320 +
  321 + <div class="method-description">
  322 + <p>
  323 +Run before the HTTP request. Given the URL as a string.
  324 +</p>
  325 +<pre>
  326 + setup do |a_url|
  327 + headers['Cookies'] = 'user_id=1;admin=true'
  328 + end
  329 +</pre>
  330 + <p><a class="source-toggle" href="#"
  331 + onclick="toggleCode('M000007-source');return false;">[Source]</a></p>
  332 + <div class="method-source-code" id="M000007-source">
  333 +<pre>
  334 +<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 135</span>
  335 + <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">setup</span>(<span class="ruby-identifier">p</span> = <span class="ruby-keyword kw">nil</span>, <span class="ruby-operator">&amp;</span><span class="ruby-identifier">block</span>)
  336 + <span class="ruby-ivar">@setup</span> = <span class="ruby-identifier">p</span> <span class="ruby-value">? </span><span class="ruby-identifier">p</span> <span class="ruby-operator">:</span> <span class="ruby-identifier">block</span>
  337 + <span class="ruby-keyword kw">end</span>
  338 +</pre>
  339 + </div>
  340 + </div>
  341 + </div>
  342 +
  343 + <div id="method-M000008" class="method-detail">
  344 + <a name="M000008"></a>
  345 +
  346 + <div class="method-heading">
  347 + <a href="#M000008" class="method-signature">
  348 + <span class="method-name">teardown</span><span class="method-args">(p = nil, &amp;block)</span>
  349 + </a>
  350 + </div>
  351 +
  352 + <div class="method-description">
  353 + <p>
  354 +Run last, once for each page. Given the URL as a string.
  355 +</p>
  356 + <p><a class="source-toggle" href="#"
  357 + onclick="toggleCode('M000008-source');return false;">[Source]</a></p>
  358 + <div class="method-source-code" id="M000008-source">
  359 +<pre>
  360 +<span class="ruby-comment cmt"># File lib/spider/spider_instance.rb, line 140</span>
  361 + <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">teardown</span>(<span class="ruby-identifier">p</span> = <span class="ruby-keyword kw">nil</span>, <span class="ruby-operator">&amp;</span><span class="ruby-identifier">block</span>)
  362 + <span class="ruby-ivar">@teardown</span> = <span class="ruby-identifier">p</span> <span class="ruby-value">? </span><span class="ruby-identifier">p</span> <span class="ruby-operator">:</span> <span class="ruby-identifier">block</span>
  363 + <span class="ruby-keyword kw">end</span>
  364 +</pre>
  365 + </div>
  366 + </div>
  367 + </div>
  368 +
  369 +
  370 + </div>
  371 +
  372 +
  373 + </div>
  374 +
  375 +
  376 +<div id="validator-badges">
  377 + <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
  378 +</div>
  379 +
  380 +</body>
  381 +</html>
1  doc/created.rid
... ... @@ -0,0 +1 @@
  1 +Sat, 10 Nov 2007 00:25:19 -0500
223 doc/files/README.html
... ... @@ -0,0 +1,223 @@
  1 +<?xml version="1.0" encoding="iso-8859-1"?>
  2 +<!DOCTYPE html
  3 + PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  4 + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  5 +
  6 +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  7 +<head>
  8 + <title>File: README</title>
  9 + <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
  10 + <meta http-equiv="Content-Script-Type" content="text/javascript" />
  11 + <link rel="stylesheet" href=".././rdoc-style.css" type="text/css" media="screen" />
  12 + <script type="text/javascript">
  13 + // <![CDATA[
  14 +
  15 + function popupCode( url ) {
  16 + window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
  17 + }
  18 +
  19 + function toggleCode( id ) {
  20 + if ( document.getElementById )
  21 + elem = document.getElementById( id );
  22 + else if ( document.all )
  23 + elem = eval( "document.all." + id );
  24 + else
  25 + return false;
  26 +
  27 + elemStyle = elem.style;
  28 +
  29 + if ( elemStyle.display != "block" ) {
  30 + elemStyle.display = "block"
  31 + } else {
  32 + elemStyle.display = "none"
  33 + }
  34 +
  35 + return true;
  36 + }
  37 +
  38 + // Make codeblocks hidden by default
  39 + document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
  40 +
  41 + // ]]>
  42 + </script>
  43 +
  44 +</head>
  45 +<body>
  46 +
  47 +
  48 +
  49 + <div id="fileHeader">
  50 + <h1>README</h1>
  51 + <table class="header-table">
  52 + <tr class="top-aligned-row">
  53 + <td><strong>Path:</strong></td>
  54 + <td>README
  55 + </td>
  56 + </tr>
  57 + <tr class="top-aligned-row">
  58 + <td><strong>Last Update:</strong></td>
  59 + <td>Thu Nov 08 17:51:17 -0500 2007</td>
  60 + </tr>
  61 + </table>
  62 + </div>
  63 + <!-- banner header -->
  64 +
  65 + <div id="bodyContent">
  66 +
  67 +
  68 +
  69 + <div id="contextContent">
  70 +
  71 + <div id="description">
  72 + <p>
  73 +<a href="../classes/Spider.html">Spider</a>, a Web spidering library for
  74 +Ruby. It handles the robots.txt, scraping, collecting, and looping so that
  75 +you can just handle the data.
  76 +</p>
  77 +<h2>Examples</h2>
  78 +<h3>Crawl the Web, loading each page in turn, until you run out of memory</h3>
  79 +<pre>
  80 + require 'spider'
  81 + Spider.start_at('http://mike-burns.com/') {}
  82 +</pre>
  83 +<h3>To handle erroneous responses</h3>
  84 +<pre>
  85 + require 'spider'
  86 + Spider.start_at('http://mike-burns.com/') do |s|
  87 + s.on :failure do |a_url, resp, prior_url|
  88 + puts &quot;URL failed: #{a_url}&quot;
  89 + puts &quot; linked from #{prior_url}&quot;
  90 + end
  91 + end
  92 +</pre>
  93 +<h3>Or handle successful responses</h3>
  94 +<pre>
  95 + require 'spider'
  96 + Spider.start_at('http://mike-burns.com/') do |s|
  97 + s.on :success do |a_url, resp, prior_url|
  98 + puts &quot;#{a_url}: #{resp.code}&quot;
  99 + puts resp.body
  100 + puts
  101 + end
  102 + end
  103 +</pre>
  104 +<h3>Limit to just one domain</h3>
  105 +<pre>
  106 + require 'spider'
  107 + Spider.start_at('http://mike-burns.com/') do |s|
  108 + s.add_url_check do |a_url|
  109 + a_url =~ %r{^http://mike-burns.com.*}
  110 + end
  111 + end
  112 +</pre>
  113 +<h3>Pass headers to some requests</h3>
  114 +<pre>
  115 + require 'spider'
  116 + Spider.start_at('http://mike-burns.com/') do |s|
  117 + s.setup do |a_url|
  118 + if a_url =~ %r{^http://.*wikipedia.*}
  119 + headers['User-Agent'] = &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;
  120 + end
  121 + end
  122 + end
  123 +</pre>
  124 +<h3>Use memcached to track cycles</h3>
  125 +<pre>
  126 + require 'spider'
  127 + require 'spider/included_in_memcached'
  128 + SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
  129 + Spider.start_at('http://mike-burns.com/') do |s|
  130 + s.check_already_seen_with IncludedInMemcached.new(SERVERS)
  131 + end
  132 +</pre>
  133 +<h3>Track cycles with a custom object</h3>
  134 +<pre>
  135 + require 'spider'
  136 +
  137 + class ExpireLinks &lt; Hash
  138 + def &lt;&lt;(v)
  139 + [v] = Time.now
  140 + end
  141 + def include?(v)
  142 + [v] &amp;&amp; (Time.now + 86400) &lt;= [v]
  143 + end
  144 + end
  145 +
  146 + Spider.start_at('http://mike-burns.com/') do |s|
  147 + s.check_already_seen_with ExpireLinks.new
  148 + end
  149 +</pre>
  150 +<h3>Create a URL graph</h3>
  151 +<pre>
  152 + require 'spider'
  153 + nodes = {}
  154 + Spider.start_at('http://mike-burns.com/') do |s|
  155 + s.add_url_check {|a_url| a_url =~ %r{^http://mike-burns.com.*} }
  156 +
  157 + s.on(:every) do |a_url, resp, prior_url|
  158 + nodes[prior_url] ||= []
  159 + nodes[prior_url] &lt;&lt; a_url
  160 + end
  161 + end
  162 +</pre>
  163 +<h3>Use a proxy</h3>
  164 +<pre>
  165 + require 'net/http_configuration'
  166 + require 'spider'
  167 + http_conf = Net::HTTP::Configuration.new(:proxy_host =&gt; '7proxies.org',
  168 + :proxy_port =&gt; 8881)
  169 + http_conf.apply do
  170 + Spider.start_at('http://img.4chan.org/b/') do |s|
  171 + s.on(:success) do |a_url, resp, prior_url|
  172 + File.open(a_url.gsub('/',':'),'w') do |f|
  173 + f.write(resp.body)
  174 + end
  175 + end
  176 + end
  177 + end
  178 +</pre>
  179 +<h2>Author</h2>
  180 +<p>
  181 +Mike Burns <a href="http://mike-burns.com">mike-burns.com</a>
  182 +mike@mike-burns.com
  183 +</p>
  184 +<p>
  185 +Help from Matt Horan, John Nagro, and Henri Cook.
  186 +</p>
  187 +<p>
  188 +With `robot_rules&#8217; from James Edward Gray II via <a
  189 +href="http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589">blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589</a>
  190 +</p>
  191 +
  192 + </div>
  193 +
  194 +
  195 + </div>
  196 +
  197 +
  198 + </div>
  199 +
  200 +
  201 + <!-- if includes -->
  202 +
  203 + <div id="section">
  204 +
  205 +
  206 +
  207 +
  208 +
  209 +
  210 +
  211 +
  212 + <!-- if method_list -->
  213 +
  214 +
  215 + </div>
  216 +
  217 +
  218 +<div id="validator-badges">
  219 + <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
  220 +</div>
  221 +
  222 +</body>
  223 +</html>
114 doc/files/lib/spider/included_in_memcached_rb.html
... ... @@ -0,0 +1,114 @@
  1 +<?xml version="1.0" encoding="iso-8859-1"?>