Permalink
Browse files

Starting the project off with the 0.4.1 code base.

  • Loading branch information...
0 parents commit 590ed8011a64f2b0c4510f11cb8e4264132da554 @johnnagro committed May 23, 2008
38 CHANGES
@@ -0,0 +1,38 @@
+2007-11-09:
+* Handle redirects that assume a base URL.
+
+2007-11-08:
+* Move spider_instance.rb, robot_rules.rb, and included_in_memcached.rb into
+ spider subdirectory.
+
+2007-11-02:
+* Memcached support.
+
+2007-10-31:
+* Add `setup' and `teardown' handlers.
+* Can set the headers for a HTTP request.
+* Changed :any to :every .
+* Changed the arguments to the :every, :success, :failure, and code handler.
+
+2007-10-23:
+* URLs without a page component but with a query component.
+* HTTP Redirect.
+* HTTPS.
+* Version 0.2.1 .
+
+2007-10-22:
+* Use RSpec to ensure that it mostly works.
+* Use WEBrick to create a small test server for additional testing.
+* Completely re-do the API to prepare for future expansion.
+* Add the ability to apply each URL to a series of custom allowed?-like
+ matchers.
+* BSD license.
+* Version 0.2.0 .
+
+2007-03-30:
+* Clean up the documentation.
+
+2007-03-28:
+* Change the tail recursion to a `while' loop, to please Ruby.
+* Documentation.
+* Initial release: version 0.1.0 .
114 README
@@ -0,0 +1,114 @@
+Spider, a Web spidering library for Ruby. It handles the robots.txt,
+scraping, collecting, and looping so that you can just handle the data.
+
+== Examples
+
+=== Crawl the Web, loading each page in turn, until you run out of memory
+
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') {}
+
+=== To handle erroneous responses
+
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.on :failure do |a_url, resp, prior_url|
+ puts "URL failed: #{a_url}"
+ puts " linked from #{prior_url}"
+ end
+ end
+
+=== Or handle successful responses
+
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.on :success do |a_url, resp, prior_url|
+ puts "#{a_url}: #{resp.code}"
+ puts resp.body
+ puts
+ end
+ end
+
+=== Limit to just one domain
+
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.add_url_check do |a_url|
+ a_url =~ %r{^http://mike-burns.com.*}
+ end
+ end
+
+=== Pass headers to some requests
+
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.setup do |a_url|
+ if a_url =~ %r{^http://.*wikipedia.*}
+ headers['User-Agent'] = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
+ end
+ end
+ end
+
+=== Use memcached to track cycles
+
+ require 'spider'
+ require 'spider/included_in_memcached'
+ SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.check_already_seen_with IncludedInMemcached.new(SERVERS)
+ end
+
+=== Track cycles with a custom object
+
+ require 'spider'
+
+ class ExpireLinks < Hash
+ def <<(v)
+ [v] = Time.now
+ end
+ def include?(v)
+ [v] && (Time.now + 86400) <= [v]
+ end
+ end
+
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.check_already_seen_with ExpireLinks.new
+ end
+
+=== Create a URL graph
+
+ require 'spider'
+ nodes = {}
+ Spider.start_at('http://mike-burns.com/') do |s|
+ s.add_url_check {|a_url| a_url =~ %r{^http://mike-burns.com.*} }
+
+ s.on(:every) do |a_url, resp, prior_url|
+ nodes[prior_url] ||= []
+ nodes[prior_url] << a_url
+ end
+ end
+
+=== Use a proxy
+
+ require 'net/http_configuration'
+ require 'spider'
+ http_conf = Net::HTTP::Configuration.new(:proxy_host => '7proxies.org',
+ :proxy_port => 8881)
+ http_conf.apply do
+ Spider.start_at('http://img.4chan.org/b/') do |s|
+ s.on(:success) do |a_url, resp, prior_url|
+ File.open(a_url.gsub('/',':'),'w') do |f|
+ f.write(resp.body)
+ end
+ end
+ end
+ end
+
+== Author
+
+Mike Burns http://mike-burns.com mike@mike-burns.com
+
+Help from Matt Horan, John Nagro, and Henri Cook.
+
+With `robot_rules' from James Edward Gray II via
+http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589
@@ -0,0 +1,226 @@
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+ PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+ <title>Class: IncludedInMemcached</title>
+ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+ <meta http-equiv="Content-Script-Type" content="text/javascript" />
+ <link rel="stylesheet" href=".././rdoc-style.css" type="text/css" media="screen" />
+ <script type="text/javascript">
+ // <![CDATA[
+
+ function popupCode( url ) {
+ window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
+ }
+
+ function toggleCode( id ) {
+ if ( document.getElementById )
+ elem = document.getElementById( id );
+ else if ( document.all )
+ elem = eval( "document.all." + id );
+ else
+ return false;
+
+ elemStyle = elem.style;
+
+ if ( elemStyle.display != "block" ) {
+ elemStyle.display = "block"
+ } else {
+ elemStyle.display = "none"
+ }
+
+ return true;
+ }
+
+ // Make codeblocks hidden by default
+ document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
+
+ // ]]>
+ </script>
+
+</head>
+<body>
+
+
+
+ <div id="classHeader">
+ <table class="header-table">
+ <tr class="top-aligned-row">
+ <td><strong>Class</strong></td>
+ <td class="class-name-in-header">IncludedInMemcached</td>
+ </tr>
+ <tr class="top-aligned-row">
+ <td><strong>In:</strong></td>
+ <td>
+ <a href="../files/lib/spider/included_in_memcached_rb.html">
+ lib/spider/included_in_memcached.rb
+ </a>
+ <br />
+ </td>
+ </tr>
+
+ <tr class="top-aligned-row">
+ <td><strong>Parent:</strong></td>
+ <td>
+ Object
+ </td>
+ </tr>
+ </table>
+ </div>
+ <!-- banner header -->
+
+ <div id="bodyContent">
+
+
+
+ <div id="contextContent">
+
+ <div id="description">
+ <p>
+A specialized class using memcached to track items stored. It supports
+three operations: <a href="IncludedInMemcached.html#M000001">new</a>,
+&lt;&lt;, and <a href="IncludedInMemcached.html#M000003">include?</a> .
+Together these can be used to add items to the memcache, then determine
+whether the item has been added.
+</p>
+<p>
+To use it with <a href="Spider.html">Spider</a> use the
+check_already_seen_with method:
+</p>
+<pre>
+ Spider.start_at('http://example.com/') do |s|
+ s.check_already_seen_with IncludedInMemcached.new('localhost:11211')
+ end
+</pre>
+
+ </div>
+
+
+ </div>
+
+ <div id="method-list">
+ <h3 class="section-bar">Methods</h3>
+
+ <div class="name-list">
+ <a href="#M000002">&lt;&lt;</a>&nbsp;&nbsp;
+ <a href="#M000003">include?</a>&nbsp;&nbsp;
+ <a href="#M000001">new</a>&nbsp;&nbsp;
+ </div>
+ </div>
+
+ </div>
+
+
+ <!-- if includes -->
+
+ <div id="section">
+
+
+
+
+
+
+
+
+ <!-- if method_list -->
+ <div id="methods">
+ <h3 class="section-bar">Public Class methods</h3>
+
+ <div id="method-M000001" class="method-detail">
+ <a name="M000001"></a>
+
+ <div class="method-heading">
+ <a href="#M000001" class="method-signature">
+ <span class="method-name">new</span><span class="method-args">(*a)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Construct a <a href="IncludedInMemcached.html#M000001">new</a> <a
+href="IncludedInMemcached.html">IncludedInMemcached</a> instance. All
+arguments here are passed to MemCache (part of the memcache-client gem).
+</p>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000001-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000001-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/included_in_memcached.rb, line 39</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">initialize</span>(<span class="ruby-operator">*</span><span class="ruby-identifier">a</span>)
+ <span class="ruby-ivar">@c</span> = <span class="ruby-constant">MemCache</span>.<span class="ruby-identifier">new</span>(<span class="ruby-operator">*</span><span class="ruby-identifier">a</span>)
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+ <h3 class="section-bar">Public Instance methods</h3>
+
+ <div id="method-M000002" class="method-detail">
+ <a name="M000002"></a>
+
+ <div class="method-heading">
+ <a href="#M000002" class="method-signature">
+ <span class="method-name">&lt;&lt;</span><span class="method-args">(v)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+Add an item to the memcache.
+</p>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000002-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000002-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/included_in_memcached.rb, line 44</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-operator">&lt;&lt;</span>(<span class="ruby-identifier">v</span>)
+ <span class="ruby-ivar">@c</span>.<span class="ruby-identifier">add</span>(<span class="ruby-identifier">v</span>.<span class="ruby-identifier">to_s</span>, <span class="ruby-identifier">v</span>)
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+ <div id="method-M000003" class="method-detail">
+ <a name="M000003"></a>
+
+ <div class="method-heading">
+ <a href="#M000003" class="method-signature">
+ <span class="method-name">include?</span><span class="method-args">(v)</span>
+ </a>
+ </div>
+
+ <div class="method-description">
+ <p>
+True if the item is in the memcache.
+</p>
+ <p><a class="source-toggle" href="#"
+ onclick="toggleCode('M000003-source');return false;">[Source]</a></p>
+ <div class="method-source-code" id="M000003-source">
+<pre>
+<span class="ruby-comment cmt"># File lib/spider/included_in_memcached.rb, line 49</span>
+ <span class="ruby-keyword kw">def</span> <span class="ruby-identifier">include?</span>(<span class="ruby-identifier">v</span>)
+ <span class="ruby-ivar">@c</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">v</span>.<span class="ruby-identifier">to_s</span>) <span class="ruby-operator">==</span> <span class="ruby-identifier">v</span>
+ <span class="ruby-keyword kw">end</span>
+</pre>
+ </div>
+ </div>
+ </div>
+
+
+ </div>
+
+
+ </div>
+
+
+<div id="validator-badges">
+ <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
+</div>
+
+</body>
+</html>
Oops, something went wrong.

0 comments on commit 590ed80

Please sign in to comment.