Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
Commits on May 21, 2009
  1. 0.4.4

    authored
  2. 2009-05-21

    authored
    * fixed an issue with robots.txt on ssl hosts
    * fixed an issue with pulling robots.txt from disallowed hosts
    * fixed a documentation error with ExpiredLinks
    * Many thanks to Brian Campbell
Commits on Oct 9, 2008
  1. fixed nested slashes:

    authored
                                                                                                                                                                                                                                                                   
    Delivered-To: john.nagro@gmail.com
    Received: by 10.141.116.11 with SMTP id t11cs210940rvm;
            Tue, 8 Jul 2008 15:06:04 -0700 (PDT)
    Received: by 10.125.117.20 with SMTP id u20mr1615751mkm.166.1215554763044;
            Tue, 08 Jul 2008 15:06:03 -0700 (PDT)
    Return-Path: <sander@internethelden.nl>
    Received: from mail.internethelden.nl (mail.internethelden.nl [194.109.193.81])
            by mx.google.com with ESMTP id 33si4405711hue.28.2008.07.08.15.06.01;
            Tue, 08 Jul 2008 15:06:03 -0700 (PDT)
    Received-SPF: pass (google.com: domain of sander@internethelden.nl designates 194.109.193.81 as permitted sender) client-ip=194.109.193.81;
    Authentication-Results: mx.google.com; spf=pass (google.com: domain of sander@internethelden.nl designates 194.109.193.81 as permitted sender) smtp.mail=sander@internethelden.nl
    Received: from [192.168.1.100] (84-104-82-62.cable.quicknet.nl [84.104.82.62])
        by mail.internethelden.nl (Postfix) with ESMTP id 91BDF71C64
        for <john.nagro@gmail.com>; Wed,  9 Jul 2008 00:05:52 +0200 (CEST)
    Mime-Version: 1.0 (Apple Message framework v752.2)
    To: john.nagro@gmail.com
    Message-Id: <7B9E4A53-0660-45C9-9595-E593778AA95D@internethelden.nl>
    Content-Type: multipart/alternative; boundary=Apple-Mail-2-1024358836
    References: <6DE0AF97-9ABF-4B50-A922-3DBBC63CF652@internethelden.nl>
    From: Sander van der Vliet <sander@internethelden.nl>
    Subject: Fwd: Bug spider.rb
    Date: Wed, 9 Jul 2008 00:05:59 +0200
    X-Mailer: Apple Mail (2.752.2)
    
    
    --Apple-Mail-2-1024358836
    Content-Transfer-Encoding: 7bit
    Content-Type: text/plain;
        charset=US-ASCII;
        delsp=yes;
        format=flowed
    
    Hi John,
    
    My previous message was send to Mike.
    
    Please see below.
    
    Thanks,
    Sander
    
    
    Begin doorgestuurd bericht:
    
    > Van: Sander van der Vliet <sander@internethelden.nl>
    > Datum: 8 juli 2008 17:26:04 GMT+02:00
    > Aan: mike@mike-burns.com
    > Onderwerp: Bug spider.rb
    >
    > Hi Mike,
    >
    > I've found a bug in your spider script. I'm a very unexperienced  
    > ruby programmer (started ruby yesterday), but while using your  
    > script I encountered this problem with relative URL's:
    >
    > The relative url's where made like this, and it resulted in a  
    > spider loop:
    >
    > http://www.ticketmaster.nl/html/browse.htmI? 
    > start=0&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl//html/browse.htmI? 
    > start=30&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl///html/browse.htmI? 
    > start=60&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl////html/browse.htmI? 
    > start=90&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl/////html/browse.htmI? 
    > start=120&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl//////html/browse.htmI? 
    > start=150&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl///////html/browse.htmI? 
    > start=180&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl////////html/browse.htmI? 
    > start=210&type=artist&l=NL&cat=10001&siteCat=1: 200
    >
    > This is the fix:
    >
    > def construct_complete_url(base_url, additional_url,  
    > parsed_additional_url = nil) #:nodoc:
    >   parsed_additional_url ||= URI.parse(additional_url)
    >   case parsed_additional_url.scheme
    >   when nil
    >       u = base_url.is_a?(URI) ? base_url : URI.parse(base_url)
    >       if additional_url[0].chr == '/'
    >           "#{u.scheme}://#{u.host}#{additional_url}"
    >       elsif u.path.nil? || u.path == ''
    >           "#{u.scheme}://#{u.host}/#{additional_url}"
    >       elsif u.path[0].chr == '/'
    >           "#{u.scheme}://#{u.host}#{u.path}/#{additional_url}"
    >       else
    >           "#{u.scheme}://#{u.host}/#{u.path}/#{additional_url}"
    >       end
    >   else
    >       additional_url
    >   end
    > end
    >
    > Thanks for your script and I hope you will implement the fix.
    >
    > Any plans to expand it in near future? Like to support threading?
    >
    > Thanks,
    > Sander
    >
    >
Commits on Jul 6, 2008
  1. 0.4.2

    authored
  2. 0.4.2

    authored
    * Trap interrupts and shutdown gracefully
    * Support for custom urls-to-crawl objects
    * Example AmazonSQS urls-to-crawl support (next_urls_in_sqs.rb)
    * New documentation
Commits on May 23, 2008
Something went wrong with that request. Please try again.