Permalink
Switch branches/tags
Commits on Apr 23, 2018
  1. Merge pull request #3 from brigriffin/feature-redis-cache

    johnnagro committed Apr 23, 2018
    Add Redis as cache wrapper to track cycles
  2. Merge pull request #2 from jeremyevans/master

    johnnagro committed Apr 23, 2018
    Use Integer instead of Fixnum to avoid warning on ruby 2.4+
Commits on Nov 26, 2017
Commits on Apr 18, 2017
  1. Use Integer instead of Fixnum to avoid warning on ruby 2.4+

    jeremyevans committed Apr 18, 2017
    ruby 2.4 combined Fixnum and Bignum into Integer, and Fixnum and
    Bignum should not be referenced anymore.  This should be backwards
    compatible with older versions of ruby.
Commits on Sep 4, 2016
  1. updating changelog

    johnnagro committed Sep 4, 2016
Commits on May 17, 2016
  1. Start the next version

    johnnagro committed May 17, 2016
  2. tweak some docs

    johnnagro committed May 17, 2016
Commits on May 13, 2016
  1. license spec

    johnnagro committed May 13, 2016
  2. gitignore

    johnnagro committed May 13, 2016
  3. typo

    johnnagro committed May 13, 2016
  4. changelog

    johnnagro committed May 13, 2016
  5. 💰 😼

    johnnagro committed May 13, 2016
  6. remove doc folder

    johnnagro committed May 13, 2016
  7. 0.5.0

    johnnagro committed May 13, 2016
  8. seems to be running again

    johnnagro committed May 13, 2016
  9. use a version config

    johnnagro committed May 13, 2016
  10. formatting

    johnnagro committed May 13, 2016
  11. dusting this puppy off

    johnnagro committed May 13, 2016
Commits on May 21, 2009
  1. 0.4.4

    johnnagro committed May 21, 2009
  2. 2009-05-21

    johnnagro committed May 21, 2009
    * fixed an issue with robots.txt on ssl hosts
    * fixed an issue with pulling robots.txt from disallowed hosts
    * fixed a documentation error with ExpiredLinks
    * Many thanks to Brian Campbell
Commits on Oct 9, 2008
  1. fixed nested slashes:

    johnnagro committed Oct 9, 2008
                                                                                                                                                                                                                                                                   
    Delivered-To: john.nagro@gmail.com
    Received: by 10.141.116.11 with SMTP id t11cs210940rvm;
            Tue, 8 Jul 2008 15:06:04 -0700 (PDT)
    Received: by 10.125.117.20 with SMTP id u20mr1615751mkm.166.1215554763044;
            Tue, 08 Jul 2008 15:06:03 -0700 (PDT)
    Return-Path: <sander@internethelden.nl>
    Received: from mail.internethelden.nl (mail.internethelden.nl [194.109.193.81])
            by mx.google.com with ESMTP id 33si4405711hue.28.2008.07.08.15.06.01;
            Tue, 08 Jul 2008 15:06:03 -0700 (PDT)
    Received-SPF: pass (google.com: domain of sander@internethelden.nl designates 194.109.193.81 as permitted sender) client-ip=194.109.193.81;
    Authentication-Results: mx.google.com; spf=pass (google.com: domain of sander@internethelden.nl designates 194.109.193.81 as permitted sender) smtp.mail=sander@internethelden.nl
    Received: from [192.168.1.100] (84-104-82-62.cable.quicknet.nl [84.104.82.62])
        by mail.internethelden.nl (Postfix) with ESMTP id 91BDF71C64
        for <john.nagro@gmail.com>; Wed,  9 Jul 2008 00:05:52 +0200 (CEST)
    Mime-Version: 1.0 (Apple Message framework v752.2)
    To: john.nagro@gmail.com
    Message-Id: <7B9E4A53-0660-45C9-9595-E593778AA95D@internethelden.nl>
    Content-Type: multipart/alternative; boundary=Apple-Mail-2-1024358836
    References: <6DE0AF97-9ABF-4B50-A922-3DBBC63CF652@internethelden.nl>
    From: Sander van der Vliet <sander@internethelden.nl>
    Subject: Fwd: Bug spider.rb
    Date: Wed, 9 Jul 2008 00:05:59 +0200
    X-Mailer: Apple Mail (2.752.2)
    
    
    --Apple-Mail-2-1024358836
    Content-Transfer-Encoding: 7bit
    Content-Type: text/plain;
        charset=US-ASCII;
        delsp=yes;
        format=flowed
    
    Hi John,
    
    My previous message was send to Mike.
    
    Please see below.
    
    Thanks,
    Sander
    
    
    Begin doorgestuurd bericht:
    
    > Van: Sander van der Vliet <sander@internethelden.nl>
    > Datum: 8 juli 2008 17:26:04 GMT+02:00
    > Aan: mike@mike-burns.com
    > Onderwerp: Bug spider.rb
    >
    > Hi Mike,
    >
    > I've found a bug in your spider script. I'm a very unexperienced  
    > ruby programmer (started ruby yesterday), but while using your  
    > script I encountered this problem with relative URL's:
    >
    > The relative url's where made like this, and it resulted in a  
    > spider loop:
    >
    > http://www.ticketmaster.nl/html/browse.htmI? 
    > start=0&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl//html/browse.htmI? 
    > start=30&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl///html/browse.htmI? 
    > start=60&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl////html/browse.htmI? 
    > start=90&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl/////html/browse.htmI? 
    > start=120&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl//////html/browse.htmI? 
    > start=150&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl///////html/browse.htmI? 
    > start=180&type=artist&l=NL&cat=10001&siteCat=1: 200
    > http://www.ticketmaster.nl////////html/browse.htmI? 
    > start=210&type=artist&l=NL&cat=10001&siteCat=1: 200
    >
    > This is the fix:
    >
    > def construct_complete_url(base_url, additional_url,  
    > parsed_additional_url = nil) #:nodoc:
    >   parsed_additional_url ||= URI.parse(additional_url)
    >   case parsed_additional_url.scheme
    >   when nil
    >       u = base_url.is_a?(URI) ? base_url : URI.parse(base_url)
    >       if additional_url[0].chr == '/'
    >           "#{u.scheme}://#{u.host}#{additional_url}"
    >       elsif u.path.nil? || u.path == ''
    >           "#{u.scheme}://#{u.host}/#{additional_url}"
    >       elsif u.path[0].chr == '/'
    >           "#{u.scheme}://#{u.host}#{u.path}/#{additional_url}"
    >       else
    >           "#{u.scheme}://#{u.host}/#{u.path}/#{additional_url}"
    >       end
    >   else
    >       additional_url
    >   end
    > end
    >
    > Thanks for your script and I hope you will implement the fix.
    >
    > Any plans to expand it in near future? Like to support threading?
    >
    > Thanks,
    > Sander
    >
    >
Commits on Jul 6, 2008
  1. 0.4.2

    johnnagro committed Jul 6, 2008
  2. 0.4.2

    johnnagro committed Jul 6, 2008
    * Trap interrupts and shutdown gracefully
    * Support for custom urls-to-crawl objects
    * Example AmazonSQS urls-to-crawl support (next_urls_in_sqs.rb)
    * New documentation
Commits on May 23, 2008