Prototype web crawler in Ruby
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
test
.autotest
.document
.gitignore
Gemfile
LICENSE
README.rdoc
Rakefile

README.rdoc

web_crawler

A prototype of a large scale web crawler written in Ruby.

Installation

gem install web_crawler

Usage

# Must specify a start point
# Optionally limit to a domain

crawler = Crawler.new 'http://google.com'
crawler.start

# Fetches page
# Resolve DNS from cache
# Extracts all URLS
# Forms canonoical representation of all URLs
# Log output
# Stores downloaded files in .ruby_crawler/$URL

Performance

Benchmark.measure { Crawler.new('http://www.dmoz.org').start(10) }
=>  0.100000   0.020000   0.120000 (  3.949688)

Approx 218,751 pages per day (quite slow).

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Add tests for it. This is important so I don't break it in a future version unintentionally.

  • Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)

  • Send me a pull request. Bonus points for topic branches.

Copyright

Copyright © 2011 Keith Mc Donnell. See LICENSE for details.