a spider using nothing but standard unix tools (+ ruby glue)
git clone whatever
cd bashpider
rake data:get_urls # will take a while
rake crawl:restart # this will run a crawl
If you want to monitor the downloads per second, in another window type the following:
rake crawl:watch
When you feel you've gathered enough data, CTRL-C to kill both windows and then type:
rake results:process
See: post