A generic web crawler.
From a starting URL, it will crawl all links on that URL and print a list of URLs visited.
- Follow href attributes contained in tags from the same domain
- Ignores href attributes contained in tags from other domains (even subdomains)
- Captures script src and link href tags for script and link tags respectively
- Outputs a list of visited URLs
It's easy to get started!
gem install iron-crawler
iron-crawler <url>
The above command will crawl any site for you.
- concurrency (will probably have to move away from mechanize)
- test coverage with Rspec
- set up CI pipeline with travis-ci to automatically publish to rubygems