Web spider framework that can spider a domain and collect pages it visits.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
.gitignore
Cargo.lock
Cargo.toml
LICENSE
README.md

README.md

Spider

crate version

Multithreaded Web spider crawler written in Rust.

Dependencies

$ apt install openssl libssl-dev

Usage

Add this dependency to your Cargo.toml file.

[dependencies]
spider = "1.0.2"

and then you'll be able to use library. Here a simple example

extern crate spider;

use spider::website::Website;

fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    website.crawl();

    for page in website.get_pages() {
        println!("- {}", page.get_url());
    }
}

You can use Configuration object to configure your crawler:

// ..
let mut website: Website = Website::new("https://choosealicense.com");
website.configuration.blacklist_url.push("https://choosealicense.com/licenses/".to_string());
website.configuration.respect_robots_txt = true;
website.configuration.verbose = true;
website.crawl();
// ..

TODO

  • multi-threaded system
  • respect robot.txt file
  • add configuration object for polite delay, etc..
  • add polite delay
  • parse command line arguments

Contribute

I am open-minded to any contribution. Just fork & commit on another branch.