Skip to content

oleg-schmidt/flaxcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

flaxcrawler is an open source web crawler written in Java. It is very fast lightweight multi-threaded crawler, easy to setup and use. You can configure its behaviour with a plenty of settings. Or you can even use your own implementations of the flaxcrawler components. Example

package com.googlecode.flaxcrawler.examples;

import com.googlecode.flaxcrawler.CrawlerConfiguration;
import com.googlecode.flaxcrawler.CrawlerController;
import com.googlecode.flaxcrawler.CrawlerException;
import com.googlecode.flaxcrawler.DefaultCrawler;
import com.googlecode.flaxcrawler.download.DefaultDownloaderController;
import com.googlecode.flaxcrawler.model.CrawlerTask;
import com.googlecode.flaxcrawler.model.Page;
import com.googlecode.flaxcrawler.parse.DefaultParserController;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

public class FirstExample {

    public static void main(String[] args) throws MalformedURLException, CrawlerException {
        // Setting up downloader controller
        DefaultDownloaderController downloaderController = new DefaultDownloaderController();
        // Setting up parser controller
        DefaultParserController parserController = new DefaultParserController();

    // Creating crawler configuration object
    CrawlerConfiguration configuration = new CrawlerConfiguration();

        // Creating five crawlers (to work with 5 threads)
        for (int i = 0; i < 5; i++) {
            // Creating crawler and setting downloader and parser controllers
            DefaultCrawler crawler = new ExampleCrawler();
            crawler.setDownloaderController(downloaderController);
            crawler.setParserController(parserController);
            // Adding crawler to the configuration object
            configuration.addCrawler(crawler);
        }

    // Setting maximum parallel requests to a single site limit
    configuration.setMaxParallelRequests(1);
    // Setting http errors limits. If this limit violated for any
    // site - crawler will stop this site processing
    configuration.setMaxHttpErrors(HttpURLConnection.HTTP_CLIENT_TIMEOUT, 10);
    configuration.setMaxHttpErrors(HttpURLConnection.HTTP_BAD_GATEWAY, 10);
    // Setting period between two requests to a single site (in milliseconds)
    configuration.setPolitenessPeriod(500);

    // Initializing crawler controller
    CrawlerController crawlerController = new CrawlerController(configuration);
    // Adding crawler seed
    crawlerController.addSeed(new URL("http://en.wikipedia.org/"));
    // Starting and joining our crawler
    crawlerController.start();
    // Join crawler controller and wait for 60 seconds
    crawlerController.join(60000);
    // Stopping crawler controller
    crawlerController.stop();
}

/**
 * Custom crawler. Extends {@link DefaultCrawler}.
 */
private static class ExampleCrawler extends DefaultCrawler {

    /**
     * This method is called after each crawl attempt.
     * Warning - it does not matter if it was unsuccessfull attempt or response was redirected.
     * So you should check response code before handling it.
     * @param crawlerTask
     * @param page
     */
    @Override
    protected void afterCrawl(CrawlerTask crawlerTask, Page page) {
        super.afterCrawl(crawlerTask, page);

        if (page == null) {
            System.out.println(crawlerTask.getUrl() + " violates crawler constraints (content-type or content-length or other)");
        } else if (page.getResponseCode() >= 300 && page.getResponseCode() < 400) {
            // If response is redirected - crawler schedulles new task with new url
            System.out.println("Response was redirected from " + crawlerTask.getUrl());
        } else if (page.getResponseCode() == HttpURLConnection.HTTP_OK) {
            // Printing url crawled
            System.out.println(crawlerTask.getUrl() + ". Found " + (page.getLinks() != null ? page.getLinks().size() : 0) + " links.");
        }
    }

        /**
         * You may check if you want to crawl next task
         * @param crawlerTask Task that is going to be crawled if you return {@code true}
         * @param parent parent.getUrl() page contain link to a crawlerTask.getUrl() or redirects to it
         * @return
         */
        @Override
        public boolean shouldCrawl(CrawlerTask crawlerTask, CrawlerTask parent) {
            // Default implementation returns true if crawlerTask.getDomainName() == parent.getDomainName()
            return super.shouldCrawl(crawlerTask, parent);
        }
    }
}

As you can see in the code above, you should override one method to get access to the results of the crawler work:

DefaultCrawler's method afterCrawl is called after crawler has downloaded and parsed a web page. 

flaxcrawler fetures

Keeps track of the visited web pages, so the page cannot be crawled twice
Normalizes URL's of the web pages
flaxcrawler is polite - it can be configured to make a pause between two requests to a single site
Multi-threaded
Supports downloading through one or more proxies (load balances proxies if set more than one)
Highly customizable 

Maven

flaxcrawler is a Maven project so you should download maven to build project. Samples

You can find several samples of the flaxcrawler usage and customization:

FullConfigurationExample: https://code.google.com/p/flaxcrawler/wiki/FullConfigurationExample 
SpringExample: https://code.google.com/p/flaxcrawler/wiki/SpringExample 

About

Git copy of Flaxcrawler: Simple and flexible java web crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages