GitHub - oleg-schmidt/flaxcrawler: Git copy of Flaxcrawler: Simple and flexible java web crawler

Introduction

flaxcrawler is an open source web crawler written in Java. It is very fast lightweight multi-threaded crawler, easy to setup and use. You can configure its behaviour with a plenty of settings. Or you can even use your own implementations of the flaxcrawler components. Example

package com.googlecode.flaxcrawler.examples;

import com.googlecode.flaxcrawler.CrawlerConfiguration;
import com.googlecode.flaxcrawler.CrawlerController;
import com.googlecode.flaxcrawler.CrawlerException;
import com.googlecode.flaxcrawler.DefaultCrawler;
import com.googlecode.flaxcrawler.download.DefaultDownloaderController;
import com.googlecode.flaxcrawler.model.CrawlerTask;
import com.googlecode.flaxcrawler.model.Page;
import com.googlecode.flaxcrawler.parse.DefaultParserController;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

public class FirstExample {

    public static void main(String[] args) throws MalformedURLException, CrawlerException {
        // Setting up downloader controller
        DefaultDownloaderController downloaderController = new DefaultDownloaderController();
        // Setting up parser controller
        DefaultParserController parserController = new DefaultParserController();

    // Creating crawler configuration object
    CrawlerConfiguration configuration = new CrawlerConfiguration();

        // Creating five crawlers (to work with 5 threads)
        for (int i = 0; i < 5; i++) {
            // Creating crawler and setting downloader and parser controllers
            DefaultCrawler crawler = new ExampleCrawler();
            crawler.setDownloaderController(downloaderController);
            crawler.setParserController(parserController);
            // Adding crawler to the configuration object
            configuration.addCrawler(crawler);
        }

    // Setting maximum parallel requests to a single site limit
    configuration.setMaxParallelRequests(1);
    // Setting http errors limits. If this limit violated for any
    // site - crawler will stop this site processing
    configuration.setMaxHttpErrors(HttpURLConnection.HTTP_CLIENT_TIMEOUT, 10);
    configuration.setMaxHttpErrors(HttpURLConnection.HTTP_BAD_GATEWAY, 10);
    // Setting period between two requests to a single site (in milliseconds)
    configuration.setPolitenessPeriod(500);

    // Initializing crawler controller
    CrawlerController crawlerController = new CrawlerController(configuration);
    // Adding crawler seed
    crawlerController.addSeed(new URL("http://en.wikipedia.org/"));
    // Starting and joining our crawler
    crawlerController.start();
    // Join crawler controller and wait for 60 seconds
    crawlerController.join(60000);
    // Stopping crawler controller
    crawlerController.stop();
}

/**
 * Custom crawler. Extends {@link DefaultCrawler}.
 */
private static class ExampleCrawler extends DefaultCrawler {

    /**
     * This method is called after each crawl attempt.
     * Warning - it does not matter if it was unsuccessfull attempt or response was redirected.
     * So you should check response code before handling it.
     * @param crawlerTask
     * @param page
     */
    @Override
    protected void afterCrawl(CrawlerTask crawlerTask, Page page) {
        super.afterCrawl(crawlerTask, page);

        if (page == null) {
            System.out.println(crawlerTask.getUrl() + " violates crawler constraints (content-type or content-length or other)");
        } else if (page.getResponseCode() >= 300 && page.getResponseCode() < 400) {
            // If response is redirected - crawler schedulles new task with new url
            System.out.println("Response was redirected from " + crawlerTask.getUrl());
        } else if (page.getResponseCode() == HttpURLConnection.HTTP_OK) {
            // Printing url crawled
            System.out.println(crawlerTask.getUrl() + ". Found " + (page.getLinks() != null ? page.getLinks().size() : 0) + " links.");
        }
    }

        /**
         * You may check if you want to crawl next task
         * @param crawlerTask Task that is going to be crawled if you return {@code true}
         * @param parent parent.getUrl() page contain link to a crawlerTask.getUrl() or redirects to it
         * @return
         */
        @Override
        public boolean shouldCrawl(CrawlerTask crawlerTask, CrawlerTask parent) {
            // Default implementation returns true if crawlerTask.getDomainName() == parent.getDomainName()
            return super.shouldCrawl(crawlerTask, parent);
        }
    }
}

As you can see in the code above, you should override one method to get access to the results of the crawler work:

DefaultCrawler's method afterCrawl is called after crawler has downloaded and parsed a web page.

flaxcrawler fetures

Keeps track of the visited web pages, so the page cannot be crawled twice
Normalizes URL's of the web pages
flaxcrawler is polite - it can be configured to make a pause between two requests to a single site
Multi-threaded
Supports downloading through one or more proxies (load balances proxies if set more than one)
Highly customizable

Maven

flaxcrawler is a Maven project so you should download maven to build project. Samples

You can find several samples of the flaxcrawler usage and customization:

FullConfigurationExample: https://code.google.com/p/flaxcrawler/wiki/FullConfigurationExample 
SpringExample: https://code.google.com/p/flaxcrawler/wiki/SpringExample

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
javadoc		javadoc
lib		lib
src		src
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

javadoc

javadoc

lib

lib

src

src

README.md

README.md

pom.xml

pom.xml

Repository files navigation

About

Releases

Packages

Languages

oleg-schmidt/flaxcrawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages