Fureteur

Update: due to scaling issues when using a high number of pipeline, a new high performance fetcher has been developed in Haskell. Please refer to fetchIO for more information.

Fureteur is a simple, configurable, fault-tolerant http crawler written in scala. The main features are:

Configurable number of concurrent pipelines; each pipeline may include a configurable number of individual fetchers
Reuse of http connection
Modular implementation using akka actors
URLs prefetching and data writeback are made in batch (batch size is configurable)
Build on the robust apache http client library
Prefetching of URLs from file and/or AMQP queues as RabbitMQ
fetched HTML is zipped and binary64-encoded
Easy build with sbt
Configurable using a simple JSON file

Distributed Crawling

Fureteur makes it very easy to implement a distributed crawler - for instance on Amazon AWS EC2.

The main server above includes RabbitMQ queues storing the URLs to be fetched (fetchIn queues) and a queue that includes the fetched data. A simple JSON format is used for the messages. Separate tasks running on the server take care of scheduling URLs to be fetched and writing back the data into a distributed database.

A configurable number of fetcher can be started using separate EC2 instances. Each instance will get URLs batches from the server fetchIn queues, fetch them and write them back into the fetchOut queue. Cost-effective EC2 micro instances may be used since fetching is not a CPU-intensive task. When it comes to the fetcher instances, the system is totally fault-tolerant - if an instance becomes unresponsive and/or is abruptly terminated, no data will be lost thanks to the RabbitMQ acknowledgement mechanism.

Getting Started

First, let's checkout the project and show the usage:

bash-3.2$ git clone git://github.com/gip/fureteur.git
bash-3.2$ sbt "run"
[info] Loading project definition from /Users/gilles/gip/test/fureteur/project
[info] Set current project to fureteur (in build file:/Users/gilles/gip/test/fureteur/)
[info] Running fureteur.Fureteur 
usage: fureteur run <config name>    # Start execution using a local config
       fureteur load <config path>   # Start execution using the provided config file
       fureteur show <config name>   # Dump a local config to STDOUT
       fureteur list                 # Show available local config
       fureteur version              # Show version
[success] Total time: 0 s, completed Mar 29, 2012 4:34:25 PM

Now, let's show the list of available configuration file and display the first one:

bash-3.2$ sbt "run list"
[info] Loading project definition from /Users/gilles/gip/test/fureteur/project
[info] Set current project to fureteur (in build file:/Users/gilles/gip/test/fureteur/)
[info] Running fureteur.Fureteur list
Available configs:
 f2f   # File to file operation
 r2r   # Input/output from RabbitMQ
[success] Total time: 0 s, completed Mar 29, 2012 4:37:21 PM
bash-3.2$ sbt "run show f2f"
[info] Loading project definition from /Users/gilles/gip/test/fureteur/project
[info] Set current project to fureteur (in build file:/Users/gilles/gip/test/fureteur/)
[info] Running fureteur.Fureteur show f2f
-- This is an example configuration file for fureteur
{
  "conf" : "f2f",                                -- Configuration name
  "description" : "File to file operation",      -- Description
  "usage" : "f2f <input file> <output file>",    -- Usage
  "instance" : "fureteur",                       -- Instance name 
   
  "pipelines" : [                                -- Pipelines
    {
      "httpManager" : {                          -- The http connection manager
          "max_connection" : "2",
          "max_connection_per_route" : "2",
          "min_interval_ms" : "1000"
        },  
    
      "prefetcher" : { "class" : "fileBatchPrefetcher",     -- Prefetching from files
                       "file_name" : "fureteur_in",         -- Input file name
                       "batch_size" : "50",                 -- Batch size when retrieving items from files
                       "threshold_in_batches" : "3",        -- Threshold (expressed in number of bacthes)
                       "timeout_ms" : "1000"                -- Timeout in ms 
                    },
      "httpFetchers": [ { "threshold_in" : "10",            -- Input threshold
                          "threshold_out" : "50",           -- Output threshold
                          "timeout_ms" : "1000"             -- Output timeout
                        },
                        { "threshold_in" : "10",            -- Input threshold (for second fetcher)
                          "threshold_out" : "50",           -- Output threshold (for second fetcher)
                          "timeout_ms" : "1000"             -- Output timeout (for second fetcher)
                        }
                      ],
      "writeback" :{ "class" : "fileBatchWriteback",        -- Writing back to file
                     "file_name" : "fureteur_out"           -- File name
                   }
    }
  ]
}

[success] Total time: 0 s, completed Mar 29, 2012 4:38:30 PM

The configuration above basically prefetch URLs from the fureteur_in file, fetch them and write them back to the fetch_out file. All the different options should be self-explanatory. Don't hesitate to contact me for further details.

To run the configuration, just do:

bash-3.2$ sbt "run run f2f"

It is also possible to copy/paste the configuration into a local file (for instance ex.conf), modify it and start fureteur using:

bash-3.2$ sbt "run load ex.conf"

It is also possible to create a separate jar file and start fureteur without sbt - please refer to sbt documentation. The r2r configuration reads and writes from an AMQP-compliant queue.

Future Work

The next version will include optional configuration using ZooKeeper and more fault-tolerance features.

Contact

Created by Gilles Pirio. Feel free to contact me at gip.github@gmail.com. Thanks to @Entelo for supporting open-source development!

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
doc		doc
project		project
src/main/scala		src/main/scala
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

project

project

src/main/scala

src/main/scala

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

Repository files navigation

Fureteur

Distributed Crawling

Getting Started

Future Work

Contact

About

Releases

Packages

Languages

License

joohnnie/fureteur

Folders and files

Latest commit

History

Repository files navigation

Fureteur

Distributed Crawling

Getting Started

Future Work

Contact

About

Resources

License

Stars

Watchers

Forks

Languages