A SIMPLE (but fast & extensible) crawler using CommonCrawl.
Switch branches/tags
Nothing to show
Clone or download
Latest commit f1fb199 Nov 24, 2016
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE.txt Create LICENSE.txt Nov 24, 2016
README.rst Initial commit. Dec 10, 2014
all.warc.gz Initial commit. Dec 10, 2014
crawler.py 'global' fix. Dec 10, 2014
requirements.txt Initial commit. Dec 10, 2014

README.rst

Starter kit :

virtualenv env/
source env/bin/activate
pip install -r requirements.txt
python crawler.py

Let your console be flooded by the lists extracted from the web.

We recommend that you redirect the output of the crawler to a file. Then you will be able to see the error output of the crawler, showing some statistics from time to time.