Skip to content
Python Crawler for collecting domain specific web corpora
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


python will launch the crawl accroding to parameters declared in crawl_parameters.yml file or any yaml file declared as an argument of The crawler uses the seed sites found in the list of files of a given repertory (path) as well as a query that will be used to validate new webpages (query) found during the crawling process. inlinks_min define the minimum number of citation a page should accumulate before being considered as a candidate to enter the corpus. depth parameter defines the number of steps of corpus extension performed from the initial corpus made of seeds webpages. 

required modules: urllib2,BeautifulSoup, urlparse, sqlite3, pyparsing,urllib, random, multiprocessing,lxml,socket, decruft, feedparser, pattern,warnings, chardet, yaml

TODO list:

  * better scrapping of webpages, for the moment decruft is doing great but can be enhanced (
  * automatically extract dates from extracted texts (good python solutions in english, still lacking in french)
  * feed the db with cleaner and richer information (like domain name, number of views, etc)
  * take into account the charset when crawling webpages
  * crawl updating process 
  * automatic grab google links to initiate a crawl
  * TBD grid-compliant code...
  * clean the code (debug mode, documentation, etc.)
  * write a comprehensive script to keep compatibility with other developments.
  * monitoring and reporting (page retrieval problems, success, distributions, etc)
  * modular architecture : include a better information extraction process.
  * avoid downloading n times the same content. (md5 comparison)
  * retry to download pages that could not be opened.
  * targeted and careful crawl of each domain (only follow hypertext links with the ~query in the url or in the linkText)
Something went wrong with that request. Please try again.