Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
python crawl_trial.py will launch the crawl accroding to parameters declared in crawl_parameters.yml file or any yaml file declared as an argument of crawl_trial.py. The crawler uses the seed sites found in the list of files of a given repertory (path) as well as a query that will be used to validate new webpages (query) found during the crawling process. inlinks_min define the minimum number of citation a page should accumulate before being considered as a candidate to enter the corpus. depth parameter defines the number of steps of corpus extension performed from the initial corpus made of seeds webpages. required modules: urllib2,BeautifulSoup, urlparse, sqlite3, pyparsing,urllib, random, multiprocessing,lxml,socket, decruft, feedparser, pattern,warnings, chardet, yaml TODO list: * better scrapping of webpages, for the moment decruft is doing great but can be enhanced (http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/) * automatically extract dates from extracted texts (good python solutions in english, still lacking in french) * feed the db with cleaner and richer information (like domain name, number of views, etc) * take into account the charset when crawling webpages * crawl updating process * automatic grab google links to initiate a crawl * TBD grid-compliant code... * clean the code (debug mode, documentation, etc.) * write a comprehensive post_processing.py script to keep compatibility with other developments. * monitoring and reporting (page retrieval problems, success, distributions, etc) * modular architecture : include a better information extraction process. * avoid downloading n times the same content. (md5 comparison) * retry to download pages that could not be opened. * targeted and careful crawl of each domain (only follow hypertext links with the ~query in the url or in the linkText)