Yet another tiny crawler in python, using Bing Search API, Boilerpipe and Adblock.
Python
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md
abpy.py
crawtext.py
easylist.txt

README.md

Crawtext

Yet another tiny crawler in Python.

Crawtext starts crawling seeds, which can be provided by the user or via Bing Search API. It extracts relevant content of the page using Boilerpipe. If the page contain the crawl's query, URLs are extracted from the selected content. If they are not considered as spam by adblock, they get crawled at the next round until the wished depth is reached.

Crawtext save the JSON-formatted results in a file. Each result is a pertinent crawled page with its:

  • pointers: The pages in the given dataset pointing to this page.
  • content: The extracted content from the page in text format.
  • outlinks: The pages in the given dataset pointed by this page.

Installation

Dependencies on beautifulsoup, requests and boilerpipe, all of them being available through pip.

Usage

crawtext('algues vertes OR algue verte', 				# query
		0, 												# depth
		'/Users/mazieres/code/crawtext/results.json',		# absolute path to result file
		bing_account_key='============================================', # Bing Search API key
		local_seeds='/Users/mazieres/code/crawtext/myseeds.txt') 		# absolute path to local seeds

Arguments are:

  • The query that make a page pertinent or not. It support AND and OR operators.
  • The depth indidactes the number of rounds done by the crawler.
  • The absolute Path to result file.
  • The secret key of your Bing Search API account, available for free here.
  • The absolute path to your local seeds' urls, one url per line.

Contribute

Fork (and pull), or use the Issue tracker.

License

Released under MIT License.

About

Developed by @mazieres, forked from @jphcoi, both efforts being part of Cortext project.