Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Yet another tiny crawler in python, using Bing Search API, Boilerpipe and Adblock.
Branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.


Yet another tiny crawler in Python.

Crawtext starts crawling seeds, which can be provided by the user or via Bing Search API. It extracts relevant content of the page using Boilerpipe. If the page contain the crawl's query, URLs are extracted from the selected content. If they are not considered as spam by adblock, they get crawled at the next round until the wished depth is reached.

Crawtext save the JSON-formatted results in a file. Each result is a pertinent crawled page with its:

  • pointers: The pages in the given dataset pointing to this page.
  • content: The extracted content from the page in text format.
  • outlinks: The pages in the given dataset pointed by this page.


Dependencies on beautifulsoup, requests and boilerpipe, all of them being available through pip.


crawtext('algues vertes OR algue verte',                # query
        0,                                              # depth
        '/Users/mazieres/code/crawtext/results.json',       # absolute path to result file
        bing_account_key='============================================', # Bing Search API key
        local_seeds='/Users/mazieres/code/crawtext/myseeds.txt')        # absolute path to local seeds

Arguments are:

  • The query that make a page pertinent or not. It support AND and OR operators.
  • The depth indidactes the number of rounds done by the crawler.
  • The absolute Path to result file.
  • The secret key of your Bing Search API account, available for free here.
  • The absolute path to your local seeds' urls, one url per line.


Fork (and pull), or use the Issue tracker.


Released under MIT License.


Developed by @mazieres, forked from @jphcoi, both efforts being part of Cortext project.

Something went wrong with that request. Please try again.