Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Mechanical News

Mechanical News is an application framework that scrapes and saves the full text of online news articles to a database for social science research purposes.

Mechanical News it built on top of Scrapy and Flask, which lets you write web scrapers that retrieve news articles (using Scrapy), store them in the database, and then connect to a RESTful API to retrieve the articles from the database (using Flask).

You run Mechanical News on your own server. The users (i.e., researchers) instead use an R library or Python package to access the articles in a tidy data format directly from the API. The researcher doesn't need to know anything about how Mechanical News works.


  • Build your own Scrapy scraper (or use an existing scraper from the library)
  • Extract information from news articles
  • Store full text news articles to a database
  • Run in different modes:
    • Scrape articles from news sites continuously (e.g., every day)
    • Scrape articles from specific URLs

Extracted information from news articles

News content

  • headline
  • article lead
  • article body text
  • links in article body text
  • main image


  • authors
  • date of publication
  • date of modification
  • news section (e.g., World, Sports, Tech)
  • tags
  • categories
  • language
  • type of page (e.g., text article, video, sound)
  • news genre (e.g., news, sports, opinion, entertainment)
  • whether the article is behind a paywall
  • HTTP response headers
  • metadata tags (e.g., OpenGraph, microformats)
  • when the article was present on the frontpage

Overview of the architecture

Overview of the architecture of Mechanical News.


Not yet available


  • Python 3.6+
  • MySQL 5.6+
  • Docker

Mechanical News have been tested on Windows 10, Red Hat 7.6, and Ubuntu 18.

Quick start

Scrape all news articles from the news frontpages using all available spiders in the /spiders directory by running this from the project path:

$ python --crawl

Scrape all news articles from the frontpage of a specific site (bbc is the name of the spider):

$ python --crawl bbc

Scrape the news article content from a specific URL:

$ python --url

Available spiders

Show all spiders you have installed:

$ python --list

This will list all spiders in your /spiders directory. A spider is responsible for scraping a news site.


See documentation wiki.


Read how to contribute to Mechanical News by writing your own scrapers and share them.



GNU General Public License v3.0

Similar projects

  • newspaper - library for automatic news article metadata extraction using heuristics. Mainly useful for English speaking content and when you don't want specific metadata.
  • news-please - library and system for news article metadata extraction with database and search function, also built on Scrapy and newspaper. However, you cannot specify what information you want to extract.
  • Media Cloud - open data platform that allows researchers to answer quantitative questions about the content of online media. Roll your own server or use the cloud service. However, you cannot access full text due to copyright.


Web server app that crawls and saves news articles, provides article API for research








No packages published