Mechanical News is an application framework that scrapes and saves the full text of online news articles to a database for social science research purposes.
Mechanical News it built on top of Scrapy and Flask, which lets you write web scrapers that retrieve news articles (using Scrapy), store them in the database, and then connect to a RESTful API to retrieve the articles from the database (using Flask).
You run Mechanical News on your own server. The users (i.e., researchers) instead use an R library or Python package to access the articles in a tidy data format directly from the API. The researcher doesn't need to know anything about how Mechanical News works.
- Build your own Scrapy scraper (or use an existing scraper from the library)
- Extract information from news articles
- Store full text news articles to a database
- Run in different modes:
- Scrape articles from news sites continuously (e.g., every day)
- Scrape articles from specific URLs
Extracted information from news articles
- article lead
- article body text
- links in article body text
- main image
- date of publication
- date of modification
- news section (e.g., World, Sports, Tech)
- type of page (e.g., text article, video, sound)
- news genre (e.g., news, sports, opinion, entertainment)
- whether the article is behind a paywall
- HTTP response headers
- metadata tags (e.g., OpenGraph, microformats)
- when the article was present on the frontpage
Overview of the architecture
Not yet available
- Python 3.6+
- MySQL 5.6+
Mechanical News have been tested on Windows 10, Red Hat 7.6, and Ubuntu 18.
Scrape all news articles from the news frontpages using all available spiders in the
/spiders directory by running this from the project path:
$ python run.py --crawl
Scrape all news articles from the frontpage of a specific site (
bbc is the name of the spider):
$ python run.py --crawl bbc
Scrape the news article content from a specific URL:
$ python run.py --url https://www.bbc.com/XXX
Show all spiders you have installed:
$ python run.py --list
This will list all spiders in your
/spiders directory. A spider is responsible for scraping a news site.
See documentation wiki.
Read how to contribute to Mechanical News by writing your own scrapers and share them.
- newspaper - library for automatic news article metadata extraction using heuristics. Mainly useful for English speaking content and when you don't want specific metadata.
- news-please - library and system for news article metadata extraction with database and search function, also built on Scrapy and newspaper. However, you cannot specify what information you want to extract.
- Media Cloud - open data platform that allows researchers to answer quantitative questions about the content of online media. Roll your own server or use the cloud service. However, you cannot access full text due to copyright.