Skip to content

mashabelyi/web-news-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Archive Sraper

About

Implementation of a web crawler in python using Scrapy. Archived content is scraped from Internet Archive.

Supported news domains: www.cnn.com, www.foxnews.com (more coming)

Dependencies

Scrapy

pip install scrapy

Usage

python run.py --source DOMAIN --start START_DATE --end END_DATE

The crawler will scrape news content from the input DOMAIN, fetching content that was pubilshed between START_DATE and END_DATE. The crawler will create a data/source directory in the project root folder and save all scraped data in that directory.

Configuration

  • source (string) must be a supported comain (cnn or foxnews)
  • start (int) e.g. 20180803 (August 3, 2018)
  • end (int) e.g. 20180804 (August 4, 2018)

Output

The crawler saves scraped content in JSON Lines format - one record per line. Sample article record:

{
	"title": "Fighting intensifies in eastern Ukraine", 
	"date": 20170203, 
	"content": "At least four Ukrainian soldiers and one civilian have been killed in the last 24 hours....", 
	"topic": "world", 
	"url": "http://web.archive.org/web/20170403185905/http://www.cnn.com/2017/02/03/world/ukraine-fighting-intensifies/index.html", 
	"source": "www.cnn.com"
}

Speed

Parses about 15-20 pages/minute.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages