Skip to content

A Scrapy webscraper that can scrape and store articles of theguardian.com

Notifications You must be signed in to change notification settings

karimhabush/TheguardianScrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TheguardianScrapper

A Scrapy webscraper that can scrape and store articles of theguardian.com

Installation

Use the package manager pip to install required libraries.

pip install -r requirements.txt

Usage

To start scraping, make sure to create a cluster in MongoDB Atlas and use your connection credentials. Update settings.py:

MONGO_URI = 'Connection URI'
MONGO_DATABASE = 'Database Name'

Then, run the command :

scrapy crawl theguardian

To run the server API use the same credentials for MongoDB in server.py. Then, run the command :

env FLASK_APP=server.py flask run

API

The guardian spider crawls the following data:

Key Type Description
author Array of strings Author(s) of the article.
headline String Headline of the article.
content String The article's content (text only).
standfirst String The article's standfirst (text only).
label Array of strings The article's tags
url String The article's page url.
published_at Date Published date of the article.

The server API provides the following:

GET /articles

Get the list of crawled articles.

  • Path parameters :
Key Type Default value Description
page integer 1 Specify which page to query
num_articles integer 5 Specify number of articles in each page
  • Response :
{ 
  'status' : 'success',
  'page' : 'page number',
  'num_articles_found' : 'the total number of articles queried',
  'num_articles_per_page' : 'the number of articles in each page',
  'results' : [array of items queried]
}

GET /search/(content | headline | author)

Search for articles either keywords in content or headline, or author name.

  • Path parameters :
Key Type Default value Description
page integer 1 Specify which page to query
num_articles integer 5 Specify number of articles in each page
query string empty Pass a text query to search. This value should be URI encoded.
  • Response :
{ 
  'status' : 'success',
  'page' : 'page number',
  'num_articles_found' : 'the total number of articles queried',
  'num_articles_per_page' : 'the number of articles in each page',
  'results' : [array of items queried]
}

Known Issues

  • Article content selectors need improvements.
  • Search regexs need improvements.

TODO

  • Use Readability framework to improve content selector.

About

A Scrapy webscraper that can scrape and store articles of theguardian.com

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages