TheguardianScrapper

A Scrapy webscraper that can scrape and store articles of theguardian.com

Installation

Use the package manager pip to install required libraries.

pip install -r requirements.txt

Usage

To start scraping, make sure to create a cluster in MongoDB Atlas and use your connection credentials. Update settings.py:

MONGO_URI = 'Connection URI'
MONGO_DATABASE = 'Database Name'

Then, run the command :

scrapy crawl theguardian

To run the server API use the same credentials for MongoDB in server.py. Then, run the command :

env FLASK_APP=server.py flask run

API

The guardian spider crawls the following data:

Key	Type	Description
author	Array of strings	Author(s) of the article.
headline	String	Headline of the article.
content	String	The article's content (text only).
standfirst	String	The article's standfirst (text only).
label	Array of strings	The article's tags
url	String	The article's page url.
published_at	Date	Published date of the article.

The server API provides the following:

GET /articles

Get the list of crawled articles.

Path parameters :

Key	Type	Default value	Description
`page`	integer	1	Specify which page to query
`num_articles`	integer	5	Specify number of articles in each page

Response :

{ 
  'status' : 'success',
  'page' : 'page number',
  'num_articles_found' : 'the total number of articles queried',
  'num_articles_per_page' : 'the number of articles in each page',
  'results' : [array of items queried]
}

GET /search/(content | headline | author)

Search for articles either keywords in content or headline, or author name.

Path parameters :

Key	Type	Default value	Description
`page`	integer	1	Specify which page to query
`num_articles`	integer	5	Specify number of articles in each page
`query`	string	empty	Pass a text query to search. This value should be URI encoded.

Response :

{ 
  'status' : 'success',
  'page' : 'page number',
  'num_articles_found' : 'the total number of articles queried',
  'num_articles_per_page' : 'the number of articles in each page',
  'results' : [array of items queried]
}

Known Issues

Article content selectors need improvements.
Search regexs need improvements.

TODO

Use Readability framework to improve content selector.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
TheguardianScrapper		TheguardianScrapper
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TheguardianScrapper

Installation

Usage

API

GET /articles

GET /search/(content | headline | author)

Known Issues

TODO

About

Releases

Packages

Contributors 2

Languages

karimhabush/TheguardianScrapper

Folders and files

Latest commit

History

Repository files navigation

TheguardianScrapper

Installation

Usage

API

GET /articles

GET /search/(content | headline | author)

Known Issues

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages