A Scrapy webscraper that can scrape and store articles of theguardian.com
Use the package manager pip to install required libraries.
pip install -r requirements.txt
To start scraping, make sure to create a cluster in MongoDB Atlas and use your connection credentials. Update settings.py:
MONGO_URI = 'Connection URI'
MONGO_DATABASE = 'Database Name'
Then, run the command :
scrapy crawl theguardian
To run the server API use the same credentials for MongoDB in server.py. Then, run the command :
env FLASK_APP=server.py flask run
The guardian spider crawls the following data:
Key | Type | Description |
---|---|---|
author | Array of strings | Author(s) of the article. |
headline | String | Headline of the article. |
content | String | The article's content (text only). |
standfirst | String | The article's standfirst (text only). |
label | Array of strings | The article's tags |
url | String | The article's page url. |
published_at | Date | Published date of the article. |
The server API provides the following:
Get the list of crawled articles.
- Path parameters :
Key | Type | Default value | Description |
---|---|---|---|
page |
integer | 1 | Specify which page to query |
num_articles |
integer | 5 | Specify number of articles in each page |
- Response :
{
'status' : 'success',
'page' : 'page number',
'num_articles_found' : 'the total number of articles queried',
'num_articles_per_page' : 'the number of articles in each page',
'results' : [array of items queried]
}
Search for articles either keywords in content or headline, or author name.
- Path parameters :
Key | Type | Default value | Description |
---|---|---|---|
page |
integer | 1 | Specify which page to query |
num_articles |
integer | 5 | Specify number of articles in each page |
query |
string | empty | Pass a text query to search. This value should be URI encoded. |
- Response :
{
'status' : 'success',
'page' : 'page number',
'num_articles_found' : 'the total number of articles queried',
'num_articles_per_page' : 'the number of articles in each page',
'results' : [array of items queried]
}
- Article content selectors need improvements.
- Search regexs need improvements.
- Use Readability framework to improve content selector.