Skip to content

This is a data engineering coding challenge for United Remote aims to create a python application that crawls articles from a news website such as bbc.com using Scrapy, stores it in a mongo database and makes it available via an API

Notifications You must be signed in to change notification settings

nfaihi/bbc-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Coding Challenge - News Content Collect and Store


The purpose of this coding challenge given by United Remote is to develop a solution using Python that crawls news articles from the news website www.bbc.com, by selecting the necessary information about the news stories such as : title, author, article text, etc. And stores this data into a mongo database, then makes it available to search via an API.

Specifications

The challenge is divided into 4 parts :

1. Crawl the news articles :

To crawl the news articles from the news website, a crawler/Scraping framework is needed. In this challenge the framework used is Scrapy that help us extracting the data from websites. We install it using the following command :

$ pip install scrapy

After installing Scrapy, we create a project using the command :

$ scrapy startproject 'project_name'

By finishing building the web spider under the project_name/spiders directory of the project, we run it to start crawling the news articles based on start_urls list :

$ scrapy crawl 'spider_name'

2. Cleanse the articles :

After scraping and extracting the news data, this data must be cleaned by removing the superfluous content such as advertising and HTML to obtain only information relevant to the news stories e.g. article text, author, headline, article url, etc. To do this job, we can use the framework Readability :

$ pip install readability

NB : In my project I did not use the Readability framework, I just select the necessary data while scraping using Scrapy selectors.


3. Store the crawled data :

The crawled and cleaned news articles's data must be stored in a database, the choosen database is MongoDB. But first we must activate our pipeline in the settings.py file by uncommenting the following lines :

ITEM_PIPELINES = {
   'newscrawler.pipelines.NewscrawlerPipeline': 300,
}

In this challenge I am using a local mongodb database on Windows with the MongoDB Compass GUI tool to visualize the stored data.

NB: the testing database is under the project folder in both json and csv formats


4. Create API :

The last step in this challenge is to create an API that provides access to the content in the mongo database, that the user should be able to search for articles by keyword. I build this API by using Flask by employing Flask-PyMongo to allow communication between Flask and MongoDB.

$ pip install flask

Now, it's time to create the needed endpoints that the user will claim it :

# show all news articles
@app.route('/viewNews', methods=['GET'])
def get_all_news():
    articles = mongo.db.articles

    articles_list = []
    article = articles.find()

    for i in article:
        i.pop('_id')
        articles_list.append(i)

    return jsonify(articles_list)
# search for specific news using a keyword
@app.route('/search/<keyword>', methods=['GET'])
def get_news_by_keyword(keyword):
    articles = mongo.db.articles

    articles_list = []
    article = articles.find()
    keyword = keyword.lower()

    for i in article:
        if keyword in i['article_text'] or keyword in i['title']:
            i.pop('_id')
            articles_list.append(i)

    return jsonify(articles_list)

After installing Flask and creating and configuring the endpoints needed, we run the python file and start working with the API to communicate with the database.

demo flask api

For testing the created API, you can use Advanced REST Client, POSTMAN or simply use your favorite browser.

About

This is a data engineering coding challenge for United Remote aims to create a python application that crawls articles from a news website such as bbc.com using Scrapy, stores it in a mongo database and makes it available via an API

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages