Data Engineering Coding Challenge - News Content Collect and Store

The purpose of this coding challenge given by United Remote is to develop a solution using Python that crawls news articles from the news website www.bbc.com, by selecting the necessary information about the news stories such as : title, author, article text, etc. And stores this data into a mongo database, then makes it available to search via an API.

Specifications

The challenge is divided into 4 parts :

1. Crawl the news articles :

To crawl the news articles from the news website, a crawler/Scraping framework is needed. In this challenge the framework used is Scrapy that help us extracting the data from websites. We install it using the following command :

$ pip install scrapy

After installing Scrapy, we create a project using the command :

$ scrapy startproject 'project_name'

By finishing building the web spider under the project_name/spiders directory of the project, we run it to start crawling the news articles based on start_urls list :

$ scrapy crawl 'spider_name'

2. Cleanse the articles :

After scraping and extracting the news data, this data must be cleaned by removing the superfluous content such as advertising and HTML to obtain only information relevant to the news stories e.g. article text, author, headline, article url, etc. To do this job, we can use the framework Readability :

$ pip install readability

NB : In my project I did not use the Readability framework, I just select the necessary data while scraping using Scrapy selectors.

3. Store the crawled data :

The crawled and cleaned news articles's data must be stored in a database, the choosen database is MongoDB. But first we must activate our pipeline in the settings.py file by uncommenting the following lines :

ITEM_PIPELINES = {
   'newscrawler.pipelines.NewscrawlerPipeline': 300,
}

In this challenge I am using a local mongodb database on Windows with the MongoDB Compass GUI tool to visualize the stored data.

NB: the testing database is under the project folder in both json and csv formats

4. Create API :

The last step in this challenge is to create an API that provides access to the content in the mongo database, that the user should be able to search for articles by keyword. I build this API by using Flask by employing Flask-PyMongo to allow communication between Flask and MongoDB.

$ pip install flask

Now, it's time to create the needed endpoints that the user will claim it :

# show all news articles
@app.route('/viewNews', methods=['GET'])
def get_all_news():
    articles = mongo.db.articles

    articles_list = []
    article = articles.find()

    for i in article:
        i.pop('_id')
        articles_list.append(i)

    return jsonify(articles_list)

# search for specific news using a keyword
@app.route('/search/<keyword>', methods=['GET'])
def get_news_by_keyword(keyword):
    articles = mongo.db.articles

    articles_list = []
    article = articles.find()
    keyword = keyword.lower()

    for i in article:
        if keyword in i['article_text'] or keyword in i['title']:
            i.pop('_id')
            articles_list.append(i)

    return jsonify(articles_list)

After installing Flask and creating and configuring the endpoints needed, we run the python file and start working with the API to communicate with the database.

For testing the created API, you can use Advanced REST Client, POSTMAN or simply use your favorite browser.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
demo		demo
newscrawler		newscrawler
.gitignore		.gitignore
README.md		README.md
news_main.py		news_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Coding Challenge - News Content Collect and Store

Specifications

1. Crawl the news articles :

2. Cleanse the articles :

3. Store the crawled data :

4. Create API :

About

Releases

Packages

Languages

nfaihi/bbc-crawler

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Coding Challenge - News Content Collect and Store

Specifications

1. Crawl the news articles :

2. Cleanse the articles :

3. Store the crawled data :

4. Create API :

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages