Web scraper to get news article content

Codementor Page

build a simple web scraper that will return the content of a news article when given a specific URL. Some examples of real products which use similar technologies include price-tracking websites and SEO audit tools which may scrape top search results.

Requirements

Choose one news website - see article examples below for inspiration. Given a specific article URL from the website of your choice, return the title and content of the article to the user.

Examples article URLs:

https://www.nytimes.com/2020/09/02/opinion/remote-learning-coronavirus.html https://www.washingtonpost.com/technology/2020/09/25/privacy-check-blacklight/

https://edition.cnn.com/travel/article/scenic-airport-landings-2020/index.html

https://www.reuters.com/article/us-health-coronavirus-global-deaths/global-coronavirus-deaths-pass-agonizing-milestone-of-1-million-idUSKBN26K08Y

For an extra challenge: Parse out information such as the article title, updated date, and byline to return separately to the user.

Suggested Implementation

You can use something similar to this service in command line:

> python scrape_newyorktimes.py news_url

We suggest using a HTTP library like Requests to get the raw HTML file of the URL. Then use a parsing library like Beautiful Soup to parse the content. Alternatively, you can also use a Python scraping tool like Scrapy.

References

You can use xPath to select elements if there’s no class or div for the element
Take note of the Python version you have installed! (reference)

Installation

# run scrapy 
> scrapy runspider news.py 

# create a csv file 
> scrapy runspider news.py -o nyt.csv

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
request-html		request-html
truecar		truecar
.gitignore		.gitignore
Example-HTML.ipynb		Example-HTML.ipynb
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
downloaded.html		downloaded.html
example.css		example.css
example.html		example.html
news.py		news.py
nyt.csv		nyt.csv
readme.md		readme.md
web-scraping.ipynb		web-scraping.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web scraper to get news article content

Codementor Page

Requirements

Suggested Implementation

References

Installation

About

Releases

Packages

Languages

hulyak/nyt-web-scraping

Folders and files

Latest commit

History

Repository files navigation

Web scraper to get news article content

Codementor Page

Requirements

Suggested Implementation

References

Installation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages