<a href="https://colab.research.google.com/github/jgamel/learn_n_dev/blob/python_web_scrapping/NewsCatcher_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NewsCatcher

This one is another open-source library created by our team that can be used in DIY projects. It’s a simple Python web scraping library that can be used for scraping news articles from almost any news website. It also enables you to gather details related to a news website. Let’s elaborate on this with the help of some examples and code.

In case you want to grab the headlines from a news website, you can just create a Newscatcher object passing the website URL (remember to remove the HTTP and the www and just provide the website name and extension), and use the get_headlines() function to obtain the top headlines from the website. If you run the code below:

In [2]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [3]:
import sys
sys.path.append('/content/gdrive/My Drive/Colab Notebooks')

In [4]:
from newscatcher import Newscatcher, describe_url

mm = Newscatcher(website = 'mediamatters.org')

for index, headline in enumerate(mm.get_headlines()):
   print(index, headline)

0 On MSNBC, Ex-GOP Staffer Details How The Purpose Of Voter ID Laws Is To "Take People's Constitutional Rights Away"
1 Trump Adviser To MSNBC's Chris Hayes: "I Admire The Passion" In Stone's Threat To Disclose Delegate's Hotel Rooms
2 Bill O'Reilly Says Ted Cruz Is Right About New York City Values
3 How The Wash. Post Kicked Off The "Qualified" Argument Between Clinton And Sanders
4 Fox's Bolling Criticizes Paid Family Leave: "Socialism Is Spreading Across America"
5 Latest CMP Video Deceptively Attacks Planned Parenthood's Patient Consent Process
6 A Look Back At Fox News' Interviews With Obama Ahead Of His Sunday Network Appearance
7 Fox Host Claims That America Has "Helped Enough" With Syrian Refugee Crisis
8 Wash. Post Reporter: "It's Pretty Remarkable" That Cable News Channels Have Shunned Trump Ally Roger Stone
9 Mississippi Newspapers Criticize State Legislature For Passing Anti-LGBT Law
10 Charles Barkley: "The NBA Should Move The All-Star Game From Charlotte" Due To Anti-LGBT 

In [5]:
from newscatcher import Newscatcher, describe_url
import json
import time

nyt = Newscatcher(website = 'nytimes.com')
results = nyt.get_news()

count = 0
articles = results['articles']
for article in articles[:10]:   
   count+=1
   print(
     str(count) + ". " + article["title"] \
     + "\n\t\t" + article["published"] \
     + "\n\t\t" + article["link"]\
     + "\n\n"
     )
   time.sleep(0.33)

1. Putin Sends Mixed Signals on His Willingness to Negotiate
		Fri, 25 Feb 2022 17:56:36 +0000
		https://www.nytimes.com/live/2022/02/25/world/russia-ukraine-war


2. After a vicious battle for Kharkiv: wreckage, a stuck rocket and artillery booms.
		Fri, 25 Feb 2022 18:21:18 +0000
		https://www.nytimes.com/2022/02/25/world/europe/kharkiv-ukraine-military.html


3. Russia's Invasion of Ukraine Tests China's 'Sovereignty' Rhetoric
		Fri, 25 Feb 2022 18:08:00 +0000
		https://www.nytimes.com/2022/02/25/world/asia/china-russia-ukraine-sovereignty.html


4. Russia will limit access to Facebook, a major platform for dissent.
		Fri, 25 Feb 2022 17:39:03 +0000
		https://www.nytimes.com/2022/02/25/world/europe/russia-facebook.html


5. ‘I’ll Stand on the Side of Russia’: Pro-Putin Sentiment Spreads Online
		Fri, 25 Feb 2022 18:00:48 +0000
		https://www.nytimes.com/2022/02/25/technology/pro-russia-pro-putin-sentiment-spreads-online.html


6. European Leaders Agree to a Second Wave of Russia Sanc

In the code above, we used the get_news() function to get the top news from nytimes.com. While extracting just a few of the data points, you can get all of them for further processing:

* Title
* Link
* Authors
* Tags
* Date
* Summary
* Content
* Link for Comments
* Post_id

While these were the tools to obtain news information, you can also use the “describe_url” function to get details related to websites. For example, we took 3 news URLs, and obtained this information related to them:

In [6]:
from newscatcher import describe_url

websites = ['nytimes.com', 'cronachediordinariorazzismo.org', 'libertaegiustizia.it']

for website in websites:
   print(describe_url(website))

{'url': 'nytimes.com', 'language': 'en', 'country': 'None', 'main_topic': 'news', 'topics': ['world', 'travel', 'tech', 'science', 'politics', 'news', 'food', 'finance', 'business']}
{'url': 'cronachediordinariorazzismo.org', 'language': 'it', 'country': 'None', 'main_topic': 'politics', 'topics': ['politics']}
{'url': 'libertaegiustizia.it', 'language': 'it', 'country': 'None', 'main_topic': 'politics', 'topics': ['politics']}


You can see how it identified the 2nd and 3rd websites to be of Italian origin and the topics for all 3. Some data points like the country may not be available for all the websites since they are providing services worldwide.

In [11]:
from newscatcher import Newscatcher

nc = Newscatcher(website = 'nytimes.com')
results = nc.get_news()

# results.keys()
# 'url', 'topic', 'language', 'country', 'articles'

# Get the articles
articles = results['articles']

first_article_summary = articles[0]['summary']
first_article_title = articles[0]['title']

nc = Newscatcher(website = 'nytimes.com', topic = 'politics')

results = nc.get_news()
articles = results['articles']

from newscatcher import describe_url

describe = describe_url('nytimes.com')

print(describe['topics'])




['world', 'travel', 'tech', 'science', 'politics', 'news', 'food', 'finance', 'business']
