<a href="https://colab.research.google.com/github/jgamel/learn_n_dev/blob/python_web_scrapping/NewsCatcher_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NewsCatcher

This one is another open-source library created that can be used in DIY projects. It’s a simple Python web scraping library that can be used for scraping news articles from almost any news website. It also enables you to gather details related to a news website. Let’s elaborate on this with the help of some examples and code.

In case you want to grab the headlines from a news website, you can just create a Newscatcher object passing the website URL (remember to remove the HTTP and the www and just provide the website name and extension), and use the get_headlines() function to obtain the top headlines from the website. If you run the code below:

Mount Drive:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [None]:
import sys
sys.path.append('/content/gdrive/My Drive/Colab Notebooks')

### Example 1:

Get Headlines

In [None]:
from newscatcher import Newscatcher, describe_url

mm = Newscatcher(website = 'cnn.com')

for index, headline in enumerate(mm.get_headlines()):
   print(index, headline)

0 White House responds to Russia's decision to put deterrence forces on high alert
1 What these Ukrainian Americans want you to know
2 Zelensky agrees to talks Monday as Putin raises nuclear alert and West adds sanctions
3 Catch up: Here are some of the ways countries are responding
4 Berdyansk: Russians have taken control of southern Ukrainian town, mayor says
5 Casualties: Ukraine Interior Ministry says 352 civilians killed
6 Putin's lie: His justification for invasion draws outrage
7 Fighting back: See how Ukrainians are defending their country
8 Video appears to show Russian vehicles destroyed after battle
9 World's largest plane reportedly destroyed in Ukraine
10 US embassy warns Americans in Russia should consider leaving 'immediately'
11 Russian invasion runs into stiff resistance, supply lines are a 'definite vulnerability,' US officials say
12 In pictures: The world rallies in support of Ukraine
14 He won a $10 million lottery for the second time
15 A category 4 atmospheric ri

### Example 2:

Get News Title, Date, & URL

In [None]:
from newscatcher import Newscatcher, describe_url
import json
import time

nyt = Newscatcher(website = 'cnbc.com')
results = nyt.get_news()

count = 0
articles = results['articles']
for article in articles[:10]:   
   count+=1
   print(
     str(count) + ". " + article["title"] \
     + "\n\t\t" + article["published"] \
     + "\n\t\t" + article["link"]\
     + "\n\n"
     )
   time.sleep(0.33)

1. EU says it will fund arms deliveries to Ukraine and hit Russia with new sanctions
		Sun, 27 Feb 2022 17:54 GMT
		https://www.cnbc.com/2022/02/27/eu-says-it-will-fund-arms-deliveries-to-ukraine-and-hit-russia-with-new-sanctions.html


2. Germany announces major defense policy shift in face of Russia's Ukraine invasion
		Sun, 27 Feb 2022 11:47 GMT
		https://www.cnbc.com/2022/02/27/scholz-germany-pledges-defense-spending-increase-in-shift-in-strategy.html


3. Ukraine hospitals could run out of oxygen in 24 hours as war disrupts health services, WHO says
		Sun, 27 Feb 2022 21:08 GMT
		https://www.cnbc.com/2022/02/27/ukraine-hospitals-could-run-out-of-oxygen-supplies-in-24-hours-who-says.html


4. BP offloads its nearly 20% stake in Russia's Rosneft
		Sun, 27 Feb 2022 17:28 GMT
		https://www.cnbc.com/2022/02/27/bp-offloads-stake-in-russias-rosneft.html


5. 'Stiff Ukrainian resistance' thwarts Russian advances, inflicts casualties
		Sun, 27 Feb 2022 02:10 GMT
		https://www.cnbc.com/2022

In the code above, we used the get_news() function to get the top news from nytimes.com. While extracting just a few of the data points, you can get all of them for further processing:

* Title
* Link
* Authors
* Tags
* Date
* Summary
* Content
* Link for Comments
* Post_id

While these were the tools to obtain news information, you can also use the “describe_url” function to get details related to websites. For example, we took 3 news URLs, and obtained this information related to them:

In [None]:
from newscatcher import describe_url

websites = ['nytimes.com', 'cnbc.com', 'cnn.com']

for website in websites:
   print(describe_url(website))

{'url': 'nytimes.com', 'language': 'en', 'country': 'None', 'main_topic': 'news', 'topics': ['world', 'travel', 'tech', 'science', 'politics', 'news', 'food', 'finance', 'business']}
{'url': 'cnbc.com', 'language': 'en', 'country': 'US', 'main_topic': 'news', 'topics': ['tech', 'sport', 'news', 'finance', 'business']}
{'url': 'cnn.com', 'language': 'en', 'country': 'US', 'main_topic': 'news', 'topics': ['world', 'travel', 'tech', 'politics', 'news', 'entertainment', 'business']}


You can see how it identified the 2nd and 3rd websites to be of Italian origin and the topics for all 3. Some data points like the country may not be available for all the websites since they are providing services worldwide.

In [None]:
from newscatcher import Newscatcher

nc = Newscatcher(website = 'cnn.com')
results = nc.get_news()

# results.keys()
# 'url', 'topic', 'language', 'country', 'articles'

# Get the articles
articles = results['articles']

first_article_summary = articles[0]['summary']
first_article_title = articles[0]['title']

nc = Newscatcher(website = 'cnn.com', topic = 'politics')

results = nc.get_news()
articles = results['articles']


from newscatcher import describe_url

describe = describe_url('cnn.com')

print(describe['topics'])


['world', 'travel', 'tech', 'politics', 'news', 'entertainment', 'business']
