<a href="https://colab.research.google.com/github/jgamel/learn_n_dev/blob/python_web_scrapping/NewsCatcher_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NewsCatcher

This one is another open-source library created that can be used in DIY projects. It’s a simple Python web scraping library that can be used for scraping news articles from almost any news website. It also enables you to gather details related to a news website. Let’s elaborate on this with the help of some examples and code.

In case you want to grab the headlines from a news website, you can just create a Newscatcher object passing the website URL (remember to remove the HTTP and the www and just provide the website name and extension), and use the get_headlines() function to obtain the top headlines from the website. If you run the code below:

Mount Drive:

In [2]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [3]:
import sys
sys.path.append('/content/gdrive/My Drive/Colab Notebooks')

### Example 1:

Get Headlines

In [4]:
from newscatcher import Newscatcher, describe_url

mm = Newscatcher(website = 'cnn.com')

for index, headline in enumerate(mm.get_headlines()):
   print(index, headline)

0 40-mile Russian convoy appears to have stalled, UK defense ministry says
1 Russians struggle to understand war: 'We didn't choose this'
2 Grim assessment of latest Macron-Putin call comes as Russian forces lay siege to southern city of Mariupol
3 Watch: Ukrainian civilians stare down Russian military convoy
4 Mariupol: Russia cuts key Ukrainian city off from the world
5 Sanctions: Biden administration planning to impose new sanctions on Russian oligarchs as soon as Thursday
6 Banned: Russian and Belarusian athletes to no longer compete at Beijing 2022 Winter Paralympics
7 Opinion: Germany's military U-turn is a turning point in the history of Europe
8 Amanpour: These countries could convince Putin to stop
9 Jewish community in Ukraine reacts to Putin's Nazi rhetoric
10 Trump and right-wing lawyer part of 'criminal conspiracy,' Jan. 6 panel alleges
11 Serena Williams calls out New York Times for photo mistake
12 See 'Wheel of Fortune' moment some say is 'most painful 2 minutes' ever
1

### Example 2:

Get News Title, Date, & URL

In [5]:
from newscatcher import Newscatcher, describe_url
import json
import time

nyt = Newscatcher(website = 'cnbc.com')
results = nyt.get_news()

count = 0
articles = results['articles']
for article in articles[:10]:   
   count+=1
   print(
     str(count) + ". " + article["title"] \
     + "\n\t\t" + article["published"] \
     + "\n\t\t" + article["link"]\
     + "\n\n"
     )
   time.sleep(0.33)

1. Airline software giant ends pact with Russia's Aeroflot, crippling carrier's ability to sell seats
		Thu, 03 Mar 2022 13:33 GMT
		https://www.cnbc.com/2022/03/03/airline-software-giant-sabre-ends-service-with-russias-aeroflot-crippling-carriers-ability-to-sell-seats.html


2. Retailers start to warn of business impact from Russia's invasion of Ukraine
		Thu, 03 Mar 2022 16:15 GMT
		https://www.cnbc.com/2022/03/03/ukraine-news-retailers-start-warn-of-business-impact-from-russian-invasion.html


3. Dow falls more than 100 points, tech stocks slide as investors monitor Ukraine-Russia conflict
		Wed, 02 Mar 2022 23:07 GMT
		https://www.cnbc.com/2022/03/02/stock-market-futures-open-to-close-news.html


4. Apple and FBI grilled by lawmakers on spyware from Israeli NSO Group
		Thu, 03 Mar 2022 17:00 GMT
		https://www.cnbc.com/2022/03/03/apple-and-fbi-grilled-by-lawmakers-on-spyware-from-israeli-nso-group.html


5. Wharton's Siegel says it's a 'big policy mistake' for Fed to slow tightening

In the code above, we used the get_news() function to get the top news from nytimes.com. While extracting just a few of the data points, you can get all of them for further processing:

* Title
* Link
* Authors
* Tags
* Date
* Summary
* Content
* Link for Comments
* Post_id

While these were the tools to obtain news information, you can also use the “describe_url” function to get details related to websites. For example, we took 3 news URLs, and obtained this information related to them:

In [6]:
from newscatcher import describe_url

websites = ['nytimes.com', 'cnbc.com', 'cnn.com']

for website in websites:
   print(describe_url(website))

{'url': 'nytimes.com', 'language': 'en', 'country': 'None', 'main_topic': 'news', 'topics': ['world', 'travel', 'tech', 'science', 'politics', 'news', 'food', 'finance', 'business']}
{'url': 'cnbc.com', 'language': 'en', 'country': 'US', 'main_topic': 'news', 'topics': ['tech', 'sport', 'news', 'finance', 'business']}
{'url': 'cnn.com', 'language': 'en', 'country': 'US', 'main_topic': 'news', 'topics': ['world', 'travel', 'tech', 'politics', 'news', 'entertainment', 'business']}


You can see how it identified the 2nd and 3rd websites to be of Italian origin and the topics for all 3. Some data points like the country may not be available for all the websites since they are providing services worldwide.

In [None]:
from newscatcher import Newscatcher

nc = Newscatcher(website = 'cnn.com')
results = nc.get_news()

# results.keys()
# 'url', 'topic', 'language', 'country', 'articles'

# Get the articles
articles = results['articles']

first_article_summary = articles[0]['summary']
first_article_title = articles[0]['title']

nc = Newscatcher(website = 'cnn.com', topic = 'politics')

results = nc.get_news()
articles = results['articles']


from newscatcher import describe_url

describe = describe_url('cnn.com')

print(describe['topics'])


['world', 'travel', 'tech', 'politics', 'news', 'entertainment', 'business']
