<a href="https://colab.research.google.com/github/jgamel/learn_n_dev/blob/python_web_scrapping/NewsCatcher_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NewsCatcher

This one is another open-source library created that can be used in DIY projects. It’s a simple Python web scraping library that can be used for scraping news articles from almost any news website. It also enables you to gather details related to a news website. Let’s elaborate on this with the help of some examples and code.

In case you want to grab the headlines from a news website, you can just create a Newscatcher object passing the website URL (remember to remove the HTTP and the www and just provide the website name and extension), and use the get_headlines() function to obtain the top headlines from the website. If you run the code below:

In [2]:
# Install newscatcher
!pip install newscatcher

Collecting newscatcher
  Downloading newscatcher-0.2.0-py3-none-any.whl (138 kB)
[?25l[K     |██▍                             | 10 kB 21.6 MB/s eta 0:00:01[K     |████▊                           | 20 kB 16.2 MB/s eta 0:00:01[K     |███████                         | 30 kB 11.1 MB/s eta 0:00:01[K     |█████████▌                      | 40 kB 9.3 MB/s eta 0:00:01[K     |███████████▉                    | 51 kB 4.6 MB/s eta 0:00:01[K     |██████████████▏                 | 61 kB 5.4 MB/s eta 0:00:01[K     |████████████████▌               | 71 kB 5.5 MB/s eta 0:00:01[K     |███████████████████             | 81 kB 4.3 MB/s eta 0:00:01[K     |█████████████████████▎          | 92 kB 4.7 MB/s eta 0:00:01[K     |███████████████████████▋        | 102 kB 5.2 MB/s eta 0:00:01[K     |██████████████████████████      | 112 kB 5.2 MB/s eta 0:00:01[K     |████████████████████████████▍   | 122 kB 5.2 MB/s eta 0:00:01[K     |██████████████████████████████▊ | 133 kB 5.2 MB/s eta 0:0

### Example 1:

Get Headlines

In [3]:
from newscatcher import Newscatcher, describe_url

mm = Newscatcher(website = 'cnn.com')

for index, headline in enumerate(mm.get_headlines()):
   print(index, headline)

0 Ukrainian presidential advisor says Russian troops who entered the plant were pushed back. Deputy commander says 'fierce bloody combat is ongoing.'
1 'You have to fight for your life': Ukrainian newlywed who lost legs in blast hopes to walk down the aisle with prosthetics soon
2 Key battle: Soldiers explain how they fought off Russian forces
3 Farm thefts: Russians steal vast amounts of Ukrainian grain and equipment, threatening this year's harvest
4 Moscow: Russia expels Danish embassy employees
5 Retired US major general: What it will take for Ukrainians to win
6 'Putin's altar boy': Pope warns pro-war Russian patriarch
7 Opinion: World's luck may run out as talk of nuclear war escalates
8 New audio: McCarthy said 25th Amendment 'takes too long' and wanted to reach out to Biden
9 Full Covid death toll is nearly three times higher than reported, WHO data reveals
10 How to take advantage of rising interest rates
11 Mortgage rates hit highest level since 2009
12 Biden administration l

### Example 2:

Get News Title, Date, & URL

In [4]:
from newscatcher import Newscatcher, describe_url
import json
import time

nyt = Newscatcher(website = 'cnbc.com')
results = nyt.get_news()

count = 0
articles = results['articles']
for article in articles[:10]:   
   count+=1
   print(
     str(count) + ". " + article["title"] \
     + "\n\t\t" + article["published"] \
     + "\n\t\t" + article["link"]\
     + "\n\n"
     )
   time.sleep(0.33)

1. Stocks extend losses, with Dow dropping 800 points, as post-Fed rally evaporates
		Wed, 04 May 2022 22:06 GMT
		https://www.cnbc.com/2022/05/04/stock-market-futures-open-to-close-news.html


2. Elon Musk expected to serve as temporary Twitter CEO after deal closes
		Thu, 05 May 2022 13:13 GMT
		https://www.cnbc.com/2022/05/05/elon-musk-expected-to-serve-as-temporary-twitter-ceo-after-deal-closes.html


3. Worker output fell 7.5% in the first quarter, the biggest decline since 1947
		Thu, 05 May 2022 12:33 GMT
		https://www.cnbc.com/2022/05/05/labor-productivity-fell-7point5percent-in-the-first-quarter-the-fastest-rate-since-1947.html


4. Feds propose first major revamp to fair housing rules since 1995
		Thu, 05 May 2022 13:53 GMT
		https://www.cnbc.com/2022/05/05/feds-propose-first-major-revamp-to-fair-housing-rules-since-1995.html


5. Here's what's next for stocks after the Fed's latest guidance on future rate hikes
		Thu, 05 May 2022 13:04 GMT
		https://www.cnbc.com/2022/05/05/h

In the code above, we used the get_news() function to get the top news from nytimes.com. While extracting just a few of the data points, you can get all of them for further processing:

* Title
* Link
* Authors
* Tags
* Date
* Summary
* Content
* Link for Comments
* Post_id

While these were the tools to obtain news information, you can also use the “describe_url” function to get details related to websites. For example, we took 3 news URLs, and obtained this information related to them:

In [5]:
from newscatcher import describe_url

websites = ['nytimes.com', 'cnbc.com', 'cnn.com']

for website in websites:
   print(describe_url(website))

{'url': 'nytimes.com', 'language': 'en', 'country': 'None', 'main_topic': 'news', 'topics': ['world', 'travel', 'tech', 'science', 'politics', 'news', 'food', 'finance', 'business']}
{'url': 'cnbc.com', 'language': 'en', 'country': 'US', 'main_topic': 'news', 'topics': ['tech', 'sport', 'news', 'finance', 'business']}
{'url': 'cnn.com', 'language': 'en', 'country': 'US', 'main_topic': 'news', 'topics': ['world', 'travel', 'tech', 'politics', 'news', 'entertainment', 'business']}


You can see how it identified the 2nd and 3rd websites to be of Italian origin and the topics for all 3. Some data points like the country may not be available for all the websites since they are providing services worldwide.

In [6]:
from newscatcher import Newscatcher

nc = Newscatcher(website = 'cnn.com')
results = nc.get_news()

# results.keys()
# 'url', 'topic', 'language', 'country', 'articles'

# Get the articles
articles = results['articles']

first_article_summary = articles[0]['summary']
first_article_title = articles[0]['title']

nc = Newscatcher(website = 'cnn.com', topic = 'politics')

results = nc.get_news()
articles = results['articles']


from newscatcher import describe_url

describe = describe_url('cnn.com')

print(describe['topics'])


['world', 'travel', 'tech', 'politics', 'news', 'entertainment', 'business']
