# Scrape News Articles from the Web with NewsAPI

Notebook goals: 
- Download news articles from the web using NewsAPI
- Build functions to send requests to NewsAPI for given keywords, time, and web sources
- Parse and clean the results of the API call


#### 1. NewsAPI

News API is a simple, easy-to-use REST API that returns JSON search results for current and historic news articles published by over 75,000 worldwide sources

Nice Features: 
- Search with singular keywords or complete phrases
- Specify words which must appear in articles and words that must not
- Specify the timeframe we are insterested in
- Limit searches to a single publisher, a selection of publishers, or remove unwanted publishers
- Limit your search to one of 14 languages: ar, de, en, es, fr, he, it, nl, no, pt, ru, se, ud, zh (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)

For developers is <u>totally free</u>, with restrictions: 
- New articles available with 1 hour delay
- Search articles up to a month old
- 100 requests per day

#### 2. Notebook Overview
a) First we will play with NewsAPI, test different features and parameters to get familiar with it

b) Second, we will start the following exercise: 
- Define a set of keywords regarding COVID-19 vaccines and UK news media 
- Get related news articles from NewsAPI
- Clean the data
- Make basic data visualizations and manipulation 

## NewsAPI Overview

First of all, we import the libraries that we are gonna use. You'll have to install the library *newsapi*. This is an unofficial Python client library to integrate News API into your Python application without having to make HTTP requests directly:

In [1]:
!pip install newsapi-python



We are ready to code!

In [2]:
# libraries
from newsapi.newsapi_client import NewsApiClient 
import json                      
from datetime import datetime, timedelta
import time
import numpy as np

To get started we need an API key from: https://newsapi.org

In [3]:
api_key = 'e98ffa34cc234554bfdc983f9cb9cb0c' # <--- put YOUR API key here!!!

Let's test NewsAPI requesting top headlines:

In [4]:
newsapi = NewsApiClient(api_key=api_key)
results = newsapi.get_top_headlines()
results

{'status': 'ok',
 'totalResults': 463,
 'articles': [{'source': {'id': 'independent', 'name': 'Independent'},
   'author': 'Karl Matchett',
   'title': 'Canada vs Great Britain LIVE: Women’s football latest score, goals and updates from Tokyo Olympics today - The Independent',
   'description': 'The final group stage game will determine if Team GB go through as group winners or runners-up',
   'url': 'https://www.independent.co.uk/sport/olympics/tokyo-2020-team-gb-canada-womens-football-b1891200.html',
   'urlToImage': 'https://static.independent.co.uk/2021/07/27/13/GettyImages-1330897081.jpg?width=1200&auto=webp&quality=75',
   'publishedAt': '2021-07-27T12:36:41Z',
   'content': 'Tom Daley: I am proud to be a gay man and an Olympic champion\r\nGreat Britain take on Canada in their third and final group stage fixture on Tuesday, already through to the knockout phase but now tryi… [+4306 chars]'},
  {'source': {'id': 'cnn', 'name': 'CNN'},
   'author': 'Nina Avramova and Reuters',
   '

The returned object is a <u>JSON</u> object. 

<i>
<center>
JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values). It is a very common data format, with a diverse range of applications, one example being web applications that communicate with a server. 
</center>
</i>
(https://en.wikipedia.org/wiki/JSON)
    
    
    
    
    
You can see JSON objects as (nested) python dictionaries, i.e. lists of couples key - values. The keys (i.e., *fields*) of the returned object are:

In [5]:
print(results.keys())

dict_keys(['status', 'totalResults', 'articles'])


The *interesting* field is 'articles'. Let's see the structure of the first downloaded article:

In [6]:
results["articles"][0]

{'source': {'id': 'independent', 'name': 'Independent'},
 'author': 'Karl Matchett',
 'title': 'Canada vs Great Britain LIVE: Women’s football latest score, goals and updates from Tokyo Olympics today - The Independent',
 'description': 'The final group stage game will determine if Team GB go through as group winners or runners-up',
 'url': 'https://www.independent.co.uk/sport/olympics/tokyo-2020-team-gb-canada-womens-football-b1891200.html',
 'urlToImage': 'https://static.independent.co.uk/2021/07/27/13/GettyImages-1330897081.jpg?width=1200&auto=webp&quality=75',
 'publishedAt': '2021-07-27T12:36:41Z',
 'content': 'Tom Daley: I am proud to be a gay man and an Olympic champion\r\nGreat Britain take on Canada in their third and final group stage fixture on Tuesday, already through to the knockout phase but now tryi… [+4306 chars]'}

The article is itself a <u>JSON</u> object, whose keys are:

In [7]:
print(results["articles"][0].keys())

dict_keys(['source', 'author', 'title', 'description', 'url', 'urlToImage', 'publishedAt', 'content'])


A couple of notes:
- the *text* field are 'title', 'description', 'content'
- the 'content' section is a preview of the whole content, but still enough to get the overall topic of the article
- we also have the URL to the article online

Now we will use NewsAPI to get something more specific. First, we restrict the time frame (not only last top headlines) and language. We will also intriduce a keyword: 'vaccine'

In [8]:
# initial date
start_date = datetime(2021, 7, 19)

# we set last date to seven days after start
end_date = start_date + timedelta(days=7)

# we set the country to UK
language = "en"

# we set the keyword 
keyword = "vaccine"

# request
results = newsapi.get_everything(q=keyword, 
                                 from_param=start_date.strftime("%Y-%m-%dT%H:%M:%S"), 
                                 to=end_date.strftime("%Y-%m-%dT%H:%M:%S"), 
                                 language=language)

results

{'status': 'ok',
 'totalResults': 14530,
 'articles': [{'source': {'id': 'wired', 'name': 'Wired'},
   'author': 'Eve Sneider',
   'title': 'The Pandemic Olympics, Vaccine Misinformation, and More News',
   'description': 'Catch up on the most important updates from this week.',
   'url': 'https://www.wired.com/story/pandemic-olympics-vaccine-misinformation-coronavirus-news/',
   'urlToImage': 'https://media.wired.com/photos/60faf8fb6c21a8379e7e15f7/191:100/w_1280,c_limit/science_corona_olympics_1234133123.jpg',
   'publishedAt': '2021-07-23T18:21:33Z',
   'content': 'The pandemic Olympics, vaccine misinformation, and reinstated Covid-19 passes. Heres what you should know: \r\nWant to receive this weekly roundup and other coronavirus news? Sign uphere!\r\nThe Olympics… [+4571 chars]'},
  {'source': {'id': None, 'name': 'Lifehacker.com'},
   'author': 'Lindsey Ellefson',
   'title': "How to Start Dating Again If You're Unvaccinated",
   'description': 'The post-vax slutty summer is happ

We notice that the number of results should be longer:

In [9]:
print("Expected number of articles:", results["totalResults"])
print("Number of results:", len(results["articles"]))

Expected number of articles: 14530
Number of results: 20


This is because we have to iterate over 'pages' to get all the results. Each page can contain at most 100 articles (this is a specific feature of newsapi). But developers account can only access the first 100 articles. 

So, to get more results we have to refine the keywords, or restrict the time frame.

We will restrict to a few sources:

In [10]:
# these are supported sources in newsapi
sources = ['bbc-news', 'business-insider-uk', 'independent']

# these are not supported so we have to pass the url
domains = ['mirror.co.uk', 'dailymail.co.uk', 'standard.co.uk', 
           'thesun.co.uk', 'telegraph.co.uk', 'metro.co.uk']

# we want to iterate over sources and dates and organize results better
articles = []

date = start_date
step = timedelta(days=1)  # advancing step

# iterate over dates
while date <= end_date: 
    
    # show advancement
    print(date)
    
    # iterate over sources
    for source in sources:
        print("\t", source)
        results = newsapi.get_everything(q=keyword, 
                                         sources=source,
                                         from_param=date.strftime("%Y-%m-%dT%H:%M:%S"), 
                                         to=(date+step).strftime("%Y-%m-%dT%H:%M:%S"), 
                                         language="en", 
                                         page_size=100)
        
        # check we're not reaching the 100 articles limit
        if len(results["articles"]) == 100:
            print("\t\tAlert: articles limit reached")
        
        # append new articles
        articles.extend(results["articles"])
        
    # iterate over domains
    for domain in domains:
        print("\t", domain)
        results = newsapi.get_everything(q=keyword, 
                                         domains=domain,
                                         from_param=date.strftime("%Y-%m-%dT%H:%M:%S"), 
                                         to=(date+step).strftime("%Y-%m-%dT%H:%M:%S"), 
                                         language="en", 
                                         page_size=100)
        
        # check we're not reaching the 100 articles limit
        if len(results["articles"]) == 100:
            print("\t\tAlert: articles limit reached")
        
        # append new articles
        articles.extend(results["articles"])
    
    # advance time
    date += step


2021-07-19 00:00:00
	 bbc-news
	 business-insider-uk
	 independent
	 mirror.co.uk
	 dailymail.co.uk
	 standard.co.uk
	 thesun.co.uk
	 telegraph.co.uk
	 metro.co.uk
2021-07-20 00:00:00
	 bbc-news
	 business-insider-uk
	 independent
	 mirror.co.uk
	 dailymail.co.uk
	 standard.co.uk
	 thesun.co.uk
	 telegraph.co.uk
	 metro.co.uk
2021-07-21 00:00:00
	 bbc-news
	 business-insider-uk
	 independent
	 mirror.co.uk
	 dailymail.co.uk
		Alert: articles limit reached
	 standard.co.uk
	 thesun.co.uk
	 telegraph.co.uk
	 metro.co.uk
2021-07-22 00:00:00
	 bbc-news
	 business-insider-uk
	 independent
	 mirror.co.uk
	 dailymail.co.uk
		Alert: articles limit reached
	 standard.co.uk
	 thesun.co.uk
	 telegraph.co.uk
	 metro.co.uk
2021-07-23 00:00:00
	 bbc-news
	 business-insider-uk
	 independent
	 mirror.co.uk
	 dailymail.co.uk
		Alert: articles limit reached
	 standard.co.uk
	 thesun.co.uk
	 telegraph.co.uk
	 metro.co.uk
2021-07-24 00:00:00
	 bbc-news
	 business-insider-uk
	 independent
	 mirror.co.uk
	 

How many articles?

In [11]:
print(len(articles))

1890


Save:

In [12]:
articles_dict = {"articles": articles}
with open("../data/articles_uk_" + keyword + ".json", "w") as file:
    json.dump(articles_dict, file)