# Scraping news articles with TheGuardianAPI and BeautifulSoup

This notebook creates a dataset for Natural Language Processing (NLP) by scraping news articles from the Guardian. 

First, it collects the location (urls) of the desired news articles using the Guardian Open Platform, specifically the content API endpoint. Then, it scrapes the text from each with the BeautifulSoup python library and saves it.

## Example

Let's try to obtain the text from a single article first, to get familiar with the API.

In [1]:
import os, time
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint

apikey = os.getenv('GUARDIAN_APIKEY')
BASE_URL = "http://content.guardianapis.com/search?"

In [None]:
# query = "China"
# query_fields = "body"
section = 'environment'
order_by = 'relevance'
from_date = "2020-1-1T00:00:00"
to_date = "2021-9-22T00:00:00"
query_url = f"{BASE_URL}&api-key={apikey}" \
            f"&section={section}" \
            f"&order-by={order_by}" \
            f"&from-date={from_date}" # \
            # f"&to-date={to_date}" # \
            # f"&show-fields=body"

# query_url 

In [None]:
r = requests.get(query_url)
print(f"Status code: {r.status_code}\n")
print(f"Headers: {r.headers}\n")
#pprint(r.json())
r.json()

So the request was successful, and we printed the contents in pretty-printed JSON format for readability. We could demand that the field `body` be returned as well, as a potential shortcut to calling `requests` a second time on the individual article urls. This, however, has some subtle behaviour, so we will go for the traditional route. Now, let's see the url of the article we downloaded!

In [None]:
url = r.json()['response']['results'][0]['webUrl']
print(url)

Now we requests the article itself and parse it with BeautifulSoup.

In [None]:
article = requests.get(url)
soup_article = BeautifulSoup(article.content, 'html.parser')
# print(soup_article.prettify())

We list all `p` tags with the specified properties (i.e. class and position inside a certain `div`), we extract and collate the text. And we are done.

In [None]:
body = soup_article.find_all('div', class_='article-body-commercial-selector')
ps = body[0].find_all('p', class_='dcr-s23rjr')
par_list = [p.text for p in ps]
final = " ".join(par_list)
# replace the Unicode-converted HTML entities
final = final.replace('\xa0', ' ')
final

I would imagine there should be a shortcut to all this, since the API can return the body in HTML if prompted (calling `show-fields=body` in the api call above). This, however, contains certain artifacts (such as related content) which I haven't been able to remove yet. Ideally, there should be a switch in the API. If this worked, the following snippet of code would retrieve the whole text without a second round of HTTP requests.

In [None]:
# article_body = r.json()['response']['results'][0]['fields']['body']
# article_body
# new_soup = BeautifulSoup(article_body, 'html.parser')
# ps2 = new_soup.find_all('p')
# par_list = [p.text for p in ps2]
# final2 = " ".join(par_list)
# final2

## Create the dataset

### Grab article urls and store them

Now we can repeat this process to grab as many articles as needed. We will search all articles containing the word "Hong Kong" in the body, from Jan 1 2019. This query returns thousands of hits, over many pages. It is convenient to increase the `page-size` of the server response to the maximum value (200) and to use a slightly different syntax for the HTTP request, so it's easier to iterate over the parameter `page`.

Let's grab the first 2 pages (400 articles) of each category.

In [62]:
def get_section_results(base_url, params, section, n_pages=3):
    
    section_results = []
    current_page = 1
    total_pages = n_pages
    
    while current_page <= total_pages:
        params.update({'page': current_page, 'section': section})
        try:
            r = requests.get(base_url, params)
            r.raise_for_status() 
        except requests.exceptions.RequestException as err:
            raise SystemExit(err)
            
        data = r.json()
        section_results.extend(data['response']['results'])
        current_page += 1
        
    print(f"Grabbed {len(section_results)} results.")
    return section_results

In [63]:
def results_to_html(results, section, to_file=True):
    
    # grab urls, write to file
    urls = [result['webUrl'] for result in results]
    
    if to_file:
        with open(f"urls_{section}.txt", 'w') as f:
            for url in urls:
                f.write(f"{url}\n")
        print(f"Written urls to urls_{section}.txt.")

    # retrieve HTML from urls
    html_files = {}
    while len(html_files) < len(urls): 
        
        for i, url in enumerate(urls):
            if i not in html_files:
                try:
                    file = requests.get(url)
                    file.raise_for_status()
                    html_files[i] = file
                except requests.exceptions.RequestException as err:
                    print(f"At file {i}: {err}")
                    time.sleep(10)
    
    # tests
    codes = set(file.status_code for file in html_files.values())
    
    if len(html_files) == len(urls):
        print(f"Retrieved html responses for all {len(urls)} urls. Status codes remaining: {codes}.")
        return html_files
    else:
        sys.exit(f"Error: only got {len(html_files)} articles for {len(urls)} urls. Codes: {codes}.")

In [64]:
def html_to_text(html_files, section):
    
    all_texts = []
    
    for file_id, file in html_files.items():
        soup = BeautifulSoup(file.content, 'html.parser')
        body = soup.find_all('div', class_='article-body-commercial-selector')
        
        # items with no body are (few) liveblogs, can ignore
        if len(body) == 1:
            if section in ['film', 'culture']:
                ps = body[0].find_all('p', class_='dcr-1m34hpq')
            else:
                ps = body[0].find_all('p', class_='dcr-s23rjr')
            par_list = [p.text for p in ps]
            text = " ".join(par_list)
            
            # discard items not encompassed by above tags
            if not (text is ''):
                all_texts.append(text)
            else:
                # keep a record of "bad" urls
                with open('bad_p.txt', 'a') as f:
                    f.write(f"Section {section}: {file.url}\n")
    
    print(f"All done! Correctly parsed documents: {len(all_texts)}, discarded: {len(html_files)-len(all_texts)}.")
    return all_texts

In [66]:
def main():
    
    API_ENDPOINT = "http://content.guardianapis.com/search"
    sections = ['environment', 'business', 'film', 'culture', 'education']
    my_params = {
        'api-key': apikey,
        'order-by': 'relevance', 
        'from-date': "2020-1-1",
        'page-size': 200,
    }
    
    for i, section in enumerate(sections):
        
        print(f"[{i+1}/{len(sections)}] Requesting '{section}' articles from api...")
        results = get_section_results(API_ENDPOINT, my_params, section, n_pages=3)
        
        print(f"Requesting html contents...")
        html_files = results_to_html(results, section, to_file=True)
        
        print(f"Parsing and cleaning contents...")
        texts = html_to_text(html_files, section)
        
        print(f"Saving dataframe...")
        df = pd.DataFrame({'Content': texts})
        df.to_csv(f"{section}_news.csv", index=False)

In [67]:
!rm bad_p.txt
main()

[1/5] Requesting 'environment' articles from api...
Grabbed 600 results.
Requesting html contents...
Written urls to urls_environment.txt.
At file 127: 429 Client Error: Too Many Requests for url: https://www.theguardian.com/environment/2020/sep/21/coalitions-gas-plan-would-help-fewer-than-1-of-manufacturing-workers-report-finds
At file 180: 429 Client Error: Too Many Requests for url: https://www.theguardian.com/environment/2020/nov/16/us-and-uk-yet-to-show-support-for-global-treaty-to-tackle-plastic-pollution
At file 229: 429 Client Error: Too Many Requests for url: https://www.theguardian.com/environment/2020/oct/28/trump-permit-logging-alaska-tongass-national-forest
At file 305: 429 Client Error: Too Many Requests for url: https://www.theguardian.com/environment/2020/jun/18/worst-outbreak-ever-nearly-a-million-pigs-culled-in-nigeria-due-to-swine-fever
At file 368: 429 Client Error: Too Many Requests for url: https://www.theguardian.com/environment/2020/apr/29/sweet-city-the-costa-r

### Scrape article's body from url

Now that we have all urls, let's retrieve all HTML. Since sometimes the server becomes overloaded and throws a 429 Error Code, we wait a bit before resuming our spamming. We store successful responses in a dictionary, so that it is simple to check which threw an error and still need to be retrieved.

We can re-load the file to check everything is in order.

In [68]:
sections = ['environment', 'business', 'film', 'culture', 'education']

for section in sections:
    tmp = pd.read_csv(f"{section}_news.csv")
    print(len(tmp))

585
475
599
570
524


In [69]:
tmp.isna().sum()

Content    0
dtype: int64

And we are done. Another way to save text data would be to save each article to a separate .txt file, but for now this will suffice.