# Scraping news articles with TheGuardianAPI and BeautifulSoup

This notebook creates a labelled dataset for Natural Language Processing (NLP) by scraping news articles from different sections of the Guardian. 

First, it collects the location (urls) of the desired news articles using the Guardian Open Platform, specifically the content API endpoint. Then, it scrapes the text from each with the BeautifulSoup python library and saves it.

In [1]:
import os, time
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint

apikey = os.getenv('GUARDIAN_APIKEY')

## Create the dataset

Now we can repeat this process to grab as many articles as needed. We will search for articles from 5 different sections of the newspaper, from Jan 1 2020. This type of query returns thousands of hits, over many pages. It is convenient to increase the `page-size` of the server response to the maximum value (200) and to iterate over the parameters `page` and `section`.

In [2]:
def get_section_results(base_url, params, section, n_pages=3):
    
    section_results = []
    current_page = 1
    total_pages = n_pages
    
    while current_page <= total_pages:
        params.update({'page': current_page, 'section': section})
        try:
            r = requests.get(base_url, params)
            r.raise_for_status() 
        except requests.exceptions.RequestException as err:
            raise SystemExit(err)
            
        data = r.json()
        section_results.extend(data['response']['results'])
        current_page += 1
        
    print(f"Grabbed {len(section_results)} results.")
    return section_results

In [3]:
def results_to_html(results, section, to_file=True):
    
    # grab urls, write to file
    urls = [result['webUrl'] for result in results]
    
    if to_file:
        with open(f"urls/urls_{section}.txt", 'w') as f:
            for url in urls:
                f.write(f"{url}\n")
        print(f"Written urls to urls_{section}.txt.")

    # retrieve HTML from urls
    html_files = {}
    while len(html_files) < len(urls): 
        
        for i, url in enumerate(urls):
            if i not in html_files:
                try:
                    file = requests.get(url)
                    file.raise_for_status()
                    html_files[i] = file
                except requests.exceptions.RequestException as err:
                    with open("logs.txt", 'a') as f:
                        f.write(f"At file {i}: {err}\n")
                    time.sleep(10)
    
    # tests
    codes = set(file.status_code for file in html_files.values())
    
    if len(html_files) == len(urls):
        print(f"Retrieved html responses for all {len(urls)} urls. Status codes remaining: {codes}.")
        return html_files
    else:
        sys.exit(f"Error: only got {len(html_files)} articles for {len(urls)} urls. Codes: {codes}.")

In [4]:
def html_to_text(html_files, section):
    
    all_texts = []
    
    for file_id, file in html_files.items():
        soup = BeautifulSoup(file.content, 'html.parser')
        body = soup.find_all('div', class_='article-body-commercial-selector')
        
        # items with no body are (few) liveblogs, can ignore
        if len(body) == 1:
            if section in ['film', 'culture']:
                ps = body[0].find_all('p', class_='dcr-1m34hpq')
            else:
                ps = body[0].find_all('p', class_='dcr-s23rjr')
            par_list = [p.text for p in ps]
            text = " ".join(par_list)
            
            # replace the Unicode-converted HTML entities
            text = text.replace('\xa0', ' ')
            
            # discard items not encompassed by above tags
            if not (text == ''):
                all_texts.append(text)
            else:
                # keep a record of "bad" urls
                with open('bad_p.txt', 'a') as f:
                    f.write(f"Section {section}: {file.url}\n")
    
    print(f"All done! Correctly parsed documents: {len(all_texts)}, discarded: {len(html_files)-len(all_texts)}.")
    return all_texts

Let's grab the first 3 pages (600 items) of each section. After discarding some items due to inconsistent parsing we'll be left with a sizeable dataset with around 2500 news articles.

In [5]:
def main():
    
    API_ENDPOINT = "http://content.guardianapis.com/search"
    sections = ['environment', 'business', 'film', 'culture', 'education']
    my_params = {
        'api-key': apikey,
        'order-by': 'relevance', 
        'from-date': "2020-1-1",
        'page-size': 200,
    }
    
    for i, section in enumerate(sections):
        
        print(f"[{i+1}/{len(sections)}] Requesting '{section}' articles from api...")
        results = get_section_results(API_ENDPOINT, my_params, section, n_pages=3)
        
        print(f"Requesting html contents...")
        html_files = results_to_html(results, section, to_file=True)
        
        print(f"Parsing and cleaning contents...")
        texts = html_to_text(html_files, section)
        
        print(f"Saving dataframe...")
        df = pd.DataFrame({'Content': texts, 'Section': section})
        df.to_csv(f"data/{section}_news.csv", index=False)

In [6]:
!rm bad_p.txt
main()

[1/5] Requesting 'environment' articles from api...
Grabbed 600 results.
Requesting html contents...
Written urls to urls_environment.txt.
Retrieved html responses for all 600 urls. Status codes remaining: {200}.
Parsing and cleaning contents...
All done! Correctly parsed documents: 585, discarded: 15.
Saving dataframe...
[2/5] Requesting 'business' articles from api...
Grabbed 600 results.
Requesting html contents...
Written urls to urls_business.txt.
Retrieved html responses for all 600 urls. Status codes remaining: {200}.
Parsing and cleaning contents...
All done! Correctly parsed documents: 475, discarded: 125.
Saving dataframe...
[3/5] Requesting 'film' articles from api...
Grabbed 600 results.
Requesting html contents...
Written urls to urls_film.txt.
Retrieved html responses for all 600 urls. Status codes remaining: {200}.
Parsing and cleaning contents...
All done! Correctly parsed documents: 599, discarded: 1.
Saving dataframe...
[4/5] Requesting 'culture' articles from api...


Check saved data.

In [7]:
sections = ['environment', 'business', 'film', 'culture', 'education']

for section in sections:
    tmp = pd.read_csv(f"data/{section}_news.csv")
    print(len(tmp))

585
475
599
572
523


In [8]:
df = pd.read_csv("data/business_news.csv")
print(df.shape)
df.head()

(475, 2)


Unnamed: 0,Content,Section
0,A consortium that includes high street giant N...,business
1,Connells has struck an agreed offer for Countr...,business
2,UK mortgage approvals have risen to the highes...,business
3,"Another 787,000 Americans filed for unemployme...",business
4,Primark has said it will lose an additional £2...,business


And we are done. Another way to save text data would be to save each article to a separate .txt file, but for now this will suffice.