# Scraping news articles with TheGuardianAPI and BeautifulSoup

This notebook creates a dataset for Natural Language Processing (NLP) by scraping news articles from the Guardian. 

First, it collects the location (urls) of the desired news articles using the Guardian Open Platform, specifically the content API endpoint. Then, it scrapes the text from each with the BeautifulSoup python library and saves it.

## Example

Let's try to obtain the text from a single article first, to get familiar with the API.

In [1]:
import os, time
import requests
from bs4 import BeautifulSoup
from pprint import pprint

apikey = os.getenv('GUARDIAN_APIKEY')
BASE_URL = "http://content.guardianapis.com/search?"

In [2]:
query = "Hong Kong AND election"
query_fields = "body"
from_date = "2021-9-21T00:00:00"
to_date = "2021-9-21T08:00:00"
query_url = f"{BASE_URL}&api-key={apikey}" \
            f"&q={query}" \
            f"&query-fields={query_fields}" \
            f"&from-date={from_date}" \
            f"&to-date={to_date}" # \
            # f"&show-fields=body"

# query_url

In [3]:
r = requests.get(query_url)
print(f"Status code: {r.status_code}\n")
print(f"Headers: {r.headers}\n")
pprint(r.json())

Status code: 200

Headers: {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Cache-Control': 'max-age=0, no-cache="set-cookie"', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Tue, 21 Sep 2021 19:09:08 GMT', 'Server': 'Concierge', 'Set-Cookie': 'AWSELB=75B9BD811C5C032EDEF76366759629DCCB8726D7A37401C19A3457074430A4AA14CD1FA6CBE4519DDF3CD336789F71716B110728D8FF3418C2C759D07E5F767DD54D1B7752;PATH=/;MAX-AGE=86400', 'Via': 'kong/0.14.0', 'X-Kong-Proxy-Latency': '1', 'X-Kong-Upstream-Latency': '12', 'X-RateLimit-Limit-day': '5000', 'X-RateLimit-Limit-minute': '720', 'X-RateLimit-Remaining-day': '4980', 'X-RateLimit-Remaining-minute': '719', 'Content-Length': '419', 'Connection': 'keep-alive'}

{'response': {'currentPage': 1,
              'orderBy': 'relevance',
              'pageSize': 10,
              'pages': 1,
              'results': [{'apiUrl': 'https://content.guardianapis.com/world/2021/sep/21/hong-kong-leader-defends-elect

So the request was successful, and we printed the contents in pretty-printed JSON format for readability. We could demand that the field `body` be returned as well, as a potential shortcut to calling `requests` a second time on the individual article urls. This, however, has some subtle behaviour, so we will go for the traditional route. Now, let's see the url of the article we downloaded!

In [4]:
for i in range(len(r.json()['response']['results'])):
    url = r.json()['response']['results'][i]['webUrl']
    print(url)

https://www.theguardian.com/world/2021/sep/21/hong-kong-leader-defends-election-after-single-opposition-figure-makes-it-to-1500-strong-committee


Now we requests the article itself and parse it with BeautifulSoup.

In [5]:
article = requests.get(url)
soup_article = BeautifulSoup(article.content, 'html.parser')
# print(soup_article.prettify())

We list all `p` tags with the specified properties (i.e. class and position inside a certain `div`), we extract and collate the text. And we are done.

In [6]:
body = soup_article.find_all('div', class_='article-body-commercial-selector')
ps = body[0].find_all('p', class_='dcr-s23rjr')
par_list = [p.text for p in ps]
final = " ".join(par_list)
final

'Hong Kong’s chief executive, Carrie Lam, has defended the weekend’s election of a powerful committee to appoint senior leaders, after just one candidate not strictly aligned with the establishment camp was elected among the 1,500 positions. Under an overhauled electoral system, dubbed “patriots rule Hong Kong”, fewer than 5,000 people were eligible to vote on Sunday, choosing from candidates who had already been vetted for political loyalty and cleared of being a national security threat. The results saw primarily Beijing loyalists and pro-establishment figures elected to the committee. The group will choose nearly half the Hong Kong legislature next year, and a new leader for the territory. Just two candidates described by local media as not strictly from the establishment ranks , were able to run. Only one, Tik Chi-yuen, was elected. On Tuesday at her regular press briefing Lam rejected criticisms of the lack of opposition figures among the candidates and eligible voters, saying “no

I would imagine there should be a shortcut to all this, since the API can return the body in HTML if prompted (calling `show-fields=body` in the api call above). This, however, contains certain artifacts (such as related content) which I haven't been able to remove yet. Ideally, there should be a switch in the API. If this worked, the following snippet of code would retrieve the whole text without a second round of HTTP requests.

In [7]:
# article_body = r.json()['response']['results'][0]['fields']['body']
# article_body
# new_soup = BeautifulSoup(article_body, 'html.parser')
# ps2 = new_soup.find_all('p')
# par_list = [p.text for p in ps2]
# final2 = " ".join(par_list)
# final2

## Create the dataset

### Grab article urls and store them

Now we can repeat this process to grab as many articles as needed. We will search all articles containing the word "Hong Kong" in the body, from Jan 1 2019. This query returns thousands of hits, over many pages. It is convenient to increase the `page-size` of the server response to the maximum value (200) and to use a slightly different syntax for the HTTP request, so it's easier to iterate over the parameter `page`.

In [8]:
API_ENDPOINT = "http://content.guardianapis.com/search"

my_params = {
    'api-key': apikey,
    'q': "Hong Kong",
    'query-fields': "body",
    'from-date': "2019-1-1",
    'page-size': 200,
}

Let's grab the first 15 pages.

In [9]:
def get_all_results(base_url, params):

    all_results = []
    current_page = 1
    total_pages = 15
    while current_page <= total_pages:
        if current_page % 5 == 0: print(f"Downloading page {current_page}...")
        params['page'] = current_page
        try:
            r = requests.get(base_url, params)
            r.raise_for_status()
        except requests.exceptions.RequestException as err:
            raise SystemExit(err)
        data = r.json()
        all_results.extend(data['response']['results'])
        current_page += 1
        
    print("Finished downloading.")
    return all_results

all_results = get_all_results(API_ENDPOINT, my_params)

Downloading page 5...
Downloading page 10...
Downloading page 15...
Finished downloading.


In [10]:
print(f"We grabbed {len(all_results)} articles! The metadata for the first one:\n")

all_results[0]

We grabbed 3000 articles! The metadata for the first one:



{'id': 'world/2021/sep/21/hong-kong-leader-defends-election-after-single-opposition-figure-makes-it-to-1500-strong-committee',
 'type': 'article',
 'sectionId': 'world',
 'sectionName': 'World news',
 'webPublicationDate': '2021-09-21T05:46:48Z',
 'webTitle': 'Hong Kong leader defends election after single non-establishment figure picked for 1,500-strong committee',
 'webUrl': 'https://www.theguardian.com/world/2021/sep/21/hong-kong-leader-defends-election-after-single-opposition-figure-makes-it-to-1500-strong-committee',
 'apiUrl': 'https://content.guardianapis.com/world/2021/sep/21/hong-kong-leader-defends-election-after-single-opposition-figure-makes-it-to-1500-strong-committee',
 'isHosted': False,
 'pillarId': 'pillar/news',
 'pillarName': 'News'}

Let's extract the urls and save them to file for future reference.

In [11]:
urls = [res['webUrl'] for res in all_results]
# urls[:5]

In [12]:
with open('urls.txt', 'w') as f:
    for url in urls:
        f.write(f"{url}\n")

We can also check what kind of page we got:

In [13]:
types = [res['type'] for res in all_results]
set(types)

{'article', 'gallery', 'interactive', 'liveblog'}

Not all items are articles. This could be a problem when scraping text, as e.g. liveblog has a more complex structure.

### Scrape article's body from url

Now that we have all urls, let's retrieve all HTML. Since sometimes the server becomes overloaded and throws a 429 Error Code, we wait a bit before resuming our spamming. We store successful responses in a dictionary, so that it is simple to check which threw an error and still need to be retrieved.

In [14]:
articles = {}
while len(articles) < len(urls):
    for i, url in enumerate(urls):
        if i not in articles:
            try:
                article = requests.get(url)
                article.raise_for_status()
                articles[i] = article
            except requests.exceptions.RequestException as err:
                # print(f"At article {i}:\n")
                # print(err)
                time.sleep(10)

Check that we successfully retrieved all.

In [15]:
print(f"n_articles: {len(articles)} vs n_urls: {len(urls)}")

codes = []
for art in articles.values():
    codes.append(art.status_code)
print(f"codes encountered: {set(codes)}")

n_articles: 3000 vs n_urls: 3000
codes encountered: {200}


We now parse the HTML as we did before.

In [16]:
def text_from_response(resp):
    soup_article = BeautifulSoup(resp.content, 'html.parser')
    body = soup_article.find_all('div', class_='article-body-commercial-selector')
    if len(body)==1:
        ps = body[0].find_all('p', class_='dcr-s23rjr')
        par_list = [p.text for p in ps]
        text = " ".join(par_list)
    else:
        text = 'Missing'
    
    return text, len(body)

In [17]:
all_texts = []
lengths = []

for k, art in articles.items():
    text, l_body = text_from_response(art)
    cond = text is '' or text is 'Missing'
    if not cond:
        all_texts.append(text)
    lengths.append(l_body)

In [18]:
len(all_texts), len(lengths)

(2033, 3000)

Some responses have no or empty body, so we don't collect them (note: it should be possible to catch these cases during the HTML parsing).

In [19]:
# some responses have no body
print(set(lengths))

# how many?
d = {}
for i in lengths:
    d[i] = d.get(i, 0) + 1 
print(d)

{0, 1}
{1: 2831, 0: 169}


Let's convert this to a DataFrame and save it as a csv file.

In [20]:
import pandas as pd

df = pd.DataFrame({'Content': all_texts})
print(df.shape)
df.head()

(2033, 1)


Unnamed: 0,Content
0,"Hong Kong’s chief executive, Carrie Lam, has d..."
1,International companies are being forced to re...
2,Hong Kong authorities have raided the city’s T...
3,The Hong Kong activist Andy Li and paralegal C...
4,For much of the last year Kacey Wong was wakin...


In [21]:
df.to_csv("articles.csv", index=False)

We can re-load the file to check everything is in order.

In [22]:
df2 = pd.read_csv("articles.csv")
df2.head()

Unnamed: 0,Content
0,"Hong Kong’s chief executive, Carrie Lam, has d..."
1,International companies are being forced to re...
2,Hong Kong authorities have raided the city’s T...
3,The Hong Kong activist Andy Li and paralegal C...
4,For much of the last year Kacey Wong was wakin...


In [23]:
df2.isna().sum()

Content    0
dtype: int64

And we are done. Another way to save text data would be to save each article to a separate .txt file, but for now this will suffice.