# Fetching Articles from an API

The previous notebook demonstrated how to fetch posts from Reddit.

This notebook, however, will demonstrate how to fetch articles from a news site through their API. Specifically, we will be getting articles from [The Guardian](https://www.theguardian.com/), which offers an [API](https://open-platform.theguardian.com/) from which we can get articles. Registration is easy and is free for developer use, which is enough for our purposes.

### Importing libraries

To make web requests to the API, we need to import `requests` and to process JSON, we import `json`.

Also, we will define a utility function `request_api()` to wrap the request process. The `endpoint` parameter denotes which endpoint will be called. For example, the `search` endpoint returns content. The `apiKey` parameter is our API key, which is required anyway. The `params` keyword arguments are basically the parameters that the endpoints can accept. This modifies the URL in a manner similar to when making GET requests (e.g. `http://example.com?key=value&key2=value2`).

In [None]:
import requests
import json

def request_api(endpoint, apiKey, **params):
    API_SITE = "https://content.guardianapis.com/"
    queries = 'api-key=' + apiKey
    
    for key, value in params.items():
        queries = key.replace('_', '-') + '=' + str(value) + '&' + queries
    
    url = API_SITE + endpoint + '?' + queries
    return requests.get(url)

### Preparing the API key

To be able to access the API, we need to read the key. This key only allows a limited number of requests per day, so it is important to keep the requests down as much as possible. Also, it must not be pushed into the repository for obvious reasons.

One way to do this is to store the key in a file named `key.txt`, then putting the file name in a file named `.gitignore`, which will prevent git from committing the key file.

In [None]:
API_KEY = ''
with open('key.txt', 'r') as file:
    API_KEY = file.read()

### Experimenting with the API

Of interest is the `/search` endpoint, which we will use to fetch news articles. It supports parameters like how many results per 'page', the number of pages and the dates. More information can be found in the [documentation](https://open-platform.theguardian.com/documentation/).

Let's give it a try.

In [None]:
content = request_api('search', API_KEY)
test = json.loads(content.text)
len(test['response']['results'])

By default, `/search` returns ten results. If we want more results per page, we set the page size:

In [None]:
content = request_api('search', API_KEY, page_size=20)
test = json.loads(content.text)
len(test['response']['results'])

The documents state that the maximum accepted value for `page-size` is 50. We will use that value.

To read more pages, we simply set `page` to whatever page we are on.

### Getting the articles

What we need is to return all news articles from 11 March to 12 March 2021. Fortunately, the `/search` endpoint allows the parameters `from-date` and `to-date` which specifies the date range.

In [None]:
content = request_api('search', API_KEY, page_size=50, from_date='2021-03-11', to_date='2021-03-12')
test = json.loads(content.text)
test['response']['results']

We also need the article's author(s). These are typically the article's contributors or whoever contributed to the story or article. By default, the requests do not return contributor information; they are stored in tags that need to be explicitly requested. To do this, we put `show-tags=contributor`.

In [None]:
content = request_api('search', API_KEY, page_size=50, from_date='2021-03-11', to_date='2021-03-12', show_tags='contributor')
test = json.loads(content.text)
test['response']['results']

In addition, we need the full text of the article. The documentation says one can extract the full text using `show-fields=body`, but it is in HTML and we do not need tags floating in our text.

We can extract the full, non-HTML text by specifying `show-blocks=body`, and reading `bodyTextSummary`:

In [None]:
content = request_api('search', API_KEY, page_size=50, from_date='2021-03-11', to_date='2021-03-12', show_tags='contributor', show_blocks='body')
test = json.loads(content.text)
test['response']['results']

However, some entries do not refer to news articles. For example, there are `liveblog` entries, which means something is being tracked, for example, in November 2020, the US elections which was frequently updated with new entries as fresh news came.

We want to filter the `type` so that we are only reading `article` entries. So, while we are reading each entry, we check `type` if it is `article`.

In [None]:
test_articles = list(filter(lambda x: x['type'] == 'article', test['response']['results']))
test_articles

Now, we are ready to read entries into objects so we can save them as a JSON file. To start, we process the first entry:

In [None]:
test_title = test_articles[0]['webTitle']
test_date = test_articles[0]['webPublicationDate']
test_authors = []
for tag in test_articles[0]['tags']:
    if tag['type'] != 'contributor':
        continue
    test_authors.append(tag['webTitle'])
    
# For article entries, the number of blocks is only one, so it should be easy to extract the text.
test_text = test_articles[0]['blocks']['body'][0]['bodyTextSummary']

print(test_title)
print(test_date)
print(test_authors)
print(test_text)

There seems to be something wrong with the text, probably caused by the encoding. Fortunately, we don't need fancy apostrophes and quotation marks. We will have to clean it first.

In [None]:
def clean_text(text):
    # Could be cleaner
    return text.replace("â€™", "'").replace("â€œ", "\"").replace("â€�", "\"").replace("â€¢", "*").replace("Ã©", "e").replace("Ã¼", "u").replace("â€“", "-")

print(clean_text(test_text))

The same must be done on any field that uses special formatting, like the title of the article.

Now that we have done cleaning chores, let's compile everything.

In [None]:
test_article_objs = []
for article in test_articles:
    # Permit only articles, not liveblogs or other
    if article['type'] != 'article':
        continue
    
    title = clean_text(article['webTitle'])
    date = article['webPublicationDate']
    authors = []
    for tag in article['tags']:
        if tag['type'] != 'contributor':
            continue
        authors.append(clean_text(tag['webTitle']))

    # For article entries, the number of blocks is only one, so it should be easy to extract the text.
    text = clean_text(article['blocks']['body'][0]['bodyTextSummary'])
    
    test_article_objs.append({
        'title': title,
        'date': date,
        'authors': authors,
        'text': text
    })
    
test_article_objs

### Putting it all together

We need to collect as many articles as we could between 11 and 12 March 2021. The requests can only support a maximum number of results per page, so we need to loop as much as we can, and apply an artificial limit so we do not overload the API.

In [None]:
max_limit = 20
from_date = '2021-03-11'
to_date = '2021-03-12'
page_limit = 50
all_articles = []

for page in range(1, max_limit + 1):
    print("Fetching page " + str(page))
    try:
        response = request_api('search', API_KEY, page=page, page_size=page_limit, from_date=from_date, to_date=to_date, show_tags='contributor', show_blocks='body')
        
        # Avoid upsetting the API :(
        if response.status_code != 200:
            break
        
        results = json.loads(response.text)['response']['results']
        
        for entry in results:
            # Permit only articles, not liveblogs or other
            if entry['type'] != 'article':
                continue

            title = clean_text(entry['webTitle'])
            date = entry['webPublicationDate']
            authors = []
            for tag in entry['tags']:
                if tag['type'] != 'contributor':
                    continue
                authors.append(clean_text(tag['webTitle']))

            # For article entries, the number of blocks is only one, so it should be easy to extract the text.
            text = clean_text(entry['blocks']['body'][0]['bodyTextSummary'])

            all_articles.append({
                'title': title,
                'date': date,
                'authors': authors,
                'text': text
            })
            
        # We have reached the limit (probably)
        if len(results) < page_limit:
            break
    except:
        # If we somehow encounter an error, break instantly
        break
print("Done!")

### Sanity Check

In [None]:
all_articles

In [None]:
len(all_articles)

Now we have collected all the articles, we can now format the whole array as JSON and write to the file `articles.json`.

In [None]:
with open("articles.json", "w") as f:
    json.dump(all_articles, f, indent=4)