# The Guardian API

In the `05_web_scraping_beautiful_soup.ipynb` notebook, we saw examples on how BeautifulSoup can be used 
to parse messy HTML, to extract information, and to act as a rudimentary web crawler. 
We used The Guardian as an illustrative example about how this can be achieved. 
The reason for choosing The Guardian was because they provide a REST API to their servers. 
With the REST API it is possible to perform specific queries on their servers, and to receive 
current information from their servers according to their API guide (ie in JSON)

http://open-platform.theguardian.com/

In order to use their API, you will need to register for an API key. 
At the time of writing (Jan 28, 2020) this was an automated process that can be completed at 

https://bonobo.capi.gutools.co.uk/register/developer

On registration you will receive an API key which will look like: 303qwe2k-xxxx-xxxx-xxxx-eff86a248059

The API is documented here: 

http://open-platform.theguardian.com/documentation/

and Python bindings to their API are provided by The Guardian here

https://github.com/prabhath6/theguardian-api-python

and these can easily be integrated into a web-crawler based on API calls, rather than being based 
on HTML parsing, etc. 

We use four parameters in our queries here: 

1. `section`: the section of the newspaper that we are interested in querying. In this case we will look at 
the technology section 

2. `order-by`: We have specified that the newest items should be closer to the front of the query list 

3. `api-key`: In this notebook, the api-key is left as `test` (works here), but for *real* deployment of such a spider an API key obtained from Guardian should be specified. For the lab tasks, you should replace `test` API key with your personal API key. 

4. `page-size`: The number of results to return. 

In [74]:
from __future__ import print_function

import requests 
import json 

Hard coding a secret such as a password or a key in a code file is a security smell. Instead of hard coding your api-key, you should store your api-key in a config file.

# Fetch Api-Key from Config File 
#### Note: You need to create a config file with your API your api-key and place it under the same directory as this notebook, otherwise the code block below will not work 

The content on my config file looks like below: 

```
[guardian]
api-key=303qwe2k-xxxx-xxxx-xxxx-eff86a248059
```

My config file is named: ```myconfig.cfg```

Note: The api-key above is not a real Guardian api-key. If you use that, you will get an error. 

In [75]:
from configparser import ConfigParser

parser = ConfigParser()
parser.read('./myconfig.cfg')

myapikey = parser.get('guardian', 'api-key')

# If you cannot create a config file, comment the three lines above and uncomment the line below.

# myapikey = '303qwe2k-xxxx-xxxx-xxxx-eff86a248059' # Replace with your api-key. This is not a real api-key.
# print(myapikey)

# Inspect all sections and search for technology-based sections

In [76]:
url = 'https://content.guardianapis.com/sections?api-key=' + myapikey
req = requests.get(url)
src = req.text 

In [77]:
json.loads(src)['response']['status']

'ok'

In [78]:
sections = json.loads(src)['response']

print(sections.keys())

dict_keys(['status', 'userTier', 'total', 'results'])


In [79]:
print(json.dumps(sections['results'][0], indent=2, sort_keys=True))

{
  "apiUrl": "https://content.guardianapis.com/about",
  "editions": [
    {
      "apiUrl": "https://content.guardianapis.com/about",
      "code": "default",
      "id": "about",
      "webTitle": "About",
      "webUrl": "https://www.theguardian.com/about"
    }
  ],
  "id": "about",
  "webTitle": "About",
  "webUrl": "https://www.theguardian.com/about"
}


In [80]:
for result in sections['results']: 
    if 'tech' in result['id'].lower(): 
        print(result['webTitle'], result['apiUrl'])

Technology https://content.guardianapis.com/technology


# Manual query on whole API

In [81]:
# Specify the arguments
args = {
    'section': 'technology', 
    'order-by': 'newest', 
    'api-key': myapikey, 
    'page-size': '100',
#     'q' : 'privacy%20AND%20data'
}

# Construct the URL
base_url = 'http://content.guardianapis.com/search'
url = '{}?{}'.format(
    base_url, 
    '&'.join(["{}={}".format(kk, vv) for kk, vv in args.items()])
)

# Make the request and extract the source
req = requests.get(url) 
src = req.text

In [82]:
print('Number of byes received:', len(src))

Number of byes received: 59669


The API returns JSON, so we parse this using the in-built JSON library. 
The API specifies that all data are returned within the `response` key, even under failure. 
Thereofre, I have immediately descended to the response field 

# Parsing the JSON

In [83]:
response = json.loads(src)['response']
print('The following are available:\n ', sorted(response.keys()))

The following are available:
  ['currentPage', 'orderBy', 'pageSize', 'pages', 'results', 'startIndex', 'status', 'total', 'userTier']


# Verifying the status code

It is important to verify that the status message is `ok` before continuing - if it is not `ok` no 'real' data 
will have been received. 

In [84]:
assert response['status'] == 'ok'

# Listing the results 

The API standard states that the results will be found in the `results` field under the `response` field. 
Furthermore, the URLs will be found in the `webUrl` field, and the title will be found in the `webTitle` 
field. 

First let's look to see what a single result looks like in full, and then I will print a restricted 
set of parameters on the full set of results .

In [85]:
print(json.dumps(response['results'][0], indent=2, sort_keys=True))

{
  "apiUrl": "https://content.guardianapis.com/technology/2022/jan/26/spotify-neil-young-joe-rogan-covid-misinformation",
  "id": "technology/2022/jan/26/spotify-neil-young-joe-rogan-covid-misinformation",
  "isHosted": false,
  "pillarId": "pillar/news",
  "pillarName": "News",
  "sectionId": "technology",
  "sectionName": "Technology",
  "type": "article",
  "webPublicationDate": "2022-01-27T12:26:09Z",
  "webTitle": "Spotify to remove Neil Young music in feud over Joe Rogan\u2019s false Covid claims",
  "webUrl": "https://www.theguardian.com/technology/2022/jan/26/spotify-neil-young-joe-rogan-covid-misinformation"
}


## Task 6. Response Statistics  

Use Guardian's API to identify the count of all news stories published under the Technology section. List the page size and the number of pages in the result set.

Note that I commented out ```'q' : 'privacy%20AND%20data'``` in the args few blocks above.

In [86]:
# Solution to Task 6

print('Total news stories:', response['total']) 
print('Pages:', response['pages'])
print('Page size:', response['pageSize'])

Total news stories: 54785
Pages: 548
Page size: 100


## Task 7. News Stories About a Specific Topic 

Return all stories in the technology section that are about privacy.

Solution: ```'q' : 'privacy'```

Filter the stories that talk about WhatsApp and Signal.  

Solution: ```'q' : 'privacy%20AND%20whatsapp%20AND%20signal'```


Are there any privacy stories talking about privacy, Whatsapp, and Signal that do not talk about AI? 

Solution: ```'q' : 'privacy%20AND%20whatsapp%20AND%20signal%20AND%20NOT%20(AI%20OR%20%22Artificial%20Intelligence%22)%20'```
> privacy AND whatsapp AND signal AND NOT (AI OR "Artificial Intelligence")


List these stories. Solution in the code cell below. 


Other search queries 7a and 7b are similar to Task 7. I list the attributes below. 

#### 7a. All News Stories About a Phrase 

Return all news stories that are about stock squeeze. 

```'q' : '%22short%20squeeze%22'```

List the ones that are in the business section of the Guardian. 

```
'q' : '%22short%20squeeze%22'
'section' : 'business'
```


#### 7b. All News Stories About a Person 

Return all news stories about Elon Musk published by Guardian in 2020 and 2022.  

How many of these news stories are about Dogecoin?  Of the stories that are about Elon Musk and Dogecoin, how many of those do not mention Tesla?  

```
'from-date':'2020-01-01'
'to-date':'2022-01-25'
'order-by':'newest'
'page:1'
'q':'%22Elon%20Musk%22%20AND%20dogecoin%20AND%20NOT%20Tesla' 
```
\# q = "Elon Musk" AND dogecoin AND NOT Tesla

Hints. In the search string, you could use ```%20``` for ```space```; ```%22``` for ```double quote```; ```AND``` for conjunction; ```OR``` for disjunction; and ```NOT``` for negation

In [87]:
# Solution to Task 7 below; change the q below for each subpart

args = {
    'section': 'technology', 
    'order-by': 'newest', 
    'api-key': myapikey, 
    'page-size': '10',
#     'q' : 'privacy' # All stories about privacy
#     'q' : 'privacy%20AND%20whatsapp%20AND%20signal' # privacy stories mentioning WhatsApp and Signal 
    'q' : 'privacy%20AND%20whatsapp%20AND%20signal%20AND%20NOT%20(AI%20OR%20%22Artificial%20Intelligence%22)%20'
    # privacy AND whatsapp AND signal AND NOT (AI OR "Artificial Intelligence"): stories not mentioning AI
}    

# Construct the URL
base_url = 'http://content.guardianapis.com/search'
url = '{}?{}'.format(
    base_url, 
    '&'.join(["{}={}".format(kk, vv) for kk, vv in args.items()])
)

# Make the request and extract the source
req = requests.get(url) 
src = req.text

print('Number of byes received:', len(src))

# Parsing JSON
response = json.loads(src)['response']
print('The following are available:\n ', sorted(response.keys()))

# Verifying the status code
assert response['status'] == 'ok'

# Listing the results
print(json.dumps(response['results'][0], indent=2, sort_keys=True))

# Response statistics
print('Total news stories:', response['total']) 
print('Pages:', response['pages'])
print('Page size:', response['pageSize'])

Number of byes received: 5992
The following are available:
  ['currentPage', 'orderBy', 'pageSize', 'pages', 'results', 'startIndex', 'status', 'total', 'userTier']
{
  "apiUrl": "https://content.guardianapis.com/technology/2021/oct/08/i-might-delete-it-facebooks-problem-with-younger-users",
  "id": "technology/2021/oct/08/i-might-delete-it-facebooks-problem-with-younger-users",
  "isHosted": false,
  "pillarId": "pillar/news",
  "pillarName": "News",
  "sectionId": "technology",
  "sectionName": "Technology",
  "type": "article",
  "webPublicationDate": "2021-10-08T13:56:08Z",
  "webTitle": "\u2018I might delete it\u2019: Facebook\u2019s problem with younger users",
  "webUrl": "https://www.theguardian.com/technology/2021/oct/08/i-might-delete-it-facebooks-problem-with-younger-users"
}
Total news stories: 40
Pages: 4
Page size: 10


In [88]:
# This cell is part of the initial iPython notebook that was shared.

for result in response['results']: 
    print(result['webUrl'][:70], result['webTitle'][:20])

https://www.theguardian.com/technology/2021/oct/08/i-might-delete-it-f ‘I might delete it’:
https://www.theguardian.com/technology/2021/oct/06/tell-us-are-you-con  Tell us: are you co
https://www.theguardian.com/technology/2021/aug/13/uk-security-chiefs- UK security chiefs i
https://www.theguardian.com/technology/2021/jun/13/whatsapp-boss-decri WhatsApp boss decrie
https://www.theguardian.com/technology/2021/may/11/what-happens-when-w What happens when Wh
https://www.theguardian.com/technology/2021/may/09/how-private-is-your How private is your 
https://www.theguardian.com/technology/2021/feb/22/whatsapp-to-try-aga WhatsApp to try agai
https://www.theguardian.com/technology/2021/feb/14/facebook-v-apple-th Facebook v Apple: th
https://www.theguardian.com/technology/2021/jan/26/uk-regulator-to-wri UK regulator to writ
https://www.theguardian.com/technology/2021/jan/24/whatsapp-loses-mill WhatsApp loses milli


## Task 8. Request Specific Content from the API

Fetch the ith result from the list obtained from on the search query formed in Task 7. 

Identify the id of the ith result and fetch the headline and body text of the news story.  

### Solution 
#### 1. Fetching the ith result

Let's now request a specific piece of content from the API. 

We select the ith result from the above response and get its ```apiUrl``` and ```id```:

In [89]:
i = 0
api_url = response['results'][i]['apiUrl']
api_id = response['results'][i]['id']

print(api_url)
print(api_id)

https://content.guardianapis.com/technology/2021/oct/08/i-might-delete-it-facebooks-problem-with-younger-users
technology/2021/oct/08/i-might-delete-it-facebooks-problem-with-younger-users


#### 2. Fetching the headline and body text of the news story with id 

We then use the ```id``` to contstruct a search url string to request this piece of content from the API.

(Note that you need to include the ```api-key``` in the search. You also need to specify if you want to include data fields other than the article metadata e.g. ```body``` and ```headline``` are included in the example below.)

In [90]:
base_url = "https://content.guardianapis.com/search?"
search_string = "ids=%s&api-key=%s&show-fields=headline,body" %(api_id, myapikey)

url = base_url + search_string
print(url)

https://content.guardianapis.com/search?ids=technology/2021/oct/08/i-might-delete-it-facebooks-problem-with-younger-users&api-key=300efc2b-aa7f-4cac-b4ff-eff86a248059&show-fields=headline,body


In [91]:
req = requests.get(url) 
src = req.text

In [92]:
response = json.loads(src)['response']
assert response['status'] == 'ok'

In [93]:
print(response['results'][0]['fields']['headline'])

‘I might delete it’: Facebook’s problem with younger users


In [94]:
body = response['results'][0]['fields']['body']
print(body)

<p>Oliver Coghlan embodies Facebook’s problems with teen and young adult audiences – a growing number of them do not like it. The 23-year-old says he stopped using Facebook regularly three years ago and he is considering deleting the app. His sole use for it now is to check people’s birthdays.</p> <p>“I haven’t deleted it yet but I might do soon – I really don’t like the company’s monopolistic behaviour,” said Coghlan, a British student based in the Netherlands. He added that the EU referendum and the 2016 US presidential election, and the online anger that accompanied those polls, convinced him that he wanted to spend less time on Facebook’s main platform.</p>  <figure class="element element-image element--supporting" data-media-id="fac86f78ac2f178a655ac527d687f5edb9750de7"> <img src="https://media.guim.co.uk/fac86f78ac2f178a655ac527d687f5edb9750de7/0_60_487_609/400.jpg" alt="Oliver Coghlan" width="400" height="500" class="gu-image" /> <figcaption> <span class="element-image__caption"

#### 3. Simple Text Processing: Count Word Frequencies and Store in a Data Frame

We can now do some simple text processing on the article text. e.g. count the word frequnecies:

In [95]:
# First, we need to clean that data -- remove HTML tags. 
# Here is a "not so good" way to do it. You could consider BeautifulSoup here!

words = body.replace('<p>','').replace('</p>','').split()
print(len(words))
unique_words = list(set(words))
print(len(unique_words))
#count_dictionary = {word: count for word, count in zip(words, [words.count(w) for w in words])}
count_dictionary = {'word': unique_words, 'count': [words.count(w) for w in unique_words]}

993
534


In [96]:
import pandas as pd

In [97]:
df = pd.DataFrame(count_dictionary)
df.sort_values(by='count', ascending=False)

# We could export the data frame in a CSV and observe the complete output
# df.to_csv('term-frequency.csv')

Unnamed: 0,word,count
392,the,39
309,to,34
490,that,21
288,and,20
49,a,17
...,...,...
205,"users,”",1
203,crossbench,1
202,accountable,1
201,alternative</span>,1


We now have a dataframe with word occurence frequency in the article. Try importing the data frame to a CSV.

We notice because of punctuation marks some words appear more than once.  For instance, ```UK``` could occur as ```UK.``` and ```Facebook``` could also occurs as ```Facebook,``` and ```Facebook,"```. 

One option to fix this would be to strip out the punctuation using Python string manipulation. 

You could also use regular expressions to remove the punctuations. 

Below is an imperfect solution using regular expressions. You will notice it fails for several instances. As you work on it, you will find a better solution. Please post yours on Teams or Discussion Forum. :)

In [98]:
import re  ## imports the regular expression library
words_wo_punctuation = re.sub(r'[^\w\s]','',body.replace('<p>','').replace('</p>','')).split()  

Note that the regex ```r'[^\w\s]'``` substitutes anything in ```body``` that is not a word ```\w``` or and blank space ```\s``` with the empty string ```''```.

In [99]:
unique_words = list(set(words_wo_punctuation))
print(len(unique_words))
count_dictionary = {'word': unique_words, 'count': [words_wo_punctuation.count(w) for w in unique_words]}

469


In [100]:
df = pd.DataFrame(count_dictionary)
df.sort_values(by='count', ascending=False)

# We could export the data frame in a CSV and observe the complete output
# df.to_csv('term-frequency-regex.csv')
# Open the CSV in a text editor or a spread sheet and analyse the output

Unnamed: 0,word,count
335,the,39
260,to,34
46,a,23
427,that,21
241,and,20
...,...,...
184,context,1
183,galvanised,1
182,decision,1
180,peoples,1


In [101]:
df.sort_values(by='count', ascending=False).to_csv('term-frequency-regex.csv')

This Python Regular Expression cheatsheet (Courtesy: Laura Gemmel) is useful: 
https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf

Use the cheat sheet to create a better regular expression and filter the body text. 
Post your solutions on the BB forum.