Name: Kagen Lim

Completed as HW08 of QMSS-G5072, Modern Data Structures, Fall 2020.

Contact: [kagen.lim@columbia.edu](mailto:kagen.lim@columbia.edu)

## 1. Choose your API

In [1]:
import requests
import json
import os
import pandas as pd

#### a) Choose an API and briefly describe the type of data you can obtain from it. Note: Please do not use any of the APIs we covered in lecture (e.g. NYTimes, Github etc.).

The Guardian is a United Kingdom-based newspaper. Thier homepage can be found [here](https://www.theguardian.com/international). This API is maintained in-house by them, as a open and publically accessible service for developers to access all of their web content. 

This API provides article information under by a few categories, for instance, `sectionName` (i.e., what section of The Guardian that the article is classified under) and `webTitle` (i.e., the title of the article.

The API allows for the user to enter search terms, to uncover relevant articles of interest. 

#### b) Provide a link to the API documentation and

**Link to API Documentation**: https://open-platform.theguardian.com/documentation/

#### c) the base URL of the API you intend to use.

**Base URL of the API**: https://content.guardianapis.com

## 2. Authentication

#### a) Briefly explain how the API authenticates the user. 

The Guardian API authenticates users using the API Key that they key in, as part of the `get` request. API Keys are unique values, that are given to each user upon registration. This enables the developers on their end to keep track of *exactly who* accesses what information. 

#### b) Apply for an API key if necessary and provide the information (with relevant URL) how that can be done. Do not include the API key in the assignment submission.

**Apply for an API Key at this link**: https://open-platform.theguardian.com/access/ 

There are very few barriers to entry, to get an API key for The Guardian's API. One simply needs to key in their particulars, email contact, and product name/reason for key, in order to get an API key. 

## 3. Send a Simple GET request

#### a) Execute a simple GET request to obtain a small amount of data from the API. Describe a few query parameters and add them to the query. If you have a choice of the output the API returns (e.g. XML or JSON), I suggest to choose JSON because it easier to work with. Your output here should include the code for the GET request, including the query parameters, as well as a snippet of the output.

In [3]:
guardian_api = os.getenv('guardian_api')

#### Query Parameters: 
- Search: This should be string input, a single word or phrase, that reflects what you would like to search for.
- Page-Size: This should be an integer inpiut that reflects the number of results you'd like to have in the output.
- API Key: This should be the API Key that was requested for, through the link in 2b). 

In [4]:
r1 = requests.get('https://content.guardianapis.com/search?q=coronavirus&page-size=2&api-key={}'.format(guardian_api))

#I want articles related to coronaviruses, therefore `/search?q=coronavirus`
#I want a sample size of 2, therefore `&page-size=2`
#I insert my API Key in `&api-key={}` 

In [7]:
print(json.dumps(r1.json(), indent=2))

{
  "response": {
    "status": "ok",
    "userTier": "developer",
    "total": 26205,
    "startIndex": 1,
    "pageSize": 2,
    "currentPage": 1,
    "pages": 13103,
    "orderBy": "relevance",
    "results": [
      {
        "id": "world/2020/nov/18/us-passes-250000-deaths-from-coronavirus",
        "type": "article",
        "sectionId": "world",
        "sectionName": "World news",
        "webPublicationDate": "2020-11-18T22:49:04Z",
        "webTitle": "US passes 250,000 deaths from coronavirus",
        "webUrl": "https://www.theguardian.com/world/2020/nov/18/us-passes-250000-deaths-from-coronavirus",
        "apiUrl": "https://content.guardianapis.com/world/2020/nov/18/us-passes-250000-deaths-from-coronavirus",
        "isHosted": false,
        "pillarId": "pillar/news",
        "pillarName": "News"
      },
      {
        "id": "world/2020/sep/28/england-new-coronavirus-restrictions-explained",
        "type": "article",
        "sectionId": "world",
        "sectionName"

#### b) Check (and show) the status of the request.

In [5]:
r1.status_code

#Status code of 200 shows this is working.

200

#### c) Check (and show) the type of the response (e.g. XML, JSON, csv).

In [6]:
r1.headers['content-type']

#This is a JSON file. 

'application/json'

## 4. Parse the response and Create a dataset

#### a) Take the response returned by the API and turn it into a useful Python object (e.g. a list, vector, or pandas data frame). Show the code how this is done.

In [8]:
guardian_json = r1.json()

In [9]:
guardian_json_df = pd.DataFrame(guardian_json['response']['results'])
guardian_json_df

#Pandas DataFrame

Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,isHosted,pillarId,pillarName
0,world/2020/nov/18/us-passes-250000-deaths-from...,article,world,World news,2020-11-18T22:49:04Z,"US passes 250,000 deaths from coronavirus",https://www.theguardian.com/world/2020/nov/18/...,https://content.guardianapis.com/world/2020/no...,False,pillar/news,News
1,world/2020/sep/28/england-new-coronavirus-rest...,article,world,World news,2020-09-28T18:36:34Z,England’s new coronavirus restrictions explained,https://www.theguardian.com/world/2020/sep/28/...,https://content.guardianapis.com/world/2020/se...,False,pillar/news,News


#### b) Using the API, create a dataset (in data frame format) for multiple records. I'd say a sample size greater than 100 is sufficient for the example but feel free to get more data if you feel ambitious and the API allows you to do that fairly easily. The dataset can include only a small subset of the returned data. Just choose some interesting features. There is no need to be inclusive here. 

In [10]:
r2 = requests.get('https://content.guardianapis.com/search?q=coronavirus&page-size=200&api-key={}'.format(guardian_api))

#I want articles related to coronaviruses, therefore `/search?q=coronavirus`
#I now want a sample size of 200, therefore `&page-size=200`
#I insert my API Key in `&api-key={}` 

In [11]:
r2.status_code

200

In [12]:
guardian_json2 = r2.json()

In [13]:
guardian_json_df2 = pd.DataFrame(guardian_json2['response']['results'])
guardian_json_df2.head(10)

Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,isHosted,pillarId,pillarName
0,world/2020/nov/18/us-passes-250000-deaths-from...,article,world,World news,2020-11-18T22:49:04Z,"US passes 250,000 deaths from coronavirus",https://www.theguardian.com/world/2020/nov/18/...,https://content.guardianapis.com/world/2020/no...,False,pillar/news,News
1,world/2020/sep/28/england-new-coronavirus-rest...,article,world,World news,2020-09-28T18:36:34Z,England’s new coronavirus restrictions explained,https://www.theguardian.com/world/2020/sep/28/...,https://content.guardianapis.com/world/2020/se...,False,pillar/news,News
2,us-news/2020/nov/20/donald-trump-jr-tests-posi...,article,us-news,US news,2020-11-21T00:23:04Z,Donald Trump Jr tests positive for coronavirus,https://www.theguardian.com/us-news/2020/nov/2...,https://content.guardianapis.com/us-news/2020/...,False,pillar/news,News
3,world/2020/nov/20/coronavirus-australia-the-we...,article,world,World news,2020-11-20T08:21:13Z,Coronavirus Australia: the week at a glance,https://www.theguardian.com/world/2020/nov/20/...,https://content.guardianapis.com/world/2020/no...,False,pillar/news,News
4,australia-news/2020/nov/13/coronavirus-austral...,article,australia-news,Australia news,2020-11-13T08:01:05Z,Coronavirus Australia: the week at a glance,https://www.theguardian.com/australia-news/202...,https://content.guardianapis.com/australia-new...,False,pillar/news,News
5,world/2020/nov/02/latest-coronavirus-lockdowns...,article,world,World news,2020-11-02T15:47:43Z,Latest coronavirus lockdowns spark protests ac...,https://www.theguardian.com/world/2020/nov/02/...,https://content.guardianapis.com/world/2020/no...,False,pillar/news,News
6,australia-news/2020/nov/06/coronavirus-austral...,article,australia-news,Australia news,2020-11-06T04:53:49Z,Coronavirus Australia: the week at a glance,https://www.theguardian.com/australia-news/202...,https://content.guardianapis.com/australia-new...,False,pillar/news,News
7,australia-news/2020/oct/30/coronavirus-austral...,article,australia-news,Australia news,2020-10-30T09:58:06Z,Coronavirus Australia: the week at a glance,https://www.theguardian.com/australia-news/202...,https://content.guardianapis.com/australia-new...,False,pillar/news,News
8,world/2020/oct/27/china-loses-trust-internatio...,article,world,World news,2020-10-27T14:30:00Z,China loses trust internationally over coronav...,https://www.theguardian.com/world/2020/oct/27/...,https://content.guardianapis.com/world/2020/oc...,False,pillar/news,News
9,world/2020/sep/27/lockdowners-v-libertarians-b...,article,world,World news,2020-09-27T07:45:39Z,Lockdowners v libertarians: Britain’s coronavi...,https://www.theguardian.com/world/2020/sep/27/...,https://content.guardianapis.com/world/2020/se...,False,pillar/news,News


In [14]:
print(guardian_json_df2.shape)

#Sample size of 200 observations, with 11 features.

(200, 11)


#### c) Then, provide some summary statistics of the data. Include the data frame in a .csv file called data.csv with your submission for the grader.

I choose the `sectionName`, `type` and `webTitle` variables, to provide some summary statistics for. 

In [15]:
pd.value_counts(guardian_json_df2['sectionName'])

World news            108
Australia news         22
US news                18
UK news                14
Society                 7
Opinion                 6
Business                5
Music                   3
Politics                3
Football                2
Television & radio      2
Sport                   2
Education               2
Technology              1
Life and style          1
Money                   1
Film                    1
Science                 1
Fashion                 1
Name: sectionName, dtype: int64

This reflects the number of publications that fall within the relevant sections. A large number of these publications fall under 'World News'.

In [16]:
pd.value_counts(guardian_json_df2['type'])

article        195
liveblog         3
interactive      2
Name: type, dtype: int64

This reflects the number of publications that fall within the each type of publications. A large number of these publications fall under 'article'. 

In [17]:
guardian_json_df2['webTitle'].str.len()

0      41
1      48
2      46
3      43
4      43
       ..
195    58
196    65
197    71
198    81
199    83
Name: webTitle, Length: 200, dtype: int64

In [18]:
guardian_json_df2['webTitle'].describe()

count                                 200
unique                                174
top       Coronavirus latest: at a glance
freq                                   13
Name: webTitle, dtype: object

This reflects the length of the article titles. There are 171 unique article titles, out of 200, which indicate that some article titles are repeated (possibly routine news, for instance COVID-19 case count updates). This interpretation is bolstered by the statistic that 'Coronvirus latest: at a glance' occurs 13 times in thius dataset.

In [19]:
guardian_json_df2.to_csv('data.csv')

#Save as data.csv. This is in the same directory as this ipynb. 

## 5. API Client Function

#### For your API function, try to create a simple function that does the following things:

- allows the user to specify some smallish set of query parameters (from Q.3a)
- run a GET request with these parameters
- check the status of the request the server returns and inform the user of any errors (from Q.3b)
- parse the response and return a Python object to the user of the function. You can choose whether returning a list (from Q.4a) or a data frame (from Q.4b) is best.
- Add docstrings to the API client function that explain the paramters, the output, and ideally include a quick example.
- **Note: There is no need to make this into an Python package here. A simple function is sufficient.**

 - In the notebook, include your full function to access the API functionality. Set some sensible default values for the query parameters.

 - Run the function for the default values and show the output in the notebook.

In [20]:
def get_guardian_articles(api_key, single_search_term='breaking%20news', number_of_results=5):
    """
    Generates a Pandas DataFrame of The Guardian Articles, related to a given search term, from the Guardian API.
    By default, it will generate a list of breaking news.
    
    The first printed line will indicate the status code. 

    Parameters (Inputs)
    ----------
    api_key : str
      Character input.
      Please go to https://open-platform.theguardian.com/access/ to request for your own API Key.
   
    single_search_term : str
      Character input; this should be a single word that reflects what you would like to search for.
      It is possible to enter a phrase too, but the formatting needs to be right (i.e., correct%20formatting). 
      See examples below. 
      By default, this input will result in a list of breaking news.
      
    number_of_results : int
      Integer input.
      Enter the number of results you'd like to have in the output DataFrame.
      By default, a list of 5 articles will be generated.


    Returns (Output)
    -------
    Pandas Dataframe.
        The output is a table of articles from The Guardian that are relevant to the single_search_term.

    Examples
    --------
    >>> get_guardian_articles('singapore', 20, [YOUR_API]) #this is for a single word.
    
    >>> get_guardian_articles('journal%20article', 100, [YOUR_API]) #phrases work great too, but need to be properly formatted.

    """
    r = requests.get(f'https://content.guardianapis.com/search?q={single_search_term}&page-size={number_of_results}&api-key={api_key}')
    r.raise_for_status() #will raise an error if Status Code indicates there is dysfunction.
    json = r.json()
    df = pd.DataFrame(json['response']['results'])
    print(f'Status Code:{r.status_code}. This is working if Status Code is 200.') #for good measure, will print out Status Code.
    return df

In [21]:
get_guardian_articles(guardian_api) #default values

Status Code:200. This is working if Status Code is 200.


Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,isHosted,pillarId,pillarName
0,politics/2020/oct/05/breaking-down-labours-red...,article,politics,Politics,2020-10-05T16:37:03Z,Breaking down Labour’s ‘red wall’ | Letters,https://www.theguardian.com/politics/2020/oct/...,https://content.guardianapis.com/politics/2020...,False,pillar/news,News
1,uk-news/2020/nov/22/has-britains-second-larges...,article,uk-news,UK news,2020-11-22T09:15:18Z,Has Britain’s second largest city reached brea...,https://www.theguardian.com/uk-news/2020/nov/2...,https://content.guardianapis.com/uk-news/2020/...,False,pillar/news,News
2,commentisfree/2020/mar/29/late-breaking-news-t...,article,commentisfree,Opinion,2020-03-29T06:00:47Z,Late-breaking news: there's been a pandemic wh...,https://www.theguardian.com/commentisfree/2020...,https://content.guardianapis.com/commentisfree...,False,pillar/opinion,Opinion
3,tv-and-radio/2020/nov/05/cocomelon-netflix-the...,article,tv-and-radio,Television & radio,2020-11-05T08:18:40Z,Cocomelon: the unsettling kids show that's bre...,https://www.theguardian.com/tv-and-radio/2020/...,https://content.guardianapis.com/tv-and-radio/...,False,pillar/arts,Arts
4,money/2020/nov/15/how-much-have-you-got-breaki...,article,money,Money,2020-11-15T08:00:30Z,How much have you got? Breaking the taboos on ...,https://www.theguardian.com/money/2020/nov/15/...,https://content.guardianapis.com/money/2020/no...,False,pillar/lifestyle,Lifestyle


Some additional use cases for `get_guardian_articles` are also available here:

In [22]:
get_guardian_articles(guardian_api, 'singapore', 3) #single search term for 'singapore'

Status Code:200. This is working if Status Code is 200.


Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,isHosted,pillarId,pillarName
0,business/2020/oct/08/singapore-launches-covid-...,article,business,Business,2020-10-08T13:55:46Z,Singapore launches Covid-secure luxury cruises...,https://www.theguardian.com/business/2020/oct/...,https://content.guardianapis.com/business/2020...,False,pillar/news,News
1,tv-and-radio/2020/sep/13/tv-tonight-colonial-d...,article,tv-and-radio,Television & radio,2020-09-13T05:00:52Z,TV tonight: colonial dramas in The Singapore Grip,https://www.theguardian.com/tv-and-radio/2020/...,https://content.guardianapis.com/tv-and-radio/...,False,pillar/arts,Arts
2,business/2020/oct/12/cabin-fever-tickets-for-m...,article,business,Business,2020-10-12T17:26:11Z,Cabin fever: tickets for meal onboard Singapor...,https://www.theguardian.com/business/2020/oct/...,https://content.guardianapis.com/business/2020...,False,pillar/news,News


In [23]:
get_guardian_articles(guardian_api, 'journal%20articles', 6) #phrase search term for 'journal articles'

Status Code:200. This is working if Status Code is 200.


Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,isHosted,pillarId,pillarName
0,environment/2020/oct/05/3000-articles-100m-rea...,article,environment,Environment,2020-10-05T08:00:15Z,"3,000 articles, 100m readers: a year of our be...",https://www.theguardian.com/environment/2020/o...,https://content.guardianapis.com/environment/2...,False,pillar/news,News
1,us-news/2020/oct/08/new-england-journal-of-med...,article,us-news,US news,2020-10-08T16:04:25Z,‘Dangerously incompetent’: medical journal con...,https://www.theguardian.com/us-news/2020/oct/0...,https://content.guardianapis.com/us-news/2020/...,False,pillar/news,News
2,lifeandstyle/2020/nov/20/gratitude-journal-gre...,article,lifeandstyle,Life and style,2020-11-20T14:00:31Z,Gratitude's great – until my inner cynic kicks...,https://www.theguardian.com/lifeandstyle/2020/...,https://content.guardianapis.com/lifeandstyle/...,False,pillar/lifestyle,Lifestyle
3,media/2020/nov/19/philip-morris-sponsored-arti...,article,media,Media,2020-11-18T16:30:48Z,Philip Morris-sponsored articles in the Austra...,https://www.theguardian.com/media/2020/nov/19/...,https://content.guardianapis.com/media/2020/no...,False,pillar/news,News
4,commentisfree/2020/sep/08/elliot-dallen-reader...,article,commentisfree,Opinion,2020-09-08T16:24:05Z,'You've given me new perspective': readers on ...,https://www.theguardian.com/commentisfree/2020...,https://content.guardianapis.com/commentisfree...,False,pillar/opinion,Opinion
5,world/2020/jun/12/the-antidote-the-most-deeply...,article,world,World news,2020-06-12T06:46:20Z,The antidote: the most deeply read articles be...,https://www.theguardian.com/world/2020/jun/12/...,https://content.guardianapis.com/world/2020/ju...,False,pillar/news,News


In [24]:
get_guardian_articles(123456, 'journal%20articles', 6) #checking whether function raises error for nonsense API Key

HTTPError: 403 Client Error: Forbidden for url: https://content.guardianapis.com/search?q=journal%20articles&page-size=6&api-key=123456