# Web API via HTTP Requests

---
## 1. The minimal picture of the internet

### 1.1. HTTP

HTTP is the short form of **H**yper**T**ext **T**ransfer **P**rotocol, which is a set of "rules" that standardize how agents in the World Wide Web should communicate between each other (in the application layer, technically speaking). 

There are 2 kinds of agents that we really need to care about:

<figure style="text-align: center">

</figure>

For example, this is what happens when we visit www.google.com:

1. User enters the URL in the browser (the client).
2. The browser creates an HTTP `GET` request, i.e. a message that indicates the user want to "GET" the website data
3. The computer wraps the `GET` request, the URL, and other information (e.g. identity of the user, security data, etc.) into a single message.
4. The computer sends the message to google's server through the internet.
5. Server receives the `GET` request, and interprets the website's information it needs to send back
6. Server creates an HTTP response that contain the website information 
7. Server wraps the HTTP response with other information (e.g. identity of the user, security data, etc.) into a single message.
8. Server sends the message to user through the internet.
9. The browser receives the response message, and interprets what should be displayed in the browser
10. User sees the website on the browser. 

### 1.2. HTTP request/response

Understand what are in HTTP requests and responses is particularly important, because in web scrapping we need to automate the process without the help of a browser. Here are the most important parts in the messages:

- **HTTP method** - Determine what the server should do upon receiving a client's request. Found in HTTP request only. For example,
    - `GET` - Tell the server to return certain data. The most common method.
    - `POST` - Tell the server to create some new data. E.g. create a new post in a forum website. 
    - `PUT` - Tell the server to update existing data. 
    - `DELETE` - Tell the server to delete existing data. 
    
    There are serveral more but rarely seen. You can find all of them [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods)

- **Status code** - Information about the network connection. Some common status codes are 
    - `200` = Successful connection
    - `301` or `302` = The connection is redirected
    - `404` = Content not found
    - `500` = The server cannot process your request
    

- **Header** - Additional information about the message, e.g. 
    - Identification of the message sender
    - Timestamp of the message
    - Message length

- The **main "text"** to be delivered, e.g.
    - The website's HTML
    - Raw data (like `.json` or `.xml`)
    - Documents for downloading

### 1.3. Web API

API stands for **A**pplication **P**rogramming **I**nterface. It is a general term that describes any instructions/protocols/tools that different applications should follow when communicate and interact with each other. Or informally speaking, an API is an abstract layer that tells users how to interact with a product hidden behind a black box. 

<figure style="text-align: center">

</figure>

Here are different examples of "API" you definitely know:

<figure style="text-align: center">

</figure>

In summary, as a user of an API, all we need to do is to follow the instruction (documentation) provided by the developer to get we want. What happens behind the API is the matter of the developers only. Retrieving data from a web API is similar - **read the documentations from the developers to access the resources through HTTP requests.**

---
## 2. Accessing web API via Python

To retrieve data from a web API via python, first import the [requests](https://requests.readthedocs.io/en/master/) package. 

In [1]:
import requests


### 2.1. All you need to know is `request.get()`

When we are only fetching data from a server for a single time, almost everything can be handled by only calling `requests.get()`. It does all the required procedures for you to fetch the data, including:

- Establishing connection to the API's endpoint (i.e. the URL where the data is stored)
- Sending HTTP `GET` requests to API's server 
- Receiving the request response as a `Response` object
- Parsing the `Response` object into python data

Below I demonstrate the use of API with the [Open Notify API](http://open-notify.org/Open-Notify-API/ISS-Location-Now/), which give data about the international space station. The endpoint `http://api.open-notify.org/astros.json` will return the data of the astronauts currently in space. 

In [3]:
response = requests.get("http://api.open-notify.org/astros.json")

print(type(response))  # The return is an object from the "Response" class

<class 'requests.models.Response'>


We can examine the `Response` object's various properties to see what make up this connection. (See [here](https://requests.readthedocs.io/en/latest/api/#requests.Response) for the full list of properties.) You should be able to understand at least three of the properties:

#### A. Status code
The `.status_code` property is for checking the HTTP status code. 


In [4]:
# Check the HTTP status code
print('status code = {}'.format(response.status_code))

status code = 200


#### B. Response header

The HTTP response header contains the information about the response's sender, i.e. the server side. It also carries the identification of the server, proxy authentication, cache, and some more advanced things. It can be retrieved by calling the property `.headers`.

In [5]:
print(response.headers)

{'Server': 'nginx/1.10.3', 'Date': 'Mon, 01 Feb 2021 11:08:31 GMT', 'Content-Type': 'application/json', 'Content-Length': '356', 'Connection': 'keep-alive', 'access-control-allow-origin': '*'}


#### C. Data

The only thing we care about. Depends on the type of data, we can apply different methods to parse the content in `Response` object to python readable content.

- `response.text` = Interpret content as in unicode and parse into **a single python string**.
- `response.json()` = Interpret the content as a JSON object, and parse into **a python dictionary**.

** There are no other specific functions for other data type in the `requests` package. So use JSON data source if possible.

In [6]:
response.text

# take a look at the ' signs at the beginning and the end. It is indeed a string.

'{"message": "success", "number": 7, "people": [{"craft": "ISS", "name": "Sergey Ryzhikov"}, {"craft": "ISS", "name": "Kate Rubins"}, {"craft": "ISS", "name": "Sergey Kud-Sverchkov"}, {"craft": "ISS", "name": "Mike Hopkins"}, {"craft": "ISS", "name": "Victor Glover"}, {"craft": "ISS", "name": "Shannon Walker"}, {"craft": "ISS", "name": "Soichi Noguchi"}]}'

In [11]:
print(type(response.json()))

print('\n')

print(response.json())

<class 'dict'>


{'message': 'success', 'number': 7, 'people': [{'craft': 'ISS', 'name': 'Sergey Ryzhikov'}, {'craft': 'ISS', 'name': 'Kate Rubins'}, {'craft': 'ISS', 'name': 'Sergey Kud-Sverchkov'}, {'craft': 'ISS', 'name': 'Mike Hopkins'}, {'craft': 'ISS', 'name': 'Victor Glover'}, {'craft': 'ISS', 'name': 'Shannon Walker'}, {'craft': 'ISS', 'name': 'Soichi Noguchi'}]}


### 2.2. Visualizing JSON data

**Once the JSON data is parsed into a python dictionary, the data is already ready for use!**

However, a python dictionary is hard to read by human eyes - Every item clumps into one paragraph. If we want to add some spacing and new lines to make the dictionary reading friendly, we can use the [json](https://docs.python.org/3/library/json.html) package. It provides two main functions:
- `json.dumps()` - Writing to a JSON file. 
    - Converting python dictonary to a string that is ready for creating a JSON file. 
    - **Can also be used for pretty printing**.
- `json.loads()` - Reading a JSON file. (Not used in this demo.)
    - Converting a text/binary that containing JSON document to python data.
    - This is used when the JSON is fetched directly. E.g. the JSON is in the file directory. 

The conversion between datatypes goes as

|JSON|Python|
|:---:|:---:| 
|object|dict|
|array|list (tuple)|
|string|str|
|number (int/float)| int/float|
|true/false/null|True/False/None|


In [12]:
import json

print(type(json.dumps(response.json())), '\n') # return of json.dumps() is a string

print(json.dumps(response.json(), sort_keys=True, indent=4)) # json.dumps() can be used for pretty printing

<class 'str'> 

{
    "message": "success",
    "number": 7,
    "people": [
        {
            "craft": "ISS",
            "name": "Sergey Ryzhikov"
        },
        {
            "craft": "ISS",
            "name": "Kate Rubins"
        },
        {
            "craft": "ISS",
            "name": "Sergey Kud-Sverchkov"
        },
        {
            "craft": "ISS",
            "name": "Mike Hopkins"
        },
        {
            "craft": "ISS",
            "name": "Victor Glover"
        },
        {
            "craft": "ISS",
            "name": "Shannon Walker"
        },
        {
            "craft": "ISS",
            "name": "Soichi Noguchi"
        }
    ]
}


### 2.3. Passing parameters

Most of the time you are required to input some parameters before retrieving the data, e.g. 

- Specify your targeted datasheet.
- Specify the output format of the dataset.
- Submit parameters to server so that it can compute and return your wanted data.

Here is an example for passing parameters through `requests.get()`. Reading from the [document](http://open-notify.org/Open-Notify-API/ISS-Pass-Times/), the endpoint `http://api.open-notify.org/iss-pass.json` tells the next time when ISS will pass through a location. It requires inputs of four parameters: 

- Latitude `lat`
- Longitude `lon`
- Altitude `alt` (optional) 
- Number of times to return `n` (optional). 

To pass these parameters in a request

1. Construct a dictionary that contains these key-value pairs
2. Pass this dictionary to the attribute `params` in `requests.get()`. 

Passing a dictionary of parameters using `params` is equivalent to adding the fields at the endpoint directly, i.e. 

`http://api.open-notify.org/iss-pass.json?lat=22.337&lon=114.266`

In [13]:
parameters = {
    "lat": 22.337, 
    "lon": 114.266
}

response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)

print(json.dumps(response.json(), sort_keys=True, indent=4))

{
    "message": "success",
    "request": {
        "altitude": 100,
        "datetime": 1612186712,
        "latitude": 22.33731830555699,
        "longitude": 114.26550315684017,
        "passes": 5
    },
    "response": [
        {
            "duration": 500,
            "risetime": 1612191654
        },
        {
            "duration": 458,
            "risetime": 1612234053
        },
        {
            "duration": 646,
            "risetime": 1612239745
        },
        {
            "duration": 323,
            "risetime": 1612245749
        },
        {
            "duration": 604,
            "risetime": 1612269367
        }
    ]
}


---
## 3. Intermediate skills for API


Here I play with the Last.fm APIm which is a database holding music album data. It is described in its [homepage](https://www.last.fm/api/intro). The API root given is [http://ws.audioscrobbler.com/2.0/](http://ws.audioscrobbler.com/2.0).


### 3.1 Dealing with authentication

Some rarely used API does not require authentication, like the above ISS API (how many people really cares the ISS btw?). However commonly used API usually requires authentication because to prevent abuse, because bandwidth is limited. 

#### A. API key
Some functions in Last.fm API requires authenticated login, which is described on [this page](https://www.last.fm/api/authentication). These authentication procedures are pretty common in other APIs:

1. **Create an account and retrieve an API key** - this key is your identification to tell the server who you are.

2. **Configure API account** - Tell more to the server what you are doing with this API account.

3. **Follow some authentication procedure** - i.e. how to submit the authentication data. This parts vary with different APIs.


For example, here are the authenticated data I get through creating an account. 
- Application name = test
- API key = 61ac3db82df2f806da4e8c8431d
- Shared secret = 091adbdffb9e8725f48eae91a47
- Registered to = mtshing

_(These are not the real key. Of course I am not sharing mine to you)_


#### B. Request header
Recall that we can use  `.headers` to check the information of the server when we receive the HTTP response from server? In the same way, when we send an HTTP request to the server, the server can also get our information from the HTTP request header. The HTTP request header is quite useful for the server to gather statistics to check if we are abusing the API. For example, the server suddenly receives a lot of requests by your API key but not from your usual IP address, then the server can immediately ban your account until you login to your account in person. 

Metioned in the first bullet point in [intro](https://www.last.fm/api/intro), Last.fm requires users to use an identifiable User-Agent header on all requests: "This helps our logging and reduces the risk of you getting banned."

In [18]:
my_headers = {
    'user-agent': 'mtshing' # Use my account username as the identifier 
    
    # other fields of the header are filled automatically when the request is sent
}

For the demo, we can go to the page [REST Requests](https://www.last.fm/api/rest). It says that only two fields are truely essential for any request:
1. The API key, to be specified in the `api_key` field.
2. The method. i.e. the data set I want to retrieve.

Here I tried the [method](https://www.last.fm/api/show/chart.getTopArtists) `chart.gettopartists` which does not require account authentication. i.e. I don't have to prove that I am the API key owner. This is convenient if we are creating an application which will be publish for public use - your app's users don't have to login to your account everytime they want to fetch the data on Last.fm on your app.

In [19]:
my_parameters = {
    'api_key': '61ac3db82df2f806da4e8c8431d5c67b',
    'method': 'chart.gettopartists',
    'format': 'json'
}

response = requests.get('http://ws.audioscrobbler.com/2.0/', headers=my_headers, params=my_parameters)
print(r.status_code, '\n')

print(r.headers, '\n')

#print(r.json()) # The response content is too long to print out here

200 

{'Server': 'openresty/1.9.15.1', 'Date': 'Mon, 01 Feb 2021 15:06:41 GMT', 'Content-Type': 'application/json', 'content-length': '36313', 'Access-Control-Allow-Methods': 'POST, GET, OPTIONS', 'Access-Control-Allow-Origin': '*', 'Access-Control-Max-Age': '86400', 'Via': '1.1 google'} 



#### C. Account login (Advanced)

Those account login stuff are required only by those methods which can change the data in the database of Last.fm. [Here](https://www.last.fm/api/authspec) is the guide for authentication. Basically,

1. Send a GET request to `http://www.last.fm/api/auth/?api_key=xxxxxxxxxx`

2. The server will send you back a GET request which the URL looks like `<callback_url>/?token=yyyyyy`. 
    - Callback url is the your's address to receive the server's GET request. Imagine as it is the server calling `requests.get()` to you, instead of you calling to the server!
    - The callback url is configurable through logging in your account in person.
    - The `token` parameter is then picked up as a variable.
    
    
3. Combine the API key, token and shared secret to create a signature `api_sig`. Described in section 8.

4. Send a GET request with method as `auth.getSession` together with the API key, token and signature as other parameters.

5. Response for this GET request is a session key. By submitting a request together with the session key, you are now connected to the server via web session.
    - Connecting with the server via web session means that you are connected the server continuously. So you don't have to repeat these login process if you need to make multiple operations.
    - In comparison, all the examples above are one-time connection, i.e. once data is retrieved, the connection is cut off. You need to reconnect again for aother data retrieval.

**The login process for different APIs can be very different, so it is not meaningful to explain the methods for Last.fm in all details.** 

### 3.2 Pagination

Sometimes if the data set is too big, the dataset owner will chop his data into smaller chunks. Only one chunk is returned per request so that each request is fast and does not occupy the bandwidth for a long time.

Let's check out the returned data. It is a python dictionary (parsed from a JSON file) contains two keys 

- `@attr` key = metadata - the data about the data
- `artist` key = the information of artists, the main content you need.

In the metadata, it says:

In [22]:
print(response.json()['artists']['@attr'])

{'page': '1', 'perPage': '50', 'totalPages': '78019', 'total': '3900942'}


It means that the received data is paginated into 78019 pages, and only the first 50 entries on page 1 are returned. As described in the [doc](https://www.last.fm/api/show/chart.getTopArtists), we can control the returned results by the parameters `page` (page no.) and `limit` (no. of entry per page). Then use a for loop to call `requests.get()` for multiple times to get the results of different pages.


### 3.3 Rate limit

Be careful: Don't use for loop to send requests too frequently - It is possible for the server to think you are DDoS-ing it (i.e. sending too many request to occupy its bandwidth, thus paralyzing it), and then your account will be banned. We need to add time delay between requests. This can be done via `time.sleep()` in the `time` package.

In [24]:
import time

# restrict to show only two results per page
my_parameters['limit'] = 2

result = {} # create an empty dictionary

total_page = 2
for page in range(1,total_page+1):
    
    # turn the page by changing the 'page' value
    my_parameters['page'] = page 
    
    # send a new request
    response = requests.get('http://ws.audioscrobbler.com/2.0/', headers=my_headers, params=my_parameters)
    
    # append the new data to the same dictionary
    result = {**result, **response.json()} 
    
    # pause for 0.1s before the next request
    time.sleep(0.1) 
    
print(json.dumps(result, sort_keys=True, indent=4))

{
    "artists": {
        "@attr": {
            "page": "2",
            "perPage": "2",
            "total": "3900942",
            "totalPages": "1950471"
        },
        "artist": [
            {
                "image": [
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/34s/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "size": "small"
                    },
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/64s/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "size": "medium"
                    },
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/174s/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "size": "large"
                    },
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/300x300/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "s

### 3.4 Caching

To save time from calling the same request multiple times, use the package `requests_cache` to create a local cache so that requests will not be called when the responses can be found in the cache.

In [None]:
import requests_cache

requests_cache.install_cache()