# Web Scraping Workshop 1 - Web API Requests
Prepared by: Nickolas K. Freeman, Ph.D.

In this notebook, we will see how we can use Python to make http requests. In particular, we will be using the python `requests` library to retrieve data from application programming interfaces (APIs) available online. Additional information on http requests and the `requests` library can be found in the powerpoint slides included in the repository ('WS1_Slides') and in the article available https://realpython.com/python-requests/.

The following code block imports the python `requests` library.

In [1]:
import requests

To get a better understanding of http requests, we will first target the website `http.org`. This webiste allows developers to test their requests before deploying applications. The following code block: 1) creates a variable named `target_url` that points an area of `httpbin.org` that allows users to test **GET** requests, 2) makes a **GET** request using the `get` method available in the python requests package and stores the response in a variable named `r`, and 3) prints the content of the `r` object in *JavaScript Object Notation* format (json). 

In [2]:
# 1) specify target_url 
target_url = 'http://httpbin.org/get'

# 2) make request
r = requests.get(target_url)

# 3) print response as json
print(r.json())

{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0', 'X-Amzn-Trace-Id': 'Root=1-5e5aa644-b9d8afe2a885dca241cfcaae'}, 'origin': '75.137.131.118', 'url': 'http://httpbin.org/get'}


Note that the json representation returned is a python dictionary. Inspecting the dictionary, we can see that there are keys for `args`, `headers`, `origin`, and `url`. The values in these keys give us an idea of some of the information that we transmit when making http requests. Moreover, we can control this information to some degree. To demonstrate this, we will look into how we can modify the headers that we send with a request. Specifically, we will modify our `User-Agent` and add a referer. Information on valid http request headers can be found at https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers. 

Before we demonstrate how to modify the user agent and referer headers, let's understand the role that they play in an http request:
- `User-Agent`: Contains a characteristic string that allows the network protocol peers to **identify the application type, operating system, software vendor or software version of the requesting software user agent**. 
- `Referer`: The address of the previous web page from which a link to the currently requested page was followed.

Note that a website that we are making a request to can use the two headers we are considering to determine if we are a real user or a computer program. Specifically, as we can see in the response we received earlier, the default User-Agent used by the requests package show that we are making the request using the package. A website wishing to deter programmatic access can easily detect and deny such requests. Also, a request may seem more realistic if we are referred from a search engine such as Google. To modify these headers, we need to pass a dictionary of headers when we make a request. The following code demonstrates how this can be done. Specifically, we:

1. Define a variable named `my_user_agent`, which stores a string with a realistic value, 
2. Define a dictionary object named `headers`,
3. Add the defined user-agent variable, 
4. Specify a `Referer` header that suggests we were referred, i.e., made the request from, the Google search engine.
5. Make the same request as before with our new headers, and
6. Print the response as json.

In [3]:
my_user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'

my_headers = {'User-Agent': my_user_agent, 
             'Referer': 'https://www.google.com'}

r = requests.get(target_url, headers = my_headers)

print(r.json())

{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'Referer': 'https://www.google.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36', 'X-Amzn-Trace-Id': 'Root=1-5e5aa644-070447a8fe2d171c54bbeef2'}, 'origin': '75.137.131.118', 'url': 'http://httpbin.org/get'}


The printed output shows that our headers were correctly modified. Moreover, our request will now look more realistic to a target website. We will now look at how to pass parameters with a request. This is very common when working with web APIs, where the parameters are used to filter the data returned by the request and, oftentimes, to authenticate users. Similar to how we specified headers, we can specify parameters by passing a dictionary of parameters when we make the request. The following code block demonstrates this approach.

In [4]:
# Define a test parameter, param1
my_params = {'param1': 'my_param_value'}

# Make the request, passing headers and parameters
r = requests.get(target_url, 
                 headers = my_headers, 
                 params = my_params)

# Print the response as json
print(r.json())

{'args': {'param1': 'my_param_value'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'Referer': 'https://www.google.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36', 'X-Amzn-Trace-Id': 'Root=1-5e5aa644-499a99f21ddd581a1fa3f1fe'}, 'origin': '75.137.131.118', 'url': 'http://httpbin.org/get?param1=my_param_value'}


Notice that the response output includes our parameters as `args`. Also notice that the `url` value has been updated. In particular, the string `?param1=my_param_value` was appended to the end of our `target_url`.

# Requesting Bike-Sharing Data From http://api.citybik.es

We will now use what we have learned so far to harvest bike-sharing data using the open (at the time of writing 2/29/2020) web API available at http://api.citybik.es. This API allows users to determine information on bike-sharing networks located across the world and request information regarding the status of networks. We will use the `pandas` package to store the data. The following code block imports `pandas`.

In [5]:
import pandas as pd

Reading the documentation available at http://api.citybik.es/v2/, we can see that the API has an endpoint for obtaining data regarding bike-sharing networks, 'http://api.citybik.es/v2/networks'. The following code block makes a request to this endpoint, stores the response as json, and prints the keys of the json object.

In [6]:
# Specify endpoint
endpoint = 'http://api.citybik.es/v2/networks'

# Make the request
r = requests.get(endpoint)

# Store the response as json
response = r.json()

# Print the keys of the json response
print(response.keys())

dict_keys(['networks'])


The following code block shows that the *value* associated with the `networks` key is a python list.

In [7]:
type(response['networks'])

list

The following code block prints the first three values of this list so that we can see the format of each item.

In [8]:
response['networks'][:3]

[{'company': ['ЗАО «СитиБайк»'],
  'href': '/v2/networks/velobike-moscow',
  'id': 'velobike-moscow',
  'location': {'city': 'Moscow',
   'country': 'RU',
   'latitude': 55.75,
   'longitude': 37.616667},
  'name': 'Velobike'},
 {'company': ['Gobike A/S'],
  'href': '/v2/networks/bycyklen',
  'id': 'bycyklen',
  'location': {'city': 'Copenhagen',
   'country': 'DK',
   'latitude': 55.673582,
   'longitude': 12.564984},
  'name': 'Bycyklen'},
 {'company': ['Gobike A/S'],
  'href': '/v2/networks/nu-connect',
  'id': 'nu-connect',
  'location': {'city': 'Utrecht',
   'country': 'NL',
   'latitude': 52.117,
   'longitude': 5.067},
  'name': 'Nu-Connect'}]

As we can see, each item in the list is a python dictionary, that contains information regarding a different bike-sharing network. In the following code block, we parse these dictionaries to create a pandas' `DataFrame` object that contains information for all networks in the United States.

**Note: This parsing method is specific to this API. Each API returns content in a different format. Thus, you will need to modfiy your approach to parsing the response accordingly.**

In [9]:
# Initialize an empty python list
us_list = []

# For each dictionary in the list of dictionaries
# include in the networks key of the response
for network in response['networks']:
    
    # If the network's location is US
    if network['location']['country'] == 'US':
        
        # Append a list with the network id, city, and endpoint (href)
        # to the us_list object
        us_list.append([network['id'], network['location']['city'], network['href']])
        
# Convert the us_list object to a DataFrame and store the DataFrame
# in a variable named us_data
us_data = pd.DataFrame(us_list, columns = ['Company', 'City', 'Endpoint'])

# Print the first five rows
us_data.head()

Unnamed: 0,Company,City,Endpoint
0,we-cycle,"Aspen, CO",/v2/networks/we-cycle
1,arborbike,"Ann Arbor, MI",/v2/networks/arborbike
2,austin,"Austin, TX",/v2/networks/austin
3,bike-chattanooga,"Chattanooga, TN",/v2/networks/bike-chattanooga
4,biketown,"Portland, OR",/v2/networks/biketown


Although we will not go through making requests to all of the endpoints stored in the `us_data` DataFrame, I will demonstrate how to access data for one of the endpoints. In particular, the following code block shows how we can use the endpoint specified in the first row of the `us_data` object to request data for the first network.

In [10]:
# Get endpoint stored in first row
current_endpoint = us_data.loc[0, 'Endpoint']

# Specify base url
base_url = 'http://api.citybik.es'

# Concatenate current endpoint and base url 
# to get endpoint for request
full_url = base_url + current_endpoint

# Make request
r = requests.get(full_url)

# Store the repsonse as json
response = r.json()

The response returned for the specific network differs from that returned for all networks. The following code block shows tha it contains information on specific details regarding the locations associated with the network and the status of the location.

In [11]:
response['network']['stations'][:3]

[{'empty_slots': 2,
  'extra': {'address': 'Basalt',
   'last_updated': 1572297881,
   'renting': 1,
   'returning': 1,
   'uid': '1482'},
  'free_bikes': 3,
  'id': 'de99fa376a6ff38db656affe33d40e52',
  'latitude': 39.380066,
  'longitude': -107.081968,
  'name': 'Evans Rd',
  'timestamp': '2020-02-29T17:54:05.234000Z'},
 {'empty_slots': 8,
  'extra': {'address': 'Aspen',
   'last_updated': 1572470181,
   'renting': 0,
   'returning': 1,
   'uid': '1480'},
  'free_bikes': 0,
  'id': 'c7fddca9c7f33987a59095d9fa1dba04',
  'latitude': 39.1916,
  'longitude': -106.823,
  'name': 'Hotel Aspen',
  'timestamp': '2020-02-29T17:54:05.255000Z'},
 {'empty_slots': 6,
  'extra': {'address': 'Basalt',
   'last_updated': 1572384718,
   'renting': 0,
   'returning': 1,
   'uid': '1485'},
  'free_bikes': 0,
  'id': 'edd6bdd575f305389a942010c19b3fd0',
  'latitude': 39.359289,
  'longitude': -107.023232,
  'name': 'Roaring Fork Club Employee Housing',
  'timestamp': '2020-02-29T17:54:05.256000Z'}]

This concludes this introduction to making web API requests using python. This was a very breif introduction and there are many other things to learn. However, there is a lot of good information available online.