# Fundamentals of Data Analysis with Python 

## Day 2: Collecting Data from the Web

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>


### Overview 

High-level overview coming soon... 

### Plan for the Day

1. [What you need to know about how the Internet works to collect data from the web](#wyntk)
2. [Scraping the Web](#scrape)
    * How to scrape text and tables from static websites with BeautifulSoup
    * An overview of working with (a) multiple pages and (2) interactive content 
3. [Collecting data via Application Programming Interfaces](#apis)
    * Understanding APIs 
    * The Twitter API 
    * The Guardian API 
4. [Simple text processing with web data](#text)

<hr>

# What you need to know about how the Internet works to collect data from the web <a id='wyntk'></a>

# Scraping the Web <a id='scrape'></a>

# Collecting data via Application Programming Interfaces <a id='apis'></a>

1. [Understanding APIs](#understanding_apis)
2. [API Best Practices](#api_best)
3. [The Guardian API](#guardian)   
    a. [Overview](#g_overview)      
    a. [API Basics](#g_basics)      
    b. [Filtering]   
    c. [Extra data]   
    d. [More results]   
4. The Twitter API
5. Wikipedia API

<a id='understanding_apis'></a>
## Understanding APIs

Application Programming Interfaces (APIs) offer an alternative way to access data from online sources. They provide an explicit _interface_ to the data behind the website, defining how you can request data and what format you will receive the data. 

### Key Components of API Requests & Responses
**Endpoints** are the specific web locations where a request for a particular resource can be sent. Usually they have descriptive names like Content, Tweet, User, etc. We communicate with APIs by sending requests to these endpoints, usually in the form of a URL. 

These URLs usually contain optional **queries**, **parameters**, and **filters** that let us specify exactly what we want the API to return. 

Once a request has been made to the API it is going to return a **response**. Every response will have a response code, which will indicate whether the request was successful (200) or encountered an error (400, 401, 500, etc.). When you encounter a problem its a good idea to confirm you received a successful response, instead of one of the [many error responses](https://documentation.commvault.com/commvault/v11/article?p=45599.htm). 

As long as a request was successful, it will return a 200 OK response along with all the requested data. We will delve into what this data looks like below. 

### APIs vs Web Scraping 

Benefits: 
* Structured data (for the most part). 
* Controlled by an organization or company (Guardian, Twitter, etc) 
* Documented (usually)
* Maintained (usually)
* Rules for access are explicitly stated

Drawbacks: 
* Limited to the data made explicitly available
* Relies on the organization to make updates
* Rate limits & other restrictions apply and are usually based on business reasons rather than technical limitations


<a id='api_best'></a>
## API Best Practices

**Never store credentials in public locations**   
This includes git repositories!

Making an API token public allows others to access the API as if they were you. This puts you at risk if they violate the terms of service you agreed to when you requested an API token. 

For example, if someone were to get ahold of your API token, they could use it to launch a [denial of service attack](https://en.wikipedia.org/wiki/Denial-of-service_attack) on The Guardian's API. In this case, your token may be revoked and you'd be unable to request a new API token in the future without further violating the terms and services. 

To mitigate against this problem I would recommend one of two options: 
* Storing API tokens as environment variables
* Creating a `cred.py` to store credentials such as API tokens

Personally, I use a `cred.py` containing any of the credentials I need to access APIs, databases, etc. I keep this file stored on my computer in a single location which can be accessed by any Python script on my machine (usually somewhere in `PATH`). This way, the API token is outside of a script I might share and the file is outside of a git repo I might make public one day. 

If for some reason you need to store the `cred.py` file in the same directory as your Python file and this is within your git repo, make sure to add `cred.py` to the `.gitignore` file. 

**Know your rate limits**
Understand what you've agreed to when signing up for an API token. Terms of service are often (but not always) summarized to make sure the most important information is readily available. This information usually includes things like: 

* Rate limits
* Disallowed uses
* Limitations for sharing data
* Intellectual Property considerations

While most of these are somewhat self-explanatory, its worthwhile taking some time to go over what rate limits are and how they are controlled. 

Rate limits are the upper bound placed on how many API requests a user can make in a given amount of time. These number differ between websites and even user types. The idea is to limit the rate of requests and ensure the website isn't overrun with traffic. 

In general, rate limits are controlled in two ways. Some websites will have built-in systems that will detect over-use and throttle or revoke access for a token that is over-requesting. Other websites rely on the honour system, asking you to abide by your guidelines. In these cases the risk of exceeding limits is higher (since there is no throttling) and if you run a greater risk of being blacklisted if you exceed the API's rate limits. 

**Be Frugal**
Whenever possible, store the results from API requests.    
**<font color='crimson'>Fill this in</font>**


<a id='guardian'></a>
## The Guardian API

<a id='g_overview'></a>
### Overview
The Guardian's API allows us to query and download data related to their published articles. 

#### Endpoints
The Guardian API makes available five **endpoints**: 
* Content (`https://content.guardianapis.com/search`) &mdash; returns content. For dev keys only text. Allows querying and filtering to reduce what is returned.  
* Tags &mdash; will return all API tags (> 50, 000). These tags can be used in other quries. 
* Sections &mdash; logical grouping of content
* Editions &mdash; the content for each of the three regional main pages
* Single Item &mdash; will return all data related to a specific item (content, tag, or section) in the API. 

#### Limitations
With a non-commerical developer keys, you receive: 
* Up to 12 calls per second
* Up to 5,000 calls per day
* Access to article text (no image, audio, or video)
* Access to a subset of Guardian content (1.9 million pieces)


<a id='g_basics'></a>
### API Basics
Often, the easiest way to interface with an API is through a client. In Python, these clients are just packages that provide functions to simplify the process of accessing the API. 

The Guardian maintains and supports one client &mdash; the Scala client library. However, other clients are supported by the community. We will use the [Python client library](https://github.com/prabhath6/theguardian-api-python), one of the community-built clients, to access the Guardian API. 

However, we can also access APIs directly, without relying on a client. This is what we will do here, using the [requests](https://requests.readthedocs.io/en/master/) library. 

To start, lets use the requests package to make a generic request to the content endpoint. 

In [None]:
# Importing libraries only needs to be done once
import requests
import pprint as pp
import cred  # This is the credentials file you just created

In [None]:
API_ENDPOINT = 'http://content.guardianapis.com/search'

api_key = cred.guardian_key
MY_PARAMS = {'api-key': api_key}

response = requests.get(API_ENDPOINT, params=params)

response_dict = response.json()['response']
pp.pprint(response_dict)

There is quite a bit of information there...

Lets break it down a bit. What are individual fields contained within the response? 

In [None]:
response_dict.keys()

Each of these are described in the [content endpoint's documentation](https://open-platform.theguardian.com/documentation/search). We can examine each field individually through indexing our response dictionary. 

Lets start by seeing what order was used to sort the results. 

In [None]:
response_dict['orderBy']

In the cell below, find the total number of items that were returned in this call. Refer to the [documentation](https://open-platform.theguardian.com/documentation/search) if you aren't sure which field you are interested in.   

In [None]:
# Your Answer Here

The interesting part of the response is really what is contained within results field. The results will contain the individual items provided by the endpoint. This will be content (mainly news articles) in our case. 

In the cell below, examine what is contained within the results field and answer (1) what data structure is being used to store the results (dictionaries, lists, etc.), (2) what data is stored for each result, and (3) how many results were returned. 

In [None]:
# Your Answer Here

#### Being More Specific
Often we are interested in receiving very specific data from an API, rather than receiving all the data and then sifting through it later on.

Luckily, most APIs have built in filters. In The Guardian's API these are called queries or filters.

**Queries** allow you to request content containing free text. This works very similar to a search engine. You can use double quotes to query exact phrase matches and the AND, OR, and NOT operators are supported.   

**Filters** allow you to request content based on specific [metadata](https://dataedo.com/kb/data-glossary/what-is-metadata). Once again, you can check the [documentation](https://open-platform.theguardian.com/documentation/search) to see what metadata is available for filtering. 

We will start off simple. You might have noticed earlier that our response from the API contained the most recent content available. What if we are  actually only interested in retrieving content published prior to Jan 01, 2020?

In [None]:
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31'}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']
pp.pprint(response_dict)

We can keep tacking on parameters to further specify the types of results we want to receive.

In [None]:
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'lang': 'en', 
             'section': 'travel',
             'q': '(Cologne OR Koln) AND Germany'}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']
pp.pprint(response_dict)

In the cell below, write an API request to fetch content using a query and at least 2 filters. 

In [None]:
# YOUR ANSWER HERE

#### Getting More Information
You may have noticed in the previous API requests and responses that while we were receiving article URLs, sections, and publication dates, we were missing some pretty important data. Things like headlines, bylines, and body text are not included in the default API response. This additional information is available, but needs to be specified using the `show-fields` parameter.

In [None]:
API_ENDPOINT = 'http://content.guardianapis.com/search'

MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'lang': 'en', 
             'section': 'travel',
             'q': '(Cologne OR Koln) AND Germany',
             'show-fields': 'wordcount,body,byline'}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']

response_dict

In the cell below, write code to access and print the body text of an article from the `response_dict`. 

In [None]:
# Your Answer Here

#### Requesting More Content

In the API response, there are three fields that relate to the number of results obtained from an API request &mdash; `total`, `pages`, and `pageSize`. 

In [None]:
API_ENDPOINT = 'http://content.guardianapis.com/search'
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'lang': 'en', 
             'section': 'travel', 
             'q': 'Vancouver'}
response = requests.get(API_ENDPOINT, params=MY_PARAMS)
response_dict = response.json()['response']

In [None]:
response_dict['total']

In [None]:
response_dict['pages']

In [None]:
response_dict['pageSize']

When looking at them all together, its becomes more clear as to how they relate. 

* `total` is the number of items available to be returned.  
* `pages` is the number of pages available for return, where each page is a small subset of the total number of items.   
* `pageSize` is how many items are in the current page being returned.   

If its hard to imagine the differences between these values, you can thinking about how Google search results work.   

The key point for us to know is that in a basic API request we are likely only receiving a fraction of the total items available for return. If we want to retrieve all the data, we need to look at (1) increasing the page limit and (2) automatically requesting data from the next page. 

First, in the cell below, update the code to increase the page size from 10 to 50. Use the API [documentation](https://open-platform.theguardian.com/documentation/search) to find the right parameter. 

In [None]:
API_ENDPOINT = 'http://content.guardianapis.com/search'
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'lang': 'en', 
             'section': 'travel', 
             'q': 'Vancouver'}
response = requests.get(API_ENDPOINT, params=MY_PARAMS)
response_dict = response.json()['response']

Run the cell below to verify you successfully increased the number of results per page to 50. 

In [None]:
if response_dict['pageSize'] < 50:
    print('The page size is still less than 50. Try again.')
elif response_dict['pageSize'] == 50: 
    print('The page size is now 50. Good job!')
elif response_dict['pageSize'] > 50: 
    print('The page size is now greater than 50. How did you do that?')

Now that each page can display 50 results, nearly 5x fewer pages are needed to contain all of the data we need!

In [None]:
response_dict['pages']

However, we still need to find a way to gather data from all the pages, instead of just the first. 

Luckily, The Guardian API has a built in `page` paramter that allows us to specify which page we want to get results from. We can combine this type of request with a `while` loop to help automate our API requests. 

> **Reminder!**   
We are limited to 12 calls per second and 5000 calls a day. We will only be making 8 calls here, so it won't be a problem. However, if you decide to use a smaller page size or a different request all together you need to make sure you are still abiding my these limits. Otherwise you risk having your requests throttled, or under extreme conditions having your token revoked. 

In [None]:
# Normal Setup
API_ENDPOINT = 'http://content.guardianapis.com/search'
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'lang': 'en', 
             'section': 'travel', 
             'q': 'Vancouver', 
             'page-size': 50, 
             'show-fields': 'wordcount,body,byline'}

# Collect All Results
all_results = []
cur_page = 1
total_pages = 1

while (cur_page <= total_pages) and (cur_page < 10):  # with a fail safe
    # Make a API request
    MY_PARAMS['page'] = cur_page
    response = requests.get(API_ENDPOINT, params=MY_PARAMS)
    response_dict = response.json()['response']

    # Update our master results list
    all_results += (response_dict['results'])
    
    # Update our loop variables
    total_pages = response_dict['pages']
    cur_page += 1

In [None]:
print("Total # of results: {}".format(len(all_results)))

In [None]:
all_results[36]

Now that we have the results, we can continue to access them and work with them, without having to make more API requests. 

### Learning More
If you want to learn more about the Guardian API or want to ask questions of others working with the API, I would recommend checking out the [Guardian API talk board]() and the [Guardian developer blog](). 

## The Twitter API

## The Wikipedia API

# Simple text processing with web data <a id='text'></a>