# Fundamentals of Data Analysis with Python 

## Day 2: Collecting Data from the Web

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>


### Overview 

High-level overview coming soon... 

### Plan for the Day

1. [What you need to know about how the Internet works to collect data from the web](#wyntk)
2. [Scraping the Web](#scrape)
    * How to scrape text and tables from static websites with BeautifulSoup
    * An overview of working with (a) multiple pages and (2) interactive content 
3. [Collecting data via Application Programming Interfaces](#apis)
    * Understanding APIs 
    * The Twitter API 
    * The Guardian API 
4. [Simple text processing with web data](#text)

<hr>

# What you need to know about how the Internet works to collect data from the web <a id='wyntk'></a>

# Scraping the Web <a id='scrape'></a>

# Collecting data via Application Programming Interfaces <a id='apis'></a>

## Understanding APIs

Application Programming Interfaces offer an alternative way to access data from online sources. It provides an explicit _interface_ to the data behind the website. It explicitly defines how you can request data from the website and what format you will receive the data. 

### What APIs are Made of
* Endpoints
* Queries (?)
* Filters (?) 


### APIs vs Web Scraping

Benefits: 
* Structured data (for the most part). 
* Regulated or Controlled by 
* Usually well documented by the company 
* Maintained by the company/organization (not a random person on github)
* Explicitly allowable

Drawbacks: 
* You get the data you get 
* Relies on the company making updates to the codebase (changes may not be reflected) whereas with open-source someone can just make the change. 
* Rate limits & other restrictions based on company business decisions rather than technical limitations



## API Best Practices

## The Twitter API

## The Guardian API
The Guardian's API allows us to query and download data related to their published articles. 

API token -- important! (move to Understanding APIs or a Best Practices section under APIs) never store your API token in a git repo or any other publically available location. This is incredibly dangerous. It allows other people to use your credentials to access the API, making any of there requests tracable to you. In the case of Guardian, this is problematic b/c if someone were to get a hold of your key and use it to launch a DOS attack on Guardian, its quite likely your token would be revoked and you'd be unable to request a new one in the future. 

To solve this problem, I would reccommend creating a `cred.py` file that can be stored on your computer and imported by the Python files you are working in. Ideally, this is stored in one location on your machine that all your python packages can import from (something in PATH). (Need to look into best practices on this). Then, this file is stored outside the repo. If for some reason you need to store this file in the same directory as your python file (and thus inside the repo directory) make sure to add `cred.py` to the `.gitignore` file. 

### Rate Limits
Rate limits are defined by the [type of key](https://open-platform.theguardian.com/access/) you've applied for. Its important to understand how these limits are controlled. For example, some websites and companies will have built-in measures that reject API requests that go over the rate limit. Others will rely on the honour system and ask you to abide by their guidelines. In those cases you run the risk of being blacklisted if you exceed their rate limits, due to flagging as a denial of service attack. 

For non-commerical developer keys, you receive: 
* Up to 12 calls per second
* Up to 5,000 calls per day
* Access to article text (no image, audio, or video)
* Access to a subset of Guardian content (1.9 million pieces)


### Endpoints
The Guardian API makes available five endpoints: 
* Content &mdash; returns content. For dev keys only text. Allows querying and filtering to reduce what is returned.  
* Tags &mdash; will return all API tags (> 50, 000). These tags can be used in other quries. 
* Sections &mdash; logical grouping of content
* Editions &mdash; the content for each of the three regional main pages
* Single Item &mdash; will return all data related to a specific item (content, tag, or section) in the API. 

### Working with the API
The Guardian maintains and supports one client &mdash; the Scala client library. However, other clients are supported by the community. We will use the [Python client library], one of the community-built clients, to access the Guardian API. 

### Your Turn

### Learning More
If you want to learn more about the Guardian API or want to ask questions of others working with the API, I would recommend checking out the [Guardian API talk board]() and the [Guardian developer blog](). 

## The Wikipedia API

# Simple text processing with web data <a id='text'></a>