# STA 141B Data & Web Technologies for Data Analysis

### Lecture 8, 2/4/25, APIs


### Announcements

- HW 2 is due this Sunday. 

### Last week's topics

- Exam 
- SQL

### Today's topics

- Final project
- Getting Data from the Web
- Hypertext Transfer Protocol
- Representational State Transfer
- iTunes API
- Caching
- API Keys
- Guardian API

### Resources
 - [iTunes Search API](https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/)
 - [Guardian API](https://open-platform.theguardian.com/documentation/)

### Getting Data from the Web

We consider three ways one can get data from the web, from most to least convenient:
1. Direct download
2. API
3. Scraping

Always look for a direct download first!

##### Difference between web scraping and API

_Web Scraping_ refers to the process of extracting data from a website or specific webpage.

API stands for _application programming interface_ (API) is a collection of functions and data structures for communicating with other software. For instance, whenever you use a Python package, you're using the API created by the package's developers.

The goal of both web scraping and (web) APIs is to access web data.

Web scraping allows you to extract data from any website through the use of web scraping software. On the other hand, APIs give you direct access to the data you want.

Websites sometimes provide an API so that programmers can access content without web scraping. 

### Hypertext Transfer Protocol

The hypertext transfer protocol (HTTP) is a set of rules for communicating over the internet.

For example, your web browser uses HTTP every time you visit a web page. The browser makes a _request_ to the server for the page, and if nothing goes wrong, the server _responds_ with the page. If you have Firefox or Chrome, you can inspect these requests with your browser's web developer tools (Windows: <kbd>Ctrl</kbd> + <kbd>i</kbd>; MacOS: <kbd>&#8984;</kbd> + <kbd>&#8997;</kbd> + <kbd>i</kbd>).

Several [different kinds of HTTP requests](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods) are possible. Think of these as the different "verbs" you can use when communicating in HTTP.

Many protocols exist for communicating over the internet. For instance, you may have heard of _file transfer protocol_ (FTP) for transferring files, or _simple mail transfer protocol_ (SMTP) for sending/receiving email. However, web APIs almost always use HTTP.

A response to an HTTP request always includes a status code that summarizes whether the request was successful. Wikipedia has a full [list of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). Generally,

* 200-299: Your request succeeded.
* 300-399: You need to take further action to complete the request.
* 400-499: Your request wasn't valid (you made a mistake). You've probably seen 404 before!
* 500-599: Your request failed (the server made a mistake).

### Representational State Transfer 

The most popular kind of web API is a _representational state transfer_ (REST) API. The API needs to meet the following architectural requirements to be considered a REST API:

- Client-server: REST applications have a server that manages application data and state. 
- Stateless: Servers don’t maintain client state, clients manage their own application state. The client’s requests to the server contain all the information required to process them.
- Cacheable: servers must mark their responses as cacheable or not. Systems and clients can cache responses when convenient to improve performance. 
- Uniform interface: This is REST’s most well-known feature or rule. 

The URL with which we can talk to the server is sometimes called *endpoint*. 

### iTunes API

We use the iTunes API at `https://itunes.apple.com/search`, see [documentation](https://developer.apple.com/library/archive/documentation/AudioVideo/Conceptual/iTuneSearchAPI/Searching.html#//apple_ref/doc/uid/TP40017632-CH5-SW1). 

When you first use a web API, check the documentation to find out what the endpoints are and what kind of HTTP requests to use. If the documentation doesn't mention what kind of HTTP request to use, then GET is usually the right choice.

#### Making Requests

Python's `requests` package provides functions for making HTTP requests. Let's use the endpoint we learned from the iTunes API.

In [None]:
import requests

The syntax for the `requests` package is `response = requests.get("WEBSITE ADDRESS")`. 

#### Query Strings

Most of the functions we use have parameters, and you can pass arguments for those parameters when you call a function.

Endpoints in REST APIs work the same way, but the syntax is different. You can pass arguments by adding `?PARAMETER=ARGUMENT` to the end of the URL. Parameter and argument pairs are separated by `&`. This syntax is called a _query string_.

The search endpoint is `https://itunes.apple.com/search`, and the documentation lists several parameters. We can use `requests` to build the query string automatically.

Lets answer the question: How many albums of *Beyoncé* are on iTunes? 

In [None]:
r = requests.get("https://itunes.apple.com/search", params = {
        "term": "beyonce", # add multiple terms via +
        "media": "music",
        "entity": "album",
        "attribute": "tvEpisodeTerm", # check iTunes docs
        "country": "US", 
        "limit": "1"
    })

In [None]:
type(r)

In [None]:
r.url

In [None]:
r

You can have `requests` check the status for you with the `.raise_for_status()` method.

In [None]:
r.url

In [None]:
response = requests.get("https://itunes.apple.com/search", params = {
        "term": "beyonce", 
        "media": "music",
        "entity": "album",
        "attribute": "artistTerm", # artistsTerm is no valid attribute! 
        "country": "US", 
        "limit": "200"
    })

Once you have the response, now what? Where's the data? Different web APIs use different formats. Again, see the documentation. Two common formats are:

 - _JavaScript Object Notation_ (JSON): JSON looks and works a lot like Python lists and dictionaries. Lists are surrounded with `[ ]`, and dictionaries are surrounded with `{ }`. There are many Python libraries for reading JSON into lists and dictionaries. Jupyter notebooks are an example of a file in JSON format.

 - _eXtensible Markup Language_ (XML): XML uses "tags" denoted by `< >` to mark up sections of text. We'll learn more about XML when we learn about web scraping, since XML is very similar to hypertext markup language (HTML), the language used to build web pages.

The iTunes returns data in JSON format (derived from JavaScript). We can inspect the raw content (bytes) of a response with the `.content` attribute. If we know the response is in a text format, we can use `.text` to see the content as an ordinary Python string.

In [None]:
response.text

Since the response we got is in JSON format, we'd like to convert the string to lists and dictionaries. The `requests` package provides a method `.json()` to do this.

In [None]:
result = response.json()
result

In [None]:
type(result)

In [None]:
result["results"][0]

In [None]:
import pandas as pd
results = pd.DataFrame(result['results'])
results

In [None]:
results.shape

### Caching

Making an HTTP request is not free! It has a real cost in CPU time and also cash. Server administrators will not appreciate it if you make too many requests or make requests too quickly. So:

* Use `time.sleep()` to slow down any requests you make in a loop. Aim for no more than 20-30 requests per second.
* Install and use the `requests_cache` package to avoid downloading extra data when you make the same request twice.

Failing to be polite can get you banned from websites!

We can use `sleep` from `time` to suspend any operation for the passed number of seconds. 

In [None]:
import time 
print(time.ctime())
time.sleep(0.05)
print(time.ctime())

A possible problem for time consuming requests is that data is requested multiple times. This can be avoided by using a cache. When the request is made, it first checks the cache. Only if the data is not found there, the data is pulled from the server and copied into the cache. 

We cache our search results with `requests_cache` ([docs](https://requests-cache.readthedocs.io/en/v0.9.6/user_guide.html)). 

In [None]:
import requests
session = requests.Session() 
print(time.ctime())
for i in range(10):
    session.get('http://httpbin.org/delay/1') # this endpoints delays by one second
print(time.ctime())

In [None]:
import requests_cache
session = requests_cache.CachedSession('demo_cache')
print(time.ctime())
for i in range(10):
    res = session.get('http://httpbin.org/delay/1')
print(time.ctime())

In [None]:
res.text

### API Keys

Many APIs use a _key_ or _token_ to identify the user. For instance, The Guardian, a British newspaper, provides a [web API](https://open-platform.theguardian.com/) to access their news articles. You need an API key to use their web APIs. You can get one for free [here](https://bonobo.capi.gutools.co.uk/register/developer).

#### Storing API Keys

Your API key is private and your responsibility. Treat it like a password. Keep it secret! 

In order to keep your API key separate from your code:
1. Save the API key in a text file.
2. Use Python to load the API key into a variable.

Python's built-in `open()` function opens a file, and the `.readline()` method reads a line from a file. Often you'll see these used with `with`, which automatically closes the file at the end of the block:

In [None]:
def read_key(keyfile):
    with open(keyfile) as f:
        return f.readline().strip("\n")

In [None]:
key = read_key("../keys/guardian.txt") # Don't print out your actual API key

In [None]:
type(key)

Now you can use the `key` variable anywhere you need the actual API key.

#### Querying The Guardian

We've got our key, so let's use The Guardian API. 

We want to answer the question whether Biden or Trump get more newspaper coverage in the days leading up to the 2024 U.S. presidential election. Let's start by trying to get all of the articles about one of the candidates.

In [None]:
response = requests.get("https://content.guardianapis.com/search", params = {
        "api-key": key,
        "q": "Harris",
        "from-date": "2024-10-20",
        "to-date": "2024-11-5",
        "page-size": 50,
        "order_by": "newest",
        "page": 1
    }) # try page 12

In [None]:
response.raise_for_status

In [None]:
response.json()

In [None]:
import time
def get_articles(q, page = 1, from_date = "2024-10-20"):
    time.sleep(0.05) 
    response = requests.get("https://content.guardianapis.com/search", params = {
        "api-key": key,
        "q": q,
        "from-date": from_date,
        "to-date": "2024-11-5",
        "page-size": 50,
        "order_by": "newest", 
        "page": page
    })
    response.raise_for_status()
    return response.json()["response"]

In [None]:
harris = get_articles("Harris")

In [None]:
harris

In [None]:
pages = harris["pages"]
pages

In [None]:
pageSize = harris["total"]
pageSize

In [None]:
currentPage = harris["currentPage"]
currentPage

In [None]:
results = harris["results"]
for p in range(2, pages + 1):
    results += get_articles("Harris", p)["results"]

In [None]:
results

In [None]:
type(results)

In [None]:
df = pd.DataFrame(results)

In [None]:
df.shape

In [None]:
df.tail()

In [None]:
df["webPublicationDate"] = pd.to_datetime(df["webPublicationDate"])

In [None]:
type(df["webPublicationDate"][0])

In [None]:
df.head()

In [None]:
date = df["webPublicationDate"].dt
date

In [None]:
date.day_name()

In [None]:
dates = pd.DataFrame({"day": date.day, "day_name": date.day_name()})

In [None]:
dates

In [None]:
dates.groupby(["day", "day_name"]).size()

Write it as a function

In [None]:
def get_articles(q, page = 1):
    response = requests.get("https://content.guardianapis.com/search", params = {
        "api-key": key,
        "q": q,
        "from-date": "2024-10-20",
        "to-date": "2024-11-5",
        "page-size": 50,
        "page": page
    })
    response.raise_for_status()
    return response.json()["response"]

In [None]:
def get_all_articles(q, time_sleep = 0.05):
    # Get the first page, and find out how many pages there are.
    candidate = get_articles(q)
    pages = candidate["pages"]

    # Loop over remaining pages.
    results = candidate["results"]
    for p in range(2, pages + 1):
        results += get_articles(q, p)["results"]
        time.sleep(time_sleep)

    # Convert the articles to data frame, and the date column to a date.
    df = pd.DataFrame(results)
    df["webPublicationDate"] = pd.to_datetime(df["webPublicationDate"])
    
    # Get the day and day name, then count them.
    date = df["webPublicationDate"].dt
    dates = pd.DataFrame({"day": date.day, "day_name": date.day_name()})
    return dates.groupby(["day", "day_name"]).size()

In [None]:
harris=get_all_articles("Harris")
harris

In [None]:
harris.head(10)

In [None]:
trump=get_all_articles("Trump")
trump

In [None]:
df = pd.DataFrame([harris,trump]).T
df = df.rename(columns={0: 'Harris', 1: 'Trump'})
df = df.reset_index()
df

In [None]:
df = df.melt(id_vars = ['day', 'day_name'])
df

In [None]:
import plotnine as p9
(
    p9.ggplot(df, p9.aes(x='day',y='value',color='variable')) + 
        p9.geom_line() + 
    p9.labs(color='',x='Day',y='Number of articles')
)

What are some ways this analysis could be improved?

* Check that articles about "Trump" and "Harris" are actually about the two candidates. Some may be about other things -- the English word "trump", ...
* Check whether the API searches article text or just article titles.
* Use more sources, and use American newspapers (unless the goal was to analyze international news).
* Make visualizations.
* Use a larger time window.
* Use other kinds of data (e.g., poll results) to look for relationships.

Collecting and cleaning data takes a lot of very technical work, but it's only the first step in the analysis. When you finish data collection and cleaning, it can feel like you're finally done. Take a moment to congratulate yourself and step away from the data, so that when you come back you'll be ready to do a careful statistical analysis.

### OAuth

[OAuth](https://en.wikipedia.org/wiki/OAuth) is a way to give an application access to data on a website or web API.

You might run into OAuth if you use a web API where the data is private. For instance, Twitter provides a [web API](https://developer.twitter.com/en/docs.html) for managing your personal Twitter account. If you want to access the API from a Python script, first you have to use OAuth to tell Twitter that the script has permission to use your data.

OAuth can operate in several different ways. As always, check the documentation for the web API you want to use in order to find out what you need to do.

The simplest case of OAuth requires scripts to have a key or token from the web API provider. This is very similar to using an API key.

For more complicated cases, the `requests-ouathlib` package ([docs](https://requests-oauthlib.readthedocs.io/en/latest/)) may help.

### Summary 

- Third parties provide access to their data bases via APIs
- Check API documentation to assemble a valid query
- You are a guest, be polite! 