In [1]:
from lec_utils import *


<div class="alert alert-info" markdown="1">

#### Lecture 9


# APIs I

    
</div>

### Agenda

- Introduction to HTTP
- JSON Format
- APIs

## Introduction to HTTP

---

### Data sources

- Often, the data you need doesn't exist in "clean" `.csv` files.

- **Solution**: Collect your own data from the internet!<br><small>For most questions you can think of, the answer exists somewhere on the internet. If not, you can run our own survey ‚Äì also on the internet!</small>

<div class="alert alert-danger">
    
#### Reference Slide

### Manual copy-pasting
    
</div>

- If data is already nicely formatted in a table online, sometimes we can easily copy it and paste it into a `.csv` or `.tsv` file.<br><small>`.tsv` stands for "tab-separated values", just like `.csv` stands for "comma-separated values."</small> 

- For example, open the 2025 Dartmouth Football schedule [**here**](https://dartmouthsports.com/sports/football/schedule) and click "Text Only".


<center><img src="imgs/dart-schedule.jpg" width=700><br><small>This is what you should see.</small></center>

- Copy the text in the table at the bottom and save it in a file named `2025-schedule.tsv` in your `data` folder.<br><small>You may need to do some minor reformatting in the `.tsv` file before this works.<br>**As a challenge**, see if you can find a way to do this entirely within your Terminal, i.e. without opening a text editor!</small>

- For Wikipedia specifically, you can use [Wikitable2CSV](https://wikitable2csv.ggor.de/), which converts Wikipedia tables to `.csv` files for you.

### Programatically accessing data

- We won't always be able to copy-paste tables from online, and even when we can, it's not easily **reproducible**.<br><small>What if [dartmouthsports.com](https://dartmouthsports.com/sports/football/schedule) didn't have a "Text Only" option? Or what if the schedule changes ‚Äì how can I prevent myself from having to copy-and-paste again?</small>

- To programmatically download data from the internet, we'll need to use the **HTTP protocol**.<br><small>By "programmatically", we mean by writing code.</small>

### The request-response model

- HTTP stands for **Hypertext Transfer Protocol**.<br><small>It was developed in 1989 by Tim Berners-Lee (and friends). The "S" in HTTPS stands for "secure".</small>

- HTTP follows the **request-response** model, in which a <b><span style="color:blue">request</span></b> is made by the <b><span style="color:blue">client</span></b> and a <b><span style="color:orange">response</span></b> is returned by the <b><span style="color:orange">server</span></b>.



<center><img src='imgs/req-response.png' width=500></center>

- **Example**: YouTube search üé•.
    - Consider the following URL: https://www.youtube.com/results?search_query=chopin+competition+2025
    - Your web browser, a <b><span style="color:blue">client</span></b>, makes an HTTP <b><span style="color:blue">request</span></b> with a search query.
    - The <b><span style="color:orange">server</span></b>, YouTube, is a computer that is sitting somewhere else.
    - The server returns a <b><span style="color:orange">response</span></b> that contains the search results.
    - **Note**: `?search_query=chopin+competition+2025` is called a "query string."

<div class="alert alert-danger">
    
### Consequences of the request-response model
    
</div>

- When a request is sent to view content on a webpage, the server must:
    - process your request (i.e. prepare data for the response).
    - send content back to the client in its response.

- Remember, servers are computers.  Someone has to pay to keep these computers running.<br><small>**Every time you access a website, someone has to pay.**</small>

- If you make too many requests, the server may block your IP address, or **you may even take down the website**!<br><small>A journalist scraped and accidentally took down the Cook County Inmate Locater, and as a result, inmate's families weren't able to contact them while the site was down.</small>

### HTTP request methods

- There are several types of request methods; see [Mozilla's web docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) for a detailed list.

- `GET` is used to request data **from** a specified resource.<br><small>Almost all of the requests we'll make in this class are `GET` requests.<br>To load websites, your web browser uses a lot of `GET` requests!</small>

- `POST` is used to **send** data to the server. <br><small>For example, uploading a photo to Instagram or entering credit card information on Amazon.</small>

- You can make requests directly in your Terminal using the `curl` method. **Here, we'll make requests using the `requests` Python module.**<br><small>There are other packages that work similarly (e.g. `urllib`), but `requests` is arguably the easiest to use.</small>

In [2]:
import requests

### Example: `GET` requests via `requests`

- For example, let's try and learn more about the events listed on the Happening @ Dartmouth home page, https://home.dartmouth.edu/events.

In [28]:
res = requests.get('https://home.dartmouth.edu/events') 

- `res` is now a `Response` object.

In [29]:
res

<Response [200]>

- The `text` attribute of `res` is a string that containing the entire response.

In [30]:
type(res.text)

str

In [31]:
len(res.text)

132818

In [32]:
print(res.text[:2000])

<!DOCTYPE html><html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# schema: http://schema.org/ sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema# "><head><meta charset="utf-8" /><noscript><style>form.antibot * :not(.antibot-message) { display: none !important; }</style></noscript><link rel="canonical" href="https://home.dartmouth.edu/events" /><meta property="og:title" content="Events | Dartmouth" /><meta property="og:image" content="https://home.dartmouth.edu/modules/custom/dart_metatag/images/dpine_16x9.webp" /><meta name="twitter:card" content="summary_large_image" /><meta name="twitter:title" content="Events" /><meta name="twitter:image" content="https://home.dartmouth.edu/modules/custom/dart_metatag/images/dpine_16x9.webp" /><me

- The response is a string containing **HTML**, the markup language used to format information on the internet. The events data we're looking for is in `res.text` _somewhere_, but we have to search for it and extract it.

<div class="alert alert-danger" markdown="1">

### Example: `POST` requests via `requests`

- What happens when we try and make a `POST` request somewhere where we're unable to?

In [33]:
yt_res = requests.post('https://youtube.com',
                       data={'name': 'Hello'})
yt_res

<Response [400]>

In [34]:
# This takes the text of yt_res and renders it as an HTML document within our notebook!
HTML(yt_res.text)

### HTTP status codes

- When we **request** data from a website, the server includes an **HTTP status code** in the response.  

* The most common status code is `200`, which means there were no issues.  

In [10]:
res

<Response [200]>

* Other times, you will see a different status code, describing some sort of event or error.
    - Common examples: `403`: forbidden, `404`: page not found, `500`: internal server error.
    - [The first digit of a status describes its general "category."](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)

- For example, [The Economist](http://www.economist.com/) doesn't let us scrape it.<br><small>Nothing is stopping us from opening Chrome, clicking "View Page Source", and manually downloading the HTML, though!</small>

In [35]:
res = requests.get('http://www.economist.com/')
res.status_code

403

- As an aside, you can render HTML directly in a notebook using the `HTML` function.<br><small>We already imported this function by running `from IPython.display import HTML`.</small>

In [36]:
HTML(res.text)

<div class="alert alert-danger" markdown="1">


### Handling unsuccessful requests

- Sometimes, websites either don't want you to scrape, or prohibit you from scraping.<br><small>It's best practice to check the website's `robots.txt` file, where they specify who is and isn't allowed to scrape.

- Some unsuccessful requests can be re-tried, depending on the issue.<br><small>A good first step is to wait a little, then try again.</small>

- A common issue is that you're making too many requests to a particular server at a time. If this is the case, you are being **rate-limited**; one solution is to increase the time between each request.<br><small>You can even do this programatically, say, using `time.sleep`.</small>

## The structure of HTML

---

### Scraping vs. APIs

- There are two different ways of programmatically accessing data from the internet: either **by scraping**, or **through an API**.

- **Scraping** is the act of emulating a web browser to access its HTML source code.<small>When scraping, you get back data as HTML and have to **parse** that HTML to extract the information you want. Parse means to "extract meaning from a sequence of symbols".

<center>
    
| ‚úÖ Pros | ‚ùå Cons |
| --- | --- |
| If the website exists, you can usually scrape it.<br><small>This is what Google does!</small> | Scraping and parsing code gets **messy**, since <br>HTML documents contain lots of content unrelated to the<br>information you're trying to find (advertisements, formatting).<br><br>When the website's structure changes, your code will need to, too.<br><br>The site owner may not _want_ you to scrape it!</small>
    
    
</center>

- An application programming interface, or **API**, is a service that makes data directly available to the user in a **convenient** fashion. Usually, APIs give us code back as JSON objects.<br><small>APIs are made by organizations that host data. For example, X (formally known as Twitter) has an [API](https://developer.twitter.com/en/docs/twitter-api), as does [OpenAI](https://platform.openai.com/docs/overview?lang=python), the creators of ChatGPT.</small>


| ‚úÖ Pros | ‚ùå Cons |
| --- | --- |
| If an API exists, the data are usually clean, up-to-date, and ready to use.<br><br>The presence of an API signals that the data provider<br> is okay with you using their data.<br><br>The data provider can plan and regulate data usage.<br><small>Sometimes, you may need to create an API "key",<br>which is like an account for using the API.<br>APIs can often give you access to data that isn't publicly available.</small> | APIs don't always exist for the data you want! |

- We'll start by learning how to use API; we'll discuss scraping in the next lecture.

## APIs and JSON

---

Recall, scraping was one of the ways to access data from the internet. APIs are the other way.

### Application programing interface (API) terminology

- A URL, or uniform resource locator, describes the location of a website or resource.

- API requests are `GET`/`POST` requests to a specially maintained URLs.

- As an example, we'll look at the [Pok√©mon API](https://pokeapi.co).

- All requests are made to:

```
        https://pokeapi.co/api/v2/{endpoint}/{name}
```

- For example, to learn about Pikachu, we use the `pokemon` **endpoint** with name `pikachu`.

        https://pokeapi.co/api/v2/pokemon/pikachu

- Or, to learn about all water Pokemon, we use the `type` endpoint with name `water`.

        https://pokeapi.co/api/v2/type/water

### Example: Pok√©mon API ‚ö°Ô∏è

- To illustrate, let's make a `GET` request to learn more about Pikachu.

In [37]:
def request_pokemon(name):
    url = f'https://pokeapi.co/api/v2/pokemon/{name}'
    return requests.get(url)
res = request_pokemon('pikachu')
res

<Response [200]>

- Remember, the 200 status code is good! Let's take a look at the text, the same way we did before:

In [38]:
res.text[:1000]

'{"abilities":[{"ability":{"name":"static","url":"https://pokeapi.co/api/v2/ability/9/"},"is_hidden":false,"slot":1},{"ability":{"name":"lightning-rod","url":"https://pokeapi.co/api/v2/ability/31/"},"is_hidden":true,"slot":3}],"base_experience":112,"cries":{"latest":"https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/latest/25.ogg","legacy":"https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/legacy/25.ogg"},"forms":[{"name":"pikachu","url":"https://pokeapi.co/api/v2/pokemon-form/25/"}],"game_indices":[{"game_index":84,"version":{"name":"red","url":"https://pokeapi.co/api/v2/version/1/"}},{"game_index":84,"version":{"name":"blue","url":"https://pokeapi.co/api/v2/version/2/"}},{"game_index":84,"version":{"name":"yellow","url":"https://pokeapi.co/api/v2/version/3/"}},{"game_index":25,"version":{"name":"gold","url":"https://pokeapi.co/api/v2/version/4/"}},{"game_index":25,"version":{"name":"silver","url":"https://pokeapi.co/api/v2/version/5/"}},{"game_index"

- Unlike when we were scraping earlier, the text in the response no longer resembles HTML!

### JSON

- JSON stands for **JavaScript Object Notation**. It is a lightweight format for storing and transferring data.

- It is:
    - very easy for computers to read and write.
    - moderately easy for programmers to read and write by hand.
    - meant to be generated and parsed.

- Most modern languages have an interface for working with JSON objects.<br><small>JSON objects **resemble** Python dictionaries, but are not the same!</small>

### JSON data types

| Type | Description |
| --- | --- |
| String | Anything inside double quotes. |
| Number | Any number (no difference between ints and floats). |
| Boolean | `true` and `false`. |
| Null | JSON's empty value, denoted by `null`. |
| Array | Like Python lists. |
| Object | A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects). |

<br>

<center><small>See <a href="https://json-schema.org/understanding-json-schema/reference/type.html">json-schema.org</a> for more details.</small></center>

### Example JSON object

<center><img src='imgs/hierarchy.png' width=500> <small>See <code>data/family.json</code>.</small></center>

In [39]:
!cat family.json

{
    "name": "Grandma",
    "age": 94,
    "children": [
        {
        "name": "Dad",
        "age": 60,
        "children": [{"name": "Me", "age": 24}, 
                     {"name": "Brother", "age": 22}]
        },
        {
        "name": "My Aunt",
        "children": [{"name": "Cousin 1", "age": 34}, 
                     {"name": "Cousin 2", "age": 36, "children": 
                        [{"name": "Cousin 2 Jr.", "age": 2}]
                     }
                    ]
        }
    ]
}

In [47]:
family_tree['children'][1]['children'][1]['children'][0]['name']

'Cousin 2 Jr.'

In [40]:
import json
with open('family.json', 'r') as f:
    family_str = f.read()
    family_tree = json.loads(family_str) # loads stands for load string.

In [44]:
json.loads(family_str)

{'name': 'Grandma',
 'age': 94,
 'children': [{'name': 'Dad',
   'age': 60,
   'children': [{'name': 'Me', 'age': 24}, {'name': 'Brother', 'age': 22}]},
  {'name': 'My Aunt',
   'children': [{'name': 'Cousin 1', 'age': 34},
    {'name': 'Cousin 2',
     'age': 36,
     'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}

In [45]:
family_tree

{'name': 'Grandma',
 'age': 94,
 'children': [{'name': 'Dad',
   'age': 60,
   'children': [{'name': 'Me', 'age': 24}, {'name': 'Brother', 'age': 22}]},
  {'name': 'My Aunt',
   'children': [{'name': 'Cousin 1', 'age': 34},
    {'name': 'Cousin 2',
     'age': 36,
     'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}

In [18]:
family_tree['children'][1]['children'][0]['age'] 

34

<div class="alert alert-danger" markdown="1">


### Using the `json` module

- `json.load(f)` loads a JSON file from a file object.

- `json.loads(f)` loads a JSON file from a **s**tring.

In [19]:
with open('family.json') as f:
    family_tree = json.load(f)
family_tree

{'name': 'Grandma',
 'age': 94,
 'children': [{'name': 'Dad',
   'age': 60,
   'children': [{'name': 'Me', 'age': 24}, {'name': 'Brother', 'age': 22}]},
  {'name': 'My Aunt',
   'children': [{'name': 'Cousin 1', 'age': 34},
    {'name': 'Cousin 2',
     'age': 36,
     'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}

In [20]:
with open('family.json') as f:
    family_tree_string = f.read()
    family_tree = json.loads(family_tree_string)
family_tree

{'name': 'Grandma',
 'age': 94,
 'children': [{'name': 'Dad',
   'age': 60,
   'children': [{'name': 'Me', 'age': 24}, {'name': 'Brother', 'age': 22}]},
  {'name': 'My Aunt',
   'children': [{'name': 'Cousin 1', 'age': 34},
    {'name': 'Cousin 2',
     'age': 36,
     'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}

<div class="alert alert-danger">
    
### Aside: `pd.read_json`
    
</div>

- `pandas` also has a built-in `read_json` function.

In [21]:
with open('family.json', 'r') as f:
    family_df = pd.read_json(f)
family_df

Unnamed: 0,name,age,children
0,Grandma,94,"{'name': 'Dad', 'age': 60, 'children': [{'name..."
1,Grandma,94,"{'name': 'My Aunt', 'children': [{'name': 'Cou..."


- It only makes sense to use it, though, when you have a JSON file that has some sort of tabular structure. Our family tree example does not.

### Example: Pok√©mon API ‚ö°Ô∏è

- The response we get back from the Pok√©mon API looks like JSON.<br>We can extract the JSON from this request with the `json` method of `res`.<br><small>We could also pass `res.text` to `json.loads`.</small>

In [48]:
res = request_pokemon('pikachu')
res.text[:1000]

'{"abilities":[{"ability":{"name":"static","url":"https://pokeapi.co/api/v2/ability/9/"},"is_hidden":false,"slot":1},{"ability":{"name":"lightning-rod","url":"https://pokeapi.co/api/v2/ability/31/"},"is_hidden":true,"slot":3}],"base_experience":112,"cries":{"latest":"https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/latest/25.ogg","legacy":"https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/legacy/25.ogg"},"forms":[{"name":"pikachu","url":"https://pokeapi.co/api/v2/pokemon-form/25/"}],"game_indices":[{"game_index":84,"version":{"name":"red","url":"https://pokeapi.co/api/v2/version/1/"}},{"game_index":84,"version":{"name":"blue","url":"https://pokeapi.co/api/v2/version/2/"}},{"game_index":84,"version":{"name":"yellow","url":"https://pokeapi.co/api/v2/version/3/"}},{"game_index":25,"version":{"name":"gold","url":"https://pokeapi.co/api/v2/version/4/"}},{"game_index":25,"version":{"name":"silver","url":"https://pokeapi.co/api/v2/version/5/"}},{"game_index"

In [49]:
res

<Response [200]>

In [23]:
pikachu = res.json()
pikachu

{'abilities': [{'ability': {'name': 'static',
    'url': 'https://pokeapi.co/api/v2/ability/9/'},
   'is_hidden': False,
   'slot': 1},
  {'ability': {'name': 'lightning-rod',
    'url': 'https://pokeapi.co/api/v2/ability/31/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 112,
 'cries': {'latest': 'https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/latest/25.ogg',
  'legacy': 'https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/legacy/25.ogg'},
 'forms': [{'name': 'pikachu',
   'url': 'https://pokeapi.co/api/v2/pokemon-form/25/'}],
 'game_indices': [{'game_index': 84,
   'version': {'name': 'red', 'url': 'https://pokeapi.co/api/v2/version/1/'}},
  {'game_index': 84,
   'version': {'name': 'blue', 'url': 'https://pokeapi.co/api/v2/version/2/'}},
  {'game_index': 84,
   'version': {'name': 'yellow',
    'url': 'https://pokeapi.co/api/v2/version/3/'}},
  {'game_index': 25,
   'version': {'name': 'gold', 'url': 'https://pokeapi.co/api/v2/version

In [50]:
pikachu.keys()

dict_keys(['abilities', 'base_experience', 'cries', 'forms', 'game_indices', 'height', 'held_items', 'id', 'is_default', 'location_area_encounters', 'moves', 'name', 'order', 'past_abilities', 'past_stats', 'past_types', 'species', 'sprites', 'stats', 'types', 'weight'])

In [51]:
pikachu['weight']

60

In [52]:
pikachu['abilities'][1]['ability']['name']

'lightning-rod'

### Invalid `GET` requests

- Let's try a `GET` request for `'wolverine'`.

In [53]:
request_pokemon('wolverine')

<Response [404]>

- We receive a 404 error, since there is no Pok√©mon named `'wolverine'`!

### More on APIs

- We accessed the Pok√©mon API by making requests. But, some APIs exist as Python _wrappers_, which allow you to make requests by calling Python functions.<br><small>`request_pokemon` is essentially a wrapper for (a small part of) the Pok√©mon API. If you're curious, try out the [DeepSeek API](https://api-docs.deepseek.com/)!</small>

- Some APIs will require you to create an API key, and send that key as part of your request.<br><small>See Activity 2 today!</small>

- Many of the APIs you'll use are "REST" APIs. Learn more about RESTful APIs [here](https://en.wikipedia.org/wiki/REST#Architectural_constraints).<br><small>REST stands for "Representational State Transfer." One of the key properties of a RESTful API is that servers don't store any information about previous requests, or who is making them.