In [None]:
from lec_utils import *


<div class="alert alert-info" markdown="1">

#### Lecture 8

# Web Scraping and APIs

### EECS 398: Practical Data Science, Spring 2025

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/sp25">github.com/practicaldsc/sp25</a> • 📣 See latest announcements [**here on Ed**](https://edstem.org/us/courses/78535/discussion/6647877) </small>
    
</div>


### Agenda 📆

- Introduction to HTTP.
- The structure of HTML.
- Parsing HTML.
- Example: Scraping quotes 💬.
- APIs and JSON.

## Introduction to HTTP

---

### Data sources

- Often, the data you need doesn't exist in "clean" `.csv` files.

- **Solution**: Collect your own data from the internet!<br><small>For most questions you can think of, the answer exists somewhere on the internet. If not, you can run our own survey – also on the internet!</small>

<div class="alert alert-danger">
    
#### Reference Slide

### Manual copy-pasting
    
</div>

- If data is already nicely formatted in a table online, sometimes we can easily copy it and paste it into a `.csv` or `.tsv` file.<br><small>`.tsv` stands for "tab-separated values", just like `.csv` stands for "comma-separated values."</small> 

- For example, open the 2024 Michigan Football schedule [**here**](https://mgoblue.com/sports/football/schedule) and click "Text Only".


<center><img src="imgs/mich-schedule.png" width=700><br><small>This is what you should see.</small></center>

- Copy the text in the table at the bottom and save it in a file named `2024-schedule.tsv` in your `data` folder.<br><small>You may need to do some minor reformatting in the `.tsv` file before this works.<br>**As a challenge**, see if you can find a way to do this entirely within your Terminal, i.e. without opening a text editor!</small>

In [None]:
schedule = ...
schedule.head()

- For Wikipedia specifically, you can use [Wikitable2CSV](https://wikitable2csv.ggor.de/), which converts Wikipedia tables to `.csv` files for you.

### Programatically accessing data

- We won't always be able to copy-paste tables from online, and even when we can, it's not easily **reproducible**.<br><small>What if [mgoblue.com](https://mgoblue.com/sports/football/schedule) didn't have a "Text Only" option? Or what if the schedule changes – how can I prevent myself from having to copy-and-paste again?</small>

- To programmatically download data from the internet, we'll need to use the **HTTP protocol**.<br><small>By "programmatically", we mean by writing code.</small>

- We'll cover the essentials of how the internet works for the purposes of accessing data, but for more details, take [EECS 485](https://eecs485.org).

### The request-response model

- HTTP stands for **Hypertext Transfer Protocol**.<br><small>It was developed in 1989 by Tim Berners-Lee (and friends). The "S" in HTTPS stands for "secure".</small>

- HTTP follows the **request-response** model, in which a <b><span style="color:blue">request</span></b> is made by the <b><span style="color:blue">client</span></b> and a <b><span style="color:orange">response</span></b> is returned by the <b><span style="color:orange">server</span></b>.



<center><img src='imgs/req-response.png' width=500></center>

- **Example**: YouTube search 🎥.
    - Consider the following URL: https://www.youtube.com/results?search_query=luka+lakers+trade.
    - Your web browser, a <b><span style="color:blue">client</span></b>, makes an HTTP <b><span style="color:blue">request</span></b> with a search query.
    - The <b><span style="color:orange">server</span></b>, YouTube, is a computer that is sitting somewhere else.
    - The server returns a <b><span style="color:orange">response</span></b> that contains the search results.
    - **Note**: ?search_query=luka+lakers+trade is called a "query string."

<div class="alert alert-danger">
    
#### Reference Slide

### Consequences of the request-response model
    
</div>

- When a request is sent to view content on a webpage, the server must:
    - process your request (i.e. prepare data for the response).
    - send content back to the client in its response.

- Remember, servers are computers.  Someone has to pay to keep these computers running.<br><small>**Every time you access a website, someone has to pay.**</small>

- If you make too many requests, the server may block your IP address, or **you may even take down the website**!<br><small>A journalist scraped and accidentally took down the Cook County Inmate Locater, and as a result, inmate's families weren't able to contact them while the site was down.</small>

### HTTP request methods

- There are several types of request methods; see [Mozilla's web docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) for a detailed list.

- `GET` is used to request data **from** a specified resource.<br><small>Almost all of the requests we'll make in this class are `GET` requests.<br>To load websites, your web browser uses a lot of `GET` requests!</small>

- `POST` is used to **send** data to the server. <br><small>For example, uploading a photo to Instagram or entering credit card information on Amazon.</small>

- You can make requests directly in your Terminal using the `curl` method, which you'll learn more about in EECS 485. **Here, we'll make requests using the `requests` Python module.**<br><small>There are other packages that work similarly (e.g. `urllib`), but `requests` is arguably the easiest to use.</small>

In [None]:
import requests

### Example: `GET` requests via `requests`

- For example, let's try and learn more about the events listed on the Happening @ Michigan page, https://events.umich.edu.

In [None]:
res = ...

- `res` is now a `Response` object.

In [None]:
res

- The `text` attribute of `res` is a string that containing the entire response.

In [None]:
type(res.text)

In [None]:
len(res.text)

In [None]:
print(res.text[:2000])

- The response is a string containing **HTML**, the markup language used to format information on the internet. The events data we're looking for is in `res.text` _somewhere_, but we have to search for it and extract it.

<div class="alert alert-danger" markdown="1">

#### Reference Slide

### Example: `POST` requests via `requests`

- The following call to `requests.post` makes a `POST` request to https://httpbin.org/post, with a `'name'` parameter of `'Go Blue'`.

In [None]:
post_res = requests.post('https://httpbin.org/post',
                         data={'name': 'Go Blue'})
post_res

In [None]:
post_res.text

- Now, the response is a string describing a JSON object. We'll learn how to work with these later in the lecture, but for now, note that we can use the `.json()` method to convert it to a Python dictionary.

In [None]:
post_res.json()

- What happens when we try and make a `POST` request somewhere where we're unable to?

In [None]:
yt_res = requests.post('https://youtube.com',
                       data={'name': 'Go Blue'})
yt_res

In [None]:
# This takes the text of yt_res and renders it as an HTML document within our notebook!
HTML(yt_res.text)

### HTTP status codes

- When we **request** data from a website, the server includes an **HTTP status code** in the response.  

* The most common status code is `200`, which means there were no issues.  

In [None]:
res

* Other times, you will see a different status code, describing some sort of event or error.
    - Common examples: `403`: forbidden, `404`: page not found, `500`: internal server error.
    - [The first digit of a status describes its general "category."](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)

- For example, the CSE faculty page doesn't let us scrape it.<br><small>Nothing is stopping us from opening Chrome, clicking "View Page Source", and manually downloading the HTML, though!</small>

In [None]:
res = requests.get('https://cse.engin.umich.edu/people/faculty/')
res.status_code

- As an aside, you can render HTML directly in a notebook using the `HTML` function.<br><small>We already imported this function by running `from IPython.display import HTML`.</small>

In [None]:
HTML(res.text)

<div class="alert alert-danger" markdown="1">

#### Reference Slide

### Handling unsuccessful requests

- Sometimes, websites either don't want you to scrape, or prohibit you from scraping.<br><small>It's best practice to check the website's `robots.txt` file, where they specify who is and isn't allowed to scrape.<br>As we saw on the previous slide, the CSE website blocks us from scraping it, as we got a 403: Forbidden status code.</small>

- Some unsuccessful requests can be re-tried, depending on the issue.<br><small>A good first step is to wait a little, then try again.</small>

- A common issue is that you're making too many requests to a particular server at a time. If this is the case, you are being **rate-limited**; one solution is to increase the time between each request.<br><small>You can even do this programatically, say, using `time.sleep`.</small>

- See [LDS 14](https://learningds.org/ch/14/web_http.html) for more examples.

## The structure of HTML

---

### Scraping vs. APIs

- There are two different ways of programmatically accessing data from the internet: either **by scraping**, or **through an API**.

- **Scraping** is the act of emulating a web browser to access its HTML source code.<small>When scraping, you get back data as HTML and have to **parse** that HTML to extract the information you want. Parse means to "extract meaning from a sequence of symbols".

<center>
    
| ✅ Pros | ❌ Cons |
| --- | --- |
| If the website exists, you can usually scrape it.<br><small>This is what Google does!</small> | Scraping and parsing code gets **messy**, since <br>HTML documents contain lots of content unrelated to the<br>information you're trying to find (advertisements, formatting).<br><br>When the website's structure changes, your code will need to, too.<br><br>The site owner may not _want_ you to scrape it!</small>
    
    
</center>

- An application programming interface, or **API**, is a service that makes data directly available to the user in a **convenient** fashion. Usually, APIs give us code back as JSON objects.<br><small>APIs are made by organizations that host data. For example, X (formally known as Twitter) has an [API](https://developer.twitter.com/en/docs/twitter-api), as does [OpenAI](https://platform.openai.com/docs/overview?lang=python), the creators of ChatGPT.</small>


| ✅ Pros | ❌ Cons |
| --- | --- |
| If an API exists, the data are usually clean, up-to-date, and ready to use.<br><br>The presence of an API signals that the data provider<br> is okay with you using their data.<br><br>The data provider can plan and regulate data usage.<br><small>Sometimes, you may need to create an API "key",<br>which is like an account for using the API.<br>APIs can often give you access to data that isn't publicly available.</small> | APIs don't always exist for the data you want! |

- We'll start by learning how to scrape; we'll discuss APIs at the end of the lecture.

### What is HTML?

- HTML (Hypertext Markup Language) is **the** basic building block of the internet. 

- It is a **markup language**, not a programming language.<br><small>Markup languages specify what something should _look like_, while programming languages specify what something should _calculate_ or _do_.</small>

- Specifically, it defines the content and layout of a webpage, and as such, it is what you get back when you scrape a webpage.

- We're only going to learn enough HTML to help us scrape information.<br><small>See [this tutorial](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) for more details on HTML.</small>

### An example webpage

- For instance, here's the source code of a very basic webpage.

In [None]:
html_string = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>'''

- Here's what that webpage actually looks like:

In [None]:
HTML(html_string)

### The anatomy of HTML documents

- **HTML document**: The totality of markup that makes up a webpage.

<center>

<img src="imgs/webpage_anatomy.png" width=500>

</center>

- **Document Object Model (DOM)**: The internal representation of an HTML document as a hierarchical **tree** structure.

<center><img src="imgs/dom_tree.png" width=500></center>

- **Why do we care about the DOM?** Extracting information out of an HTML document will involve **traversing** this tree.

- **HTML element**: An object in the DOM, such as a paragraph, header, or title.

- **HTML tags**: Markers that denote the **start** and **end** of an element, such as `<p>` and `</p>`.<br><small>See the attached reference slides for examples of common tags.</small>

- **Attributes**: Some tags have **attributes**, which further specify how to display information.

```html
        <p style="color: red">Look at my red text!</p>
```

<div class="alert alert-danger" markdown="1">

#### Reference Slide

### Useful tags to know

- Often, the information we're looking for is nestled in one of these tags:

|Element|Description|
|:---|:---|
|`<html>`|the document|
|`<head>`|the header|
|`<body>`|the body|
|`<div>` |a logical division of the document|
|`<span>`|an *inline* logical division|
|`<p>`|a paragraph|
| `<a>`| an anchor (hyperlink)|
|`<h1>, <h2>, ...`| header(s) |
|`<img>`| an image |

- There are many, many more. See [this article](https://en.wikipedia.org/wiki/HTML_element) for examples.

<div class="alert alert-danger" markdown="1">

#### Reference Slide

### Example tags and attributes

- Tags can have **attributes**, which further specify how to display information on a webpage.

- For instance, `<img>` tags have `src` and `alt` attributes, among others:<br>

```html
        <img src="cool-visualization.png" alt="My box plot that I'm super proud of." width=500>
```

- Hyperlinks have `href` attributes: 

```html
        Click <a href="https://study.practicaldsc.org">this link</a> to access past exams.
```

- The `<div>` tag is one of the more common tags. It defines a "section" of an HTML document, and is often used as a container for other HTML elements.<br><small>Think of `<div>`s like cells in Jupyter Notebooks.</small>
    
```html
        <div class="background">
          <h3>This is a heading</h3>
          <p>This is a paragraph.</p>
        </div>
```

- Often, the information we're looking for is stored in an attribute!<br><small>You can imagine a situation where we want to get the URL behind a button, for example.</small>

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
<small>Remember that you can always ask questions anonymously at the link above!</small>
    
What lingering questions do we have about the Document Object Model and the structure of HTML?

## Parsing HTML

---

### Beautiful Soup 🍜

- [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python HTML parser.<br><small>Remember, **parse** means to "extract meaning from a sequence of symbols".

- **Warning**: Beautiful Soup 4 and Beautiful Soup 3 work differently, so make sure you are using and looking at documentation for Beautiful Soup 4.<br><small>Rest assured, the `pds` conda environment already has Beautiful Soup 4 installed.</small>

### Example HTML document

- To start, we'll work with the source code for an HTML page with the DOM tree shown below:

<center><img src="imgs/dom_tree_1.png" width=600></center>

- This is the DOM tree of the HTML document `html_string` we created earlier.

In [None]:
print(html_string)

In [None]:
HTML(html_string)

### Instantiating `BeautifulSoup` objects

- `bs4`'s `BeautifulSoup` function takes in a string or file-like object representing HTML and returns a **parsed** document.

In [None]:
# We also could have used:
# import bs4
# But, then we'd need to use bs4.BeautifulSoup every time.
from bs4 import BeautifulSoup

- Normally, we pass the result of a `GET` request to `BeautifulSoup`, but here we will pass our hand-crafted `html_string`.

In [None]:
soup = ...
soup

In [None]:
type(soup)

- `BeautifulSoup` objects have several useful attributes, e.g. `text`:

In [None]:
...

### Finding elements in a BeautifulSoup object

- The two main methods you will use to extract information from a BeautifulSoup object are `find` and `find_all`.

- `soup.find(tag)` finds the **first** instance of a tag (the first one on the page, i.e. the first one that DFS sees), and returns just that tag.<br><small>It has several optional arguments, including some that involve defining `lambda` functions: **look at the documentation!**</small>

- `soup.find_all(tag)` will find **all** instances of a tag, and returns a **list** of tags.

- Remember: **`find` finds tags!**

### Using `find`

- Let's try and extract the first `<div>` subtree.

<center><img src="imgs/dom_tree_1.png" width=600> ⬇️ <center><img src="imgs/dom_subtree_1.png" width=325></center>  </center> 

In [None]:
...

- Let's try and find the `<div>` element that has an `id` attribute equal to `'nav'`.

In [None]:
...

- `find` will return the first occurrence of a tag, regardless of its depth in the tree.

In [None]:
# The ul child is not at the top of the tree, but we can still find it.
...

### Using `find_all`

- `find_all` returns a list of all matching tags.

In [None]:
soup

In [None]:
...

In [None]:
...

- We often use the `find_all` method in conjunction with a `for`-loop or list comprehension, to perform some operation on every matching tag.

In [None]:
...

### Node attributes

- The `text` attribute of a tag element gets the text between the opening and closing tags.

In [None]:
soup.find('p')

In [None]:
soup.find('p').text

- The `attrs` attribute of a tag element lists all of its attributes.

In [None]:
soup.find('div')

In [None]:
soup.find('div').text

In [None]:
soup.find('div').attrs

- The `get` method of a tag element **gets the value of an attribute**.<br><small>`find` and `get` are easy to get confused, but you'll use them both a lot.</small>

In [None]:
soup.find('div').get('id')

- The `get` method must be called directly on the node that contains the attribute you're looking for.

In [None]:
soup

In [None]:
# While there are multiple 'id' attributes, none of them are in the <html> tag at the top.
soup.get('id')

In [None]:
soup.find('div').get('id')

<div class="alert alert-success" markdown="1">
    <h3>Activity</h3>
    
Consider the following HTML document, which represents a webpage containing the top few songs with the most streams on Spotify today in Canada.

```html
<head>
    <title>3*Canada-2022-06-04</title>
</head>
<body>
    <h1>Spotify Top 3 - Canada</h1>
    <table>
        <tr class='heading'>
            <th>Rank</th>
            <th>Artist(s)</th> 
            <th>Song</th>
        </tr>
        <tr class=1>
            <td>1</td>
            <td>Harry Styles</td> 
            <td>As It Was</td>
        </tr>
        <tr class=2>
            <td>2</td>
            <td>Jack Harlow</td> 
            <td>First Class</td>
        </tr>
        <tr class=3>
            <td>3</td>
            <td>Kendrick Lamar</td> 
            <td>N95</td>
        </tr>
    </table>
</body>
```

- **Part 1**: How many leaf nodes are there in the DOM tree of the previous document — that is, how many nodes have no children?

- **Part 2**: What does the following line of code evaluate to?

```python
        len(soup.find_all("td"))
```

- **Part 3**: What does the following line of code evaluate to?

```python
        soup.find("tr").get("class")
```

## Example: Scraping quotes 💬

---

### Example: Scraping quotes

- Navigate to [quotes.toscrape.com](https://quotes.toscrape.com).

<center><img src="imgs/quotes2scrape.png" width=60%></center>

- **Goal**: Extract quotes, and relevant metadata, into a DataFrame.

- Specifically, let's try to make a DataFrame that looks like the one below:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>quote</th>
      <th>author</th>
      <th>author_url</th>
      <th>tags</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>change,deep-thoughts,thinking,world</td>
    </tr>
    <tr>
      <th>1</th>
      <td>“It is our choices, Harry, that show what we truly are, far more than our abilities.”</td>
      <td>J.K. Rowling</td>
      <td>https://quotes.toscrape.com/author/J-K-Rowling</td>
      <td>abilities,choices</td>
    </tr>
    <tr>
      <th>2</th>
      <td>“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>inspirational,life,live,miracle,miracles</td>
    </tr>
  </tbody>
</table>

### Organizing our work

- Eventually, we will implement a single function, `make_quote_df`, which takes in an integer `n` and returns a **DataFrame** with the quotes on the **first `n` pages** of [quotes.toscrape.com](https://quotes.toscrape.com).

- Along the way, we'll implement several helper functions, with the goal of separating our logic: **each function should either request information, OR parse, but not both!**

- This makes it easier to debug and catch errors.

- It also avoids **unnecessary requests**.

### Downloading a single page

- First, let's figure out how to download a single page from [quotes.toscrape.com](https://quotes.toscrape.com).

- The URLs seem to be formatted a very particular way:

```html
        https://quotes.toscrape.com/page/2
```

In [None]:
def download_page(i):
    ...

- Let's test our function on a single page, like Page 2.<br><small>There's nothing special about Page 2; we chose it arbitrarily.</small>

In [None]:
soup = ...
soup

- Now that this works, later on, we can call `download_page(1)`, `download_page(2)`, `download_page(3)`, ..., `download_page(n)`.

### Parsing a single page

- Now, let's try and extract the relevant information out of the `soup` object for Page 2.

- **Open [quotes.toscrape.com/page/2](https://quotes.toscrape.com/page/2/) in Chrome, right click the page, and click "Inspect"!**<br><small>This will help us find where each quote is located in the HTML.</small>

In [None]:
divs = ...
...

In [None]:
divs[0]

- From this `<div>`, we can extract the quote, author name, author's URL, and tags.<br><small>Strategy: Figure out how to process one `<div>`, then put that logic in a function to use on other `<div>`s.

In [None]:
# The quote.
...

In [None]:
# The author.
...

In [None]:
# The URL for the author.
...

In [None]:
# The quote's tags.
...

### Parsing a single quote, and then a single page

- Let's implement a function that takes in a `<div>` corresponding to a single quote and returns a dictionary containing the quote's information.<br><small>Why use a dictionary? Passing `pd.DataFrame` a list of dictionaries is an easy way to create a DataFrame.</small>

In [None]:
def process_quote(div):
    quote = div.find('span', class_='text').text
    author = div.find('small', class_='author').text
    author_url = 'https://quotes.toscrape.com' + div.find('a').get('href')
    tags = div.find('meta', class_='keywords').get('content')
    return {'quote': quote, 'author': author, 'author_url': author_url, 'tags': tags}

In [None]:
# Make sure everything here looks correct based on what's on the webpage!
...

- Now, we can implement a function that takes in a **list** of `<div>`s, calls `process_quote` on each `<div>` in the list, and returns a **DataFrame**.

In [None]:
def process_page(divs):
    return pd.DataFrame([process_quote(div) for div in divs])

In [None]:
process_page(divs)

### Putting it all together

- Now, we can implement `make_quote_df`.

In [None]:
def make_quote_df(n):
    '''Returns a DataFrame containing the quotes on the first n pages of https://quotes.toscrape.com/.''' # This is called a docstring!
    dfs = []
    for i in range(1, n+1):
        # Download page n and create a BeautifulSoup object.
        soup = download_page(i)
        # Create DataFrame using the information in that page.
        divs = soup.find_all('div', class_='quote')
        df = process_page(divs)
        # Append DataFrame to dfs.
        dfs.append(df)
    # Stitch all DataFrames together.
    return pd.concat(dfs).reset_index(drop=True)

In [None]:
quotes = make_quote_df(3)
quotes.head()

- Now, `quotes` is s DataFrame, like any other!

In [None]:
quotes['author'].value_counts().iloc[:10].sort_values().plot(kind='barh')

<div class="alert alert-danger" markdown="1">

#### Reference Slide

### Summary of our steps

1. Implement `download_page(i)`, which downloads a **single page** (page `i`) and returns a `BeautifulSoup` object of the response.

2. Implement `process_quote(div)`, which takes in a `<div>` tree corresponding to a **single quote** and returns a dictionary containing all of the relevant information for that quote.

3. Implement `process_page(divs)`, which takes in a list of `<div>` trees corresponding to a **single page** and returns a DataFrame containing all of the relevant information for all quotes on that page.

4. Implement `make_quote_df(n)`.

## APIs and JSON

---

Recall, scraping was one of the ways to access data from the internet. APIs are the other way.

### Application programing interface (API) terminology

- A URL, or uniform resource locator, describes the location of a website or resource.

- API requests are `GET`/`POST` requests to a specially maintained URLs.

- As an example, we'll look at the [Pokémon API](https://pokeapi.co).

- All requests are made to:

```
        https://pokeapi.co/api/v2/{endpoint}/{name}
```

- For example, to learn about Pikachu, we use the `pokemon` **endpoint** with name `pikachu`.

        https://pokeapi.co/api/v2/pokemon/pikachu

- Or, to learn about all water Pokemon, we use the `type` endpoint with name `water`.

        https://pokeapi.co/api/v2/pokemon/pikachu

### Example: Pokémon API ⚡️

- To illustrate, let's make a `GET` request to learn more about Pikachu.

In [None]:
def request_pokemon(name):
    ...
res = request_pokemon('pikachu')
res

- Remember, the 200 status code is good! Let's take a look at the text, the same way we did before:

In [None]:
res.text[:1000]

- Unlike when we were scraping earlier, the text in the response no longer resembles HTML!

<center><img src='imgs/json.png' width=500></center>

### JSON

- JSON stands for **JavaScript Object Notation**. It is a lightweight format for storing and transferring data.

- It is:
    - very easy for computers to read and write.
    - moderately easy for programmers to read and write by hand.
    - meant to be generated and parsed.

- Most modern languages have an interface for working with JSON objects.<br><small>JSON objects **resemble** Python dictionaries, but are not the same!</small>

### JSON data types

| Type | Description |
| --- | --- |
| String | Anything inside double quotes. |
| Number | Any number (no difference between ints and floats). |
| Boolean | `true` and `false`. |
| Null | JSON's empty value, denoted by `null`. |
| Array | Like Python lists. |
| Object | A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects). |

<br>

<center><small>See <a href="https://json-schema.org/understanding-json-schema/reference/type.html">json-schema.org</a> for more details.</small></center>

### Example JSON object

<center><img src='imgs/hierarchy.png' width=500> <small>See <code>data/family.json</code>.</small></center>

In [None]:
!cat data/family.json

In [None]:
import json
with open('data/family.json', 'r') as f:
    family_str = f.read()
    family_tree = json.loads(family_str) # loads stands for load string.

In [None]:
family_tree

In [None]:
...

<div class="alert alert-danger" markdown="1">

#### Reference Slide

### Using the `json` module

- `json.load(f)` loads a JSON file from a file object.

- `json.loads(f)` loads a JSON file from a **s**tring.

In [None]:
with open('data/family.json') as f:
    family_tree = json.load(f)
family_tree

In [None]:
with open('data/family.json') as f:
    family_tree_string = f.read()
    family_tree = json.loads(family_tree_string)
family_tree

<div class="alert alert-danger">
    
#### Reference Slide

### Aside: `pd.read_json`
    
</div>

- `pandas` also has a built-in `read_json` function.

In [None]:
with open('data/family.json', 'r') as f:
    family_df = pd.read_json(f)
family_df

- It only makes sense to use it, though, when you have a JSON file that has some sort of tabular structure. Our family tree example does not.

### Example: Pokémon API ⚡️

- The response we get back from the Pokémon API looks like JSON.<br>We can extract the JSON from this request with the `json` method of `res`.<br><small>We could also pass `res.text` to `json.loads`.</small>

In [None]:
res = request_pokemon('pikachu')
res.text[:1000]

In [None]:
pikachu = res.json()
pikachu

In [None]:
pikachu.keys()

In [None]:
pikachu['weight']

In [None]:
pikachu['abilities'][1]['ability']['name']

### Invalid `GET` requests

- Let's try a `GET` request for `'wolverine'`.

In [None]:
request_pokemon('wolverine')

- We receive a 404 error, since there is no Pokémon named `'wolverine'`!

### More on APIs

- We accessed the Pokémon API by making requests. But, some APIs exist as Python _wrappers_, which allow you to make requests by calling Python functions.<br><small>`request_pokemon` is essentially a wrapper for (a small part of) the Pokémon API. If you're curious, try out the [DeepSeek API](https://api-docs.deepseek.com/)!</small>

- Some APIs will require you to create an API key, and send that key as part of your request.<br><small>See Homework 5 for an example!</small>

- Many of the APIs you'll use are "REST" APIs. Learn more about RESTful APIs [here](https://en.wikipedia.org/wiki/REST#Architectural_constraints).<br><small>REST stands for "Representational State Transfer." One of the key properties of a RESTful API is that servers don't store any information about previous requests, or who is making them.

### Summary, next time

- Beautiful Soup is an HTML parser that allows us to (somewhat) easily extract information from HTML documents.
    - `soup.find` and `soup.find_all` are the methods you will use most often.

- When writing scraping code:
    - Use "inspect element" to identify the names of tags and attributes that are relevant to the information you want to extract.
    - Separate your logic for making requests and for parsing.