---

# APIs and web scraping in Python

Connecting to a remote web API in Python is easy with the `requests` library (https://requests.readthedocs.io/en/latest/).

In [None]:
import requests

To *get* data from an API, we make an HTTP GET request.

All we need is the url.

We will use the Star Wars API at https://swapi.dev.
    
Their "root" API returns all the possible API endpoints and their urls, let's start there.

In [None]:
url = "https://swapi.dev/api"

response = requests.get(url)

In [None]:
type(response)

Every HTTP response comes with a *status code* which tells us whether the request was successful.

- 200 means everything was fine
- values in the 300 range mean some sort of redirection happened
- the 400 range means client error - the requester did something wrong like type an incorrect url (404 means "not found" for example)
- the 500 range means server error - the request was fine but the web server encountered a problem in trying to respond

If you ever want to know what a status code means, you can use the website http.cat, e.g. https://http.cat/404

In [None]:
response.status_code

There is also a convenience method to check if the response errored. This does nothing if the request was fine, otherwise it will raise an exception.

In [None]:
response.raise_for_status()

The response object contains the response as raw text

In [None]:
response.text

But if the data is sent back in JSON format we can convert the response to a Python object

In [None]:
response_json = response.json()

In [None]:
type(response_json)

In [None]:
response_json

And now we can access the data inside it like any other Python object!

In [None]:
response_json["people"]

Let's actually call one of these APIs to gather some data.

In [None]:
people_url = response_json["people"]

people = requests.get(people_url).json()

In [None]:
people

In [None]:
people["results"]

In [None]:
people["results"][0]

<h1 style="color: #fcd805">Exercise: APIs</h1>

1. Every endpoint in the Star Wars API supports searching. Read the documentation at https://swapi.dev/documentation#search and see if you can search the database to find **Darth Vader's height**.

2. Find the **endpoint** (i.e. the specific url) responsible for returning data about starships.

Use this endpoint to search the database and find the Millennium Falcon.

What is its **cargo capacity**?

3. Every starship record contains links to its pilots. Find the characters who have piloted the Millennium Falcon and print their names.

*Hint: you may need to make further API calls...!*

## Converting API data to `pandas`

Not only can we convert an API response to a Python object, we can convert it to a `pandas` DataFrame (if we have a list of values).

Let's use the endpoint to give us a collection of people:

In [None]:
people_response = requests.get("https://swapi.dev/api/people")

people_response.raise_for_status()

people = people_response.json()

people

`pandas` interprets a list of dictionaries as a collection of rows.

Keys in the dictionaries become columns and the values become the row values:

In [None]:
import pandas as pd

people_df = pd.DataFrame(people["results"])

people_df.head()

Let's now enhance the data by downloading details of each person's homeworld.

We can do this by calling the url in the `homeworld` column and saving the returned values to another column.

In [None]:
def fetch_homeworld_data(url):
    try:
        return requests.get(url).json()
    except Exception as e:
        return None  # Return None in case of any errors

# Apply the function to the 'homeworld' column and save the result in 'homeworld_data'
people_df['homeworld_data'] = people_df['homeworld'].apply(fetch_homeworld_data)

people_df.head()

Pretty good! But we ran into a problem because the `homeworld_data` column is a dictionary.

We can "unpack" this in `pandas` into separate columns:

In [None]:
people_homeworlds = pd.json_normalize(people_df["homeworld_data"])
people_homeworlds.head()

In [None]:
# we'll rename columns to start with `homeworld_`
people_homeworlds.columns = ["homeworld_" + c for c in people_homeworlds.columns]

people_homeworlds.head()

Now all that remains is to put these two datasets together.

This isn't a join, we actually just want to connect the two `DataFrame`s side by side without a join key.

We can do this with `.concat()`:

In [None]:
# concat takes a LIST of DataFrames
# axis is either 0 (horizontal, two DataFrames on top of one another)
# or 1 (vertical, two DataFrames side by side)
people_df_final = pd.concat([people_df, people_homeworlds], axis=1)
people_df_final.head()

We can also drop the original `homeworld` column

In [None]:
people_df_final = people_df_final.drop(columns=["homeworld"])

people_df_final

We need to some data cleaning and type conversion, but otherwise we can analyse this data in `pandas`!

In [None]:
people_df_final["homeworld_climate"].value_counts()

In [None]:
import numpy as np

people_df_final["homeworld_orbital_period"] = people_df_final["homeworld_orbital_period"].replace("unknown", np.nan)
people_df_final["homeworld_orbital_period"] = people_df_final["homeworld_orbital_period"].astype(float)

people_df_final["homeworld_orbital_period"].mean()

### API keys

Most APIs require authentication of some sort.

Often this just means signing up for an API key, which is a string that's unique to you. Keep it safe, like a password.

Depending on the API, using a key can be as easy as adding it into the url as an extra parameter.

For example, Alpha Vantage (a free API service for stock price data) requires an email signup to generate a key.

The example urls all have the key of `"demo"` which you simply replace with your own key:

https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo

<h1 style="color: #fcd805">Exercise: APIs and `pandas`</h1>

We're going to explore a new API, the Gutendex (https://gutendex.com/).

This is an API to access data about the Project Gutenberg catalogue. Project Gutenberg (https://www.gutenberg.org/) is an initiative to digitise works of literature.

The url to retrieve all books is https://gutendex.com/books.

1. Look at the documentation on the website to figure out how to modify the url to get only books on the topic of horror.

Call this url using `requests` to get a response.

2. Convert the response to a Python object. How many books are there in total that are tagged "horror"?

_Hint: look at the response and find the right dictionary key to answer the question._

3. Find the right dictionary key within the returned result to retrieve the books as a list. Convert these to a `pandas` DataFrame.

How many books were returned?

4. Each request only retrieves 32 books, but we want all of them. Write a loop to go through all pages of the horror catalogue. In your loop you should:

- request a new page of books by altering the url each time
- take the results, save them into a Python object, then convert it to a `pandas` DataFrame
- collect all these `pandas` DataFrames into a list

At the end of your loop you should have a list of `pandas` DataFrames.

5. Use the `.concat()` method to combine your DataFrames into a single DataFrame.

How many horror books do you have in your data? Does the number match the count from question 2?

6. How many downloads of horror books were there in total?

7. BONUS: Which author has the most books in the horror section?

To answer this:

- the `authors` column is a list of dictionaries. Figure out how to extract the *first* dictionary from each list and save these into a new column
- use this new column to "unpack" the dictionary using `json_normalize`
- use this "JSON normalised" data to calculate the most frequent author

# Web scraping

Web scraping is needed when data is on the web but not accessible with a clean API.

In these instances, we can extract the data from the web page directly.

We can use `requests` to get the raw HTML of a web page, which we can then explore.

We're going to scrape data from a fictional bookstore: http://books.toscrape.com/

In [None]:
bookstore_response = requests.get("http://books.toscrape.com/")

bookstore_response.raise_for_status()

The returned content is now not JSON, but raw HTML in a string

In [None]:
bookstore_response.text

To be able to extract components from this, we will use the `BeautifulSoup` library.

In [None]:
from bs4 import BeautifulSoup

We create a "beautiful soup" object from the raw HTML

In [None]:
soup = BeautifulSoup(bookstore_response.text, "html.parser")

In [None]:
type(soup)

Looking at the object, it still looks like the HTML but we have additional methods available to us to explore it.

In [None]:
soup

What we're interested in is extracting specific HTML **elements**.

For this, we need to learn a bit of syntax, which are technically CSS selectors. CSS is a way to style a web page (more info and tutorials here: https://www.w3schools.com/css/).

The simplest form of a selector is using a tag type. That is, finding elements on a page that are all the same type, such as links.

In HTML, a link is an `<a>` tag, so we can find all links like this:

In [None]:
links = soup.select("a")

links

In [None]:
type(links)

In [None]:
type(links[0])

These are all `Tag` objects which represent an HTML element.

These link tags all contain:

- text, which is what we see displayed on the page
- an "href" which is the url to visit when you click the link

We can extract both using `BeautifulSoup`:

In [None]:
[link.text for link in links]

In [None]:
[link["href"] for link in links]

You might find many elements of the same type, but with a different `class`.

A class is a way to tell CSS which elements should look the same.

For example, all buttons on the webpage have the same classes, including one called `"btn"`.

In CSS, to select all items of the same class, we can use `.` like this:

In [None]:
buttons = soup.select(".btn")
buttons

<h1 style="color: #fcd805">Exercise: web scraping</h1>

Your turn to scrape some data from the bookshop!

We're going to extract all the prices from the page and calculate the average book price.

1. Inspect the web page. What makes each book price element unique?

_Hint: right-click and click Inspect to view the HTML behind an element on the page._

2. Use `BeautifulSoup` to select all the elements that show a book's price.

3. Extract only the displayed text from these elements into a list.

You should end up with a list of strings.

4. Create a `pandas` `Series` from this list of strings by using `pd.Series`.

5. Using your `pandas` knowledge, clean up these strings so they are just numeric prices, and convert the `Series` to be a numeric type.

6. Now calculate the average price of books on the web page.