<h1 style="color: #fcd805">Exercise: APIs</h1>

1. Every endpoint in the Star Wars API supports searching. Read the documentation at https://swapi.dev/documentation#search and see if you can search the database to find **Darth Vader's height**.

In [1]:
import requests

url = "https://swapi.dev/api/people/?search=darth"

response = requests.get(url)
response.raise_for_status()

In [2]:
darth = response.json()
darth["results"][0]["height"]

'202'

2. Find the **endpoint** (i.e. the specific url) responsible for returning data about starships.

Use this endpoint to search the database and find the Millennium Falcon.

What is its **cargo capacity**?

In [3]:
url = "https://swapi.dev/api/starships?search=millennium"

response = requests.get(url)
response.raise_for_status()

In [4]:
falcon = response.json()
falcon["results"][0]["cargo_capacity"]

'100000'

3. Every starship record contains links to its pilots. Find the characters who have piloted the Millennium Falcon and print their names.

*Hint: you may need to make further API calls...!*

In [5]:
pilots = falcon["results"][0]["pilots"]

pilots

['https://swapi.dev/api/people/13/',
 'https://swapi.dev/api/people/14/',
 'https://swapi.dev/api/people/25/',
 'https://swapi.dev/api/people/31/']

In [6]:
for pilot in pilots:
    person_request = requests.get(pilot)
    person_request.raise_for_status()
    person = person_request.json()
    print(person["name"])

Chewbacca
Han Solo
Lando Calrissian
Nien Nunb


<h1 style="color: #fcd805">Exercise: APIs and `pandas`</h1>

We're going to explore a new API, the Gutendex (https://gutendex.com/).

This is an API to access data about the Project Gutenberg catalogue. Project Gutenberg (https://www.gutenberg.org/) is an initiative to digitise works of literature.

The url to retrieve all books is https://gutendex.com/books.

1. Look at the documentation on the website to figure out how to modify the url to get only books on the topic of horror.

Call this url using `requests` to get a response.

In [7]:
book_response = requests.get("https://gutendex.com/books?topic=horror")

book_response.raise_for_status()

books_json = book_response.json()

books_json

{'count': 249,
 'next': 'https://gutendex.com/books/?page=2&topic=horror',
 'previous': None,
 'results': [{'id': 84,
   'title': 'Frankenstein; Or, The Modern Prometheus',
   'authors': [{'name': 'Shelley, Mary Wollstonecraft',
     'birth_year': 1797,
     'death_year': 1851}],
   'translators': [],
   'subjects': ["Frankenstein's monster (Fictitious character) -- Fiction",
    'Frankenstein, Victor (Fictitious character) -- Fiction',
    'Gothic fiction',
    'Horror tales',
    'Monsters -- Fiction',
    'Science fiction',
    'Scientists -- Fiction'],
   'bookshelves': ['Browsing: Culture/Civilization/Society',
    'Browsing: Fiction',
    'Browsing: Gender & Sexuality Studies',
    'Browsing: Literature',
    'Browsing: Science-Fiction & Fantasy',
    'Gothic Fiction',
    'Movie Books',
    'Precursors of Science Fiction',
    'Science Fiction by Women'],
   'languages': ['en'],
   'copyright': False,
   'media_type': 'Text',
   'formats': {'text/html': 'https://www.gutenberg.or

2. Convert the response to a Python object. How many books are there in total that are tagged "horror"?

_Hint: look at the response and find the right dictionary key to answer the question._

In [8]:
books_json["count"]

249

3. Find the right dictionary key within the returned result to retrieve the books as a list. Convert these to a `pandas` DataFrame.

How many books were returned?

In [9]:
import pandas as pd

books = books_json["results"]

books_df = pd.DataFrame(books)
print(books_df.shape)
books_df.head()

(32, 11)


Unnamed: 0,id,title,authors,translators,subjects,bookshelves,languages,copyright,media_type,formats,download_count
0,84,"Frankenstein; Or, The Modern Prometheus","[{'name': 'Shelley, Mary Wollstonecraft', 'bir...",[],[Frankenstein's monster (Fictitious character)...,"[Browsing: Culture/Civilization/Society, Brows...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,78467
1,5200,Metamorphosis,"[{'name': 'Kafka, Franz', 'birth_year': 1883, ...","[{'name': 'Wyllie, David (Translator)', 'birth...","[Metamorphosis -- Fiction, Psychological fiction]","[Browsing: Fiction, Browsing: Literature, Brow...",[en],True,Text,{'text/html': 'https://www.gutenberg.org/ebook...,25124
2,345,Dracula,"[{'name': 'Stoker, Bram', 'birth_year': 1847, ...",[],"[Dracula, Count (Fictitious character) -- Fict...","[Browsing: Fiction, Browsing: Literature, Brow...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,22060
3,43,The Strange Case of Dr. Jekyll and Mr. Hyde,"[{'name': 'Stevenson, Robert Louis', 'birth_ye...",[],"[Horror tales, London (England) -- Fiction, Mu...","[Browsing: Fiction, Browsing: Psychiatry/Psych...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,15401
4,8492,The King in Yellow,"[{'name': 'Chambers, Robert W. (Robert William...",[],"[Horror tales, American, Short stories, Americ...","[Browsing: Culture/Civilization/Society, Brows...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,9822


4. Each request only retrieves 32 books, but we want all of them. Write a loop to go through all pages of the horror catalogue. In your loop you should:

- request a new page of books by altering the url each time
- take the results, save them into a Python object, then convert it to a `pandas` DataFrame
- collect all these `pandas` DataFrames into a list

At the end of your loop you should have a list of `pandas` DataFrames.

In [10]:
import time

# we know we have 233 books and 32 per page
# so we could explicitly loop a certain number of times
# or we could see that the JSON provides a "next" url
# which is a typical pattern to allow pagination
# so we could also keep going until that's None (i.e. blank)

book_dataframes = []

keep_going = True
page_url = "https://gutendex.com/books?topic=horror"

while keep_going:
    print(f"Attempting {page_url}...")
    books_page = requests.get(page_url)
    books_page.raise_for_status()
    
    books_json = books_page.json()
    
    # extract the book DataFrame
    books_df = pd.DataFrame(books_json["results"])
    book_dataframes.append(books_df)
    
    # and extract the next url unless we're done
    if books_json["next"]:
        page_url = books_json["next"]
    else:
        keep_going = False
    
    # a courtesy :-)
    time.sleep(0.5)

print("Done!")

Attempting https://gutendex.com/books?topic=horror...
Attempting https://gutendex.com/books/?page=2&topic=horror...
Attempting https://gutendex.com/books/?page=3&topic=horror...
Attempting https://gutendex.com/books/?page=4&topic=horror...
Attempting https://gutendex.com/books/?page=5&topic=horror...
Attempting https://gutendex.com/books/?page=6&topic=horror...
Attempting https://gutendex.com/books/?page=7&topic=horror...
Attempting https://gutendex.com/books/?page=8&topic=horror...
Done!


5. Use the `.concat()` method to combine your DataFrames into a single DataFrame.

How many horror books do you have in your data? Does the number match the count from question 2?

In [11]:
books_all = pd.concat(book_dataframes, ignore_index=True)
print(books_all.shape)
books_all.head()

(249, 11)


Unnamed: 0,id,title,authors,translators,subjects,bookshelves,languages,copyright,media_type,formats,download_count
0,84,"Frankenstein; Or, The Modern Prometheus","[{'name': 'Shelley, Mary Wollstonecraft', 'bir...",[],[Frankenstein's monster (Fictitious character)...,"[Browsing: Culture/Civilization/Society, Brows...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,78467
1,5200,Metamorphosis,"[{'name': 'Kafka, Franz', 'birth_year': 1883, ...","[{'name': 'Wyllie, David (Translator)', 'birth...","[Metamorphosis -- Fiction, Psychological fiction]","[Browsing: Fiction, Browsing: Literature, Brow...",[en],True,Text,{'text/html': 'https://www.gutenberg.org/ebook...,25124
2,345,Dracula,"[{'name': 'Stoker, Bram', 'birth_year': 1847, ...",[],"[Dracula, Count (Fictitious character) -- Fict...","[Browsing: Fiction, Browsing: Literature, Brow...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,22060
3,43,The Strange Case of Dr. Jekyll and Mr. Hyde,"[{'name': 'Stevenson, Robert Louis', 'birth_ye...",[],"[Horror tales, London (England) -- Fiction, Mu...","[Browsing: Fiction, Browsing: Psychiatry/Psych...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,15401
4,8492,The King in Yellow,"[{'name': 'Chambers, Robert W. (Robert William...",[],"[Horror tales, American, Short stories, Americ...","[Browsing: Culture/Civilization/Society, Brows...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,9822


6. How many downloads of horror books were there in total?

_Technically there is a media type which could be Sound, for audiobooks perhaps, but we'll class that under "download of a book"_

In [12]:
books_all["download_count"].sum()

277005

7. BONUS: Which author has the most books in the horror section?

To answer this:

- the `authors` column is a list of dictionaries. Figure out how to extract the *first* dictionary from each list and save these into a new column
- use this new column to "unpack" the dictionary using `json_normalize`
- use this "JSON normalised" data to calculate the most frequent author

In [13]:
import numpy as np

# write a custom function
def get_author(authors):
    if len(authors) >= 1:
        return authors[0]
    
    # if the authors list is blank, return NaN
    return np.nan

# use the function to extract a single author
# from the list of authors (which may be empty)
books_all["author"] = books_all["authors"].apply(get_author)

books_all.head()

Unnamed: 0,id,title,authors,translators,subjects,bookshelves,languages,copyright,media_type,formats,download_count,author
0,84,"Frankenstein; Or, The Modern Prometheus","[{'name': 'Shelley, Mary Wollstonecraft', 'bir...",[],[Frankenstein's monster (Fictitious character)...,"[Browsing: Culture/Civilization/Society, Brows...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,78467,"{'name': 'Shelley, Mary Wollstonecraft', 'birt..."
1,5200,Metamorphosis,"[{'name': 'Kafka, Franz', 'birth_year': 1883, ...","[{'name': 'Wyllie, David (Translator)', 'birth...","[Metamorphosis -- Fiction, Psychological fiction]","[Browsing: Fiction, Browsing: Literature, Brow...",[en],True,Text,{'text/html': 'https://www.gutenberg.org/ebook...,25124,"{'name': 'Kafka, Franz', 'birth_year': 1883, '..."
2,345,Dracula,"[{'name': 'Stoker, Bram', 'birth_year': 1847, ...",[],"[Dracula, Count (Fictitious character) -- Fict...","[Browsing: Fiction, Browsing: Literature, Brow...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,22060,"{'name': 'Stoker, Bram', 'birth_year': 1847, '..."
3,43,The Strange Case of Dr. Jekyll and Mr. Hyde,"[{'name': 'Stevenson, Robert Louis', 'birth_ye...",[],"[Horror tales, London (England) -- Fiction, Mu...","[Browsing: Fiction, Browsing: Psychiatry/Psych...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,15401,"{'name': 'Stevenson, Robert Louis', 'birth_yea..."
4,8492,The King in Yellow,"[{'name': 'Chambers, Robert W. (Robert William...",[],"[Horror tales, American, Short stories, Americ...","[Browsing: Culture/Civilization/Society, Brows...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,9822,"{'name': 'Chambers, Robert W. (Robert William)..."


In [14]:
pd.json_normalize(books_all["author"])["name"].value_counts().head(10)

name
Poe, Edgar Allan                      26
Lovecraft, H. P. (Howard Phillips)    17
Howard, Robert E. (Robert Ervin)      11
Stoker, Bram                          10
Blackwood, Algernon                   10
Doyle, Arthur Conan                    9
Radcliffe, Ann Ward                    8
Shelley, Mary Wollstonecraft           8
Stevenson, Robert Louis                8
Bierce, Ambrose                        7
Name: count, dtype: int64

<h1 style="color: #fcd805">Exercise: web scraping</h1>

Your turn to scrape some data from the bookshop!

We're going to extract all the prices from the page and calculate the average book price.

1. Inspect the web page. What makes each book price element unique?

_Hint: right-click and click Inspect to view the HTML behind an element on the page._

_Every price element is inside a `<p>` tag with class "price_color"_

2. Use `BeautifulSoup` to select all the elements that show a book's price.

In [15]:
from bs4 import BeautifulSoup

bookstore_response = requests.get("http://books.toscrape.com/")

bookstore_response.raise_for_status()

soup = BeautifulSoup(bookstore_response.text, "html.parser")

price_tags = soup.select("p.price_color")
price_tags

[<p class="price_color">Â£51.77</p>,
 <p class="price_color">Â£53.74</p>,
 <p class="price_color">Â£50.10</p>,
 <p class="price_color">Â£47.82</p>,
 <p class="price_color">Â£54.23</p>,
 <p class="price_color">Â£22.65</p>,
 <p class="price_color">Â£33.34</p>,
 <p class="price_color">Â£17.93</p>,
 <p class="price_color">Â£22.60</p>,
 <p class="price_color">Â£52.15</p>,
 <p class="price_color">Â£13.99</p>,
 <p class="price_color">Â£20.66</p>,
 <p class="price_color">Â£17.46</p>,
 <p class="price_color">Â£52.29</p>,
 <p class="price_color">Â£35.02</p>,
 <p class="price_color">Â£57.25</p>,
 <p class="price_color">Â£23.88</p>,
 <p class="price_color">Â£37.59</p>,
 <p class="price_color">Â£51.33</p>,
 <p class="price_color">Â£45.17</p>]

3. Extract only the displayed text from these elements into a list.

You should end up with a list of strings.

In [16]:
prices = [tag.text for tag in price_tags]
prices

['Â£51.77',
 'Â£53.74',
 'Â£50.10',
 'Â£47.82',
 'Â£54.23',
 'Â£22.65',
 'Â£33.34',
 'Â£17.93',
 'Â£22.60',
 'Â£52.15',
 'Â£13.99',
 'Â£20.66',
 'Â£17.46',
 'Â£52.29',
 'Â£35.02',
 'Â£57.25',
 'Â£23.88',
 'Â£37.59',
 'Â£51.33',
 'Â£45.17']

4. Create a `pandas` `Series` from this list of strings by using `pd.Series`.

In [17]:
price_series = pd.Series(prices)

5. Using your `pandas` knowledge, clean up these strings so they are just numeric prices, and convert the `Series` to be a numeric type.

In [18]:
price_series = price_series.str[2:].astype(float)
price_series

0     51.77
1     53.74
2     50.10
3     47.82
4     54.23
5     22.65
6     33.34
7     17.93
8     22.60
9     52.15
10    13.99
11    20.66
12    17.46
13    52.29
14    35.02
15    57.25
16    23.88
17    37.59
18    51.33
19    45.17
dtype: float64

6. Now calculate the average price of books on the web page.

In [19]:
print(price_series.mean(), price_series.median())

38.048500000000004 41.38
