---

# APIs and web scraping in Python

Connecting to a remote web API in Python is easy with the `requests` library (https://requests.readthedocs.io/en/latest/).

In [None]:
import requests

To *get* data from an API, we make an HTTP GET request.

All we need is the url.

We will use the Star Wars API at https://swapi.dev.
    
Their "root" API returns all the possible API endpoints and their urls, let's start there.

In [None]:
url = "https://swapi.dev/api"

response = requests.get(url)

In [None]:
type(response)

Every HTTP response comes with a *status code* which tells us whether the request was successful.

- 200 means everything was fine
- values in the 300 range mean some sort of redirection happened
- the 400 range means client error - the requester did something wrong like type an incorrect url (404 means "not found" for example)
- the 500 range means server error - the request was fine but the web server encountered a problem in trying to respond

If you ever want to know what a status code means, you can use the website http.cat, e.g. https://http.cat/404

In [None]:
response.status_code

200

There is also a convenience method to check if the response errored. This does nothing if the request was fine, otherwise it will raise an exception.

In [None]:
response.raise_for_status()

The response object contains the response as raw text

In [None]:
response.text

'{"people":"https://swapi.dev/api/people/","planets":"https://swapi.dev/api/planets/","films":"https://swapi.dev/api/films/","species":"https://swapi.dev/api/species/","vehicles":"https://swapi.dev/api/vehicles/","starships":"https://swapi.dev/api/starships/"}'

But if the data is sent back in JSON format we can convert the response to a Python object

In [None]:
response_json = response.json()

In [None]:
type(response_json)

dict

In [None]:
response_json

{'people': 'https://swapi.dev/api/people/',
 'planets': 'https://swapi.dev/api/planets/',
 'films': 'https://swapi.dev/api/films/',
 'species': 'https://swapi.dev/api/species/',
 'vehicles': 'https://swapi.dev/api/vehicles/',
 'starships': 'https://swapi.dev/api/starships/'}

And now we can access the data inside it like any other Python object!

In [None]:
response_json["people"]

'https://swapi.dev/api/people/'

Let's actually call one of these APIs to gather some data.

In [None]:
people_url = response_json["people"]

people = requests.get(people_url).json()

In [None]:
people

{'count': 82,
 'next': 'https://swapi.dev/api/people/?page=2',
 'previous': None,
 'results': [{'name': 'Luke Skywalker',
   'height': '172',
   'mass': '77',
   'hair_color': 'blond',
   'skin_color': 'fair',
   'eye_color': 'blue',
   'birth_year': '19BBY',
   'gender': 'male',
   'homeworld': 'https://swapi.dev/api/planets/1/',
   'films': ['https://swapi.dev/api/films/1/',
    'https://swapi.dev/api/films/2/',
    'https://swapi.dev/api/films/3/',
    'https://swapi.dev/api/films/6/'],
   'species': [],
   'vehicles': ['https://swapi.dev/api/vehicles/14/',
    'https://swapi.dev/api/vehicles/30/'],
   'starships': ['https://swapi.dev/api/starships/12/',
    'https://swapi.dev/api/starships/22/'],
   'created': '2014-12-09T13:50:51.644000Z',
   'edited': '2014-12-20T21:17:56.891000Z',
   'url': 'https://swapi.dev/api/people/1/'},
  {'name': 'C-3PO',
   'height': '167',
   'mass': '75',
   'hair_color': 'n/a',
   'skin_color': 'gold',
   'eye_color': 'yellow',
   'birth_year': '112BB

In [None]:
myop=people["results"]

In [None]:
import pandas as pd
df = pd.DataFrame(myop)

# Save DataFrame to CSV
df.to_csv('output.csv', index=False)


In [None]:
demo= people["results"][0]['name']

In [None]:
demo

'Luke Skywalker'

In [None]:
demo['name']

'Luke Skywalker'

<h1 style="color: #fcd805">Exercise: APIs</h1>

1. Every endpoint in the Star Wars API supports searching. Read the documentation at https://swapi.dev/documentation#search and see if you can search the database to find **Darth Vader's height**.

In [None]:
url = "https://swapi.dev/api/people/?search=darth"

response = requests.get(url)
response.raise_for_status()

darth = response.json()
darth["results"][0]["height"]

'202'

2. Find the **endpoint** (i.e. the specific url) responsible for returning data about starships.

Use this endpoint to search the database and find the Millennium Falcon.

What is its **cargo capacity**?

In [None]:
url = "https://swapi.dev/api/starships?search=millennium"

response = requests.get(url)
response.raise_for_status()

falcon = response.json()
falcon["results"][0]["cargo_capacity"]

'100000'

3. Every starship record contains links to its pilots. Find the characters who have piloted the Millennium Falcon and print their names.

*Hint: you may need to make further API calls...!*

In [None]:
pilots = falcon["results"][0]["pilots"]

pilots

['https://swapi.dev/api/people/13/',
 'https://swapi.dev/api/people/14/',
 'https://swapi.dev/api/people/25/',
 'https://swapi.dev/api/people/31/']

In [None]:
for pilot in pilots:
    person_request = requests.get(pilot)
    person_request.raise_for_status()
    person = person_request.json()
    print(person["name"])

Chewbacca
Han Solo
Lando Calrissian
Nien Nunb


## Converting API data to `pandas`

Not only can we convert an API response to a Python object, we can convert it to a `pandas` DataFrame (if we have a list of values).

Let's use the endpoint to give us a collection of people:

In [None]:
people_response = requests.get("https://swapi.dev/api/people")

people_response.raise_for_status()

people = people_response.json()

people

{'count': 82,
 'next': 'https://swapi.dev/api/people/?page=2',
 'previous': None,
 'results': [{'name': 'Luke Skywalker',
   'height': '172',
   'mass': '77',
   'hair_color': 'blond',
   'skin_color': 'fair',
   'eye_color': 'blue',
   'birth_year': '19BBY',
   'gender': 'male',
   'homeworld': 'https://swapi.dev/api/planets/1/',
   'films': ['https://swapi.dev/api/films/1/',
    'https://swapi.dev/api/films/2/',
    'https://swapi.dev/api/films/3/',
    'https://swapi.dev/api/films/6/'],
   'species': [],
   'vehicles': ['https://swapi.dev/api/vehicles/14/',
    'https://swapi.dev/api/vehicles/30/'],
   'starships': ['https://swapi.dev/api/starships/12/',
    'https://swapi.dev/api/starships/22/'],
   'created': '2014-12-09T13:50:51.644000Z',
   'edited': '2014-12-20T21:17:56.891000Z',
   'url': 'https://swapi.dev/api/people/1/'},
  {'name': 'C-3PO',
   'height': '167',
   'mass': '75',
   'hair_color': 'n/a',
   'skin_color': 'gold',
   'eye_color': 'yellow',
   'birth_year': '112BB

`pandas` interprets a list of dictionaries as a collection of rows.

Keys in the dictionaries become columns and the values become the row values:

In [None]:
import pandas as pd

people_df = pd.DataFrame(people["results"])

people_df.head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,films,species,vehicles,starships,created,edited,url
0,Luke Skywalker,172,77,blond,fair,blue,19BBY,male,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[],"[https://swapi.dev/api/vehicles/14/, https://s...","[https://swapi.dev/api/starships/12/, https://...",2014-12-09T13:50:51.644000Z,2014-12-20T21:17:56.891000Z,https://swapi.dev/api/people/1/
1,C-3PO,167,75,,gold,yellow,112BBY,,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[https://swapi.dev/api/species/2/],[],[],2014-12-10T15:10:51.357000Z,2014-12-20T21:17:50.309000Z,https://swapi.dev/api/people/2/
2,R2-D2,96,32,,"white, blue",red,33BBY,,https://swapi.dev/api/planets/8/,"[https://swapi.dev/api/films/1/, https://swapi...",[https://swapi.dev/api/species/2/],[],[],2014-12-10T15:11:50.376000Z,2014-12-20T21:17:50.311000Z,https://swapi.dev/api/people/3/
3,Darth Vader,202,136,none,white,yellow,41.9BBY,male,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[],[],[https://swapi.dev/api/starships/13/],2014-12-10T15:18:20.704000Z,2014-12-20T21:17:50.313000Z,https://swapi.dev/api/people/4/
4,Leia Organa,150,49,brown,light,brown,19BBY,female,https://swapi.dev/api/planets/2/,"[https://swapi.dev/api/films/1/, https://swapi...",[],[https://swapi.dev/api/vehicles/30/],[],2014-12-10T15:20:09.791000Z,2014-12-20T21:17:50.315000Z,https://swapi.dev/api/people/5/


Let's now enhance the data by downloading details of each person's homeworld.

We can do this by calling the url in the `homeworld` column and saving the returned values to another column.

In [None]:
def fetch_homeworld_data(url):
    try:
        return requests.get(url).json()
    except Exception as e:
        return None  # Return None in case of any errors

# Apply the function to the 'homeworld' column and save the result in 'homeworld_data'
people_df['homeworld_data'] = people_df['homeworld'].apply(fetch_homeworld_data)

people_df.head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,films,species,vehicles,starships,created,edited,url,homeworld_data
0,Luke Skywalker,172,77,blond,fair,blue,19BBY,male,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[],"[https://swapi.dev/api/vehicles/14/, https://s...","[https://swapi.dev/api/starships/12/, https://...",2014-12-09T13:50:51.644000Z,2014-12-20T21:17:56.891000Z,https://swapi.dev/api/people/1/,"{'name': 'Tatooine', 'rotation_period': '23', ..."
1,C-3PO,167,75,,gold,yellow,112BBY,,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[https://swapi.dev/api/species/2/],[],[],2014-12-10T15:10:51.357000Z,2014-12-20T21:17:50.309000Z,https://swapi.dev/api/people/2/,"{'name': 'Tatooine', 'rotation_period': '23', ..."
2,R2-D2,96,32,,"white, blue",red,33BBY,,https://swapi.dev/api/planets/8/,"[https://swapi.dev/api/films/1/, https://swapi...",[https://swapi.dev/api/species/2/],[],[],2014-12-10T15:11:50.376000Z,2014-12-20T21:17:50.311000Z,https://swapi.dev/api/people/3/,"{'name': 'Naboo', 'rotation_period': '26', 'or..."
3,Darth Vader,202,136,none,white,yellow,41.9BBY,male,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[],[],[https://swapi.dev/api/starships/13/],2014-12-10T15:18:20.704000Z,2014-12-20T21:17:50.313000Z,https://swapi.dev/api/people/4/,"{'name': 'Tatooine', 'rotation_period': '23', ..."
4,Leia Organa,150,49,brown,light,brown,19BBY,female,https://swapi.dev/api/planets/2/,"[https://swapi.dev/api/films/1/, https://swapi...",[],[https://swapi.dev/api/vehicles/30/],[],2014-12-10T15:20:09.791000Z,2014-12-20T21:17:50.315000Z,https://swapi.dev/api/people/5/,"{'name': 'Alderaan', 'rotation_period': '24', ..."


Pretty good! But we ran into a problem because the `homeworld_data` column is a dictionary.

We can "unpack" this in `pandas` into separate columns:

In [None]:
people_homeworlds = pd.json_normalize(people_df["homeworld_data"])
people_homeworlds.head()

Unnamed: 0,name,rotation_period,orbital_period,diameter,climate,gravity,terrain,surface_water,population,residents,films,created,edited,url
0,Tatooine,23,304,10465,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
1,Tatooine,23,304,10465,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
2,Naboo,26,312,12120,temperate,1 standard,"grassy hills, swamps, forests, mountains",12,4500000000,"[https://swapi.dev/api/people/3/, https://swap...","[https://swapi.dev/api/films/3/, https://swapi...",2014-12-10T11:52:31.066000Z,2014-12-20T20:58:18.430000Z,https://swapi.dev/api/planets/8/
3,Tatooine,23,304,10465,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
4,Alderaan,24,364,12500,temperate,1 standard,"grasslands, mountains",40,2000000000,"[https://swapi.dev/api/people/5/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-10T11:35:48.479000Z,2014-12-20T20:58:18.420000Z,https://swapi.dev/api/planets/2/


In [None]:
# we'll rename columns to start with `homeworld_`
people_homeworlds.columns = ["homeworld_" + c for c in people_homeworlds.columns]

people_homeworlds.head()

Unnamed: 0,homeworld_name,homeworld_rotation_period,homeworld_orbital_period,homeworld_diameter,homeworld_climate,homeworld_gravity,homeworld_terrain,homeworld_surface_water,homeworld_population,homeworld_residents,homeworld_films,homeworld_created,homeworld_edited,homeworld_url
0,Tatooine,23,304,10465,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
1,Tatooine,23,304,10465,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
2,Naboo,26,312,12120,temperate,1 standard,"grassy hills, swamps, forests, mountains",12,4500000000,"[https://swapi.dev/api/people/3/, https://swap...","[https://swapi.dev/api/films/3/, https://swapi...",2014-12-10T11:52:31.066000Z,2014-12-20T20:58:18.430000Z,https://swapi.dev/api/planets/8/
3,Tatooine,23,304,10465,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
4,Alderaan,24,364,12500,temperate,1 standard,"grasslands, mountains",40,2000000000,"[https://swapi.dev/api/people/5/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-10T11:35:48.479000Z,2014-12-20T20:58:18.420000Z,https://swapi.dev/api/planets/2/


Now all that remains is to put these two datasets together.

This isn't a join, we actually just want to connect the two `DataFrame`s side by side without a join key.

We can do this with `.concat()`:

In [None]:
# concat takes a LIST of DataFrames
# axis is either 0 (horizontal, two DataFrames on top of one another)
# or 1 (vertical, two DataFrames side by side)
people_df_final = pd.concat([people_df, people_homeworlds], axis=1)
people_df_final.head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,films,...,homeworld_climate,homeworld_gravity,homeworld_terrain,homeworld_surface_water,homeworld_population,homeworld_residents,homeworld_films,homeworld_created,homeworld_edited,homeworld_url
0,Luke Skywalker,172,77,blond,fair,blue,19BBY,male,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",...,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
1,C-3PO,167,75,,gold,yellow,112BBY,,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",...,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
2,R2-D2,96,32,,"white, blue",red,33BBY,,https://swapi.dev/api/planets/8/,"[https://swapi.dev/api/films/1/, https://swapi...",...,temperate,1 standard,"grassy hills, swamps, forests, mountains",12,4500000000,"[https://swapi.dev/api/people/3/, https://swap...","[https://swapi.dev/api/films/3/, https://swapi...",2014-12-10T11:52:31.066000Z,2014-12-20T20:58:18.430000Z,https://swapi.dev/api/planets/8/
3,Darth Vader,202,136,none,white,yellow,41.9BBY,male,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",...,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
4,Leia Organa,150,49,brown,light,brown,19BBY,female,https://swapi.dev/api/planets/2/,"[https://swapi.dev/api/films/1/, https://swapi...",...,temperate,1 standard,"grasslands, mountains",40,2000000000,"[https://swapi.dev/api/people/5/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-10T11:35:48.479000Z,2014-12-20T20:58:18.420000Z,https://swapi.dev/api/planets/2/


We can also drop the original `homeworld` column

In [None]:
people_df_final = people_df_final.drop(columns=["homeworld"])

people_df_final

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,films,species,...,homeworld_climate,homeworld_gravity,homeworld_terrain,homeworld_surface_water,homeworld_population,homeworld_residents,homeworld_films,homeworld_created,homeworld_edited,homeworld_url
0,Luke Skywalker,172,77,blond,fair,blue,19BBY,male,"[https://swapi.dev/api/films/1/, https://swapi...",[],...,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
1,C-3PO,167,75,,gold,yellow,112BBY,,"[https://swapi.dev/api/films/1/, https://swapi...",[https://swapi.dev/api/species/2/],...,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
2,R2-D2,96,32,,"white, blue",red,33BBY,,"[https://swapi.dev/api/films/1/, https://swapi...",[https://swapi.dev/api/species/2/],...,temperate,1 standard,"grassy hills, swamps, forests, mountains",12,4500000000,"[https://swapi.dev/api/people/3/, https://swap...","[https://swapi.dev/api/films/3/, https://swapi...",2014-12-10T11:52:31.066000Z,2014-12-20T20:58:18.430000Z,https://swapi.dev/api/planets/8/
3,Darth Vader,202,136,none,white,yellow,41.9BBY,male,"[https://swapi.dev/api/films/1/, https://swapi...",[],...,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
4,Leia Organa,150,49,brown,light,brown,19BBY,female,"[https://swapi.dev/api/films/1/, https://swapi...",[],...,temperate,1 standard,"grasslands, mountains",40,2000000000,"[https://swapi.dev/api/people/5/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-10T11:35:48.479000Z,2014-12-20T20:58:18.420000Z,https://swapi.dev/api/planets/2/
5,Owen Lars,178,120,"brown, grey",light,blue,52BBY,male,"[https://swapi.dev/api/films/1/, https://swapi...",[],...,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
6,Beru Whitesun lars,165,75,brown,light,blue,47BBY,female,"[https://swapi.dev/api/films/1/, https://swapi...",[],...,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
7,R5-D4,97,32,,"white, red",red,unknown,,[https://swapi.dev/api/films/1/],[https://swapi.dev/api/species/2/],...,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
8,Biggs Darklighter,183,84,black,light,brown,24BBY,male,[https://swapi.dev/api/films/1/],[],...,arid,1 standard,desert,1,200000,"[https://swapi.dev/api/people/1/, https://swap...","[https://swapi.dev/api/films/1/, https://swapi...",2014-12-09T13:50:49.641000Z,2014-12-20T20:58:18.411000Z,https://swapi.dev/api/planets/1/
9,Obi-Wan Kenobi,182,77,"auburn, white",fair,blue-gray,57BBY,male,"[https://swapi.dev/api/films/1/, https://swapi...",[],...,temperate,1 standard,grass,unknown,unknown,[https://swapi.dev/api/people/10/],[],2014-12-10T16:16:26.566000Z,2014-12-20T20:58:18.452000Z,https://swapi.dev/api/planets/20/


We need to some data cleaning and type conversion, but otherwise we can analyse this data in `pandas`!

In [None]:
people_df_final["homeworld_climate"].value_counts()

Unnamed: 0_level_0,count
homeworld_climate,Unnamed: 1_level_1
arid,7
temperate,3


In [None]:
import numpy as np

people_df_final["homeworld_orbital_period"] = people_df_final["homeworld_orbital_period"].replace("unknown", np.nan)
people_df_final["homeworld_orbital_period"] = people_df_final["homeworld_orbital_period"].astype(float)

people_df_final["homeworld_orbital_period"].mean()

311.55555555555554

### API keys

Most APIs require authentication of some sort.

Often this just means signing up for an API key, which is a string that's unique to you. Keep it safe, like a password.

Depending on the API, using a key can be as easy as adding it into the url as an extra parameter.

For example, Alpha Vantage (a free API service for stock price data) requires an email signup to generate a key.

The example urls all have the key of `"demo"` which you simply replace with your own key:

https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo

<h1 style="color: #fcd805">Exercise: APIs and `pandas`</h1>

We're going to explore a new API, the Gutendex (https://gutendex.com/).

This is an API to access data about the Project Gutenberg catalogue. Project Gutenberg (https://www.gutenberg.org/) is an initiative to digitise works of literature.

The url to retrieve all books is https://gutendex.com/books.

1. Look at the documentation on the website to figure out how to modify the url to get only books on the topic of horror.

Call this url using `requests` to get a response.

In [None]:
book_response = requests.get("https://gutendex.com/books?topic=horror")

book_response.raise_for_status()

books_json = book_response.json()

books_json

2. Convert the response to a Python object. How many books are there in total that are tagged "horror"?

_Hint: look at the response and find the right dictionary key to answer the question._

In [None]:
books_json["count"]

249

3. Find the right dictionary key within the returned result to retrieve the books as a list. Convert these to a `pandas` DataFrame.

How many books were returned?

In [None]:
import pandas as pd

books = books_json["results"]

books_df = pd.DataFrame(books)
print(books_df.shape)
books_df.head()

(32, 11)


Unnamed: 0,id,title,authors,translators,subjects,bookshelves,languages,copyright,media_type,formats,download_count
0,84,"Frankenstein; Or, The Modern Prometheus","[{'name': 'Shelley, Mary Wollstonecraft', 'bir...",[],[Frankenstein's monster (Fictitious character)...,"[Browsing: Culture/Civilization/Society, Brows...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,78467
1,5200,Metamorphosis,"[{'name': 'Kafka, Franz', 'birth_year': 1883, ...","[{'name': 'Wyllie, David (Translator)', 'birth...","[Metamorphosis -- Fiction, Psychological fiction]","[Browsing: Fiction, Browsing: Literature, Brow...",[en],True,Text,{'text/html': 'https://www.gutenberg.org/ebook...,25124
2,345,Dracula,"[{'name': 'Stoker, Bram', 'birth_year': 1847, ...",[],"[Dracula, Count (Fictitious character) -- Fict...","[Browsing: Fiction, Browsing: Literature, Brow...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,22060
3,43,The Strange Case of Dr. Jekyll and Mr. Hyde,"[{'name': 'Stevenson, Robert Louis', 'birth_ye...",[],"[Horror tales, London (England) -- Fiction, Mu...","[Browsing: Fiction, Browsing: Psychiatry/Psych...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,15401
4,8492,The King in Yellow,"[{'name': 'Chambers, Robert W. (Robert William...",[],"[Horror tales, American, Short stories, Americ...","[Browsing: Culture/Civilization/Society, Brows...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,9822


4. Each request only retrieves 32 books, but we want all of them. Write a loop to go through all pages of the horror catalogue. In your loop you should:

- request a new page of books by altering the url each time
- take the results, save them into a Python object, then convert it to a `pandas` DataFrame
- collect all these `pandas` DataFrames into a list

At the end of your loop you should have a list of `pandas` DataFrames.

In [None]:
import time

# we know we have 233 books and 32 per page
# so we could explicitly loop a certain number of times
# or we could see that the JSON provides a "next" url
# which is a typical pattern to allow pagination
# so we could also keep going until that's None (i.e. blank)

book_dataframes = []

keep_going = True
page_url = "https://gutendex.com/books?topic=horror"

while keep_going:
    print(f"Attempting {page_url}...")
    books_page = requests.get(page_url)
    books_page.raise_for_status()

    books_json = books_page.json()

    # extract the book DataFrame
    books_df = pd.DataFrame(books_json["results"])
    book_dataframes.append(books_df)

    # and extract the next url unless we're done
    if books_json["next"]:
        page_url = books_json["next"]
    else:
        keep_going = False

    # a courtesy :-)
    time.sleep(0.5)

print("Done!")

5. Use the `.concat()` method to combine your DataFrames into a single DataFrame.

How many horror books do you have in your data? Does the number match the count from question 2?

In [None]:
books_all = pd.concat(book_dataframes, ignore_index=True)
print(books_all.shape)
books_all.head()

(224, 11)


Unnamed: 0,id,title,authors,translators,subjects,bookshelves,languages,copyright,media_type,formats,download_count
0,84,"Frankenstein; Or, The Modern Prometheus","[{'name': 'Shelley, Mary Wollstonecraft', 'bir...",[],[Frankenstein's monster (Fictitious character)...,"[Browsing: Culture/Civilization/Society, Brows...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,78467
1,5200,Metamorphosis,"[{'name': 'Kafka, Franz', 'birth_year': 1883, ...","[{'name': 'Wyllie, David (Translator)', 'birth...","[Metamorphosis -- Fiction, Psychological fiction]","[Browsing: Fiction, Browsing: Literature, Brow...",[en],True,Text,{'text/html': 'https://www.gutenberg.org/ebook...,25124
2,345,Dracula,"[{'name': 'Stoker, Bram', 'birth_year': 1847, ...",[],"[Dracula, Count (Fictitious character) -- Fict...","[Browsing: Fiction, Browsing: Literature, Brow...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,22060
3,43,The Strange Case of Dr. Jekyll and Mr. Hyde,"[{'name': 'Stevenson, Robert Louis', 'birth_ye...",[],"[Horror tales, London (England) -- Fiction, Mu...","[Browsing: Fiction, Browsing: Psychiatry/Psych...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,15401
4,8492,The King in Yellow,"[{'name': 'Chambers, Robert W. (Robert William...",[],"[Horror tales, American, Short stories, Americ...","[Browsing: Culture/Civilization/Society, Brows...",[en],False,Text,{'text/html': 'https://www.gutenberg.org/ebook...,9822


6. How many downloads of horror books were there in total?

In [None]:
books_all["download_count"].sum()

276908

7. BONUS: Which author has the most books in the horror section?

To answer this:

- the `authors` column is a list of dictionaries. Figure out how to extract the *first* dictionary from each list and save these into a new column
- use this new column to "unpack" the dictionary using `json_normalize`
- use this "JSON normalised" data to calculate the most frequent author

# Web scraping

Web scraping is needed when data is on the web but not accessible with a clean API.

In these instances, we can extract the data from the web page directly.

We can use `requests` to get the raw HTML of a web page, which we can then explore.

We're going to scrape data from a fictional bookstore: http://books.toscrape.com/

In [None]:
bookstore_response = requests.get("http://books.toscrape.com/")

bookstore_response.raise_for_status()


The returned content is now not JSON, but raw HTML in a string

In [None]:
bookstore_response.text



To be able to extract components from this, we will use the `BeautifulSoup` library.

In [None]:
from bs4 import BeautifulSoup

We create a "beautiful soup" object from the raw HTML

In [None]:
soup = BeautifulSoup(bookstore_response.text, "html.parser")

In [None]:
type(soup)

Looking at the object, it still looks like the HTML but we have additional methods available to us to explore it.

In [None]:
soup

What we're interested in is extracting specific HTML **elements**.

For this, we need to learn a bit of syntax, which are technically CSS selectors. CSS is a way to style a web page (more info and tutorials here: https://www.w3schools.com/css/).

The simplest form of a selector is using a tag type. That is, finding elements on a page that are all the same type, such as links.

In HTML, a link is an `<a>` tag, so we can find all links like this:

In [None]:
links = soup.select("a")

links

[<a href="index.html">Books to Scrape</a>,
 <a href="index.html">Home</a>,
 <a href="catalogue/category/books_1/index.html">
                             
                                 Books
                             
                         </a>,
 <a href="catalogue/category/books/travel_2/index.html">
                             
                                 Travel
                             
                         </a>,
 <a href="catalogue/category/books/mystery_3/index.html">
                             
                                 Mystery
                             
                         </a>,
 <a href="catalogue/category/books/historical-fiction_4/index.html">
                             
                                 Historical Fiction
                             
                         </a>,
 <a href="catalogue/category/books/sequential-art_5/index.html">
                             
                                 Sequential Art
            

In [None]:
type(links)

In [None]:
links[0]

<a href="index.html">Books to Scrape</a>

In [None]:
type(links[0])

These are all `Tag` objects which represent an HTML element.

These link tags all contain:

- text, which is what we see displayed on the page
- an "href" which is the url to visit when you click the link

We can extract both using `BeautifulSoup`:

In [None]:
[link.text for link in links]

['Books to Scrape',
 'Home',
 '\n                            \n                                Books\n                            \n                        ',
 '\n                            \n                                Travel\n                            \n                        ',
 '\n                            \n                                Mystery\n                            \n                        ',
 '\n                            \n                                Historical Fiction\n                            \n                        ',
 '\n                            \n                                Sequential Art\n                            \n                        ',
 '\n                            \n                                Classics\n                            \n                        ',
 '\n                            \n                                Philosophy\n                            \n                        ',
 '\n                        

In [None]:
[link["href"] for link in links]

['index.html',
 'index.html',
 'catalogue/category/books_1/index.html',
 'catalogue/category/books/travel_2/index.html',
 'catalogue/category/books/mystery_3/index.html',
 'catalogue/category/books/historical-fiction_4/index.html',
 'catalogue/category/books/sequential-art_5/index.html',
 'catalogue/category/books/classics_6/index.html',
 'catalogue/category/books/philosophy_7/index.html',
 'catalogue/category/books/romance_8/index.html',
 'catalogue/category/books/womens-fiction_9/index.html',
 'catalogue/category/books/fiction_10/index.html',
 'catalogue/category/books/childrens_11/index.html',
 'catalogue/category/books/religion_12/index.html',
 'catalogue/category/books/nonfiction_13/index.html',
 'catalogue/category/books/music_14/index.html',
 'catalogue/category/books/default_15/index.html',
 'catalogue/category/books/science-fiction_16/index.html',
 'catalogue/category/books/sports-and-games_17/index.html',
 'catalogue/category/books/add-a-comment_18/index.html',
 'catalogue/ca

You might find many elements of the same type, but with a different `class`.

A class is a way to tell CSS which elements should look the same.

For example, all buttons on the webpage have the same classes, including one called `"btn"`.

In CSS, to select all items of the same class, we can use `.` like this:

In [None]:
buttons = soup.select(".btn")
buttons

[<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>,
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>,
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>,
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>,
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>,
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>,
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>,
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>,
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>,
 

<h1 style="color: #fcd805">Exercise: web scraping</h1>

Your turn to scrape some data from the bookshop!

We're going to extract all the prices from the page and calculate the average book price.

1. Inspect the web page. What makes each book price element unique?

_Hint: right-click and click Inspect to view the HTML behind an element on the page._

Every price element is inside a <p> tag with class "price_color"

2. Use `BeautifulSoup` to select all the elements that show a book's price.

In [None]:
price_tags = soup.select("p.price_color")
price_tags

[<p class="price_color">Â£51.77</p>,
 <p class="price_color">Â£53.74</p>,
 <p class="price_color">Â£50.10</p>,
 <p class="price_color">Â£47.82</p>,
 <p class="price_color">Â£54.23</p>,
 <p class="price_color">Â£22.65</p>,
 <p class="price_color">Â£33.34</p>,
 <p class="price_color">Â£17.93</p>,
 <p class="price_color">Â£22.60</p>,
 <p class="price_color">Â£52.15</p>,
 <p class="price_color">Â£13.99</p>,
 <p class="price_color">Â£20.66</p>,
 <p class="price_color">Â£17.46</p>,
 <p class="price_color">Â£52.29</p>,
 <p class="price_color">Â£35.02</p>,
 <p class="price_color">Â£57.25</p>,
 <p class="price_color">Â£23.88</p>,
 <p class="price_color">Â£37.59</p>,
 <p class="price_color">Â£51.33</p>,
 <p class="price_color">Â£45.17</p>]

3. Extract only the displayed text from these elements into a list.

You should end up with a list of strings.

In [None]:
prices = [tag.text for tag in price_tags]
prices

['Â£51.77',
 'Â£53.74',
 'Â£50.10',
 'Â£47.82',
 'Â£54.23',
 'Â£22.65',
 'Â£33.34',
 'Â£17.93',
 'Â£22.60',
 'Â£52.15',
 'Â£13.99',
 'Â£20.66',
 'Â£17.46',
 'Â£52.29',
 'Â£35.02',
 'Â£57.25',
 'Â£23.88',
 'Â£37.59',
 'Â£51.33',
 'Â£45.17']

4. Create a `pandas` `Series` from this list of strings by using `pd.Series`.

In [None]:
price_series = pd.Series(prices)
price_series

Unnamed: 0,0
0,Â£51.77
1,Â£53.74
2,Â£50.10
3,Â£47.82
4,Â£54.23
5,Â£22.65
6,Â£33.34
7,Â£17.93
8,Â£22.60
9,Â£52.15


5. Using your `pandas` knowledge, clean up these strings so they are just numeric prices, and convert the `Series` to be a numeric type.

In [None]:
price_series = price_series.str[2:].astype(float)
price_series

Unnamed: 0,0
0,51.77
1,53.74
2,50.1
3,47.82
4,54.23
5,22.65
6,33.34
7,17.93
8,22.6
9,52.15


6. Now calculate the average price of books on the web page.

In [None]:
print(price_series.mean(), price_series.median())

38.048500000000004 41.38
