<a href="https://colab.research.google.com/github/josepeon/python_dad_class/blob/main/html_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Extracting Data From HTML

**OBJECTIVES**


- Use `pd.read_html` to extract data from website tables
- Use `bs4` to parse html returned with requests.

### Reading in Data from HTML Tables

Now, we turn to one more approach in accessing data. As we've seen, you may have `json` or `csv` when querying a data API. Alternatively, you may receive HTML data where information is contained in tags.  Below, we examine some basic html tags and their effects.

```html
<h1>A Heading</h1>
<p>A first paragraph</p>
<p>A second paragraph</p>
<table>
  <tr>
    <th>Album</th>
    <th>Rating</th>
  </tr>
  <tr>
    <td>Pink Panther</td>
    <td>10</td>
  </tr>
</table>
```

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import requests

In [2]:
html = '''
<h1>A Heading</h1>
<p>A first paragraph</p>
<p>A second paragraph</p>
<table>
  <tr>
    <th>Album</th>
    <th>Rating</th>
  </tr>
  <tr>
    <td>Pink Panther</td>
    <td>10</td>
  </tr>
</table>
'''

In [3]:
from IPython.display import HTML

In [4]:
HTML(html)

Album,Rating
Pink Panther,10


### Making a request of a url

Let's begin with some basketball information from basketball-reference.com:

- https://www.basketball-reference.com/wnba

The tables on the page will be picked up (hopefully!) by the `read_html` function in pandas.

In [5]:
#visit the url below
url = 'https://www.basketball-reference.com/wnba'

In [6]:
#assign the results as data
#read_html
wnba = pd.read_html(url)

In [7]:
#what kind of object is data?
type(wnba)

list

In [8]:
#first element?
wnba[0]

Unnamed: 0,Team,W,L,W/L%,GB
0,Minnesota Lynx*,34,10,0.773,—
1,Las Vegas Aces*,30,14,0.682,4.0
2,Atlanta Dream*,30,14,0.682,4.0
3,Phoenix Mercury*,27,17,0.614,7.0
4,New York Liberty*,27,17,0.614,7.0
5,Indiana Fever*,24,20,0.545,10.0
6,Seattle Storm*,23,21,0.523,11.0
7,Golden State Valkyries*,23,21,0.523,11.0
8,Los Angeles Sparks,21,23,0.477,13.0
9,Washington Mystics,16,28,0.364,18.0


In [9]:
#examine information
wnba[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Team    13 non-null     object 
 1   W       13 non-null     int64  
 2   L       13 non-null     int64  
 3   W/L%    13 non-null     float64
 4   GB      13 non-null     object 
dtypes: float64(1), int64(2), object(2)
memory usage: 652.0+ bytes


In [10]:
#last dataframe?
wnba[-1]

Unnamed: 0.1,Unnamed: 0,PTS,TRB,AST,GmSc
0,Jackie Young (LVA),32,8,2,25.0
1,A'ja Wilson (LVA),28,14,3,21.0
2,Chelsea Gray (LVA),10,8,10,16.4
3,Satou Sabally (PHO),22,9,2,13.2
4,Kahleah Copper (PHO),23,3,0,12.1


**Example 2**

List of best selling albums from Wikipedia.

- https://en.wikipedia.org/wiki/List_of_best-selling_albums

In [11]:
url = 'https://en.wikipedia.org/wiki/List_of_best-selling_albums'

In [None]:
#read in the tables


In [None]:
#how many tables?


In [None]:
#look at the fourth table


In [None]:
#try to convert sales to float


In [None]:
#replace and coerce as float
# fourth_table['Claimed sales*'] = fourth_table['Claimed sales*'].replace({'20[disputed – discuss]': 20}).astype('float')

In [None]:
#alternative with string method
#fourth_table['Claimed sales*'].str.replace('[disputed – discuss]', '', regex = False)

### Scraping the Web for Data

Sometimes the data is not formatted as an `html` table or `pd.read_html` simply doesn't work.  In these situations you can use the `bs4` library and its `BeautifulSoup` object to parse HTML tags and extract information.  First, make sure you have the library installed and can import it below.

In [12]:
# pip install -U bs4

In [13]:
from bs4 import BeautifulSoup
import requests

In [14]:
sample_html = '''
<h1>Music Reviews</h1>
<p>This album was awful. <strong>Score</strong>: <i class = "score">2</i></p>
<p class = "good">This album was great. <strong>Score</strong>: <i class = "score">8</i></p>
'''

In [15]:
# create a soup object
soup = BeautifulSoup(sample_html)

In [16]:
# examine the soup
soup

<html><body><h1>Music Reviews</h1>
<p>This album was awful. <strong>Score</strong>: <i class="score">2</i></p>
<p class="good">This album was great. <strong>Score</strong>: <i class="score">8</i></p>
</body></html>

In [17]:
# find the <p> tags
soup.find('p')

<p>This album was awful. <strong>Score</strong>: <i class="score">2</i></p>

In [18]:
# find the i tag
soup.find('i')

<i class="score">2</i>

In [19]:
# find all the i tags
soup.find_all('i')

[<i class="score">2</i>, <i class="score">8</i>]

In [20]:
# find all good paragraphs
soup.find('p', {'class': 'good'})

<p class="good">This album was great. <strong>Score</strong>: <i class="score">8</i></p>

#### Extracting Data from a URL

1. Make a request.
2. Turn the request into soup!

In [25]:
url = 'https://pitchfork.com/reviews/albums/'

In [26]:
#make a request
r = requests.get(url)

In [27]:
#examine the text
r.text[:1000]

'<!DOCTYPE html><html lang="en-US"><head><title>New Albums &amp; Music Reviews | Pitchfork</title><meta charSet="utf-8"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta name="msapplication-tap-highlight" content="no"/><meta name="viewport" content="width=device-width, initial-scale=1"/><meta name="author" content="Condé Nast"/><meta name="copyright" content="Copyright (c) Condé Nast 2025"/><meta name="description" content="Daily reviews of every important album in music"/><meta name="id" content="65ce02a52126d093a5f585e1"/><meta name="keywords" content="web"/><meta name="news_keywords" content="web"/><meta name="robots" content="index, follow, max-image-preview:large"/><meta name="content-type" content="bundle"/><meta name="parsely-post-id" content="65ce02a52126d093a5f585e1"/><meta name="parsely-metadata" content="{&quot;description&quot;:&quot;Daily reviews of every important album in music&quot;,&quot;image-16-9&quot;:&quot;https://media.pitchfork.com/photos/5935a027a28a0

In [28]:
#turn it into soup!
soup = BeautifulSoup(r.text)

### Using Inspect

You can inspect an items HTML code by right clicking on the item of interest and selecting **inspect**.  Here, you will see the html tags that surround the object of interest.  

For example, when writing this lesson a recent album review on pitchfork was *Mustafa: Dunya*.  Right clicking on the image of the album cover and choosing inspect showed:

![](https://github.com/jfkoehler/bootcamp_spr25/blob/master/images/pitch_cover.png?raw=1)

In [33]:
#find the img tag
dunya = soup.find('img', {'loading': 'eager'})

In [34]:
dunya

<img alt="The Life of a Showgirl" class="ResponsiveImageContainer-eNxvmU cfBbTk responsive-image__image" data-src="https://media.pitchfork.com/photos/68a32e095783f969caddc613/1:1/w_1600%2Cc_limit/Taylor-Swift-The-Life-of-a-Showgirl.jpeg" loading="eager" src="https://media.pitchfork.com/photos/68a32e095783f969caddc613/1:1/w_1600%2Cc_limit/Taylor-Swift-The-Life-of-a-Showgirl.jpeg"/>

In [37]:
#find all img tags
images = soup.find_all('img', {'loading': 'eager'})

In [38]:
#explore attributes
images[0].attrs

{'alt': 'The Life of a Showgirl',
 'loading': 'eager',
 'class': ['ResponsiveImageContainer-eNxvmU',
  'cfBbTk',
  'responsive-image__image'],
 'src': 'https://media.pitchfork.com/photos/68a32e095783f969caddc613/1:1/w_1600%2Cc_limit/Taylor-Swift-The-Life-of-a-Showgirl.jpeg',
 'data-src': 'https://media.pitchfork.com/photos/68a32e095783f969caddc613/1:1/w_1600%2Cc_limit/Taylor-Swift-The-Life-of-a-Showgirl.jpeg'}

In [40]:
#extract source of image url
for img in images:
  print(img.attrs['src'])

https://media.pitchfork.com/photos/68a32e095783f969caddc613/1:1/w_1600%2Cc_limit/Taylor-Swift-The-Life-of-a-Showgirl.jpeg
https://media.pitchfork.com/photos/68d55ec8e0521f5b8408d24a/1:1/w_1600%2Cc_limit/Call%2520Super:%2520A%2520Rhythm%2520Protects%2520One.jpg
https://media.pitchfork.com/photos/5fd24ddc4a647c066bffa914/1:1/w_1600%2Cc_limit/PJ-Harvey.jpg
https://media.pitchfork.com/photos/68dd439cfa33e89f8ae1b578/1:1/w_1600%2Cc_limit/sombr:%2520I%2520Barely%2520Know%2520Her.jpg
https://media.pitchfork.com/photos/68d6883c0f3275dabf461ddc/1:1/w_1600%2Cc_limit/Young-Thug-UY-SCUTI.jpeg
https://media.pitchfork.com/photos/67f802189126c7d0c8f3b783/1:1/w_1600%2Cc_limit/Leon-Vynehall-In-Daytona-Yellow.jpeg
https://media.pitchfork.com/photos/68ded2435e282545940328cf/1:1/w_1600%2Cc_limit/snuggle.jpg
https://media.pitchfork.com/photos/686fda609a8ba5160e24a4f5/1:1/w_1600%2Cc_limit/Rochelle-Jordan-Through-the-Wall.jpeg
https://media.pitchfork.com/photos/68d451a69c3c6f55d4d0056c/1:1/w_1600%2Cc_limit/O

In [48]:
# extract the genre tags
soup.find('span', {'class': 'rubric__name'}).text
genres = soup.find_all('span', {'class': 'rubric__name'})

In [50]:
# extract the text from the genres
for genre in genres:
  print(genre.text)

Pop/R&B
Electronic
Rock
Pop/R&B
Rock
Rap
Electronic
Rock
Pop/R&B
Pop/R&B
Rock
Rap
Pop/R&B
Electronic
Electronic
Folk/Country
Rock
Folk/Country
Pop/R&B
Pop/R&B
Folk/Country
Rock
Rock
Rock
Experimental
Electronic
Electronic
Electronic
Electronic
Rock
Rap
Rock
Rock
Experimental
Electronic
Experimental
Folk/Country
Pop/R&B
Folk/Country
Electronic
Electronic
Rock
Rock
Rock
Pop/R&B
Pop/R&B
Electronic
Rap
Pop/R&B
Rock
Pop/R&B
Rock
Rock
Rock
Electronic
Rap
Rock
Rock
Experimental
Rock
Rock
Rock
Rock
Experimental
Electronic
Jazz
Rock
Pop/R&B
Pop/R&B
Pop/R&B
Rock
Electronic
Rap
Rock
Experimental
Rock
Pop/R&B
Rock
Metal
Rock
Electronic
Jazz
Rock
Rock
Electronic
Pop/R&B
Folk/Country
Rock
Folk/Country
Folk/Country
Rock
Rock
Electronic
Pop/R&B
Electronic
Pop/R&B
Rock
Rap
Rock
Rock
Pop/R&B
Pop/R&B
Rock
Rock
Experimental


In [57]:
covers = []
genres = []
for i in range(1, 20):
  url = f'https://pitchfork.com/reviews/albums/?page={i}'
  r = requests.get(url)
  soup = BeautifulSoup(r.text)
  cover = soup.find_all('img', {'loading': 'eager'})
  genre = soup.find_all('span', {'class': 'rubric__name'})
  covers.append(cover)
  genres.append(genre)

len(covers[0])

96

**PROBLEM**

Use the url below to the npr book review site.  Make a request, turn this into a soup object, and use the inspect tool to locate the title of each article on the page.  

In [None]:
url = 'https://www.npr.org/sections/book-reviews/'

#### Problem

Head over to [Quotes to Scrape](https://quotes.toscrape.com/) and use `requests` and `BeautifulSoup` to extract and structure the quotes as a `DataFrame` similar to that below:

| quote | author | tags |
| ------ | --------- | ------- |
| The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking | Albert Einstein |  [change ,deep-thoughts, thinking, world] |