### Extracting Data From HTML

**OBJECTIVES**


- Use `pd.read_html` to extract data from website tables
- Use `bs4` to parse html returned with requests.

### Reading in Data from HTML Tables

Now, we turn to one more approach in accessing data. As we've seen, you may have `json` or `csv` when querying a data API. Alternatively, you may receive HTML data where information is contained in tags.  Below, we examine some basic html tags and their effects.

```html
<h1>A Heading</h1>
<p>A first paragraph</p>
<p>A second paragraph</p>
<table>
  <tr>
    <th>Album</th>
    <th>Rating</th>
  </tr>
  <tr>
    <td>Pink Panther</td>
    <td>10</td>
  </tr>
</table>
```

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import requests

In [2]:
html = '''
<h1>A Heading</h1>
<p>A first paragraph</p>
<p>A second paragraph</p>
<table>
  <tr>
    <th>Album</th>
    <th>Rating</th>
  </tr>
  <tr>
    <td>Pink Panther</td>
    <td>10</td>
  </tr>
</table>
'''

In [3]:
from IPython.display import HTML

In [4]:
HTML(html)

Album,Rating
Pink Panther,10


### Making a request of a url

Let's begin with some basketball information from basketball-reference.com:

- https://www.basketball-reference.com/wnba

The tables on the page will be picked up (hopefully!) by the `read_html` function in pandas.

In [5]:
#visit the url below
url = 'https://www.basketball-reference.com/wnba'

In [6]:
#assign the results as data
#read_html
wnba = pd.read_html(url)

In [7]:
#what kind of object is data?
type(wnba)

list

In [8]:
#first element?
wnba[0]

Unnamed: 0,Team,W,L,W/L%,GB
0,New York Liberty*,32,8,0.8,—
1,Minnesota Lynx*,30,10,0.75,2.0
2,Connecticut Sun*,28,12,0.7,4.0
3,Las Vegas Aces*,27,13,0.675,5.0
4,Seattle Storm*,25,15,0.625,7.0
5,Indiana Fever*,20,20,0.5,12.0
6,Phoenix Mercury*,19,21,0.475,13.0
7,Atlanta Dream*,15,25,0.375,17.0
8,Washington Mystics,14,26,0.35,18.0
9,Chicago Sky,13,27,0.325,19.0


In [9]:
#examine information
wnba[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Team    12 non-null     object 
 1   W       12 non-null     int64  
 2   L       12 non-null     int64  
 3   W/L%    12 non-null     float64
 4   GB      12 non-null     object 
dtypes: float64(1), int64(2), object(2)
memory usage: 608.0+ bytes


In [10]:
#last dataframe?
wnba[-1]

Unnamed: 0.1,Unnamed: 0,PTS,TRB,AST,GmSc
0,A'ja Wilson (LVA),24,7,4,20.5
1,Sabrina Ionescu (NYL),24,9,5,19.7
2,Alyssa Thomas (CON),18,10,7,19.3
3,DeWanna Bonner (CON),17,6,3,16.0
4,Alanna Smith (MIN),15,6,2,15.7


**Example 2**

List of best selling albums from Wikipedia.

- https://en.wikipedia.org/wiki/List_of_best-selling_albums

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_best-selling_albums'

In [None]:
#read in the tables


In [None]:
#how many tables?


In [None]:
#look at the fourth table


In [None]:
#try to convert sales to float


In [None]:
#replace and coerce as float
# fourth_table['Claimed sales*'] = fourth_table['Claimed sales*'].replace({'20[disputed – discuss]': 20}).astype('float')

In [None]:
#alternative with string method
#fourth_table['Claimed sales*'].str.replace('[disputed – discuss]', '', regex = False)

### Scraping the Web for Data

Sometimes the data is not formatted as an `html` table or `pd.read_html` simply doesn't work.  In these situations you can use the `bs4` library and its `BeautifulSoup` object to parse HTML tags and extract information.  First, make sure you have the library installed and can import it below.

In [None]:
# pip install -U bs4

In [11]:
from bs4 import BeautifulSoup
import requests

In [12]:
sample_html = '''
<h1>Music Reviews</h1>
<p>This album was awful. <strong>Score</strong>: <i class = "score">2</i></p>
<p class = "good">This album was great. <strong>Score</strong>: <i class = "score">8</i></p>
'''

In [13]:
# create a soup object
soup = BeautifulSoup(sample_html)

In [14]:
# examine the soup
soup

<html><body><h1>Music Reviews</h1>
<p>This album was awful. <strong>Score</strong>: <i class="score">2</i></p>
<p class="good">This album was great. <strong>Score</strong>: <i class="score">8</i></p>
</body></html>

In [15]:
# find the <p> tags
soup.find('p')

<p>This album was awful. <strong>Score</strong>: <i class="score">2</i></p>

In [16]:
# find the i tag
soup.find('i')

<i class="score">2</i>

In [17]:
# find all the i tags
soup.find_all('i')

[<i class="score">2</i>, <i class="score">8</i>]

In [18]:
# find all good paragraphs
soup.find('p', {'class': 'good'})

<p class="good">This album was great. <strong>Score</strong>: <i class="score">8</i></p>

#### Extracting Data from a URL

1. Make a request.
2. Turn the request into soup!

In [19]:
url = 'https://pitchfork.com/reviews/albums/'

In [20]:
#make a request
r = requests.get(url)

In [21]:
#examine the text
r.text[:1000]

'<!DOCTYPE html><html lang="en-US"><head><title>New Albums &amp; Music Reviews | Pitchfork</title><meta charSet="utf-8"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta name="msapplication-tap-highlight" content="no"/><meta name="viewport" content="width=device-width, initial-scale=1"/><meta name="author" content="Condé Nast"/><meta name="copyright" content="Copyright (c) Condé Nast 2024"/><meta name="description" content="Daily reviews of every important album in music"/><meta name="id" content="65ce02a52126d093a5f585e1"/><meta name="keywords" content="web"/><meta name="news_keywords" content="web"/><meta name="robots" content="index, follow, max-image-preview:large"/><meta name="content-type" content="bundle"/><meta name="parsely-post-id" content="65ce02a52126d093a5f585e1"/><meta name="parsely-metadata" content="{&quot;description&quot;:&quot;Daily reviews of every important album in music&quot;,&quot;image-16-9&quot;:&quot;https://media.pitchfork.com/photos/5935a027a28a0

In [22]:
#turn it into soup!
soup = BeautifulSoup(r.text)

### Using Inspect

You can inspect an items HTML code by right clicking on the item of interest and selecting **inspect**.  Here, you will see the html tags that surround the object of interest.  

For example, when writing this lesson a recent album review on pitchfork was *Mustafa: Dunya*.  Right clicking on the image of the album cover and choosing inspect showed:

![](images/pitch_cover.png)

In [24]:
#find the img tag
dunya = soup.find('img', {'alt': 'Dunya'})

In [26]:
dunya.attrs['src']

'https://media.pitchfork.com/photos/668fec739c03086dcec412d6/1:1/w_1600%2Cc_limit/Mustafa-Dunya.jpg'

In [28]:
#find all img tags
images = soup.find_all('img')

In [29]:
#explore attributes
images[0].attrs

{'alt': 'Pitchfork',
 'class': ['ResponsiveImageContainer-eybHBd',
  'fptoWY',
  'responsive-image__image'],
 'src': '/verso/static/pitchfork/assets/logo-inverted.svg',
 'srcset': '',
 'sizes': '100vw'}

In [30]:
#extract source of image url
[img.attrs['src'] for img in images]

['/verso/static/pitchfork/assets/logo-inverted.svg',
 '/verso/static/pitchfork/assets/logo-header.svg',
 'https://media.pitchfork.com/photos/66a3cf7aeca3501f5dc9b121/1:1/w_1600%2Cc_limit/Being%2520Dead-%2520EELS.jpg',
 'https://media.pitchfork.com/photos/66fc0c553dcae43f31bfd01c/1:1/w_1600%2Cc_limit/2300%2520-%2520Bully%2520Tape.jpeg',
 'https://media.pitchfork.com/photos/66f2da330eece3c05910cb10/1:1/w_1600%2Cc_limit/Raphael%2520Raginski%2520-%2520Plays%2520John%2520Coltrane%2520and%2520Langston%2520Hughes.jpeg',
 'https://media.pitchfork.com/photos/668fec739c03086dcec412d6/1:1/w_1600%2Cc_limit/Mustafa-Dunya.jpg',
 'https://media.pitchfork.com/photos/66e07055506fec54a6686125/1:1/w_1600%2Cc_limit/Adeline-Hotel-Whodunnit.jpg',
 'https://media.pitchfork.com/photos/66ed8ef4a29561bba8d0bd0f/1:1/w_1600%2Cc_limit/Tommy%2520Richman%2520-%2520Coyote.jpg',
 'https://media.pitchfork.com/photos/66ed9384d74ab9c23d17f237/1:1/w_1600%2Cc_limit/Merce%2520Lemon%2520-%2520Watch%2520Me%2520Drive%2520Them%

In [None]:
# extract the genre tags


In [None]:
# extract the text from the genres


**PROBLEM**

Use the url below to the npr book review site.  Make a request, turn this into a soup object, and use the inspect tool to locate the title of each article on the page.  

In [None]:
url = 'https://www.npr.org/sections/book-reviews/'

#### Summary

There are many ways you may get data -- a file that somebody shares with you, data obtained through an API, data obtained through scraping and crawling websites, and even more like a database that you connect to.  Now that you've got some basics with both data accession, cleaning, munging, and visualizing -- it's time to explore a dataset and ask your own questions.