# 🌐 Scraping, Part 5: More practice

*Getting the hang of nested elements.*

## Let's try something a just little more complicated

Let's examine the posts on Beautiful Public Data's homepage: https://www.beautifulpublicdata.com/

First, let's pop open the element inspector. Let's get a sense of how the page is structured, and what information we might want to extract.

### __Exercise__: What are some CSS selectors, of varying specificity, we could use to select all of those posts?

`article` is *probably* enough. But you could also write:

- `.post`
- `main article`
- `.posts article.post`
- ... or many other permutations!

Let's try these out. First, we'll load the HTML and convert it to a Python-accessible DOM:

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
html = requests.get("https://www.beautifulpublicdata.com/").text
soup = BeautifulSoup(html)

Then, let's try using `.select(...)` and `len(...)` to count the number of matching elements:

In [3]:
len(soup.select("article"))

25

In [4]:
len(soup.select("main article"))

25

In [5]:
len(soup.select(".post"))

25

Finally, let's assign the list of posts to a variable:

In [6]:
post_elements = soup.select("article")

Next, we'll let's grab the headline for each post. 

## Q: How would you do this?

In [7]:
for el in post_elements:
    hed = el.select("h2")[0]
    print(hed.text)


                    Aerial Glacier Photographs
                

                    The Naughty Words the FAA Removed From the Sky
                

                    Trademark Design Codes
                

                    FAA Aviation Maps
                

                    Government Comic Books
                

                    Wild Horses
                

                    The Army and Navy Style Guides
                

                    Nuclear Weapon Test Films
                

                    All of the 8,331 License Plates in America
                

                    Mapping Volcano Eruptions With Drones
                

                    Here’s All the Rocks We Hauled Back From the Moon
                

                    1,000 Photos of Dolphin Fins
                

                    The Mirror Fusion Test Facility
                

                    Special Database 18: 3,248 Mugshots Used for Training Image Recognition Systems
      

Let's use the `.strip()` method on each text string to strip out the extra whitespace:

In [8]:
for el in post_elements:
    hed = el.select("h2")[0]
    print(hed.text.strip())

Aerial Glacier Photographs
The Naughty Words the FAA Removed From the Sky
Trademark Design Codes
FAA Aviation Maps
Government Comic Books
Wild Horses
The Army and Navy Style Guides
Nuclear Weapon Test Films
All of the 8,331 License Plates in America
Mapping Volcano Eruptions With Drones
Here’s All the Rocks We Hauled Back From the Moon
1,000 Photos of Dolphin Fins
The Mirror Fusion Test Facility
Special Database 18: 3,248 Mugshots Used for Training Image Recognition Systems
Pilot Manual for a 1940's U.S. Navy Blimp
The United States Frequency Allocation Chart
The Style Guide for America’s Highways: The Manual on Uniform Traffic Control Devices
Mapping the Sea Floor
Vehicle Crash Test Films from the 1970's and 1980s
Visualizing Rivers and Floodplains with USGS Data
A Rover's First 590 Days* on Mars
Utah Highway LiDAR Scans
The GOES-16 Weather Satellite
The Pillbox Database
Photologging Vans


## Q: How would you get the element representing the *first* post on the page?

In [9]:
first_post = post_elements[0]
first_post

<article class="post-card post tag-usgs tag-nsf tag-glaciers tag-nps featured post-card-large">
<a class="post-card-image-link" href="/aerial-glacier-photographs/">
<img alt="Aerial Glacier Photographs" class="post-card-image" loading="lazy" sizes="(max-width: 1000px) 400px, 800px" src="/content/images/size/w600/2024/06/quilt_4-1-1.png" srcset="/content/images/size/w300/2024/06/quilt_4-1-1.png 300w,
                    /content/images/size/w600/2024/06/quilt_4-1-1.png 600w,
                    /content/images/size/w1000/2024/06/quilt_4-1-1.png 1000w,
                    /content/images/size/w2000/2024/06/quilt_4-1-1.png 2000w"/>
</a>
<div class="post-card-content">
<a class="post-card-content-link" href="/aerial-glacier-photographs/">
<header class="post-card-header">
<div class="post-card-tags">
<span class="post-card-primary-tag">USGS</span>
<span class="post-card-featured"><svg fill="none" height="17" viewbox="0 0 16 17" width="16" xmlns="http://www.w3.org/2000/svg">
<path d="M4.493

## Exercise: How would you get main topic of a post?

![](../images/bpd-topic-1.png)

![](../images/bpd-topic-2.png)

![](../images/bpd-topic-3.png)

In [10]:
first_post.select(".post-card-primary-tag")

[<span class="post-card-primary-tag">USGS</span>]

In [11]:
first_post.select(".post-card-primary-tag")[0].text

'USGS'

In [12]:
# Now with the second element
post_elements[1].select(".post-card-primary-tag")[0].text

'FAA'

In [13]:
# ... and the third
post_elements[2].select(".post-card-primary-tag")[0].text

'USPTO'

## Exercise: How would you get each post's date?

In [14]:
first_post.select("time")[0].text

'Jun 18, 2024'

In [15]:
# Alternatively:
first_post.select("time")[0]["datetime"]

'2024-06-18'

## Exercise: Put these bits of data in a DataFrame

Let's put everything we have so far — title, topic, date — into a `pandas` `DataFrame`.

In [16]:
import pandas as pd

In [17]:
posts_df = pd.DataFrame([{
    "hed": el.select("h2")[0].text.strip(),
    "topic": el.select(".post-card-primary-tag")[0].text,
    "date": el.select("time")[0]["datetime"],
} for el in post_elements ])

posts_df

Unnamed: 0,hed,topic,date
0,Aerial Glacier Photographs,USGS,2024-06-18
1,The Naughty Words the FAA Removed From the Sky,FAA,2024-05-28
2,Trademark Design Codes,USPTO,2024-04-09
3,FAA Aviation Maps,FAA,2024-01-29
4,Government Comic Books,ARMY,2023-11-27
5,Wild Horses,BLM,2023-10-17
6,The Army and Navy Style Guides,ARMY,2023-10-04
7,Nuclear Weapon Test Films,LLNL,2023-09-21
8,"All of the 8,331 License Plates in America",Cars,2023-08-21
9,Mapping Volcano Eruptions With Drones,USGS,2023-05-23


## Q: What are the most common topics?

In [18]:
posts_df["topic"].value_counts()

topic
USGS         3
FAA          3
ARMY         2
Photologs    2
NTIA         1
Weather      1
Space        1
LIDAR        1
NHTSA        1
FHWA         1
DOE          1
NIST         1
NOAA         1
NASA         1
Cars         1
LLNL         1
BLM          1
USPTO        1
Pillbox      1
Name: count, dtype: int64

## __Exercise__: Add a column indicating the post length (in minutes)

... and:

- Find the longest post(s)
- Calculate how long it would take to read all the posts

In [19]:
length_text = el.select(".post-card-meta-length")[0].text
length_text

'3 min read'

In [20]:
length_text.split(" ")

['3', 'min', 'read']

In [21]:
length_text.split(" ")[0]

'3'

In [22]:
int(length_text.split(" ")[0])

3

In [23]:
posts_df = pd.DataFrame([{
    "hed": el.select("h2")[0].text.strip(),
    "topic": el.select(".post-card-primary-tag")[0].text,
    "date": el.select("time")[0]["datetime"],
    "minutes": int(el.select(".post-card-meta-length")[0].text.split(" ")[0])
} for el in post_elements ])

posts_df

Unnamed: 0,hed,topic,date,minutes
0,Aerial Glacier Photographs,USGS,2024-06-18,6
1,The Naughty Words the FAA Removed From the Sky,FAA,2024-05-28,8
2,Trademark Design Codes,USPTO,2024-04-09,7
3,FAA Aviation Maps,FAA,2024-01-29,9
4,Government Comic Books,ARMY,2023-11-27,7
5,Wild Horses,BLM,2023-10-17,5
6,The Army and Navy Style Guides,ARMY,2023-10-04,8
7,Nuclear Weapon Test Films,LLNL,2023-09-21,5
8,"All of the 8,331 License Plates in America",Cars,2023-08-21,10
9,Mapping Volcano Eruptions With Drones,USGS,2023-05-23,7


## Q: How can we get the longest articles?

In [24]:
posts_df.nlargest(1, "minutes", keep="all")

Unnamed: 0,hed,topic,date,minutes
8,"All of the 8,331 License Plates in America",Cars,2023-08-21,10
13,"Special Database 18: 3,248 Mugshots Used for T...",NIST,2023-03-18,10


## Q: How can we calculate the total length?

In [25]:
posts_df["minutes"].sum()

159

In [26]:
posts_df["minutes"].sum() / 60

2.65

---

---

---