# 🌐 Scraping, Part 5: More practice

*Getting the hang of nested elements.*

## Let's try something a just little more complicated

Let's examine the posts on FlowingData's homepage: https://flowingdata.com/

First, let's pop open the element inspector. Let's get a sense of how the page is structured, and what information we might want to extract.

At a high level, we're probably most interested in the `<div id="recent-posts" ...>` element. More specifically, there's a `<ul>` (which stands for unordered list) element. Each `<li class="archive-post">` within it seems to represent a post.

### __Exercise__: What are some CSS selectors, of varying specificity, we could use to select all of those posts?

`.archive-post` is *probably* enough. But you could also write:

- `#recent-posts > ul > li`
- `#recent-posts .archive-post`
- `#recent-posts li.archive-post`

Let's try these out. First, we'll load the HTML and convert it to a Python-accessible DOM:

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
fd_html = requests.get("https://flowingdata.com/").text
fd_soup = BeautifulSoup(fd_html)

Now we can run all of our proposed selectors, comparing the results:

In [3]:
selectors_to_try = [
    ".archive-post",
    "#recent-posts > ul > li",
    "#recent-posts .archive-post",
    "#recent-posts li.archive-post"
]

Next, we'll let's grab the headline for each post. 

## Q: How would you do this?

In [4]:
for post_el in fd_soup.select(".archive-post"):
    hed = post_el.select("h1 a")[0]
    print(hed.text)


Visualization Tools and Learning Resources, June 2023 Roundup 

Astericking NBA champions 

Noise and health 

Password game requires more ridiculous rules as you play 

A year of flight paths, for someone with an unlimited pass 

Map of electric grid required for cleaner energy 

Crochet lake map 

Chart Practice: Branch Out Beyond the Visual Bits 

An interactive guide to color and contrast 

Switching from Python to R 

Friend simulation system, with ChatGPT 

To make electric vehicle batteries, China must be involved 

Where people are moving in the U.S. 

Chart Practice: Changing the Audience 

Life timeline in a spreadsheet 

Objectiveness distributions 

Using gaps in location data to track illegal fishing 

Fake location signals from oil tankers avoiding oversight 

Generative AI exaggerates stereotypes 

Smoke from Canada wildfires over the U.S. 


Let's use the `.strip()` method on each text string to strip out the extra whitespace:

In [5]:
for post_el in fd_soup.select(".archive-post"):
    hed = post_el.select("h1 a")[0]
    print(hed.text.strip())

Visualization Tools and Learning Resources, June 2023 Roundup
Astericking NBA champions
Noise and health
Password game requires more ridiculous rules as you play
A year of flight paths, for someone with an unlimited pass
Map of electric grid required for cleaner energy
Crochet lake map
Chart Practice: Branch Out Beyond the Visual Bits
An interactive guide to color and contrast
Switching from Python to R
Friend simulation system, with ChatGPT
To make electric vehicle batteries, China must be involved
Where people are moving in the U.S.
Chart Practice: Changing the Audience
Life timeline in a spreadsheet
Objectiveness distributions
Using gaps in location data to track illegal fishing
Fake location signals from oil tankers avoiding oversight
Generative AI exaggerates stereotypes
Smoke from Canada wildfires over the U.S.


## Exercise: How would you get the date of a post? And the topic?

In [6]:
first_post = fd_soup.select(".archive-post")[0]
first_post

<li class="archive-post">
<div>
<div class="note-wrapper nine columns offset-by-two alpha"><div class="members-note">Members Only</div></div>
<div class="nine columns offset-by-two alpha">
<h1>
<a href="https://flowingdata.com/2023/06/29/process-245-roundup/" rel="bookmark">
Visualization Tools and Learning Resources, June 2023 Roundup </a>
</h1>
</div>
<div class="clr"></div>
<div class="byinfo two columns alpha">
<a href="https://flowingdata.com/2023/06/29/process-245-roundup/">June 29, 2023</a>
<div style="margin-top:1.5rem">
<h3 class="toplevel">Topic</h3>
<strong><a href="https://flowingdata.com/category/the-process/" rel="category tag">The Process</a></strong>  /  <a href="https://flowingdata.com/tag/roundup/" rel="tag">roundup</a> </div>
</div>
<div class="nine columns omega" id="entry-content-wrapper">
<div class="entry">
<div class="archive-featured-image">
<a href="https://flowingdata.com/2023/06/29/process-245-roundup/">
<img alt="" class="attachment-medium size-medium wp-po

In [7]:
first_post.select(".byinfo a")[0].text

'June 29, 2023'

In [8]:
first_post.select(".byinfo strong a")[0].text

'The Process'

Now let's get those for each post, and put everything we have so far into a `pandas` `DataFrame`.

In [9]:
import pandas as pd

In [10]:
fd_posts = pd.DataFrame([{
    "hed": post_el.select("h1 a")[0].text.strip(),
    "date": post_el.select(".byinfo a")[0].text,
    "topic": post_el.select(".byinfo strong a")[0].text,
} for post_el in fd_soup.select(".archive-post") ])

fd_posts

Unnamed: 0,hed,date,topic
0,"Visualization Tools and Learning Resources, Ju...","June 29, 2023",The Process
1,Astericking NBA champions,"June 29, 2023",Statistics
2,Noise and health,"June 28, 2023",Infographics
3,Password game requires more ridiculous rules a...,"June 27, 2023",Infographics
4,"A year of flight paths, for someone with an un...","June 27, 2023",Maps
5,Map of electric grid required for cleaner energy,"June 26, 2023",Maps
6,Crochet lake map,"June 23, 2023",Maps
7,Chart Practice: Branch Out Beyond the Visual Bits,"June 22, 2023",The Process
8,An interactive guide to color and contrast,"June 22, 2023",Design
9,Switching from Python to R,"June 21, 2023",Coding


What is the most common topic?

In [11]:
fd_posts["topic"].value_counts()

topic
Maps                         6
Infographics                 5
The Process                  3
Statistics                   1
Design                       1
Coding                       1
Network Visualization        1
Statistical Visualization    1
Self-surveillance            1
Name: count, dtype: int64

## __Exercise__: Add a column indicating whether any given post is for "Members Only"

How many are there?

In [12]:
fd_posts = pd.DataFrame([{
    "hed": post_el.select("h1 a")[0].text.strip(),
    "date": post_el.select(".byinfo a")[0].text,
    "topic": post_el.select(".byinfo strong a")[0].text,
    "members_only": len(post_el.select(".members-note")),    
} for post_el in fd_soup.select(".archive-post") ])

fd_posts

Unnamed: 0,hed,date,topic,members_only
0,"Visualization Tools and Learning Resources, Ju...","June 29, 2023",The Process,1
1,Astericking NBA champions,"June 29, 2023",Statistics,0
2,Noise and health,"June 28, 2023",Infographics,0
3,Password game requires more ridiculous rules a...,"June 27, 2023",Infographics,0
4,"A year of flight paths, for someone with an un...","June 27, 2023",Maps,0
5,Map of electric grid required for cleaner energy,"June 26, 2023",Maps,0
6,Crochet lake map,"June 23, 2023",Maps,0
7,Chart Practice: Branch Out Beyond the Visual Bits,"June 22, 2023",The Process,1
8,An interactive guide to color and contrast,"June 22, 2023",Design,0
9,Switching from Python to R,"June 21, 2023",Coding,0


In [13]:
fd_posts["members_only"].sum()

3

---

---

---