# 🌐 Scraping, Part 1.4: Practice, practice, practice!

*Let's scrape some dataaaaaaa!*

*Note: In these examples, I'll be using `lxml`, but feel free to use `BeautifulSoup` if you prefer.*

## Let's start with Soma's personal website

It's https://jonathansoma.com.

Open it in your browser. View the raw HTML, and also practice popping open the element inspector.

## Q: What do you see? What would you want to extract from it?

Some ideas:

- How many hyperlinks does Soma's homepage contain?
- Which paragraph contains the most hyperlinks?

Let's load the HTML in Python. Remember how?

In [1]:
import requests

In [2]:
soma_html = requests.get("https://jonathansoma.com/").text
print(soma_html[:200])

<!DOCTYPE html>
<html>
<head>
<title>Jonathan Soma makes things</title>
<style>
#content {
width: 700px;
color: #333;
margin: 0 auto;
padding-bottom: 100px;
padding-top: 100px;
font-family: Georgia, s


Now let's parse it with `lxml`. Remember how?

In [3]:
import lxml.html
soma_dom = lxml.html.fromstring(soma_html)

## Q: How many links on Soma's homepage?

In [4]:
soma_links = soma_dom.cssselect("a")
len(soma_links)

11

## Q: What are all the URLs those hyperlinks point to?

In [5]:
for link in soma_links:
    print(link.attrib["href"])

http://brooklynbrainery.com
http://dabbles.in
http://www.omgmsg.com
https://investigate.ai
http://jonathansoma.com/singles
http://handsomeatlas.com
http://jonathansoma.com/notes/dosas-and-injera/
http://jonathansoma.com/open-source-language-map
https://tinyletter.com/jsoma
http://twitter.com/dangerscarf
mailto:jonathan.soma@gmail.com


## Exercise: How many links are in each paragraph?

Let's start with grabbing each paragraph:

In [6]:
for i, p in enumerate(soma_dom.cssselect("p")):
    print(f"Paragraph {i+1}: {p.text_content()}")
    print("---")

Paragraph 1: I run a fake school and a paid newsletter about hobbies and have been known to talk too much about food. I love just about everything.
---
Paragraph 2: I've worked on baby-steps data science for journalists and lonely young men and rad old maps and pancakes and crowdsourced linguistics.
---
Paragraph 3: Want updates? I have a newsletter for that, too.
---
Paragraph 4:  
---
Paragraph 5: pithy = @dangerscarf lengthy = jonathan.soma@gmail.com
---


Now let's search *within* each paragraph for its links; we can do this because `.cssselect(...)` works on *any* element:

In [7]:
for i, p in enumerate(soma_dom.cssselect("p")):
    p_links = p.cssselect("a")
    print(f"Paragraph {i+1} has {len(p_links)} link(s)")
    print("---")

Paragraph 1 has 3 link(s)
---
Paragraph 2 has 5 link(s)
---
Paragraph 3 has 1 link(s)
---
Paragraph 4 has 0 link(s)
---
Paragraph 5 has 2 link(s)
---


Now let's print the text and URL of each link:

In [8]:
for i, p in enumerate(soma_dom.cssselect("p")):
    p_links = p.cssselect("a")
    print(f"Paragraph {i+1} has {len(p_links)} link(s):")
    for a in p_links:
        text = a.text_content()
        url = a.attrib["href"]
        print(f"→ {text}: {url}")
    print("---")

Paragraph 1 has 3 link(s):
→ fake school: http://brooklynbrainery.com
→ paid newsletter about hobbies: http://dabbles.in
→ food: http://www.omgmsg.com
---
Paragraph 2 has 5 link(s):
→ baby-steps data science for journalists: https://investigate.ai
→ lonely young men: http://jonathansoma.com/singles
→ rad old maps: http://handsomeatlas.com
→ pancakes: http://jonathansoma.com/notes/dosas-and-injera/
→ crowdsourced linguistics: http://jonathansoma.com/open-source-language-map
---
Paragraph 3 has 1 link(s):
→ newsletter: https://tinyletter.com/jsoma
---
Paragraph 4 has 0 link(s):
---
Paragraph 5 has 2 link(s):
→ @dangerscarf: http://twitter.com/dangerscarf
→ jonathan.soma@gmail.com: mailto:jonathan.soma@gmail.com
---


## Exercise: `pandas` refresher

How would you make `pandas` `DataFrame` representing each link's text and URL?

(You can forget, for now, about what paragraph the link is in.)

In [9]:
import pandas as pd

In [10]:
soma_link_df = pd.DataFrame([ {
    "text": link.text_content(),
    "url": link.attrib["href"]
} for link in soma_links ])

soma_link_df

Unnamed: 0,text,url
0,fake school,http://brooklynbrainery.com
1,paid newsletter about hobbies,http://dabbles.in
2,food,http://www.omgmsg.com
3,baby-steps data science for journalists,https://investigate.ai
4,lonely young men,http://jonathansoma.com/singles
5,rad old maps,http://handsomeatlas.com
6,pancakes,http://jonathansoma.com/notes/dosas-and-injera/
7,crowdsourced linguistics,http://jonathansoma.com/open-source-language-map
8,newsletter,https://tinyletter.com/jsoma
9,@dangerscarf,http://twitter.com/dangerscarf


If [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) aren't your cup of tea, here's another way we could have done that:

In [11]:
soma_link_list = []

for link in soma_links:
    item_data = {
        "text": link.text_content(),
        "url": link.attrib["href"],
    }
    soma_link_list.append(item_data)

soma_link_list

[{'text': 'fake school', 'url': 'http://brooklynbrainery.com'},
 {'text': 'paid newsletter about hobbies', 'url': 'http://dabbles.in'},
 {'text': 'food', 'url': 'http://www.omgmsg.com'},
 {'text': 'baby-steps data science for journalists',
  'url': 'https://investigate.ai'},
 {'text': 'lonely young men', 'url': 'http://jonathansoma.com/singles'},
 {'text': 'rad old maps', 'url': 'http://handsomeatlas.com'},
 {'text': 'pancakes',
  'url': 'http://jonathansoma.com/notes/dosas-and-injera/'},
 {'text': 'crowdsourced linguistics',
  'url': 'http://jonathansoma.com/open-source-language-map'},
 {'text': 'newsletter', 'url': 'https://tinyletter.com/jsoma'},
 {'text': '@dangerscarf', 'url': 'http://twitter.com/dangerscarf'},
 {'text': 'jonathan.soma@gmail.com', 'url': 'mailto:jonathan.soma@gmail.com'}]

In [12]:
soma_link_df = pd.DataFrame(soma_link_list)
soma_link_df

Unnamed: 0,text,url
0,fake school,http://brooklynbrainery.com
1,paid newsletter about hobbies,http://dabbles.in
2,food,http://www.omgmsg.com
3,baby-steps data science for journalists,https://investigate.ai
4,lonely young men,http://jonathansoma.com/singles
5,rad old maps,http://handsomeatlas.com
6,pancakes,http://jonathansoma.com/notes/dosas-and-injera/
7,crowdsourced linguistics,http://jonathansoma.com/open-source-language-map
8,newsletter,https://tinyletter.com/jsoma
9,@dangerscarf,http://twitter.com/dangerscarf


## Exercise: Add the URL's *protocol* to the DataFrame

(The protocol is the bit that comes before the `:`.)

In [13]:
soma_link_list = []

for link in soma_links:
    item_data = {
        "text": link.text_content(),
        "url": link.attrib["href"],
        "protocol": link.attrib["href"].split(":")[0],
    }
    soma_link_list.append(item_data)

soma_link_df = pd.DataFrame(soma_link_list)

soma_link_df

Unnamed: 0,text,url,protocol
0,fake school,http://brooklynbrainery.com,http
1,paid newsletter about hobbies,http://dabbles.in,http
2,food,http://www.omgmsg.com,http
3,baby-steps data science for journalists,https://investigate.ai,https
4,lonely young men,http://jonathansoma.com/singles,http
5,rad old maps,http://handsomeatlas.com,http
6,pancakes,http://jonathansoma.com/notes/dosas-and-injera/,http
7,crowdsourced linguistics,http://jonathansoma.com/open-source-language-map,http
8,newsletter,https://tinyletter.com/jsoma,https
9,@dangerscarf,http://twitter.com/dangerscarf,http


In [14]:
soma_link_df["protocol"].value_counts()

protocol
http      8
https     2
mailto    1
Name: count, dtype: int64

## Let's try something a just little more complicated

Let's examine the posts on FlowingData's homepage: https://flowingdata.com/

First, let's pop open the element inspector. Let's get a sense of how the page is structured, and what information we might want to extract.

At a high level, we're probably most interested in the `<div id="recent-posts" ...>` element. More specifically, there's a `<ul>` (which stands for unordered list) element. Each `<li class="archive-post">` within it seems to represent a post.

### __Exercise__: What are some CSS selectors, of varying specificity, we could use to select all of those posts?

`.archive-post` is *probably* enough. But you could also write:

- `#recent-posts > ul > li`
- `#recent-posts .archive-post`
- `#recent-posts li.archive-post`

Let's try these out. First, we'll load the HTML and convert it to a Python-accessible DOM:

In [15]:
fd_html = requests.get("https://flowingdata.com/").text
fd_dom = lxml.html.fromstring(fd_html)

Now we can run all of our proposed selectors, comparing the results:

In [16]:
selectors_to_try = [
    ".archive-post",
    "#recent-posts > ul > li",
    "#recent-posts .archive-post",
    "#recent-posts li.archive-post"
]

In [17]:
for sel in selectors_to_try:
    num_elements = len(fd_dom.cssselect(sel))
    print(f"{sel: <30} matches {num_elements} elements")

.archive-post                  matches 20 elements
#recent-posts > ul > li        matches 20 elements
#recent-posts .archive-post    matches 20 elements
#recent-posts li.archive-post  matches 20 elements


Next, we'll let's grab the headline for each post. 

## Q: How would you do this?

In [18]:
for post_el in fd_dom.cssselect(".archive-post"):
    hed = post_el.cssselect("h1 a")[0]
    print(hed.text_content())


Switching from Python to R 

Friend simulation system, with ChatGPT 

To make electric vehicle batteries, China must be involved 

Where people are moving in the U.S. 

Chart Practice: Changing the Audience 

Life timeline in a spreadsheet 

Objectiveness distributions 

Using gaps in location data to track illegal fishing 

Fake location signals from oil tankers avoiding oversight 

Generative AI exaggerates stereotypes 

Smoke from Canada wildfires over the U.S. 

Artificial Data Visualization 

NYC city council district voting guide 

See if you are middle class 

A moving drumbeat, explained visually 

Map of donut federations 

Changes to Blackjack payouts so that gamblers lose more to casinos 

Rights at risk at the U.S. Supreme Court level 

An open-access journal for visualization research 

Chart Practice: Feature Focus 


Let's use the `.strip()` method on each text string to strip out the extra whitespace:

In [19]:
for post_el in fd_dom.cssselect(".archive-post"):
    hed = post_el.cssselect("h1 a")[0]
    print(hed.text_content().strip())

Switching from Python to R
Friend simulation system, with ChatGPT
To make electric vehicle batteries, China must be involved
Where people are moving in the U.S.
Chart Practice: Changing the Audience
Life timeline in a spreadsheet
Objectiveness distributions
Using gaps in location data to track illegal fishing
Fake location signals from oil tankers avoiding oversight
Generative AI exaggerates stereotypes
Smoke from Canada wildfires over the U.S.
Artificial Data Visualization
NYC city council district voting guide
See if you are middle class
A moving drumbeat, explained visually
Map of donut federations
Changes to Blackjack payouts so that gamblers lose more to casinos
Rights at risk at the U.S. Supreme Court level
An open-access journal for visualization research
Chart Practice: Feature Focus


## Exercise: How would you get the date of a post? And the topic?

In [20]:
first_post = fd_dom.cssselect(".archive-post")[0]
first_post

<Element li at 0x1079868e0>

In [21]:
first_post.cssselect(".byinfo a")[0].text_content()

'June 21, 2023'

In [22]:
first_post.cssselect(".byinfo strong a")[0].text_content()

'Coding'

Now let's get those for each post, and put everything we have so far into a `pandas` `DataFrame`.

In [23]:
fd_posts = pd.DataFrame([{
    "hed": post_el.cssselect("h1 a")[0].text_content().strip(),
    "date": post_el.cssselect(".byinfo a")[0].text_content(),
    "topic": post_el.cssselect(".byinfo strong a")[0].text_content(),
} for post_el in fd_dom.cssselect(".archive-post") ])

fd_posts

Unnamed: 0,hed,date,topic
0,Switching from Python to R,"June 21, 2023",Coding
1,"Friend simulation system, with ChatGPT","June 20, 2023",Network Visualization
2,"To make electric vehicle batteries, China must...","June 19, 2023",Infographics
3,Where people are moving in the U.S.,"June 16, 2023",Statistical Visualization
4,Chart Practice: Changing the Audience,"June 15, 2023",The Process
5,Life timeline in a spreadsheet,"June 15, 2023",Self-surveillance
6,Objectiveness distributions,"June 14, 2023",Infographics
7,Using gaps in location data to track illegal f...,"June 13, 2023",Maps
8,Fake location signals from oil tankers avoidin...,"June 13, 2023",Maps
9,Generative AI exaggerates stereotypes,"June 12, 2023",Infographics


What is the most common topic?

In [24]:
fd_posts["topic"].value_counts()

topic
Infographics                 6
Maps                         4
The Process                  3
Statistical Visualization    2
Coding                       1
Network Visualization        1
Self-surveillance            1
Statistics                   1
News                         1
Name: count, dtype: int64

## __Exercise__: Add a column indicating whether any given post is for "Members Only"

How many are there?

In [25]:
fd_posts = pd.DataFrame([{
    "hed": post_el.cssselect("h1 a")[0].text_content().strip(),
    "date": post_el.cssselect(".byinfo a")[0].text_content(),
    "topic": post_el.cssselect(".byinfo strong a")[0].text_content(),
    "members_only": len(post_el.cssselect(".members-note")),
} for post_el in fd_dom.cssselect(".archive-post") ])

fd_posts

Unnamed: 0,hed,date,topic,members_only
0,Switching from Python to R,"June 21, 2023",Coding,0
1,"Friend simulation system, with ChatGPT","June 20, 2023",Network Visualization,0
2,"To make electric vehicle batteries, China must...","June 19, 2023",Infographics,0
3,Where people are moving in the U.S.,"June 16, 2023",Statistical Visualization,0
4,Chart Practice: Changing the Audience,"June 15, 2023",The Process,1
5,Life timeline in a spreadsheet,"June 15, 2023",Self-surveillance,0
6,Objectiveness distributions,"June 14, 2023",Infographics,0
7,Using gaps in location data to track illegal f...,"June 13, 2023",Maps,0
8,Fake location signals from oil tankers avoidin...,"June 13, 2023",Maps,0
9,Generative AI exaggerates stereotypes,"June 12, 2023",Infographics,0


In [26]:
fd_posts["members_only"].sum()

3

## Let's talk (and parse) `<table>`s

```
 ┌───────────┐
 │  <table>  │
 └─┬─────────┘
   │  ┌───────────┐
   ├─►│  <thead>  │
   │  └─────┬─────┘
   │        │  ┌───────┐
   │        └─►│  <tr> │
   │           └───┬───┘
   │               │  ┌────────┐
   │               └─►│  <th>  │
   │                  └────────┘
   │  ┌───────────┐
   └─►│  <tbody>  │
      └─────┬─────┘
            │  ┌────────┐
            └─►│  <tr>  │
               └───┬────┘
                   │  ┌────────┐
                   └─►│  <td>  │
                      └────────┘
```

Let's try parsing the table of giant watermelons from "`Homework 02, Part 2: The command line is fun (I promise) (optional)`"

Here's the website: http://www.bigpumpkins.com/WeighoffResultsGPC.aspx?c=W&y=2022

In [27]:
watermelon_html = requests.get(
    "http://www.bigpumpkins.com/WeighoffResultsGPC.aspx?c=W&y=2022"
).text

watermelon_dom = lxml.html.fromstring(watermelon_html)

Let's get the table:

In [28]:
watermelon_dom.cssselect("table")

[<Element table at 0x11e59f4c0>,
 <Element table at 0x10b269b20>,
 <Element table at 0x11e5e49a0>,
 <Element table at 0x11e5e6020>]

Hmmmmm. What's happening?

Let's get the table *with more specificity*. How would you do that?

In [29]:
watermelon_dom.cssselect("table.ReportResults")

[<Element table at 0x11e5e6020>]

In [30]:
watermelon_table = watermelon_dom.cssselect("table.ReportResults")[0]

Now let's get the row elements:

In [31]:
row_els = watermelon_table.cssselect("tbody tr")
len(row_els)

300

Let's take a look at the first one, using `lxml.html.tostring(el)`:

In [32]:
lxml.html.tostring(row_els[0])

b'<tr><td align="right">1</td><td align="right">325.40</td><td>Mudd, Framk</td><td>Vine Grove</td><td>Kentucky</td><td>United States</td><td>Allardt Pumpkin Festival</td><td>305 Mudd 16</td><td>305 Mudd</td><td align="right">223.0</td><td align="right">303.00</td><td align="right">7.0</td></tr>'

Let's turn this row into a list, where each cell is one item in the list:

In [33]:
[ cell.text_content() for cell in row_els[0].cssselect("td") ]

['1',
 '325.40',
 'Mudd, Framk',
 'Vine Grove',
 'Kentucky',
 'United States',
 'Allardt Pumpkin Festival',
 '305 Mudd 16',
 '305 Mudd',
 '223.0',
 '303.00',
 '7.0']

## Exercise: How would you extract and represent data for all the rows?

In [34]:
watermelon_entries = [
    [ cell.text_content() for cell in row.cssselect("td") ]
for row in row_els ]

watermelon_entries[:3]

[['1',
  '325.40',
  'Mudd, Framk',
  'Vine Grove',
  'Kentucky',
  'United States',
  'Allardt Pumpkin Festival',
  '305 Mudd 16',
  '305 Mudd',
  '223.0',
  '303.00',
  '7.0'],
 ['2',
  '309.00',
  'McCaslin, Nick',
  'Hawesville',
  'Kentucky',
  'United States',
  'Chillicothe Halloween Festival',
  '301.5 McCaslin',
  'Self',
  '224.0',
  '307.00',
  '1.0'],
 ['3',
  '306.00',
  'Vial, Andrew',
  'Liberty',
  'North Carolina',
  'United States',
  'NC State Fair GPC Weigh-Off',
  '341.5 Vial 19',
  '330.5 Vial B 19',
  '223.0',
  '301.00',
  '2.0']]

Now all we're missing is the header.

In [35]:
watermelon_headers = [ header.text_content() for header in watermelon_table.cssselect("thead th") ]
watermelon_headers

['Place',
 'Weight (lbs)',
 'Grower Name',
 'City',
 'State/Prov',
 'Country',
 'GPC Site',
 'Seed (Mother)',
 'Pollinator (Father)',
 'OTT',
 'Est. Weight',
 'Pct. Chart']

Now let's put it all together, making a `DataFrame` that uses the headers as column names:

In [36]:
watermelon_df = pd.DataFrame(watermelon_entries, columns=watermelon_headers)
watermelon_df.head()

Unnamed: 0,Place,Weight (lbs),Grower Name,City,State/Prov,Country,GPC Site,Seed (Mother),Pollinator (Father),OTT,Est. Weight,Pct. Chart
0,1,325.4,"Mudd, Framk",Vine Grove,Kentucky,United States,Allardt Pumpkin Festival,305 Mudd 16,305 Mudd,223.0,303.0,7.0
1,2,309.0,"McCaslin, Nick",Hawesville,Kentucky,United States,Chillicothe Halloween Festival,301.5 McCaslin,Self,224.0,307.0,1.0
2,3,306.0,"Vial, Andrew",Liberty,North Carolina,United States,NC State Fair GPC Weigh-Off,341.5 Vial 19,330.5 Vial B 19,223.0,301.0,2.0
3,4,302.5,"Mudd, Frank",Vine Grove,Kentucky,United States,Roberts Family Farms,305 Mudd 16,Self,221.0,297.0,2.0
4,5,291.5,"VanBeck, Patrick",Willlow Spring,North Carolina,United States,NC State Fair GPC Weigh-Off,Carolina Cross Burpee,305 Vial DMG,221.0,297.0,-2.0


## Q: What grower entered the most melons?

In [37]:
watermelon_df["Grower Name"].value_counts().head()

Grower Name
Melka, Friedrich    6
Smiley, Samantha    5
Kent, Chris         5
McCaslin, Nick      5
Mudd, Frank         5
Name: count, dtype: int64

## Q: What country grew the most melons?

In [38]:
watermelon_df["Country"].value_counts().head()

Country
United States    213
Canada            29
Germany           15
Italy             11
Austria            8
Name: count, dtype: int64

## Q: Which growers entered the most total weight in watermelons?

In [39]:
(
    watermelon_df
    .astype({ "Weight (lbs)": float })
    .groupby("Grower Name")
    ["Weight (lbs)"]
    .sum()
    .sort_values(ascending=False)
    .head()
)

Grower Name
McCaslin, Nick      1327.0
Mudd, Frank         1203.2
Kent, Chris         1163.0
Smiley, Samantha    1101.3
Houston, Hank       1044.0
Name: Weight (lbs), dtype: float64

---

---

---