## Web Scraping Practice
From:
https://flowingdata.com/ 

In [1]:
import requests
from bs4 import BeautifulSoup

In [6]:
url = requests.get("https://flowingdata.com/").text

soup = BeautifulSoup(url)
type(soup)

bs4.BeautifulSoup

In [13]:
#Try to collect posts on site

selectors_to_try = [
    ".archive-post",
    "#recent-posts > ul > li",
    "#recent-posts .archive-post",
    "#recent-posts li.archive-post"
]

for sel in selectors_to_try:
    num_elements = len(soup.select(sel))
    print(f"{sel: <30} matches {num_elements} elements")

.archive-post                  matches 20 elements
#recent-posts > ul > li        matches 20 elements
#recent-posts .archive-post    matches 20 elements
#recent-posts li.archive-post  matches 20 elements


In [16]:
# Get headline for each post

for post_el in soup.select(".archive-post"):
    hed = post_el.select("h1 a")[0]
    print(hed.text.strip())


Astericking NBA champions
Noise and health
Password game requires more ridiculous rules as you play
A year of flight paths, for someone with an unlimited pass
Map of electric grid required for cleaner energy
Crochet lake map
Chart Practice: Branch Out Beyond the Visual Bits
An interactive guide to color and contrast
Switching from Python to R
Friend simulation system, with ChatGPT
To make electric vehicle batteries, China must be involved
Where people are moving in the U.S.
Chart Practice: Changing the Audience
Life timeline in a spreadsheet
Objectiveness distributions
Using gaps in location data to track illegal fishing
Fake location signals from oil tankers avoiding oversight
Generative AI exaggerates stereotypes
Smoke from Canada wildfires over the U.S.
Artificial Data Visualization


In [20]:
first_post = soup.select(".archive-post")[0]
first_post

<li class="archive-post">
<div>
<div class="nine columns offset-by-two alpha">
<h1>
<a href="https://flowingdata.com/2023/06/29/astericking-nba-champions/" rel="bookmark">
Astericking NBA champions </a>
</h1>
</div>
<div class="clr"></div>
<div class="byinfo two columns alpha">
<a href="https://flowingdata.com/2023/06/29/astericking-nba-champions/">June 29, 2023</a>
<div style="margin-top:1.5rem">
<h3 class="toplevel">Topic</h3>
<strong><a href="https://flowingdata.com/category/statistics/" rel="category tag">Statistics</a></strong>  /  <a href="https://flowingdata.com/tag/basketball/" rel="tag">basketball</a>, <a href="https://flowingdata.com/tag/pudding/" rel="tag">Pudding</a>, <a href="https://flowingdata.com/tag/russell-samora/" rel="tag">Russell Samora</a> </div>
</div>
<div class="nine columns omega" id="entry-content-wrapper">
<div class="entry">
<div class="archive-featured-image">
<a href="https://flowingdata.com/2023/06/29/astericking-nba-champions/">
<img alt="" class="attac

In [21]:
first_post.select(".byinfo a")[0].text

'June 29, 2023'

In [22]:
#asking to pull an a -tag that's also within a strong tag in .byinfo 
first_post.select(".byinfo strong a")[0].text

'Statistics'

In [25]:
import pandas as pd

#put list into dataframe
#note that information in the dictionary provides format that will be the output (in this
# case a dictionary), the brackets create a list, and the latter half requires one to go
# through and do a for loop
fd_posts = pd.DataFrame([{
    "hed": post_el.select("h1 a")[0].text.strip(),
    "date": post_el.select(".byinfo a")[0].text,
    "topic": post_el.select(".byinfo strong a")[0].text,
} for post_el in soup.select(".archive-post") ])

fd_posts

Unnamed: 0,hed,date,topic
0,Astericking NBA champions,"June 29, 2023",Statistics
1,Noise and health,"June 28, 2023",Infographics
2,Password game requires more ridiculous rules a...,"June 27, 2023",Infographics
3,"A year of flight paths, for someone with an un...","June 27, 2023",Maps
4,Map of electric grid required for cleaner energy,"June 26, 2023",Maps
5,Crochet lake map,"June 23, 2023",Maps
6,Chart Practice: Branch Out Beyond the Visual Bits,"June 22, 2023",The Process
7,An interactive guide to color and contrast,"June 22, 2023",Design
8,Switching from Python to R,"June 21, 2023",Coding
9,"Friend simulation system, with ChatGPT","June 20, 2023",Network Visualization


# Scraping, Part 6: Table Talk
* From HTML table to data table 

Structure of tables (double click in to see code)
<table>
    <thead>
        <tr>
            <th>
    <tbody>
        <tr>
            <td>

tr = tablerow
td = standard table cells

In [26]:
watermelon_url = "http://www.bigpumpkins.com/WeighoffResultsGPC.aspx?c=W&y=2022"
watermelon_html = requests.get(watermelon_url).text

In [27]:
watermelon_soup = BeautifulSoup(watermelon_html)

In [31]:
tables = watermelon_soup.select("table.ReportResults")
len(tables)

watermelon_table = tables[0]
watermelon_table.text[:500]

'PlaceWeight (lbs)Grower NameCityState/ProvCountryGPC SiteSeed (Mother)Pollinator (Father)OTTEst. WeightPct. Chart1325.40Mudd, FramkVine GroveKentuckyUnited StatesAllardt Pumpkin Festival305 Mudd 16305 Mudd223.0303.007.02309.00McCaslin, NickHawesvilleKentuckyUnited StatesChillicothe Halloween Festival301.5 McCaslinSelf224.0307.001.03306.00Vial, AndrewLibertyNorth CarolinaUnited StatesNC State Fair GPC Weigh-Off341.5 Vial 19330.5 Vial B 19223.0301.002.04302.50Mudd, FrankVine GroveKentuckyUnited St'

In [32]:
row_els = watermelon_table.select("tbody tr")
len(row_els)

300

In [33]:
row_els[0]

<tr><td align="right">1</td><td align="right">325.40</td><td>Mudd, Framk</td><td>Vine Grove</td><td>Kentucky</td><td>United States</td><td>Allardt Pumpkin Festival</td><td>305 Mudd 16</td><td>305 Mudd</td><td align="right">223.0</td><td align="right">303.00</td><td align="right">7.0</td></tr>

In [34]:
#First way- less ideal but more familiar
row_cells = []
for cell in row_els[0].select("td"):
    row_cells.append(cell.text)
row_cells

['1',
 '325.40',
 'Mudd, Framk',
 'Vine Grove',
 'Kentucky',
 'United States',
 'Allardt Pumpkin Festival',
 '305 Mudd 16',
 '305 Mudd',
 '223.0',
 '303.00',
 '7.0']

In [35]:
#Second way
[ cell.text for cell in row_els[0].select("td") ]

['1',
 '325.40',
 'Mudd, Framk',
 'Vine Grove',
 'Kentucky',
 'United States',
 'Allardt Pumpkin Festival',
 '305 Mudd 16',
 '305 Mudd',
 '223.0',
 '303.00',
 '7.0']

In [36]:
#How to collect all data - one way
watermelon_entries = [
    [ cell.text for cell in row.select("td") ]
for row in row_els ]

watermelon_entries[:3]

[['1',
  '325.40',
  'Mudd, Framk',
  'Vine Grove',
  'Kentucky',
  'United States',
  'Allardt Pumpkin Festival',
  '305 Mudd 16',
  '305 Mudd',
  '223.0',
  '303.00',
  '7.0'],
 ['2',
  '309.00',
  'McCaslin, Nick',
  'Hawesville',
  'Kentucky',
  'United States',
  'Chillicothe Halloween Festival',
  '301.5 McCaslin',
  'Self',
  '224.0',
  '307.00',
  '1.0'],
 ['3',
  '306.00',
  'Vial, Andrew',
  'Liberty',
  'North Carolina',
  'United States',
  'NC State Fair GPC Weigh-Off',
  '341.5 Vial 19',
  '330.5 Vial B 19',
  '223.0',
  '301.00',
  '2.0']]

In [37]:
#How to collect all data - second way using for loop
watermelon_entries = []
for row in row_els:
    row_cells = []
    for cell in row.select("td"):
        row_cells.append(cell.text)
    watermelon_entries.append(row_cells)

watermelon_entries[:3]

[['1',
  '325.40',
  'Mudd, Framk',
  'Vine Grove',
  'Kentucky',
  'United States',
  'Allardt Pumpkin Festival',
  '305 Mudd 16',
  '305 Mudd',
  '223.0',
  '303.00',
  '7.0'],
 ['2',
  '309.00',
  'McCaslin, Nick',
  'Hawesville',
  'Kentucky',
  'United States',
  'Chillicothe Halloween Festival',
  '301.5 McCaslin',
  'Self',
  '224.0',
  '307.00',
  '1.0'],
 ['3',
  '306.00',
  'Vial, Andrew',
  'Liberty',
  'North Carolina',
  'United States',
  'NC State Fair GPC Weigh-Off',
  '341.5 Vial 19',
  '330.5 Vial B 19',
  '223.0',
  '301.00',
  '2.0']]

In [38]:
#import as pandas data frame
header_cells = watermelon_table.select("thead th")
watermelon_headers = [ header.text for header in header_cells ]
watermelon_headers

['Place',
 'Weight (lbs)',
 'Grower Name',
 'City',
 'State/Prov',
 'Country',
 'GPC Site',
 'Seed (Mother)',
 'Pollinator (Father)',
 'OTT',
 'Est. Weight',
 'Pct. Chart']

In [39]:
import pandas as pd
watermelon_df = pd.DataFrame(watermelon_entries, columns=watermelon_headers)
watermelon_df.head()

Unnamed: 0,Place,Weight (lbs),Grower Name,City,State/Prov,Country,GPC Site,Seed (Mother),Pollinator (Father),OTT,Est. Weight,Pct. Chart
0,1,325.4,"Mudd, Framk",Vine Grove,Kentucky,United States,Allardt Pumpkin Festival,305 Mudd 16,305 Mudd,223.0,303.0,7.0
1,2,309.0,"McCaslin, Nick",Hawesville,Kentucky,United States,Chillicothe Halloween Festival,301.5 McCaslin,Self,224.0,307.0,1.0
2,3,306.0,"Vial, Andrew",Liberty,North Carolina,United States,NC State Fair GPC Weigh-Off,341.5 Vial 19,330.5 Vial B 19,223.0,301.0,2.0
3,4,302.5,"Mudd, Frank",Vine Grove,Kentucky,United States,Roberts Family Farms,305 Mudd 16,Self,221.0,297.0,2.0
4,5,291.5,"VanBeck, Patrick",Willlow Spring,North Carolina,United States,NC State Fair GPC Weigh-Off,Carolina Cross Burpee,305 Vial DMG,221.0,297.0,-2.0


# Scraping, Part 7: Scraping multiple pages
Enumerating and Traversing

Considerations for scraping:
* burden on web server
* purpose/public interest
* accountability - provide it

In [40]:
#Enumerating multiple pages
BASE_URL = "https://scraping-practice-jsvine.vercel.app/launches/paginated/"
for i in range(23):
    print(BASE_URL + "?page=" + str(i+1))


https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=1
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=2
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=3
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=4
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=5
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=6
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=7
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=8
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=9
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=10
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=11
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=12
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=13
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=14
h

In [41]:
BASE_URL = "https://scraping-practice-jsvine.vercel.app/launches/paginated/"

# Note the shorter range, for practice
for i in range(3):
    page_url = BASE_URL + "?page=" + str(i + 1)
    page_html = requests.get(page_url).text
    page_soup = BeautifulSoup(page_html)
    heading = page_soup.select("h3")[0]
    print(heading.text)

Page 1 of 23
Page 2 of 23
Page 3 of 23


In [42]:
#Store data itself and store it:
all_rows = []

for i in range(3):
    print("Fetching page " + str(i + 1))
    page_url = BASE_URL + "?page=" + str(i + 1)
    page_html = requests.get(page_url).text
    page_soup = BeautifulSoup(page_html)
    table = page_soup.select("table")[0]
    row_els = table.select("tbody tr")
    for tr in row_els:
        cells = [ td.text for td in tr.select("td") ]
        all_rows.append(cells)
        
all_rows[:3]

Fetching page 1
Fetching page 2
Fetching page 3


[['Jun 23, 2023',
  'Starlink Group 5-12',
  'Falcon 9',
  'Space Exploration Technologies Corporation',
  'FL'],
 ['Jun 22, 2023',
  'Starlink Gp 5-7',
  'Falcon 9',
  'Space Exploration Technologies Corporation',
  'CA'],
 ['Jun 18, 2023',
  'PSN MFS',
  'Falcon 9',
  'Space Exploration Technologies Corporation',
  'FL']]

In [43]:
len(all_rows)

75

In [45]:
sample_pd = pd.DataFrame(all_rows)
sample_pd

Unnamed: 0,0,1,2,3,4
0,"Jun 23, 2023",Starlink Group 5-12,Falcon 9,Space Exploration Technologies Corporation,FL
1,"Jun 22, 2023",Starlink Gp 5-7,Falcon 9,Space Exploration Technologies Corporation,CA
2,"Jun 18, 2023",PSN MFS,Falcon 9,Space Exploration Technologies Corporation,FL
3,"Jun 17, 2023",FST-1,Electron,Rocket Lab Global,VA
4,"Jun 12, 2023",Transporter-8,Falcon 9,Space Exploration Technologies Corporation,CA
...,...,...,...,...,...
70,"Oct 5, 2022",Crew-5,Falcon 9,Space Exploration Technologies Corporation,FL
71,"Oct 4, 2022",SES 20-21,Atlas V,United Launch Alliance,FL
72,"Oct 1, 2022",FLTA002,Alpha,Firefly Aerospace,CA
73,"Sep 24, 2022",Starlink Group 4-35,Falcon 9,Space Exploration Technologies Corporation,FL


# Traversing directories and other listings

In [46]:
#How to get table from each sub page
#Get URLS for each subpage

BASE_URL = "https://scraping-practice-jsvine.vercel.app/launches/directory/"
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html)

In [47]:
#Try to get href attribute for each of the links:
links = soup.select("ul li a")

for link in links[:3]:
    print(BASE_URL + link["href"])

https://scraping-practice-jsvine.vercel.app/launches/directory/cb50c8c
https://scraping-practice-jsvine.vercel.app/launches/directory/6840f28
https://scraping-practice-jsvine.vercel.app/launches/directory/0ed154e


In [48]:
#Expansion to pull header from each sub page

for link in links[:3]:
    page_url = BASE_URL + link["href"]
    page_html = requests.get(page_url).text
    page_soup = BeautifulSoup(page_html)
    heading = page_soup.select("h1")[0]
    print(heading.text)

Commercial Space Launches: ABL Space Systems
Commercial Space Launches: American Rocket
Commercial Space Launches: Armadillo Aerospace


In [55]:
#Pull all the row from the first 3 companies into a dataframe
all_rows = []

links = soup.select("ul li a")

for link in links[:3]:
    page_url = BASE_URL + link["href"]
    print("Fetching " + page_url)
    page_html = requests.get(page_url).text
    page_soup = BeautifulSoup(page_html)
    table = page_soup.select("table")[0]
    row_els = table.select("tbody tr")
    for tr in row_els:
        cells = []
        for td in tr.select("td"):
            cells.append(td.text)
        all_rows.append(cells)
        
pd.DataFrame(all_rows)


Fetching https://scraping-practice-jsvine.vercel.app/launches/directory/cb50c8c
Fetching https://scraping-practice-jsvine.vercel.app/launches/directory/6840f28
Fetching https://scraping-practice-jsvine.vercel.app/launches/directory/0ed154e


Unnamed: 0,0,1,2,3,4
0,"Jan 10, 2023",Demonstration Mission-1,RS1,ABL Space Systems,AK
1,"Oct 5, 1989",SET-1,SMLV,American Rocket,CA
2,"Jan 5, 2013",Scientific,STIG-B III,Armadillo Aerospace,NM
3,"Nov 4, 2012",Scientific,STIG-B,Armadillo Aerospace,NM
4,"Oct 6, 2012",Scientific,STIG-B,Armadillo Aerospace,NM


# Scraping, Part 8: Scraping gracefully

1. Sleeping- Pausing execcution of python; may want to add some slowness if scraping on a delicate site
2. Announcing yourself - Send information about yourself as you request
3. Caching
4. Catching http errors

In [58]:
#Sleeping - pauses fetching; number refers to seconds for pause
from time import sleep
for i in range(3):
        print("Fetching page " + str(i+1))
        sleep(1)

Fetching page 1
Fetching page 2
Fetching page 3


In [None]:
# Announcing yourself
ident = (
    "Jeremy Singer-Vine (jsvine@gmail.com), " + 
    "scraping for educational purposes"
)

html = requests.get(
    "https://example.com",
    headers = {
        "From": ident
    }
).text

* Caching- Fetech each page only once (unless it's changing rapidly)
* Make a subdirectory in your notebooks using mkdir table--pages


In [67]:
#Make subdirectory

from pathlib import Path

#create folder called table pages

BASE_URL = "https://scraping-practice-jsvine.vercel.app/launches/paginated/"

for i in range(3):
    dest = Path("table-pages/" + str(i + 1) + ".html")
    
    if dest.exists(): # ... load it from file
        page_html = open(dest).read()
        
    else: # ... fetch it
        page_url = BASE_URL + "?page=" + str(i + 1)
        print("Fetching " + page_url)
        page_html = requests.get(page_url).text
        
        # ... and then save it to file
        with open(dest, "w") as f:
            f.write(page_html)
            
    page_soup = BeautifulSoup(page_html)
    heading = page_soup.select("h3")[0]
    print(heading.text)


Fetching https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=1
Page 1 of 23
Fetching https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=2
Page 2 of 23
Fetching https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=3
Page 3 of 23


In [65]:
# Catching HTTP errors
flaky_url = "https://scraping-practice-jsvine.vercel.app/launches/paginated/flaky/"

requests.get(flaky_url)

<Response [500]>

In [66]:
response = requests.get(flaky_url)
print(response.status_code)

200
