# 🌐 Scraping, Part 7: Scraping multiple pages

*Enumerating and traversing.*

In [1]:
import requests
from bs4 import BeautifulSoup

## Discussion time: Let's talk ethics

## Considerations

- Burden on the server? Minimize it.
- Purpose / public interest? Have a clear one.
- Accountability? Provide it.

## Let's start low stakes: 

## <center>https://scraping-practice-jsvine.vercel.app</center>

## *Enumerating* multiple pages

Often times, the data you want is split across a series of numbered pages.

Open the "__Paginated table__" example on the practice server. Click around. Look at the URLs of each page. What do you notice?

## Iterating over the pages

Each page's URL looks like `https://[...]/launches/paginated/?page=NUMBER`.

How would we make a list of *all* the URLs necessary to get the full dataset?

In [2]:
for i in range(23):
    print(i + 1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


In [3]:
BASE_URL = "https://scraping-practice-jsvine.vercel.app/launches/paginated/"
for i in range(23):
    print(BASE_URL + "?page=" + str(i + 1))

https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=1
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=2
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=3
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=4
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=5
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=6
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=7
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=8
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=9
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=10
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=11
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=12
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=13
https://scraping-practice-jsvine.vercel.app/launches/paginated/?page=14
h

Now let's try scraping data from there, starting just with the `Page -- of --` heading, to make sure we're getting the correct information:

In [4]:
import requests
from bs4 import BeautifulSoup

In [5]:
BASE_URL = "https://scraping-practice-jsvine.vercel.app/launches/paginated/"

# Note the shorter range, for practice
for i in range(3):
    page_url = BASE_URL + "?page=" + str(i + 1)
    page_html = requests.get(page_url).text
    page_soup = BeautifulSoup(page_html)
    heading = page_soup.select("h3")[0]
    print(heading.text)

Page 1 of 23
Page 2 of 23
Page 3 of 23


Great, now let's try to get the data itself and store it somewhere:

In [6]:
all_rows = []

for i in range(3):
    print("Fetching page " + str(i + 1))
    page_url = BASE_URL + "?page=" + str(i + 1)
    page_html = requests.get(page_url).text
    page_soup = BeautifulSoup(page_html)
    table = page_soup.select("table")[0]
    row_els = table.select("tbody tr")
    for tr in row_els:
        cells = [ td.text for td in tr.select("td") ]
        all_rows.append(cells)
        
all_rows[:3]

Fetching page 1
Fetching page 2
Fetching page 3


[['Jun 23, 2023',
  'Starlink Group 5-12',
  'Falcon 9',
  'Space Exploration Technologies Corporation',
  'FL'],
 ['Jun 22, 2023',
  'Starlink Gp 5-7',
  'Falcon 9',
  'Space Exploration Technologies Corporation',
  'CA'],
 ['Jun 18, 2023',
  'PSN MFS',
  'Falcon 9',
  'Space Exploration Technologies Corporation',
  'FL']]

In [7]:
len(all_rows)

75

... and put it in a `DataFrame`:

In [8]:
import pandas as pd
pd.DataFrame(all_rows)

Unnamed: 0,0,1,2,3,4
0,"Jun 23, 2023",Starlink Group 5-12,Falcon 9,Space Exploration Technologies Corporation,FL
1,"Jun 22, 2023",Starlink Gp 5-7,Falcon 9,Space Exploration Technologies Corporation,CA
2,"Jun 18, 2023",PSN MFS,Falcon 9,Space Exploration Technologies Corporation,FL
3,"Jun 17, 2023",FST-1,Electron,Rocket Lab Global,VA
4,"Jun 12, 2023",Transporter-8,Falcon 9,Space Exploration Technologies Corporation,CA
...,...,...,...,...,...
70,"Oct 5, 2022",Crew-5,Falcon 9,Space Exploration Technologies Corporation,FL
71,"Oct 4, 2022",SES 20-21,Atlas V,United Launch Alliance,FL
72,"Oct 1, 2022",FLTA002,Alpha,Firefly Aerospace,CA
73,"Sep 24, 2022",Starlink Group 4-35,Falcon 9,Space Exploration Technologies Corporation,FL


## Discussion: What improvements could we make?

- Get the number of pages programmatically
- Fetch each page only once
- Handle errors

## *Traversing* directories and other listings

Enumerating and traversing are similar techniques. With enumeration/pagination, you're just incrementing the page number to figure out the next URL. But with traversing, you'll need to get your list of URLs *programmatically*, from the starting point's HTML.

## Examine this page, and its HTML:

## <center>https://scraping-practice-jsvine.vercel.app/launches/directory/</center>

## How would get the tables from each sub-page?

Let's get the URLs for each subpage:

In [9]:
BASE_URL = "https://scraping-practice-jsvine.vercel.app/launches/directory/"
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html)

First, we can try just getting the `href` attribute of each link:

In [10]:
links = soup.select("ul li a")

for link in links[:3]:
    print(link["href"])

cb50c8c
6840f28
0ed154e


Let's turn those into proper URLs:

In [11]:
links = soup.select("ul li a")

for link in links[:3]:
    print(BASE_URL + link["href"])

https://scraping-practice-jsvine.vercel.app/launches/directory/cb50c8c
https://scraping-practice-jsvine.vercel.app/launches/directory/6840f28
https://scraping-practice-jsvine.vercel.app/launches/directory/0ed154e


Now that we have the URLs, we can take a similar approach as we did with the paginated tables:

In [12]:
links = soup.select("ul li a")

for link in links[:3]:
    page_url = BASE_URL + link["href"]
    page_html = requests.get(page_url).text
    page_soup = BeautifulSoup(page_html)
    heading = page_soup.select("h1")[0]
    print(heading.text)

Commercial Space Launches: ABL Space Systems
Commercial Space Launches: American Rocket
Commercial Space Launches: Armadillo Aerospace


## Exercise

Create a `DataFrame` of all data for the first three companies in the directory. Use the code we used for the paginated tables as inspiration.

In [13]:
all_rows = []

links = soup.select("ul li a")

for link in links[:3]:
    page_url = BASE_URL + link["href"]
    print("Fetching " + page_url)
    page_html = requests.get(page_url).text
    page_soup = BeautifulSoup(page_html)
    table = page_soup.select("table")[0]
    row_els = table.select("tbody tr")
    for tr in row_els:
        cells = [ td.text for td in tr.select("td") ]
        all_rows.append(cells)
        
pd.DataFrame(all_rows)

Fetching https://scraping-practice-jsvine.vercel.app/launches/directory/cb50c8c
Fetching https://scraping-practice-jsvine.vercel.app/launches/directory/6840f28
Fetching https://scraping-practice-jsvine.vercel.app/launches/directory/0ed154e


Unnamed: 0,0,1,2,3,4
0,"Jan 10, 2023",Demonstration Mission-1,RS1,ABL Space Systems,AK
1,"Oct 5, 1989",SET-1,SMLV,American Rocket,CA
2,"Jan 5, 2013",Scientific,STIG-B III,Armadillo Aerospace,NM
3,"Nov 4, 2012",Scientific,STIG-B,Armadillo Aerospace,NM
4,"Oct 6, 2012",Scientific,STIG-B,Armadillo Aerospace,NM


---

---

---