# Web Crawling with Python — Notebook 3
## Crawling Multiple Pages

---

### Why Crawl Multiple Pages?

Many websites organize content across multiple pages (pagination).  
A crawler should be able to navigate through these pages and collect data from all of them.


---

## Task 1: Review — Download and Parse a Single Page

Below is a pattern you’ve seen before:


In [2]:
import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com/page/1/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# TODO: Print the page title (as a warmup)
title_tag = soup.title
print(title_tag)

<title>Quotes to Scrape</title>


---

## Task 2: Find and Print All Quotes on the Page

This website has quotes inside `<span class="text">`.

**Your task:**  
- Find all `<span>` tags with class `"text"`
- Print the text of each quote


In [3]:
# Starter:
quotes = soup.find_all("span", class_="text")

# TODO: Loop through 'quotes' and print each quote text
for quote in quotes:
     print(quote, end="\n" * 2)
    

<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>

<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>

<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>

<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>

<span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>

<span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>

<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span

---

## Task 3: Crawling Multiple Pages (Pagination)

On this site, there is a **Next** button at the bottom of each page.  
The button’s HTML looks like: `<li class="next"><a href="/page/2/">Next <span aria-hidden="true">→</span></a></li>`

**Your task:**  
- Check if there is a **Next** page
- If so, get its URL and repeat the crawling process

**Hint:** Use a `while` loop.

**Below is a starter:**


In [10]:
base_url = "https://quotes.toscrape.com"
page_url = "/page/1/"
all_quotes = []

while page_url:
    full_url = base_url + page_url
    response = requests.get(full_url)
    soup = BeautifulSoup(response.text, "html.parser")

    # TODO: Extract and print quotes on this page (reuse your code from above)
    quotes = soup.find_all("div", class_="quote")
    for quote in quotes:
        quote_text = quote.find("span", class_="text").text
        print(f"Quote: {quote_text}")
    

    # Find the Next button and update page_url (or set to None if there is no next page)
    next_btn = soup.find("li", class_="next")
    if next_btn:
        next_link = next_btn.find("a")["href"]
        page_url = next_link
    else:
        page_url = None


Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Quote: “Try not to become a man of success. Rather become a man of value.”
Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Quote: “A day without sunshine is like, you know, night.”
Quote: “This life is what

---

## Task 4: Store All Quotes

**Modify your loop:**  
Instead of just printing, **append** all quote texts to the `all_quotes` list.

After crawling all pages, print the total number of quotes collected.


In [12]:
# TODO: Store all quotes in the all_quotes list, then print the total count at the end
all_quotes.append({
            "text": quote_text
        })

print(all_quotes)

[{'text': '“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”'}]


---

## Challenge: Save Quotes to a File

**Optional:**  
- Write all the quotes you collected into a text file, one per line.


In [16]:
# TODO: Save the quotes in 'all_quotes' to a file (e.g., quotes.txt)
output_file = "quotes.txt"
with open(output_file, "w", encoding="utf-8") as f:
    for quote_content in all_quotes:
        f.write(quote_text + "\n") # Write each quote on a new line

print(f"All quotes saved to {output_file}")


All quotes saved to quotes.txt


---

## Reflection

- What challenges did you face with pagination?
- How might you adapt this approach for sites with more complex navigation?

- Write your thoughts here:_

- The primary challenges with the specific pagination approach shown (looking for a li with class next and an "a" tag within it) stem from the variability of website structures Inconsistent HTML Structure, Rate Limiting and IP Blocking, Session/Login Requirements
- In summary, the key to adapting for complex pagination is a deeper understanding of how the target website works, particularly how it loads new content. This almost always involves careful inspection of the HTML structure and often, the network requests made by your browser.