# Web Crawling with Python — Notebook 3
## Crawling Multiple Pages

---

### Why Crawl Multiple Pages?

Many websites organize content across multiple pages (pagination).  
A crawler should be able to navigate through these pages and collect data from all of them.


---

## Task 1: Review — Download and Parse a Single Page

Below is a pattern you’ve seen before:


In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com/page/1/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# TODO: Print the page title (as a warmup)


---

## Task 2: Find and Print All Quotes on the Page

This website has quotes inside `<span class="text">`.

**Your task:**  
- Find all `<span>` tags with class `"text"`
- Print the text of each quote


In [None]:
# Starter:
quotes = soup.find_all("span", class_="text")

# TODO: Loop through 'quotes' and print each quote text
for quote in quotes:
     print(quote, end="\n" * 2)
    

---

## Task 3: Crawling Multiple Pages (Pagination)

On this site, there is a **Next** button at the bottom of each page.  
The button’s HTML looks like: `<li class="next"><a href="/page/2/">Next <span aria-hidden="true">→</span></a></li>`

**Your task:**  
- Check if there is a **Next** page
- If so, get its URL and repeat the crawling process

**Hint:** Use a `while` loop.

**Below is a starter:**


In [None]:
base_url = "https://quotes.toscrape.com"
page_url = "/page/1/"
all_quotes = []

while page_url:
    full_url = base_url + page_url
    response = requests.get(full_url)
    soup = BeautifulSoup(response.text, "html.parser")

    # TODO: Extract and print quotes on this page (reuse your code from above)
    quotes = soup.find_all("div", class_="quote")
    for quote in quotes:
         print(f"Quote: {text}")
    

    # Find the Next button ad update page_url (or set to None if there is no next page)
    next_btn = soup.find("li", class_="next")
    if next_btn:
        next_link = next_btn.find("a")["href"]
        page_url = next_link
    else:
        page_url = None


---

## Task 4: Store All Quotes

**Modify your loop:**  
Instead of just printing, **append** all quote texts to the `all_quotes` list.

After crawling all pages, print the total number of quotes collected.


In [None]:
# TODO: Store all quotes in the all_quotes list, then print the total count at the end
all_quotes.append({
            "text": text
        })

print(all_quotes)

---

## Challenge: Save Quotes to a File

**Optional:**  
- Write all the quotes you collected into a text file, one per line.


In [None]:
# TODO: Save the quotes in 'all_quotes' to a file (e.g., quotes.txt)
print(all_quotes.txt)


---

## Reflection

- What challenges did you face with pagination?
- How might you adapt this approach for sites with more complex navigation?

- Write your thoughts here:_

- The primary challenges with the specific pagination approach shown (looking for a li with class next and an "a" tag within it) stem from the variability of website structures Inconsistent HTML Structure, Rate Limiting and IP Blocking, Session/Login Requirements
- In summary, the key to adapting for complex pagination is a deeper understanding of how the target website works, particularly how it loads new content. This almost always involves careful inspection of the HTML structure and often, the network requests made by your browser.