# Assignment

### Question 1

Question: The scraping of `https://www.scrapethissite.com/pages/forms/` in the last section assumes a hardcoded (fixed) no of pages. Can you improve the code by removing the hardcoded no of pages and instead use the `»` button to determine if there are more pages to scrape? Hint: Use a `while` loop.

```python
def parse_and_extract_rows(soup: BeautifulSoup):
    """
    Extract table rows from the parsed HTML.

    Args:
        soup: The parsed HTML.

    Returns:
        An iterator of dictionaries with the data from the current page.
    """
    header = soup.find('tr')
    headers = [th.text.strip() for th in header.find_all('th')]
    teams = soup.find_all('tr', 'team')
    for team in teams:
        row_dict = {}
        for header, col in zip(headers, team.find_all('td')):
            row_dict[header] = col.text.strip()
        yield row_dict
```

In [1]:
import requests
from bs4 import BeautifulSoup
import time

def parse_and_extract_rows(soup: BeautifulSoup):
    """
    Extract table rows from the parsed HTML.

    Args:
        soup: The parsed HTML.

    Returns:
        An iterator of dictionaries with the data from the current page.
    """
    header = soup.find('tr')
    headers = [th.text.strip() for th in header.find_all('th')]
    teams = soup.find_all('tr', 'team')
    for team in teams:
        row_dict = {}
        for header, col in zip(headers, team.find_all('td')):
            row_dict[header] = col.text.strip()
        yield row_dict

def scrape_all_pages():
    """
    Scrape all pages from the hockey teams website using pagination detection.
    Uses the presence of the 'Next' button (») to determine if there are more pages.
    """
    base_url = "https://www.scrapethissite.com/pages/forms/"
    rows = []
    page = 1
    
    while True:
        # Construct URL for current page
        if page == 1:
            url = base_url
        else:
            url = f"{base_url}?page_num={page}"
        
        # Make request
        r = requests.get(url)
        soup = BeautifulSoup(r.text, "html.parser")
        
        # Extract data from current page using the provided function
        for row_dict in parse_and_extract_rows(soup):
            rows.append(row_dict)
        
        # Check if there's a "Next" button (») to determine if more pages exist
        # The Next button has aria-label="Next" and contains the » symbol
        next_button = soup.find("a", {"aria-label": "Next"})
        
        if not next_button:
            # No more pages, break the loop
            break
        
        print(f"Scraped page {page}, found {len([row for row in parse_and_extract_rows(soup)])} teams")
        
        # Move to next page
        page += 1
        
        # Be respectful to the server - pause between requests
        time.sleep(1)
    
    print(f"Scraping complete! Total pages: {page}, Total teams: {len(rows)}")
    return rows

# Usage
all_team_data = scrape_all_pages()

Scraped page 1, found 25 teams
Scraped page 2, found 25 teams
Scraped page 3, found 25 teams
Scraped page 4, found 25 teams
Scraped page 5, found 25 teams
Scraped page 6, found 25 teams
Scraped page 7, found 25 teams
Scraped page 8, found 25 teams
Scraped page 9, found 25 teams
Scraped page 10, found 25 teams
Scraped page 11, found 25 teams
Scraped page 12, found 25 teams
Scraped page 13, found 25 teams
Scraped page 14, found 25 teams
Scraped page 15, found 25 teams
Scraped page 16, found 25 teams
Scraped page 17, found 25 teams
Scraped page 18, found 25 teams
Scraped page 19, found 25 teams
Scraped page 20, found 25 teams
Scraped page 21, found 25 teams
Scraped page 22, found 25 teams
Scraped page 23, found 25 teams
Scraping complete! Total pages: 24, Total teams: 582


In [3]:
all_team_data[0]

{'Team Name': 'Boston Bruins',
 'Year': '1990',
 'Wins': '44',
 'Losses': '24',
 'OT Losses': '',
 'Win %': '0.55',
 'Goals For (GF)': '299',
 'Goals Against (GA)': '264',
 '+ / -': '35'}

In [4]:
all_team_data[-1]

{'Team Name': 'Winnipeg Jets',
 'Year': '2011',
 'Wins': '37',
 'Losses': '35',
 'OT Losses': '10',
 'Win %': '0.451',
 'Goals For (GF)': '225',
 'Goals Against (GA)': '246',
 '+ / -': '-21'}