## We are scraping this : 

https://www.scrapethissite.com/pages/forms/?page_num=1

#### **Steps:**

1. Import the necessary packages:
   - `import requests` 
   - `from bs4 import BeautifulSoup` 
2. Use `requests.get()` to retrieve the necessary details of the page to be scraped.
   
   - `page = requests.get("page_url")` where `page_url` is the URL of page to be scraped.
   - can check `page.status_code` to know if the connection is fine ie `status_code = 200` else `status_code = 404` for some error such as if the page doesnt exist.
3. Use `BeautifulSoup` to create a soup object as `page_soup = BeautifulSoup(markup = page.content, features= "html.parser")` or simply, `page_soup = BeautifulSoup(page.content, "html.parser")`. 
   
4. Alternatively, we can also use `page_soup = BeautifulSoup(page.text, "html.parser")`. `page.content` returns `binary` data whereas `page.text` returns `string` data.
   
   Now, this object holds the `html` codes for the page we are dealing with. 
   
5. Next step is to extract the relevant details via inspecting the code. This can be achieved with the `find`,`findAll`,`findChild` methods in conjunction with `get` and `getText` methods.


In [1]:
import requests
from bs4 import BeautifulSoup

We start by writing a function to scrape a given page (ie its url). Rather unconventionally, I am also passing the data object to which the scraped data is to be appended.

In [2]:
def page_scraper(page_url: str, team_data: list):
    page = requests.get(page_url)
    page_soup = BeautifulSoup(page.content, features="html.parser")

    # Extracting Table Data
    table_data = page_soup.find_all("tr", class_="team")
    # Of course, we have inspected the site's html code inorder to do this.
    for row in table_data:
        team = {}

        name = row.find("td", class_="name").getText(strip=True)
        year = int(row.find("td", class_="year").getText(strip=True))
        wins = int(row.find("td", class_="wins").getText(strip=True))
        losses = int(row.find("td", class_="losses").getText(strip=True))
        ot_losses = 0 if not row.find("td", class_="ot-losses").getText(
            strip=True) else int(row.find("td", class_="ot-losses").getText(strip=True))
        gf = int(row.find("td", class_="gf").getText(strip=True))
        ga = int(row.find("td", class_="ga").getText(strip=True))

        # upon inspection it was seen that win% consisted of two classes.
        # thus we have to take care of it in the following way.
        if not row.find("td", class_="pct text-success"):
            win_pct = float(
                row.find("td", class_="pct text-danger").getText(strip=True))
        else:
            win_pct = float(
                row.find("td", class_="pct text-success").getText(strip=True))
        # similarly, as above, there were two classes for goal-difference (gd)
        if not row.find("td", class_="diff text-success"):
            gd = int(row.find("td", class_="diff text-danger").getText(strip=True))
        else:
            gd = int(row.find("td", class_="diff text-success").getText(strip=True))

        team["name"] = name
        team["year"] = year
        team["wins"] = wins
        team["losses"] = losses
        team["ot-losses"] = ot_losses
        team["win%"] = win_pct
        team["gf"] = gf
        team["ga"] = ga
        team["gd"] = gd

        team_data.append(team)
    print(f"DONE : {page_url}")  # just to see progress, entirely optional

I do a test run with the function.

In [5]:
data = []
page_scraper("https://www.scrapethissite.com/pages/forms/?page_num=1", data)
len(data), data[:3]

DONE : https://www.scrapethissite.com/pages/forms/?page_num=1


(25,
 [{'name': 'Boston Bruins',
   'year': 1990,
   'wins': 44,
   'losses': 24,
   'ot-losses': 0,
   'win%': 0.55,
   'gf': 299,
   'ga': 264,
   'gd': 35},
  {'name': 'Buffalo Sabres',
   'year': 1990,
   'wins': 31,
   'losses': 30,
   'ot-losses': 0,
   'win%': 0.388,
   'gf': 292,
   'ga': 278,
   'gd': 14},
  {'name': 'Calgary Flames',
   'year': 1990,
   'wins': 46,
   'losses': 26,
   'ot-losses': 0,
   'win%': 0.575,
   'gf': 344,
   'ga': 263,
   'gd': 81}])

That was just 1 page of data. Now let us see how we can get data from all such pages. We write a function to scrape the entire web site for the team details. Once we go beyond the existing team data, we reach a page which has "0 items" displayed. We use that as the break condition. This is far more elegant that manually counting the number of pages and then using that to loop over pages.

In [6]:
def site_scraper(team_data: list):
    page_num = 1

    while True:
        base_url = f"https://www.scrapethissite.com/pages/forms/?page_num={
            page_num}"
        tester = requests.get(base_url)
        tester_soup = BeautifulSoup(tester.content, "html.parser")
        validity = tester_soup.find("section").findChild(
            "div", class_="col-md-12").findChild("small").get_text()

        if validity == "0 items":
            break
        page_scraper(base_url, team_data)
        page_num += 1

Finally, we run the scraper to get all the team data.

In [7]:
all_the_data = []
site_scraper(all_the_data)

DONE : https://www.scrapethissite.com/pages/forms/?page_num=1
DONE : https://www.scrapethissite.com/pages/forms/?page_num=2
DONE : https://www.scrapethissite.com/pages/forms/?page_num=3
DONE : https://www.scrapethissite.com/pages/forms/?page_num=4
DONE : https://www.scrapethissite.com/pages/forms/?page_num=5
DONE : https://www.scrapethissite.com/pages/forms/?page_num=6
DONE : https://www.scrapethissite.com/pages/forms/?page_num=7
DONE : https://www.scrapethissite.com/pages/forms/?page_num=8
DONE : https://www.scrapethissite.com/pages/forms/?page_num=9
DONE : https://www.scrapethissite.com/pages/forms/?page_num=10
DONE : https://www.scrapethissite.com/pages/forms/?page_num=11
DONE : https://www.scrapethissite.com/pages/forms/?page_num=12
DONE : https://www.scrapethissite.com/pages/forms/?page_num=13
DONE : https://www.scrapethissite.com/pages/forms/?page_num=14
DONE : https://www.scrapethissite.com/pages/forms/?page_num=15
DONE : https://www.scrapethissite.com/pages/forms/?page_num=16
D

In [8]:
len(all_the_data), all_the_data[-10:]

(582,
 [{'name': 'Philadelphia Flyers',
   'year': 2011,
   'wins': 47,
   'losses': 26,
   'ot-losses': 9,
   'win%': 0.573,
   'gf': 264,
   'ga': 232,
   'gd': 32},
  {'name': 'Phoenix Coyotes',
   'year': 2011,
   'wins': 42,
   'losses': 27,
   'ot-losses': 13,
   'win%': 0.512,
   'gf': 216,
   'ga': 204,
   'gd': 12},
  {'name': 'Pittsburgh Penguins',
   'year': 2011,
   'wins': 51,
   'losses': 25,
   'ot-losses': 6,
   'win%': 0.622,
   'gf': 282,
   'ga': 221,
   'gd': 61},
  {'name': 'San Jose Sharks',
   'year': 2011,
   'wins': 43,
   'losses': 29,
   'ot-losses': 10,
   'win%': 0.524,
   'gf': 228,
   'ga': 210,
   'gd': 18},
  {'name': 'St. Louis Blues',
   'year': 2011,
   'wins': 49,
   'losses': 22,
   'ot-losses': 11,
   'win%': 0.598,
   'gf': 210,
   'ga': 165,
   'gd': 45},
  {'name': 'Tampa Bay Lightning',
   'year': 2011,
   'wins': 38,
   'losses': 36,
   'ot-losses': 8,
   'win%': 0.463,
   'gf': 235,
   'ga': 281,
   'gd': -46},
  {'name': 'Toronto Maple Leaf

We will now write all this to a file (.txt). For completeness sake.

In [9]:
with open("teams.txt", "w") as file:
    for data in all_the_data:
        file.write(f"{str(data)}\n")

### Following were some stuff I tested (before I wrote out the final solution above):

In [None]:
test = requests.get("https://www.scrapethissite.com/pages/forms/?page_num=1")
test_soup = BeautifulSoup(test.content, "html.parser")
table_dt = test_soup.find_all("tr", class_="team")

for row in table_dt:
    if not row.find("td", class_="pct text-success"):
        print(row.find("td", class_="pct text-danger").get_text(strip=True))
    else:
        print(row.find("td", class_="pct text-success").get_text(strip=True))

0.55
0.388
0.575
0.613
0.425
0.463
0.388
0.575
0.338
0.487
0.4
0.312
0.45
0.412
0.512
0.2
0.588
0.287
0.35
0.463
0.325
0.45
0.388
0.388
0.45


In [None]:
tester = requests.get(
    "https://www.scrapethissite.com/pages/forms/?page_num=240")
tester_soup = BeautifulSoup(tester.content, "html.parser")
validity = tester_soup.find("section").findChild(
    "div", class_="col-md-12").findChild("small").get_text()
validity
# if tester_soup=="0 items":
#     print("This will work!")

'0 items'