# Exercise - Regular Expressions & Web Scraping (2022 NHL Statistics)

## Objective: Show the top 3 Colorado Avalanche players that scored the most points during the 2021-2022 NHL season 

(Stats source: [Covers.com](https://www.covers.com/sport/hockey/nhl/players/statistics/points/2021-2022))

### To-Do List for future expanded project / separate repository:
* Allow varied user input (e.g. "ranger", "arizona", etc.) + use list of team name abbreviations & regex to determine which team the user is referring to

* Based on above, update functions to be applicable for all NHL teams

* Update standard setup for BeatifulSoup to allow dynamic URL season by season changes (may also need to include in user input)

* Once everything is functional, consider adding it to website portfolio (more details/image of players?)


In [304]:
import requests # one of the ways to connect to websites via Python
from bs4 import BeautifulSoup # allows you to go through page source and get data
import re # regular expressions library

In [305]:
def get_nhl_stats():

    ###### STANDARD SETUP FOR BEATUIFULSOUP ######
    url = "https://www.covers.com/sport/hockey/nhl/players/statistics/points/2021-2022"

    # GET request; stores page HTML source in variable
    source_code = requests.get(url) 

    # gets the front end text of the HTMl source code; ignoring back end stuff; essentally parses through HTML source
    plain_text = source_code.text 

    # can sort through this variable
    soup = BeautifulSoup(plain_text) 


    table = soup.find("table", attrs={"id": "PostSeason"})
    table_headers = table.thead.find_all("th") # find all th's within table's thead section
    table_rows = table.tbody.find_all("tr") # find all tr's within table's tbody section

    ###### CREATE LIST OF HEADERS ######
    list_of_headings = []

    for headers in table_headers:
        # remove any newlines and extra spaces from left and right 
        list_of_headings.append(headers.text.replace('\n', ' ').strip())
        
    #print(list_of_headings)  #['# Player', 'Team', 'POS', 'GP', '+/-', 'SOG', 'PPG', 'SHG', 'G', 'A', 'P']




    ###### CREATE LIST OF TABLE ROW DATA & COMBINE W/ HEADERS TO MAKE DICTIONARY ######
    # list of multiple dictionaries
    table_data = []

    for rows in table_rows: 

        # each row is stored as dictionary consisting of the data
        # EXAMPLE: rows = {'# Player': '', 'Team': '', 'POS': '', ...}
        rows_dict = {}

        # zip() allows you to combine two pieces of data & iterate over them as a tuple
        for td, header in zip(rows.find_all("td"), list_of_headings): 
            # to fix issue where number before player's name didn't have a space between the number & name
            if header == '# Player': 
                rows_dict[header] = td.text.replace('\n', ' ').rstrip().lstrip()
            else:
                rows_dict[header] = td.text.replace('\n', '').strip()

        table_data.append(rows_dict)




    return table_data


In [306]:
# list consisting of multiple table rows as key[headers], value[data] dictionaries
get_nhl_stats()

for x in get_nhl_stats():
    print(x)

{'# Player': '1 C. McDavid', 'Team': 'EDM', 'POS': 'C', 'GP': '16', '+/-': '15', 'SOG': '61', 'PPG': '2', 'SHG': '0', 'G': '10', 'A': '23', 'P': '33'}
{'# Player': '2 L. Draisaitl', 'Team': 'EDM', 'POS': 'C', 'GP': '16', '+/-': '4', 'SOG': '44', 'PPG': '3', 'SHG': '1', 'G': '7', 'A': '25', 'P': '32'}
{'# Player': '3 C. Makar', 'Team': 'COL', 'POS': 'D', 'GP': '18', '+/-': '8', 'SOG': '65', 'PPG': '2', 'SHG': '1', 'G': '7', 'A': '20', 'P': '27'}
{'# Player': '4 N. Kucherov', 'Team': 'TB', 'POS': 'RW', 'GP': '21', '+/-': '5', 'SOG': '69', 'PPG': '5', 'SHG': '0', 'G': '7', 'A': '19', 'P': '26'}
{'# Player': '5 M. Rantanen', 'Team': 'COL', 'POS': 'RW', 'GP': '18', '+/-': '2', 'SOG': '51', 'PPG': '1', 'SHG': '0', 'G': '5', 'A': '20', 'P': '25'}
{'# Player': '6 M. Zibanejad', 'Team': 'NYR', 'POS': 'C', 'GP': '20', '+/-': '0', 'SOG': '63', 'PPG': '6', 'SHG': '0', 'G': '10', 'A': '14', 'P': '24'}
{'# Player': '7 A. Fox', 'Team': 'NYR', 'POS': 'D', 'GP': '20', '+/-': '0', 'SOG': '42', 'PPG': '2

In [307]:
## ASSIGNING VARIABLE
data = get_nhl_stats()

In [308]:
def top_avs_player_stats_regex():
    player_points_tuples_list = []

    for player_row in data:
        if player_row['Team'] == 'COL' and player_row != None:
            player_name_orig = str(player_row['# Player'])
            player_name_txt_only = (re.findall(r'([A-Z].+)', player_name_orig))[0]
            
            # getting & sorting player points using tuples
            points_scored = int(player_row['P'])
            
            player_points_tuples_list.append((points_scored, player_name_txt_only)) # reversed tuple (value, key) pair for sorting
            player_points_tuples_list = sorted(player_points_tuples_list, reverse=True) # points sorted in descending order
    for v, k in player_points_tuples_list[:3]:
        print(k, "---", v, "goals scored this season")


## Top 3 Scoring Leaders for the 2021-2022 Colorado Avalanche:

In [309]:
top_avs_player_stats_regex()

C. Makar --- 27 goals scored this season
M. Rantanen --- 25 goals scored this season
N. MacKinnon --- 21 goals scored this season


In [310]:
# REGEX TEST EXAMPLE: Separating/removing number before player name

temp = "3 C. Makar"

temp_str = (re.findall(r'([A-Z].+)', temp))

print(temp_str)

print(temp_str[0])

['C. Makar']
C. Makar


[Helpful Documentation](https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup)