## NHL Player Stats Scraper (Automated)

RUN THIS PROGRAM BEFORE EACH WORKSHOP TO GET THE LATEST DATA.

**Author:** Peter Beens  
**Date:** 2025-03-06  
**Description:**  
This script scrapes NHL player statistics from ESPN using **Selenium** and **Pandas**.  
It automatically clicks the **"Show More"** button to ensure all player data is loaded before extraction.  
The extracted data is saved to a CSV file in the format:  
**`nhl_player_stats_YYYY-YYYY.csv`** (e.g., `nhl_player_stats_2024-2025.csv`).

#### **Features:**
✅ **Automated "Show More" Clicking** – Ensures all stats are loaded.  
✅ **Player Name Extraction** – Uses Selenium to scrape player names.  
✅ **Stat Table Parsing** – Extracts and merges statistical tables.  
✅ **CSV Export** – Saves the cleaned dataset for further analysis.  

#### **Usage Instructions:**
1. **Install Dependencies:**

2. **Ensure `geckodriver` is Installed** (for Firefox WebDriver).  
3. **Run the script in a Python environment** with internet access.  

#### **Dependencies:**
- `selenium`
- `pandas`
- `time`
- `io.StringIO`  

#### **Output:**
- A CSV file with NHL player stats, named:  

`nhl_player_stats_YYYY-YYYY.csv`




In [None]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
from io import StringIO

# Initialize the WebDriver for Firefox. Ensure geckodriver is installed and in your PATH.
wd = webdriver.Firefox()

# Define the URL and filename
end_year = 2025  # Change this to the desired season
url = f"https://www.espn.com/nhl/stats/player/_/season/{end_year}/seasontype/2"
filename = f'nhl_player_stats_{end_year-1}-{end_year}.csv'

# Open the URL in the WebDriver
wd.get(url)
time.sleep(3)  # Allow initial page load

# Function to click "Show More" until all data is loaded
def click_show_more():
    while True:
        try:
            # Locate the "Show More" link inside the div
            show_more_link = wd.find_element(By.XPATH, "//div[contains(@class, 'loadMore')]//a[contains(@class, 'loadMore__link')]")
            
            # Scroll into view
            wd.execute_script("arguments[0].scrollIntoView();", show_more_link)
            time.sleep(1)  # Allow scrolling time
            
            # Click using JavaScript (ensures it works)
            wd.execute_script("arguments[0].click();", show_more_link)
            time.sleep(2)  # Allow time for new data to load
        except NoSuchElementException:
            print("No more 'Show More' button found. Page is fully loaded.")
            break
        except ElementClickInterceptedException:
            print("Click intercepted. Retrying after a short wait...")
            time.sleep(2)

# Click "Show More" until all data is loaded
click_show_more()

# Extract the player names using Selenium
try:
    player_elements = wd.find_elements(By.XPATH, "//tr[contains(@class, 'Table__TR')]//td[2]//a")
    names = [element.text for element in player_elements if element.text]
    print(f"Number of player names extracted: {len(names)}")
except Exception as e:
    print("Error extracting player names:", e)
    names = []

# Extract tables using pandas
html_source = wd.page_source
try:
    tables = pd.read_html(StringIO(html_source))
except ValueError:
    print("Error: No tables found on the page.")
    wd.quit()
    exit()

# Ensure tables exist
if len(tables) < 2:
    print("Error: Expected at least 2 tables but found", len(tables))
    wd.quit()
    exit()

# Extract player data
players = tables[0]
stats = tables[1]

# Drop 'RK' column if it exists
if 'RK' in players.columns:
    players = players.drop(columns=['RK'])

# Add extracted player names
players['Name'] = names

# Merge player and stats DataFrames
df = pd.concat([players, stats], axis=1)

# Display the final DataFrame
print(df.head())

# Save the DataFrame to a CSV file
df.to_csv(filename, index=False)
print(f"Data saved to {filename}")

# Close the WebDriver
wd.quit()


## Merge All Years of NHL Player Stats

**Author:** Peter Beens  
**Date:** 2025-03-06  
**Description:** This script merges all individual NHL player stats CSV files into a single dataset.  
It extracts the season year from the filenames and adds it as a new "Year" column, positioned right after "Name."  
All files must follow the naming pattern: **nhl_player_stats_YYYY-YYYY.csv** (e.g., `nhl_player_stats_2024-2025.csv`).  

#### Usage:
- Ensure all CSV files are stored in the same directory as this script.
- Run the script in an environment with `pandas` installed.
- The merged file will be saved as **nhl_player_stats_all.csv**.

#### Dependencies:
- `pandas`
- `glob`
- `re`

In [None]:
import pandas as pd
import glob
import re

# Define the expected NHL column order with "Year" added after "Name"
column_order = ['Name', 'Year', 'POS', 'GP', 'G', 'A', 'PTS', '+/-', 'PIM', 'TOI/G', 
                'PPG', 'PPA', 'S', 'S%', 'SHFT', 'GWG', 'FW', 'FL', 'FO%', 
                'SOA', 'SOG', 'SO%']

# Initialize an empty list to hold DataFrames
all_data = []

# Use glob to get all filenames matching the pattern
file_pattern = "nhl_player_stats_*.csv"
files = glob.glob(file_pattern)

# Iterate through the list of files
for file in files:
    # Extract the year from the filename using regex
    match = re.search(r'nhl_player_stats_(\d{4})-\d{4}\.csv', file)
    if not match:
        print(f"Skipping file {file}: Unable to extract year from filename.")
        continue  # Skip files that don't match expected format

    year = int(match.group(1))  # Extracted year (e.g., 2024 from "nhl_player_stats_2024-2025.csv")

    # Read the CSV file into a DataFrame
    df = pd.read_csv(file)

    # Add the extracted Year column
    df.insert(1, 'Year', year)  # Insert after "Name" (column index 1)

    # Append the DataFrame to the list
    all_data.append(df)

# Concatenate all DataFrames in the list
combined_df = pd.concat(all_data, ignore_index=True)

# Reorder the columns to match NHL format
combined_df = combined_df[column_order]

# Save the combined DataFrame to a new CSV file
combined_df.to_csv('nhl_player_stats_all.csv', index=False)

print("Files have been successfully combined and saved to 'nhl_player_stats_all.csv'.")
