# **NHL Player Performance Prediction**

## by Team Lewandowski
### (Elvin Kim, Jay Leung, Adian Murzagaliyev, Sou Hamura)

---

### **Objective**
This project aims to develop a **projection system** that predicts a hockey player’s performance during their **first three NHL seasons** based on their **statistics in previous leagues**. The system processes an **original dataset**, automates data retrieval from **HockeyDB**, and applies **machine learning** to predict future performance.

---

## **Why This Project Matters & What Makes It Unique**

This project is deeply connected to **who we are**—as members of the **UCSB hockey team**, we are combining our passion for hockey with data science to solve a **longstanding challenge in player scouting**.

![UCSB Hockey Team](./_LEX8554.jpeg)


---

## **Why This Model Is Unique**

### **Why Predicting NHL Performance Is Difficult**
Unlike the **NFL or NBA**, where most players come from a **single college system (NCAA)**, NHL players come from **dozens of different leagues** across multiple countries, each with varying levels of difficulty.

✅ **Different Leagues, Different Standards** – Scoring in the **KHL, SHL, AHL, or NCAA** is not the same.  
✅ **Same Country, Multiple Leagues** – In North America, players can develop in the **NCAA, OHL, WHL, or USHL**, all with different competition levels.  
✅ **International Complexity** – Players from **Europe, Russia, and North America** follow unique development paths, making performance comparisons difficult.  

### **What Makes Our Model Unique?**
Our **Random Forest-based model** overcomes these challenges by:  
✅ **Adjusting for League Difficulty** – Captures how hard it is to produce points in different leagues.  
✅ **Learning from Global Data** – Incorporates stats from players across all major pre-NHL leagues.  
✅ **Built by Players, for Players** – As hockey players, we understand the **nuances of player development** beyond just numbers.  

---

## **Step 1: Processing the Original CSV File**

The dataset that we first found contained hockey player statistics from various leagues and game situations. However, it did not contain the specific statistics that we required, so we decided to only extract the player names, since it included all players currently active in 2024-2025 season. 

### Filtering by "Situation" Column

To create a **comprehensive and unbiased dataset**, we filtered the data to include only rows where "situation" = "all". The original dataset contained performance stats from specific game conditions (e.g., 5on5, 4on4, 5on4), which could introduce bias and limit the model’s generalizability.

This step allows us to work with a **complete statistical profile** of each player, making our **NHL performance projections more robust and accurate**.

### **Keeping Only Name**
After filtering, we **retain only the "name" columns** while removing all other variables.  

import pandas as pd
import os

def filter_situation_all(input_file: str):
    """Filter rows where 'situation' is 'all'."""
    df_sample = pd.read_csv(input_file, nrows=5)
    expected_columns = len(df_sample.columns)

    print(f"Expected column count: {expected_columns}")

    df = pd.read_csv(input_file, header=0, dtype=str, on_bad_lines="skip")

    df = df.dropna(subset=['situation'])

    if df.shape[1] != expected_columns:
        print(f"Warning: Dataset has {df.shape[1]} columns instead of {expected_columns}. Some rows may have been removed.")

    print(f"Rows before filtering: {len(df)}")

    if 'situation' not in df.columns:
        print("Error: Column 'situation' not found in dataset.")
        return None

    df_filtered = df[df['situation'] == 'all']
    print(f"Rows after filtering by 'situation': {len(df_filtered)}")

    output_file_all = os.path.join(os.path.dirname(input_file), "filtered_" + os.path.basename(input_file))
    df_filtered.to_csv(output_file_all, index=False)

    print(f"Filtered (situation == 'all') file saved as: {output_file_all}")
    return output_file_all

def filter_only_name(input_file: str):
    """Keep only the 'name' column from the filtered dataset."""
    df = pd.read_csv(input_file, header=0, dtype=str, on_bad_lines="skip")

    if 'name' not in df.columns:
        print("Error: Column 'name' not found in dataset.")
        return None

    df_name_only = df[['name']]

    output_file_name = os.path.join(os.path.dirname(input_file), "name_only_" + os.path.basename(input_file))
    df_name_only.to_csv(output_file_name, index=False)

    print(f"Filtered (name only) file saved as: {output_file_name}")
    return output_file_name

if __name__ == "__main__":
    input_path = r"C:\Users\souha\OneDrive\ドキュメント\Lewandowski (Datathon)\moneypuck downloaded - player - skaters.csv"
    
    filtered_all_file = filter_situation_all(input_path)
    
    if filtered_all_file:
        filter_only_name(filtered_all_file)

---

## **Step 2: Automating Data Retrieval from HockeyDB**

Since the original dataset **does not contain complete career history** for each player, we **scrape additional player statistics** from **HockeyDB**, a website that tracks player performance across different leagues.

import csv
import time
import urllib.parse
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

options = Options()
options.headless = True
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

player_names = [
    
]
try:
    with open("skaters.csv", mode="r", encoding="utf-8") as file:
        reader = csv.reader(file)
        skater_names = [row[0].strip() for row in reader if row]
        player_names.extend(skater_names)
        print(f"Loaded {len(skater_names)} skaters from skaters.csv")
except Exception as e:
    print(f"Error reading skaters.csv: {e}")

csv_filename = "hockey_players_stats.csv"
headers = ["Player", "Season", "Team", "League", "GP", "G", "A", "PTS", "PIM", "+/-"]

with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(headers)

    for player_name in player_names:
        print(f"Scraping data for {player_name}...")

        formatted_name = player_name.replace(" ", "+")

        print(formatted_name)
        url = f"https://www.hockeydb.com/ihdb/stats/find_player.php?full_name={formatted_name}"

        driver.set_page_load_timeout(30)
        try:
            driver.get(url)
        except:
            print(f"Timeout, skipping")
            continue

        time.sleep(2)

        html = driver.page_source
        soup = BeautifulSoup(html, "html.parser")

        table = soup.find("table")
        
        if table:
            for row in table.find_all("tr"):
                cols = row.find_all("td")

                if len(cols) >= 9 and "NHL Totals" not in row.text:
                    season = cols[0].text.strip()
                    team = cols[1].text.strip()
                    league = cols[2].text.strip()
                    gp = cols[3].text.strip()
                    g = cols[4].text.strip()
                    a = cols[5].text.strip()
                    pts = cols[6].text.strip()
                    pim = cols[7].text.strip()
                    plus_minus = cols[8].text.strip()

                    writer.writerow([player_name, season, team, league, gp, g, a, pts, pim, plus_minus])

print(f"\n Data saved to {csv_filename}.")
driver.quit()

### **Reformatting Player Names for URL Generation**
To access a player’s profile on **HockeyDB**, we **convert each player’s name** into a format that matches the website’s URL structure.

🔹 **Example:**  
- **Player:** `"Connor McDavid"`  
- **Reformatted URL:**  
  `https://www.hockeydb.com/ihdb/stats/pdisplay.php?pid=XXXXX`  
  (*where "XXXXX" represents the player’s unique ID, obtained from HockeyDB*)  

### **Generating and Accessing URLs**
Using the **reformatted names**, the system **constructs a URL for each player**, then **accesses the player’s page on HockeyDB** to extract:  
✅ **All leagues the player has competed in**  
✅ **Performance statistics** (goals, assists, points, games played, etc.)  

### **Storing Data in a CSV File**
Once the data is extracted, it is **saved into a structured CSV file**, where each row corresponds to a player and contains:  
✅ **Player name**  
✅ **Position** (Forward or Defense)  
✅ **Pre-NHL league(s)**  
✅ **Performance stats** (goals, assists, points, etc.)  

By incorporating **position data**, the system can account for positional differences in performance, leading to **more accurate NHL projections**.

---

## **Step 3: Data Cleaning**

### Removing Unnecessary Data
Once we extracted all statistics from **HockeyDB**, we accumulated a total of **9,500 lines**, which was a lot more than we needed. Among those lines, there were quite a few **duplicate entries**, which we successfully eliminated.


### Eliminating Unqualified Players
Since each player differs in career length, we aimed to keep the **training process unbiased**.  
To do this, we **removed all players who had less than 3 years of NHL experience**.  
✅ This ensures that our projection model **only predicts player stats for the first three NHL seasons**.


### Filtering Out Seasons
For players with **more than 3 years of NHL experience**, we filtered the data so that it **only includes each player’s first three seasons in the NHL**, as well as their **pre-NHL stats**.

✅ This helps maintain consistency in our **training dataset** and ensures accurate predictions.





import pandas as pd

df = pd.read_csv("filtered_nhl_history.csv")

df = df.drop(columns=["Team"])

numeric_cols = ["GP", "G", "A", "PTS", "PIM", "+/-"]
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0)

nhl_df = df[df["League"] == "NHL"]  
non_nhl_df = df[df["League"] != "NHL"]  

non_nhl_grouped = non_nhl_df.groupby(["Player", "League"], as_index=False)[numeric_cols].sum()

final_df = pd.concat([nhl_df, non_nhl_grouped]).sort_values(by=["Player", "Season"], na_position="first")

final_df.to_csv("final_nhl_dataset.csv", index=False)

print("The dataset is saved as 'final_nhl_dataset.csv'.")

---

## **Step 4: Building the Machine Learning Model**

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import OneHotEncoder

file_path = "filteredNHLFinal.csv"
df = pd.read_csv(file_path)

nhl_df = df[df["League"] == "NHL"].copy()
non_nhl_df = df[df["League"] != "NHL"].copy()

career_totals_fixed = non_nhl_df.groupby("Player")[["GP", "G", "A", "PTS", "PIM", "+/-"]].sum().reset_index()
career_totals_fixed = career_totals_fixed.rename(columns={
    "GP": "GP_non_nhl", "G": "G_non_nhl", "A": "A_non_nhl",
    "PTS": "PTS_non_nhl", "PIM": "PIM_non_nhl", "+/-": "+/-_non_nhl"
})

player_league_history = non_nhl_df.groupby("Player")["League"].unique().reset_index()

encoder = OneHotEncoder(sparse_output=False)
league_encoded = encoder.fit_transform(player_league_history["League"].apply(lambda x: ','.join(x)).values.reshape(-1, 1))

league_encoded_df = pd.DataFrame(league_encoded, columns=encoder.get_feature_names_out(["League"]))
league_encoded_df["Player"] = player_league_history["Player"]

career_totals_with_leagues = career_totals_fixed.merge(league_encoded_df, on="Player", how="left")

nhl_first_3_seasons = nhl_df.groupby("Player").head(3).copy()

nhl_first_3_seasons["Season_Number"] = nhl_first_3_seasons.groupby("Player").cumcount()

merged_df_seasonal = nhl_first_3_seasons.merge(career_totals_with_leagues, on="Player", how="left")

merged_df_seasonal = merged_df_seasonal.dropna()

features_seasonal = ["GP_non_nhl", "G_non_nhl", "A_non_nhl", "PTS_non_nhl", "PIM_non_nhl", "+/-_non_nhl", "Season_Number"] + list(league_encoded_df.columns[:-1])

targets_seasonal = ["GP", "G", "A", "PTS", "PIM", "+/-"]

X_seasonal = merged_df_seasonal[features_seasonal]
y_seasonal = merged_df_seasonal[targets_seasonal]
X_train_seasonal, X_test_seasonal, y_train_seasonal, y_test_seasonal = train_test_split(X_seasonal, y_seasonal, test_size=0.2, random_state=42)

rf_model_seasonal = RandomForestRegressor(n_estimators=200, random_state=42)
rf_model_seasonal.fit(X_train_seasonal, y_train_seasonal)


y_pred_seasonal = rf_model_seasonal.predict(X_test_seasonal)

mae_seasonal = mean_absolute_error(y_test_seasonal, y_pred_seasonal)
rmse_seasonal = mean_squared_error(y_test_seasonal, y_pred_seasonal) ** 0.5

print(f"Mean Absolute Error (MAE): {mae_seasonal:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_seasonal:.2f}")


test_player_custom = pd.DataFrame({
    "GP_non_nhl": [30 + 127 + 83] * 3,
    "G_non_nhl": [6 + 9 + 21] * 3,
    "A_non_nhl": [6 + 23 + 15] * 3,
    "PTS_non_nhl": [12 + 42 + 36] * 3,
    "PIM_non_nhl": [15 + 18 + 53] * 3,
    "+/-_non_nhl": [-25 + (-4) + 6] * 3,
    "Season_Number": [0, 1, 2]
})

league_features_custom = {col: [0] * 3 for col in league_encoded_df.columns[:-1]}
league_features_custom["League_Rus-MHL"] = [1] * 3
league_features_custom["League_AHL"] = [1] * 3
league_features_custom["League_KHL"] = [1] * 3

league_df_custom = pd.DataFrame(league_features_custom)
test_player_custom = pd.concat([test_player_custom, league_df_custom], axis=1)

missing_cols_custom = set(features_seasonal) - set(test_player_custom.columns)
for col in missing_cols_custom:
    test_player_custom[col] = [0] * 3

test_player_custom = test_player_custom[features_seasonal]

predicted_nhl_performance_custom = rf_model_seasonal.predict(test_player_custom)

for season_num in range(3):
    print(f"Predicted NHL Performance for Season {season_num + 1}:")
    print(f"Games Played: {predicted_nhl_performance_custom[season_num][0]:.1f}")
    print(f"Goals: {predicted_nhl_performance_custom[season_num][1]:.1f}")
    print(f"Assists: {predicted_nhl_performance_custom[season_num][2]:.1f}")
    print(f"Points: {predicted_nhl_performance_custom[season_num][3]:.1f}")
    print(f"Penalty Minutes: {predicted_nhl_performance_custom[season_num][4]:.1f}")
    print(f"Plus/Minus: {predicted_nhl_performance_custom[season_num][5]:.1f}")
    print("-" * 40)

### Comparing Machine Learning Models: **Random Forest vs. Multiple Linear Regression**  

Initially, we experimented with **Multiple Linear Regression (MLR)** as a potential model for predicting **NHL performance**.  
However, we found that **Random Forest Regression** provided **more accurate and reliable projections**, especially given the complexity of the dataset.

#### **Challenges with Multiple Linear Regression**
**Multiple Linear Regression (MLR)** assumes a **linear relationship** between a player's **pre-NHL stats** (goals, assists, points, games played, etc.) and their **NHL performance**.  

However, this approach has **several limitations**, particularly due to:

✅ **Variability in League Strength** – Different **pre-NHL leagues** have varying levels of **competition**, making it difficult for a simple linear model to **generalize across leagues**.  

✅ **Non-Linear Relationships** – The impact of stats like **goals or assists** on future performance is **not always linear** and depends on **league difficulty and competition level**.  

✅ **Interactions Between Variables** – A player’s development depends on **multiple interacting factors** (e.g., **playing time, team strength, league quality**), which **MLR struggles to capture** effectively.  


#### **Why Random Forest is a Better Fit**
Random Forest Regression **outperformed MLR** because it:  
 **Captures Non-Linear Patterns** – Handles complex relationships between stats and performance.  
 **Accounts for League Differences** – Adjusts for varying league difficulties automatically.  
 **Handles Interactions Between Features** – Considers how different variables impact player development.  



---

## **Step 5: Developing the Website**

We are developing an **interactive website** to make our **NHL player prediction model** and **league analysis** easily accessible.  
Our platform allows users to explore **player projections** and **in-depth league analytics** through an **intuitive interface**.  

The **front-end** is built using **JavaScript, HTML, and CSS**, while the **back-end** is powered by **Flask**, ensuring seamless integration between **data processing** and **user interaction**

### **Website Features**  

✅ **Player Analysis & Prediction** – View **individual player stats**, their **pre-NHL career**, and **projected NHL performance** based on our model.  

✅ **League Analysis** – Compare the **difficulty levels of different leagues** and analyze how **players from each league historically perform in the NHL**.  

✅ **Player Comparison** – Compare a player’s **projected NHL stats** with **current and historical NHL players**, helping **scouts, analysts, and fans** understand **performance trends over time**.  
