# 🏀 NCAA March Madness Feature Engineering: Building Predictive Team Metrics

## 🚀 Objective
This notebook transforms raw NCAA basketball data into **meaningful features** for predicting tournament winners.  
By engineering **team performance metrics**, we prepare a dataset that can be used for **machine learning models**.

## 🔬 Key Steps
1️⃣ **Load & inspect datasets** (season results, tournament games, seeds, teams)  
2️⃣ **Compute team performance metrics** (win percentage, avg points, point differential)  
3️⃣ **Merge team stats into tournament match data**  
4️⃣ **Create feature differences for predictive modeling**  

📌 **Next Step:** Use these engineered features to **train a machine learning model** to predict NCAA tournament outcomes! 🎯


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load datasets
season_results = pd.read_csv("../data/MRegularSeasonCompactResults.csv")  # Regular season games
tourney_results = pd.read_csv("../data/MNCAATourneyCompactResults.csv")  # Tournament games
seeds = pd.read_csv("../data/MNCAATourneySeeds.csv")  # Team seed rankings
teams = pd.read_csv("../data/MTeams.csv")  # Team names

# Display first few rows of each dataset
display(season_results.head(), tourney_results.head(), seeds.head(), teams.head())

# Print column names for reference
print("Season Results Columns:", season_results.columns)
print("Tournament Results Columns:", tourney_results.columns)
print("Seeds Columns:", seeds.columns)
print("Teams Columns:", teams.columns)


### 📊 Computing Team Strength
We will calculate:
- **Win Percentage** = Total Wins / Total Games
- **Average Points Scored** (Offensive Strength)
- **Average Points Allowed** (Defensive Strength)
- **Point Differential** = (Avg Points Scored - Avg Points Allowed)


In [None]:
# Calculate total wins per team
win_counts = season_results.groupby("WTeamID").size().reset_index(name="Wins")

# Calculate total games played (wins + losses)
total_games = season_results.groupby("WTeamID").size().add(season_results.groupby("LTeamID").size(), fill_value=0).reset_index(name="TotalGames")

# Merge wins and total games
team_stats = win_counts.merge(total_games, left_on="WTeamID", right_on="WTeamID", how="left")
team_stats["WinPercentage"] = team_stats["Wins"] / team_stats["TotalGames"]

# Compute average points scored (offense) and allowed (defense)
avg_points_scored = season_results.groupby("WTeamID")["WScore"].mean().reset_index(name="AvgPointsScored")
avg_points_allowed = season_results.groupby("LTeamID")["LScore"].mean().reset_index(name="AvgPointsAllowed")

# Merge into team stats
team_stats = team_stats.merge(avg_points_scored, on="WTeamID", how="left")
team_stats = team_stats.merge(avg_points_allowed, left_on="WTeamID", right_on="LTeamID", how="left")

# Compute point differential (offense - defense)
team_stats["PointDifferential"] = team_stats["AvgPointsScored"] - team_stats["AvgPointsAllowed"]

# Drop redundant LTeamID column
team_stats.drop(columns=["LTeamID"], inplace=True, errors="ignore")

# Display updated team stats
display(team_stats.head())


### 🔢 Adding Seed Information
The **seeding rank** of a team is a strong predictor of tournament performance.  
- Lower **seed values** indicate stronger teams.
- We extract **only the numeric value** of the seed (e.g., `"W01"` → `1`).


In [None]:
# Extract numeric seed value (e.g., "W01" → 1)
seeds["SeedValue"] = seeds["Seed"].apply(lambda x: int(x[1:3]))

# Merge seeds into team stats
team_stats = team_stats.merge(seeds[["TeamID", "SeedValue"]], left_on="WTeamID", right_on="TeamID", how="left")

# Drop redundant TeamID column
team_stats.drop(columns=["TeamID"], inplace=True)

# Display updated team stats
display(team_stats.head())


### 🔗 Merging Team Stats into Tournament Data
We now merge **team performance metrics** into **tournament match results**,  
so we can compare stats of both competing teams.


In [None]:
# Merge team stats for the winning team (WTeamID)
tourney_results = tourney_results.merge(team_stats, left_on="WTeamID", right_on="WTeamID", how="left")
tourney_results.rename(columns={
    "WinPercentage": "W_WinPercentage",
    "AvgPointsScored": "W_AvgPointsScored",
    "AvgPointsAllowed": "W_AvgPointsAllowed",
    "PointDifferential": "W_PointDifferential",
    "SeedValue": "W_SeedValue"
}, inplace=True)

# Merge team stats for the losing team (LTeamID)
tourney_results = tourney_results.merge(team_stats, left_on="LTeamID", right_on="WTeamID", how="left")
tourney_results.rename(columns={
    "WinPercentage": "L_WinPercentage",
    "AvgPointsScored": "L_AvgPointsScored",
    "AvgPointsAllowed": "L_AvgPointsAllowed",
    "PointDifferential": "L_PointDifferential",
    "SeedValue": "L_SeedValue"
}, inplace=True)

# Drop redundant WTeamID_y column
tourney_results.drop(columns=["WTeamID_y"], inplace=True, errors="ignore")

# Display updated tournament dataset
display(tourney_results.head())


### 🔀 Creating Feature Differences
Instead of using raw team statistics, we calculate **relative differences**:
- **WinPercentage_Diff** = (Winning Team % - Losing Team %)
- **AvgPointsScored_Diff** = (Winning Team Avg Points - Losing Team Avg Points)
- **SeedValue_Diff** = (Winning Team Seed - Losing Team Seed)


In [None]:
# Compute feature differences
tourney_results["WinPercentage_Diff"] = tourney_results["W_WinPercentage"] - tourney_results["L_WinPercentage"]
tourney_results["AvgPointsScored_Diff"] = tourney_results["W_AvgPointsScored"] - tourney_results["L_AvgPointsScored"]
tourney_results["AvgPointsAllowed_Diff"] = tourney_results["W_AvgPointsAllowed"] - tourney_results["L_AvgPointsAllowed"]
tourney_results["PointDifferential_Diff"] = tourney_results["W_PointDifferential"] - tourney_results["L_PointDifferential"]
tourney_results["SeedValue_Diff"] = tourney_results["W_SeedValue"] - tourney_results["L_SeedValue"]

# Select relevant features
model_data = tourney_results[[
    "WinPercentage_Diff",
    "AvgPointsScored_Diff",
    "AvgPointsAllowed_Diff",
    "PointDifferential_Diff",
    "SeedValue_Diff"
]].copy()

# Add target variable (1 = Winning team won)
model_data["Result"] = 1

# Display final dataset
display(model_data.head())
