# NBA Playoff Predictor - Feature Engineering

This notebook creates predictive features for NBA playoff prediction using historical NBA data. We engineer features that capture various aspects of team performance, player composition, and injury patterns that may influence a team's likelihood of making the playoffs.

## Data Sources

1. **Team Statistics** (`team_stats.csv`):
   - Season-level team performance metrics
   - Includes scoring, rebounding, assists, and defensive statistics
   - Contains playoff qualification information (target variable)

2. **Player Statistics** (`player_season.csv`):
   - Player-level seasonal data
   - Includes experience, age information
   - Used to analyze team composition

3. **Injury Summary** (`injuries_summary.csv`):
   - Team-level injury counts by year
   - Used to assess impact of injuries on team performance

In [8]:
# Data manipulation
import pandas as pd
import numpy as np

# System utilities
import sys
from pathlib import Path
import json
from datetime import datetime

# Add the src directory to the path
sys.path.append('..')
from src.features.feature_builder import FeatureBuilder

# Configure pandas display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

## 1. Load Data

We begin by loading our preprocessed data sources. Each dataset contributes unique information for feature engineering:

- Team statistics provide the foundation with season-level performance metrics
- Player statistics offer insights into team composition and experience
- Injury summary data helps assess the impact of player availability

In [9]:
# Load all data sources
data_dir = '../data/processed'

# Team statistics
team_stats = pd.read_csv(f'{data_dir}/team_stats.csv')
print("\nTeam Stats Shape:", team_stats.shape)
print("Sample of team stats columns:")
print(team_stats[['season', 'team', 'playoffs', 'pts_per_game', 'fg_percent']].head())

# Player statistics
player_stats = pd.read_csv(f'{data_dir}/player_season.csv')
print("\nPlayer Stats Shape:", player_stats.shape)
print("Sample of player stats columns:")
print(player_stats[['season', 'player', 'team', 'age', 'experience']].head())

# Injury summary
injuries = pd.read_csv(f'{data_dir}/injuries_summary.csv')
print("\nInjuries Shape:", injuries.shape)
print("Sample of injury data:")
print(injuries.head())


Team Stats Shape: (659, 29)
Sample of team stats columns:
   season team  playoffs  pts_per_game  fg_percent
0    2025  ATL     False         116.1       0.463
1    2025  BOS     False         121.2       0.464
2    2025  BKN     False         111.8       0.468
3    2025  CHI     False         118.5       0.475
4    2025  CHA     False         107.5       0.424

Player Stats Shape: (12281, 6)
Sample of player stats columns:
   season          player team  age  experience
0    2004     Aaron McKie  PHI   31          10
1    2004  Aaron Williams  BKN   32          10
2    2004    Adonal Foyle  GSW   28           7
3    2004  Adrian Griffin  HOU   29           5
4    2004   Al Harrington  IND   23           6

Injuries Shape: (591, 4)
Sample of injury data:
   year team  count conference
0  2004  ATL     20       EAST
1  2004  BKN     21       EAST
2  2004  BOS     16       EAST
3  2004  CHA     36       EAST
4  2004  CHI     25       EAST


## 2. Create Team Features

First, we create team-level features from the team statistics data. These features capture various aspects of team performance:

### Efficiency Metrics
- **True Shooting Percentage**: Measures shooting efficiency accounting for all shot types
- **Effective Field Goal Percentage**: Adjusts for three-pointers being worth more
- **Offensive Rebound Percentage**: Team's ability to secure offensive rebounds

### Ball Movement and Control
- **Assist-to-Turnover Ratio**: Measures ball control efficiency
- **Assist Ratio**: Team's tendency to create shots through passing

### Defensive Impact
- **Stocks per Game**: Combined steals and blocks
- **Defensive Rating**: Composite defensive effectiveness metric

### Possession and Pace
- **Possessions per Game**: Estimates team's pace
- **Offensive Efficiency**: Points scored per possession
- **Three Point Rate**: Team's reliance on three-point shots
- **Free Throw Rate**: Ability to get to the free throw line

In [10]:
# Initialize feature builder
builder = FeatureBuilder()

# Create team features
print("Creating team performance features...")
team_features = builder.create_team_features(team_stats)

# Display feature statistics
print(f"\nCreated {len(team_features.columns)} team features for {len(team_features)} seasons")

# Show sample of engineered features
print("\nSample of engineered team features:")
feature_sample = [
    'true_shooting_pct', 'efg_pct', 'ast_to_ratio',
    'stocks_per_game', 'off_efficiency', 'efficiency_rating'
]
print(team_features[['team', 'season'] + feature_sample].head())

# Display summary statistics
print("\nSummary statistics for key features:")
print(team_features[feature_sample].describe())

Creating team performance features...

Created 41 team features for 659 seasons

Sample of engineered team features:
  team  season  true_shooting_pct   efg_pct  ast_to_ratio  stocks_per_game  \
0  ATL    2025           0.568538  0.531694      1.817073             15.5   
1  BOS    2025           0.604260  0.569214      2.232759             12.6   
2  BKN    2025           0.598399  0.563095      1.808219             10.1   
3  CHI    2025           0.594617  0.564978      1.863636             11.6   
4  CHA    2025           0.543368  0.513187      1.413580             13.3   

   off_efficiency  efficiency_rating  
0        1.093179           4.465165  
1        1.190710           4.129762  
2        1.123437           2.581872  
3        1.116408           3.068755  
4        1.056823           3.523508  

Summary statistics for key features:
       true_shooting_pct     efg_pct  ast_to_ratio  stocks_per_game  \
count         659.000000  659.000000    659.000000       659.000000   


## 3. Create Player Features

Next, we create team-level features from player statistics and injury data. These features capture team composition and health:

### Experience and Age
- **Experience Metrics**: Average, maximum, and minimum years of experience
- **Age Demographics**: Average, oldest, and youngest player ages
- **Roster Size**: Number of players on the team

### Health Impact
- **Injury Count**: Total number of injuries for the team
- Provides context for team performance and depth challenges

In [11]:
# Create player and injury features
print("Creating player composition and injury features...")
player_features = builder.create_player_features(player_stats, injuries)

# Display feature statistics
print(f"\nCreated {len(player_features.columns)} player/injury features for {len(player_features)} seasons")

# Show sample of engineered features
print("\nSample of engineered player features:")
feature_sample = [
    'avg_experience', 'roster_size', 'avg_age',
    'max_age', 'min_age', 'count'
]
print(player_features[['team', 'season'] + feature_sample].head())

# Display summary statistics
print("\nSummary statistics for key features:")
print(player_features[feature_sample].describe())

Creating player composition and injury features...

Created 11 player/injury features for 659 seasons

Sample of engineered player features:
  team  season  avg_experience  roster_size    avg_age  max_age  min_age  \
0  ATL    2004        5.043478           23  26.695652       32       21   
1  ATL    2005        6.750000           20  28.000000       42       19   
2  ATL    2006        3.533333           15  24.133333       32       19   
3  ATL    2007        4.000000           19  24.631579       32       20   
4  ATL    2008        4.875000           16  25.125000       33       21   

   count  
0   20.0  
1   37.0  
2   32.0  
3   78.0  
4   39.0  

Summary statistics for key features:
       avg_experience  roster_size     avg_age     max_age     min_age  \
count      659.000000   659.000000  659.000000  659.000000  659.000000   
mean         5.565131    18.635812   26.403731   34.705615   20.356601   
std          1.233115     3.199685    1.330941    2.648877    1.090996   
mi

## 3.5 Create Conference Features

We create conference-specific features to capture team performance within their conference:

### Conference Performance
- **Points vs Conference Average**: Team scoring relative to conference mean
- **Wins vs Conference Average**: Win total compared to conference average
- **Conference Rank**: Team's position in conference standings
- **Games Behind Leader**: Games behind conference leader

In [12]:
# Create conference features
print("Creating conference-based features...")
conference_features = builder.create_conference_features(team_stats)

# Display feature statistics
print(f"\nCreated {len(conference_features.columns)} conference features for {len(conference_features)} seasons")

# Show sample of engineered features
print("\nSample of engineered conference features:")
feature_sample = [
    'conf_rank', 'pts_behind_leader', 'pts_behind_8th',
    'pts_vs_conf_avg', 'games_vs_conf_avg'
]
print(conference_features[['team', 'season'] + feature_sample].head())

# Display summary statistics
print("\nSummary statistics for key features:")
print(conference_features[feature_sample].describe())

Creating conference-based features...

Created 8 conference features for 659 seasons

Sample of engineered conference features:
    team  season  conf_rank  pts_behind_leader  pts_behind_8th  \
630  ATL    2004        5.0                5.2            -1.4   
631  BOS    2004        2.0                2.7            -3.9   
632  CHI    2004       12.0                8.3             1.7   
633  CLE    2004        4.0                5.1            -1.5   
636  DET    2004       11.0                7.9             1.3   

     pts_vs_conf_avg  games_vs_conf_avg  
630         0.397143                0.0  
631         1.205283                0.0  
632        -0.604950                0.0  
633         0.429469                0.0  
636        -0.475648                0.0  

Summary statistics for key features:
        conf_rank  pts_behind_leader  pts_behind_8th  pts_vs_conf_avg  \
count  659.000000         659.000000      659.000000     6.590000e+02   
mean     7.945372           6.937936   

## 4. Combine Features

Finally, we combine all feature sets and prepare the data for modeling:

### Process
1. Merge team and player features
2. Extract target variable (playoff qualification)
3. Handle any missing values
4. Remove non-feature columns

### Quality Checks
- Verify feature completeness
- Examine feature distributions
- Confirm target variable balance

In [13]:
print("Combining all feature sets...")
feature_matrix, target = builder.combine_features(
    team_features,
    player_features,
    conference_features
)

# Display combined feature statistics
print(f"\nFinal feature matrix shape: {feature_matrix.shape}")
print(f"Target distribution (playoff rate): {target.mean():.2%}")

# Show feature categories
print("\nFeature Categories:")
categories = {
    'Team Performance': [col for col in feature_matrix.columns if any(x in col for x in ['pct', 'ratio', 'rating', 'efficiency'])],
    'Player Impact': [col for col in feature_matrix.columns if any(x in col for x in ['experience', 'age', 'injury'])],
    'Conference Impact': [col for col in feature_matrix.columns if any(x in col for x in ['conf_', 'vs_conf'])]
}

for category, cols in categories.items():
    print(f"\n{category} Features ({len(cols)} features):")
    print("- " + "\n- ".join(cols))

# Display correlation with target
correlations = pd.DataFrame({
    'feature': feature_matrix.columns,
    'correlation': [abs(feature_matrix[col].corr(target)) for col in feature_matrix.columns]
}).sort_values('correlation', ascending=False)

print("\nTop 10 Most Predictive Features:")
print(correlations.head(10))

Combining all feature sets...

Final feature matrix shape: (659, 48)
Target distribution (playoff rate): 48.56%

Feature Categories:

Team Performance Features (8 features):
- true_shooting_pct
- efg_pct
- oreb_pct
- ast_to_ratio
- ast_ratio
- def_rating
- off_efficiency
- efficiency_rating

Player Impact Features (6 features):
- avg_experience
- max_experience
- min_experience
- avg_age
- max_age
- min_age

Conference Impact Features (3 features):
- conf_rank
- pts_vs_conf_avg
- games_vs_conf_avg

Top 10 Most Predictive Features:
              feature  correlation
35     avg_experience     0.378943
39            avg_age     0.377450
43          conf_rank     0.346981
46    pts_vs_conf_avg     0.346426
45     pts_behind_8th     0.340684
7         x3p_percent     0.306230
4          fg_percent     0.294577
44  pts_behind_leader     0.289932
29         def_rating     0.280181
34  efficiency_rating     0.269600


## 5. Save Features

Save the engineered features and metadata for use in modeling:

### Saved Data
- Complete feature matrix
- Target variable (playoff qualification)
- Feature metadata and statistics
- Data quality metrics

In [None]:
def save_features(feature_matrix: pd.DataFrame, target: pd.Series, conference_features: pd.DataFrame) -> None:
    """
    Save features and metadata with detailed logging.
    
    Args:
        feature_matrix: DataFrame with engineered features
        target: Series with playoff qualification target
        conference_features: DataFrame with conference information
    """
    output_dir = Path('../data/processed/features')
    output_dir.mkdir(parents=True, exist_ok=True)
    
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    print(f"Saving features with timestamp: {timestamp}")
    
    # Save the main feature matrix with target
    final_data = feature_matrix.copy()
    final_data['playoffs'] = target
    
    # Save conference features separately
    conference_path = output_dir / f'conference_features_{timestamp}.csv'
    conference_features.to_csv(conference_path, index=False)
    print(f"Saved conference features to {conference_path}")
    
    # Save main feature matrix
    feature_path = output_dir / f'playoff_features_{timestamp}.csv'
    final_data.to_csv(feature_path, index=False)
    print(f"Saved feature matrix to {feature_path}")
    
    # Calculate category-wise statistics
    category_stats = {}
    for category, cols in categories.items():
        category_stats[category] = {
            'n_features': len(cols),
            'avg_correlation': float(correlations[correlations['feature'].isin(cols)]['correlation'].mean()),
            'top_features': correlations[correlations['feature'].isin(cols)].head(5).to_dict('records')
        }
    
    # Enhanced metadata with conference info
    metadata = {
        'timestamp': timestamp,
        'n_samples': len(feature_matrix),
        'n_features': len(feature_matrix.columns),
        'feature_names': list(feature_matrix.columns),
        'playoff_rate': float(target.mean()),
        'category_statistics': category_stats,
        'top_correlations': correlations.head(15).to_dict('records'),
        'conference_features': list(conference_features.columns),
        'description': 'Features for NBA playoff prediction based on team stats, player composition, and conference standing',
        'data_quality': {
            'missing_values': feature_matrix.isnull().sum().to_dict(),
            'feature_stats': feature_matrix.describe().to_dict()
        }
    }
    
    metadata_path = output_dir / f'playoff_features_{timestamp}_metadata.json'
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2)
    print(f"Saved metadata to {metadata_path}")
    
    print(f"\nFeature Engineering Summary:")
    print(f"- Total samples: {metadata['n_samples']}")
    print(f"- Total features: {metadata['n_features']}")
    print(f"- Playoff rate: {metadata['playoff_rate']:.2%}")
    for category, stats in category_stats.items():
        print(f"- {category}: {stats['n_features']} features, {stats['avg_correlation']:.3f} avg correlation")

Saving features with timestamp: 20241211_212826
Saved feature matrix to ../data/processed/features/playoff_features_20241211_212826.csv
Saved metadata to ../data/processed/features/playoff_features_20241211_212826_metadata.json

Feature Engineering Summary:
- Total samples: 659
- Total features: 48
- Playoff rate: 48.56%
- Team Performance: 8 features, 0.197 avg correlation
- Player Impact: 6 features, 0.250 avg correlation
- Conference Impact: 3 features, 0.253 avg correlation
