# NBA Playoff Predictor - Feature Engineering

This notebook creates predictive features for NBA playoff prediction using historical NBA data. We engineer features that capture various aspects of team performance, player composition, and injury patterns that may influence a team's likelihood of making the playoffs.

## Data Sources

1. **Team Statistics** (`team_stats.csv`):
   - Season-level team performance metrics
   - Includes scoring, rebounding, assists, and defensive statistics
   - Contains playoff qualification information (target variable)

2. **Player Statistics** (`player_season.csv`):
   - Player-level seasonal data
   - Includes experience, age information
   - Used to analyze team composition

3. **Injury Summary** (`injuries_summary.csv`):
   - Team-level injury counts by year
   - Used to assess impact of injuries on team performance

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# System utilities
import sys
from pathlib import Path
import json
from datetime import datetime

# Add the src directory to the path
sys.path.append('..')
from src.features.feature_builder import FeatureBuilder

# Configure pandas display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

## 1. Load Data

We begin by loading our preprocessed data sources. Each dataset contributes unique information for feature engineering:

- Team statistics provide the foundation with season-level performance metrics
- Player statistics offer insights into team composition and experience
- Injury summary data helps assess the impact of player availability

In [2]:
# Load all data sources
data_dir = '../data/processed'

# Team statistics
team_stats = pd.read_csv(f'{data_dir}/team_stats.csv')
print("\nTeam Stats Shape:", team_stats.shape)
print("Sample of team stats columns:")
print(team_stats[['season', 'team', 'playoffs', 'pts_per_game', 'fg_percent']].head())

# Player statistics
player_stats = pd.read_csv(f'{data_dir}/player_season.csv')
print("\nPlayer Stats Shape:", player_stats.shape)
print("Sample of player stats columns:")
print(player_stats[['season', 'player', 'team', 'age', 'experience']].head())

# Injury summary
injuries = pd.read_csv(f'{data_dir}/injuries_summary.csv')
print("\nInjuries Shape:", injuries.shape)
print("Sample of injury data:")
print(injuries.head())


Team Stats Shape: (659, 28)
Sample of team stats columns:
   season team  playoffs  pts_per_game  fg_percent
0    2025  ATL     False         116.1       0.463
1    2025  BOS     False         121.2       0.464
2    2025  BKN     False         111.8       0.468
3    2025  CHI     False         118.5       0.475
4    2025  CHA     False         107.5       0.424

Player Stats Shape: (13629, 5)
Sample of player stats columns:
   season          player team  age  experience
0    2004     Aaron McKie  PHI   31          10
1    2004  Aaron Williams  NJN   32          10
2    2004    Adonal Foyle  GSW   28           7
3    2004  Adrian Griffin  HOU   29           5
4    2004   Al Harrington  IND   23           6

Injuries Shape: (603, 3)
Sample of injury data:
   year     team  count
0  2004      ATL     20
1  2004      BKN     21
2  2004  BOBCATS     15
3  2004      BOS     16
4  2004      CHA     21


## 2. Create Team Features

First, we create team-level features from the team statistics data. These features capture various aspects of team performance:

### Efficiency Metrics
- **True Shooting Percentage**: Measures shooting efficiency accounting for all shot types
- **Effective Field Goal Percentage**: Adjusts for three-pointers being worth more
- **Offensive Rebound Percentage**: Team's ability to secure offensive rebounds

### Ball Movement and Control
- **Assist-to-Turnover Ratio**: Measures ball control efficiency
- **Assist Ratio**: Team's tendency to create shots through passing

### Defensive Impact
- **Stocks per Game**: Combined steals and blocks
- **Defensive Rating**: Composite defensive effectiveness metric

### Possession and Pace
- **Possessions per Game**: Estimates team's pace
- **Offensive Efficiency**: Points scored per possession
- **Three Point Rate**: Team's reliance on three-point shots
- **Free Throw Rate**: Ability to get to the free throw line

In [None]:
# Initialize feature builder
builder = FeatureBuilder()

# Create team features
print("Creating team performance features...")
team_features = builder.create_team_features(team_stats)

# Display feature statistics
print(f"\nCreated {len(team_features.columns)} team features for {len(team_features)} seasons")

# Show sample of engineered features
print("\nSample of engineered team features:")
feature_sample = [
    'true_shooting_pct', 'efg_pct', 'ast_to_ratio',
    'stocks_per_game', 'off_efficiency', 'efficiency_rating'
]
print(team_features[['team', 'season'] + feature_sample].head())

# Display summary statistics
print("\nSummary statistics for key features:")
print(team_features[feature_sample].describe())

## 3. Create Player Features

Next, we create team-level features from player statistics and injury data. These features capture team composition and health:

### Experience and Age
- **Experience Metrics**: Average, maximum, and minimum years of experience
- **Age Demographics**: Average, oldest, and youngest player ages
- **Roster Size**: Number of players on the team

### Health Impact
- **Injury Count**: Total number of injuries for the team
- Provides context for team performance and depth challenges

In [None]:
# Create player and injury features
print("Creating player composition and injury features...")
player_features = builder.create_player_features(player_stats, injuries)

# Display feature statistics
print(f"\nCreated {len(player_features.columns)} player/injury features for {len(player_features)} seasons")

# Show sample of engineered features
print("\nSample of engineered player features:")
feature_sample = [
    'avg_experience', 'roster_size', 'avg_age',
    'max_age', 'min_age', 'count'
]
print(player_features[['team', 'season'] + feature_sample].head())

# Display summary statistics
print("\nSummary statistics for key features:")
print(player_features[feature_sample].describe())

## 4. Combine Features

Finally, we combine all feature sets and prepare the data for modeling:

### Process
1. Merge team and player features
2. Extract target variable (playoff qualification)
3. Handle any missing values
4. Remove non-feature columns

### Quality Checks
- Verify feature completeness
- Examine feature distributions
- Confirm target variable balance

In [None]:
# Combine all features
print("Combining all feature sets...")
feature_matrix, target = builder.combine_features(
    team_features,
    player_features
)

# Display combined feature statistics
print(f"\nFinal feature matrix shape: {feature_matrix.shape}")
print(f"Target distribution (playoff rate): {target.mean():.2%}")

# Show feature categories
print("\nFeature Categories:")
categories = {
    'Team Performance': [col for col in feature_matrix.columns if any(x in col for x in ['pct', 'ratio', 'rating', 'efficiency'])],
    'Player Impact': [col for col in feature_matrix.columns if any(x in col for x in ['experience', 'age', 'injury'])]
}

for category, cols in categories.items():
    print(f"\n{category} Features ({len(cols)} features):")
    print("- " + "\n- ".join(cols))

# Display correlation with target
correlations = pd.DataFrame({
    'feature': feature_matrix.columns,
    'correlation': [abs(feature_matrix[col].corr(target)) for col in feature_matrix.columns]
}).sort_values('correlation', ascending=False)

print("\nTop 10 Most Predictive Features:")
print(correlations.head(10))

## 5. Save Features

Save the engineered features and metadata for use in modeling:

### Saved Data
- Complete feature matrix
- Target variable (playoff qualification)
- Feature metadata and statistics
- Data quality metrics

In [None]:
def save_features(feature_matrix: pd.DataFrame, target: pd.Series) -> None:
    """Save features and metadata with detailed logging."""
    output_dir = Path('../data/processed/features')
    output_dir.mkdir(parents=True, exist_ok=True)
    
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    print(f"Saving features with timestamp: {timestamp}")
    
    # Combine features with target for saving
    final_data = feature_matrix.copy()
    final_data['playoffs'] = target
    
    # Save features
    feature_path = output_dir / f'playoff_features_{timestamp}.csv'
    final_data.to_csv(feature_path, index=False)
    print(f"Saved feature matrix to {feature_path}")
    
    # Calculate category-wise statistics
    category_stats = {}
    for category, cols in categories.items():
        category_stats[category] = {
            'n_features': len(cols),
            'avg_correlation': float(correlations[correlations['feature'].isin(cols)]['correlation'].mean()),
            'top_features': correlations[correlations['feature'].isin(cols)].head(5).to_dict('records')
        }
    
    # Save detailed metadata
    metadata = {
        'timestamp': timestamp,
        'n_samples': len(feature_matrix),
        'n_features': len(feature_matrix.columns),
        'feature_names': list(feature_matrix.columns),
        'playoff_rate': float(target.mean()),
        'category_statistics': category_stats,
        'top_correlations': correlations.head(15).to_dict('records'),
        'description': 'Features for NBA playoff prediction based on team stats and player composition',
        'data_quality': {
            'missing_values': feature_matrix.isnull().sum().to_dict(),
            'feature_stats': feature_matrix.describe().to_dict()
        }
    }
    
    metadata_path = output_dir / f'playoff_features_{timestamp}_metadata.json'
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2)
    print(f"Saved metadata to {metadata_path}")
    
    # Print summary statistics
    print(f"\nFeature Engineering Summary:")
    print(f"- Total samples: {metadata['n_samples']}")
    print(f"- Total features: {metadata['n_features']}")
    print(f"- Playoff rate: {metadata['playoff_rate']:.2%}")
    for category, stats in category_stats.items():
        print(f"- {category}: {stats['n_features']} features, {stats['avg_correlation']:.3f} avg correlation")

# Save the engineered features
save_features(feature_matrix, target)