# NBA Playoff Predictor - Feature Engineering

This notebook creates predictive features for NBA playoff prediction using historical NBA data. We engineer features that capture various aspects of team performance, player composition, and shooting patterns that may influence a team's likelihood of making the playoffs.

## Data Sources

1. **Team Statistics** (`team_stats.csv`):
   - Season-level team performance metrics
   - Includes scoring, rebounding, assists, and defensive statistics
   - Contains playoff qualification information (target variable)

2. **Player Statistics** (`player_season.csv`):
   - Player-level seasonal data
   - Includes experience, position, and age information
   - Used to analyze team composition and depth

3. **Injury Data** (`injuries.csv`):
   - Historical injury records
   - Tracks player availability throughout seasons
   - Used to assess impact of injuries on team performance

4. **Shot Data** (`shots.csv`):
   - Detailed shot-by-shot information
   - Includes location, type, and outcome of shots
   - Used to analyze team shooting patterns and efficiency

## Feature Categories

1. **Team Performance Metrics**:
   - True Shooting Percentage: Measures shooting efficiency accounting for field goals, three-pointers, and free throws
   - Effective Field Goal Percentage: Adjusts for three-pointers being worth more than two-pointers
   - Three Point Reliance: Measures team's dependence on three-point shooting
   - Free Throw Rate: Indicates ability to get to the free throw line
   - Ball Control Metrics: Assist-to-turnover ratio and possession efficiency
   - Defensive Indicators: Combined steals and blocks, defensive rating

2. **Player Impact Features**:
   - Experience Levels: Average, maximum, and minimum team experience
   - Age Distribution: Team age demographics (average, youngest, oldest)
   - Position Distribution: Percentage of players at each position
   - Roster Analysis: Team size and composition metrics
   - Injury Impact: Frequency and timing of injuries

3. **Shot Analysis Features**:
   - Field Goal Percentages: Overall and by shot type
   - Shot Distance Patterns: Average and variation in shot distances
   - Court Location Analysis: Shot distribution across court zones
   - League-Relative Metrics: Team performance vs. league averages

4. **Combined Performance Indicators**:
   - Overall Efficiency Rating: Composite score of offensive and defensive efficiency
   - Possession-Based Metrics: Points per possession and possession control
   - Team Balance Indicators: Distribution of scoring and playmaking

The engineered features aim to capture both direct performance metrics and underlying team characteristics that may influence playoff qualification.

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# System utilities
import sys
from pathlib import Path
import json
from datetime import datetime

# Add the src directory to the path
sys.path.append('..')
from src.features.feature_builder import FeatureBuilder

# Configure pandas display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

## 1. Load Data

We begin by loading our preprocessed data sources. Each dataset contributes unique information for feature engineering:

- Team statistics provide the foundation with season-level performance metrics
- Player statistics offer insights into team composition and experience
- Injury data helps assess the impact of player availability
- Shot data enables detailed analysis of team shooting patterns

In [2]:
# Load all data sources
data_dir = '../data/processed/historical'

# Team statistics
team_stats = pd.read_csv(f'{data_dir}/team_stats.csv')
print("\nTeam Stats Shape:", team_stats.shape)
print("Columns:", team_stats.columns.tolist())

# Player statistics
player_stats = pd.read_csv(f'{data_dir}/player_season.csv')
print("\nPlayer Stats Shape:", player_stats.shape)
print("Columns:", player_stats.columns.tolist())

# Injury data
injuries = pd.read_csv(f'{data_dir}/injuries.csv')
print("\nInjuries Shape:", injuries.shape)
print("Columns:", injuries.columns.tolist())

# Shot data
shots = pd.read_csv(f'{data_dir}/shots.csv')
print("\nShots Shape:", shots.shape)
print("Columns:", shots.columns.tolist())


Team Stats Shape: (1876, 28)
Columns: ['season', 'lg', 'team', 'abbreviation', 'playoffs', 'g', 'mp_per_game', 'fg_per_game', 'fga_per_game', 'fg_percent', 'x3p_per_game', 'x3pa_per_game', 'x3p_percent', 'x2p_per_game', 'x2pa_per_game', 'x2p_percent', 'ft_per_game', 'fta_per_game', 'ft_percent', 'orb_per_game', 'drb_per_game', 'trb_per_game', 'ast_per_game', 'stl_per_game', 'blk_per_game', 'tov_per_game', 'pf_per_game', 'pts_per_game']

Player Stats Shape: (32358, 10)
Columns: ['season', 'seas_id', 'player_id', 'player', 'birth_year', 'pos', 'age', 'lg', 'tm', 'experience']

Injuries Shape: (37667, 6)
Columns: ['Unnamed: 0', 'Date', 'Team', 'Acquired', 'Relinquished', 'Notes']



Shots Shape: (4231262, 26)
Columns: ['SEASON_1', 'SEASON_2', 'TEAM_ID', 'TEAM_NAME', 'PLAYER_ID', 'PLAYER_NAME', 'POSITION_GROUP', 'POSITION', 'GAME_DATE', 'GAME_ID', 'HOME_TEAM', 'AWAY_TEAM', 'EVENT_TYPE', 'SHOT_MADE', 'ACTION_TYPE', 'SHOT_TYPE', 'BASIC_ZONE', 'ZONE_NAME', 'ZONE_ABB', 'ZONE_RANGE', 'LOC_X', 'LOC_Y', 'SHOT_DISTANCE', 'QUARTER', 'MINS_LEFT', 'SECS_LEFT']


## 2. Create Features

We use our FeatureBuilder class to create comprehensive features from all data sources. The feature creation process follows these steps:

1. **Team Features**:
   - Calculate advanced efficiency metrics
   - Generate possession-based statistics
   - Create composite performance indicators

2. **Player Features**:
   - Analyze team experience and age profiles
   - Calculate position distributions
   - Assess injury impact

3. **Shot Features**:
   - Evaluate shooting efficiency
   - Analyze shot location patterns
   - Compare to league averages

4. **Feature Combination**:
   - Merge all feature sets
   - Handle missing values
   - Ensure data quality

In [3]:
# Initialize feature builder
builder = FeatureBuilder()

# Create features from each data source
print("Creating team features...")
team_features = builder.create_team_features(team_stats)
print("Team features shape:", team_features.shape)

print("\nCreating player features...")
player_features = builder.create_player_features(player_stats, injuries)
print("Player features shape:", player_features.shape)

print("\nCreating shot features...")
shot_features = builder.create_shot_features(shots)
print("Shot features shape:", shot_features.shape)

# Combine all features
print("\nCombining features...")
feature_matrix, target = builder.combine_features(
    team_features,
    player_features,
    shot_features
)

print("\nFinal Feature Matrix Shape:", feature_matrix.shape)
print(f"Playoff Rate: {target.mean():.2%}")

print("\nFeatures Created:")
for category, cols in {
    'Team Performance': [col for col in feature_matrix.columns if any(x in col for x in ['pct', 'ratio', 'rating', 'efficiency'])],
    'Player Impact': [col for col in feature_matrix.columns if any(x in col for x in ['experience', 'age', 'pos_pct', 'injury'])],
    'Shot Analysis': [col for col in feature_matrix.columns if any(x in col for x in ['shot', 'loc', 'distance'])]
}.items():
    print(f"\n{category} Features:")
    print("\n".join(f"- {col}" for col in cols))

Creating team features...
Team features shape: (1876, 40)

Creating player features...
Player features shape: (1866, 35)

Creating shot features...


Shot features shape: (629, 13)

Combining features...

Final Feature Matrix Shape: (1876, 79)
Playoff Rate: 54.10%

Features Created:

Team Performance Features:
- true_shooting_pct
- efg_pct
- oreb_pct
- ast_to_ratio
- ast_ratio
- def_rating
- off_efficiency
- efficiency_rating
- pos_pct_C
- pos_pct_C-F
- pos_pct_C-PF
- pos_pct_C-SF
- pos_pct_F
- pos_pct_F-C
- pos_pct_F-G
- pos_pct_G
- pos_pct_G-F
- pos_pct_PF
- pos_pct_PF-C
- pos_pct_PF-SF
- pos_pct_PG
- pos_pct_PG-SF
- pos_pct_PG-SG
- pos_pct_SF
- pos_pct_SF-C
- pos_pct_SF-PF
- pos_pct_SF-PG
- pos_pct_SF-SG
- pos_pct_SG
- pos_pct_SG-PF
- pos_pct_SG-PG
- pos_pct_SG-PG-SF
- pos_pct_SG-SF
- fg_pct
- fg_pct_vs_avg

Player Impact Features:
- avg_experience
- max_experience
- min_experience
- avg_age
- max_age
- min_age
- pos_pct_C
- pos_pct_C-F
- pos_pct_C-PF
- pos_pct_C-SF
- pos_pct_F
- pos_pct_F-C
- pos_pct_F-G
- pos_pct_G
- pos_pct_G-F
- pos_pct_PF
- pos_pct_PF-C
- pos_pct_PF-SF
- pos_pct_PG
- pos_pct_PG-SF
- pos_pct_PG-SG
- pos_pct_S

## 3. Analyze Feature Importance

We examine the relationship between our engineered features and playoff qualification. This analysis helps us understand:

- Which features are most predictive of playoff success
- The relative importance of different feature categories
- Potential feature selection considerations for modeling

In [4]:
# Calculate correlations with playoff qualification
correlations = pd.DataFrame({
    'feature': feature_matrix.columns,
    'correlation': [abs(feature_matrix[col].corr(target)) for col in feature_matrix.columns]
})

# Sort by absolute correlation
correlations = correlations.sort_values('correlation', ascending=False)

print("Top 15 Features by Correlation with Playoff Qualification:")
print(correlations.head(15).to_string())

# Analyze feature importance by category
categories = {
    'Team Performance': [col for col in feature_matrix.columns if any(x in col for x in ['pct', 'ratio', 'rating', 'efficiency'])],
    'Player Impact': [col for col in feature_matrix.columns if any(x in col for x in ['experience', 'age', 'pos_pct', 'injury'])],
    'Shot Analysis': [col for col in feature_matrix.columns if any(x in col for x in ['shot', 'loc', 'distance'])]
}

print("\nAverage Correlation by Feature Category:")
for category, cols in categories.items():
    avg_corr = correlations[correlations['feature'].isin(cols)]['correlation'].mean()
    print(f"{category}: {avg_corr:.3f}")

Top 15 Features by Correlation with Playoff Qualification:
              feature  correlation
34  efficiency_rating     0.255863
29         def_rating     0.247222
28    stocks_per_game     0.196960
11        ft_per_game     0.195417
68             fg_pct     0.190601
76      fg_pct_vs_avg     0.190601
33            ft_rate     0.188244
12       fta_per_game     0.168563
19       blk_per_game     0.162190
18       stl_per_game     0.141669
15       drb_per_game     0.137110
26       ast_to_ratio     0.131265
16       trb_per_game     0.120664
17       ast_per_game     0.116156
8        x2p_per_game     0.114536

Average Correlation by Feature Category:
Team Performance: 0.135
Player Impact: nan
Shot Analysis: 0.027


## 4. Save Features

We save our engineered features and associated metadata for use in modeling. The saved data includes:

- Complete feature matrix with all engineered features
- Target variable (playoff qualification)
- Feature metadata (names, descriptions, correlations)
- Data quality metrics

In [5]:
def save_features(feature_matrix: pd.DataFrame, target: pd.Series) -> None:
    """Save features and metadata"""
    output_dir = Path('../data/processed/features')
    output_dir.mkdir(parents=True, exist_ok=True)
    
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    
    # Combine features with target for saving
    final_data = feature_matrix.copy()
    final_data['playoffs'] = target
    
    # Save features
    final_data.to_csv(output_dir / f'playoff_features_{timestamp}.csv', index=False)
    
    # Calculate category-wise statistics
    category_stats = {}
    for category, cols in categories.items():
        category_stats[category] = {
            'n_features': len(cols),
            'avg_correlation': float(correlations[correlations['feature'].isin(cols)]['correlation'].mean()),
            'top_features': correlations[correlations['feature'].isin(cols)].head(5).to_dict('records')
        }
    
    # Save metadata
    metadata = {
        'timestamp': timestamp,
        'n_samples': len(feature_matrix),
        'n_features': len(feature_matrix.columns),
        'feature_names': list(feature_matrix.columns),
        'playoff_rate': float(target.mean()),
        'category_statistics': category_stats,
        'top_correlations': correlations.head(15).to_dict('records'),
        'description': 'Comprehensive features for NBA playoff prediction',
        'data_quality': {
            'missing_values': feature_matrix.isnull().sum().to_dict(),
            'feature_stats': feature_matrix.describe().to_dict()
        }
    }
    
    with open(output_dir / f'playoff_features_{timestamp}_metadata.json', 'w') as f:
        json.dump(metadata, f, indent=2)
    
    print(f"Saved features and metadata with timestamp: {timestamp}")

# Save the engineered features
save_features(feature_matrix, target)

Saved features and metadata with timestamp: 20241209_185132


## 5. Feature Analysis Summary

Our feature engineering process has created a rich set of predictive features for NBA playoff prediction:

### Data Coverage
- **Sample Size**: 1,876 team-seasons
- **Feature Count**: 79 engineered features
- **Time Range**: Multiple NBA seasons
- **Playoff Rate**: ~54% of teams make playoffs

### Feature Categories

1. **Team Performance Metrics**:
   - Efficiency ratings capture overall team effectiveness
   - Ball control metrics indicate possession management
   - Defensive indicators measure stopping ability
   - Key metrics: true shooting %, effective FG%, possession efficiency

2. **Player Impact Features**:
   - Experience levels show team maturity
   - Position distribution reveals team structure
   - Age demographics indicate development stage
   - Injury tracking assesses health impact

3. **Shot Analysis Features**:
   - Shooting percentages measure efficiency
   - Shot location patterns reveal strategy
   - Distance analysis shows shot selection
   - League-relative metrics provide context

4. **Combined Indicators**:
   - Integrated performance metrics
   - Multi-faceted team evaluation
   - Balanced scoring and efficiency metrics

### Key Insights

1. **Feature Importance**:
   - Efficiency metrics show strongest correlation with playoffs
   - Team composition features provide complementary signals
   - Shot location patterns offer strategic insights

2. **Data Quality**:
   - Complete coverage across all team-seasons
   - Robust handling of missing values
   - Consistent scale and distribution of features

3. **Predictive Power**:
   - Multiple strong predictive signals identified
   - Balanced representation of different aspects
   - Rich feature set for modeling

This comprehensive feature set provides a strong foundation for building a playoff prediction model, capturing both direct performance metrics and underlying team characteristics that influence playoff qualification.