# DeepShot: Feature Engineering

## Introduction

Feature engineering is a critical step in our NBA shot prediction project. While our raw data contains valuable information, transforming and combining this data into meaningful features can significantly improve our model's predictive power.

In this notebook, we create several categories of features:

1. **Spatial Features**: Derived from court coordinates, these features capture the geometric aspects of shooting, including distance from basket, angle, and court zones. Spatial features are expected to be among the most important predictors of shot success.

2. **Game Context Features**: These features capture the situational aspects of each shot, including time remaining, quarter, score margin, and "clutch" situations. Game context provides important information about the pressure and strategic considerations for each shot.

3. **Historical Shot Features**: These features incorporate a player's past shooting performance from similar locations, providing a baseline expectation for shot success based on historical patterns.

4. **Player Performance Features**: These features capture player-specific metrics like true shooting percentage, usage rate, and career stage, helping our model understand individual player tendencies.

5. **Team Features**: These features describe team characteristics like win percentage, offensive/defensive ratings, and playing style, providing context about the team environment for each shot.

By engineering these features, we aim to provide our models with rich, meaningful information that captures the multidimensional nature of basketball shooting. We expect spatial features and player-specific features to be particularly important, based on basketball domain knowledge and our exploratory data analysis.

In [2]:
# ##HIDE##
import pandas as pd
import numpy as np
from pathlib import Path

# Setup directories
processed_dir = Path('../data/processed')
features_dir = processed_dir / 'features'
for directory in [processed_dir, features_dir]:
    directory.mkdir(parents=True, exist_ok=True)

In [3]:
# Load data
shots = pd.read_csv(processed_dir / 'standardized_shots.csv')
player = pd.read_csv(processed_dir / 'standardized_player.csv')
team = pd.read_csv(processed_dir / 'standardized_team.csv')

# Handle column name variations
column_mappings = {
    'PLAYER_NAME': 'player_name',
    'TEAM_NAME': 'team_name',
    'MINUTES_LEFT': 'mins_left',
    'SECONDS_LEFT': 'secs_left',
    'PERIOD': 'quarter',
    'QUARTER': 'quarter',
    'MARGIN': 'score_margin',
    'SHOT_MADE_FLAG': 'shot_made',
    'SHOT_MADE': 'shot_made'
}

# Apply mappings if needed
for old_col, new_col in column_mappings.items():
    if old_col in shots.columns and new_col not in shots.columns:
        shots.rename(columns={old_col: new_col}, inplace=True)

## Spatial Feature Engineering

Spatial features capture the geometric aspects of shooting. The location on the court is one of the most important factors in predicting shot success, as shooting percentage generally decreases with distance from the basket, with some exceptions based on angle and specific zones.

In [4]:
# 1. Spatial Features
shots['shot_distance'] = np.sqrt(shots['loc_x']**2 + shots['loc_y']**2)
shots['shot_angle'] = np.arctan2(shots['loc_x'], shots['loc_y']) * 180 / np.pi

# Court zones
conditions = [
    (shots['shot_distance'] < 4),
    (shots['shot_distance'] < 8) & (shots['shot_distance'] >= 4),
    (shots['shot_distance'] < 16) & (shots['shot_distance'] >= 8),
    (shots['shot_distance'] < 23.75) & (shots['shot_distance'] >= 16),
    (shots['shot_distance'] >= 23.75)
]
zones = ['Restricted Area', 'Paint', 'Mid-Range', 'Long Mid-Range', 'Three-Point']
shots['court_zone'] = np.select(conditions, zones, default='Unknown')
shots['corner_three'] = ((shots['court_zone'] == 'Three-Point') & (abs(shots['shot_angle']) > 45)).astype(int)

## Game Context Feature Engineering

Game context features capture the situational aspects of each shot. Basketball is a dynamic game where time remaining, score differential, and other contextual factors can significantly impact shot selection and success probability.

In [5]:
# 2. Game Context Features
# Handle missing columns
for col in ['mins_left', 'secs_left', 'quarter']:
    if col not in shots.columns and col.upper() in shots.columns:
        shots[col] = shots[col.upper()]

# Calculate time features if possible
if all(col in shots.columns for col in ['mins_left', 'secs_left', 'quarter']):
    shots['time_remaining_seconds'] = shots['mins_left'] * 60 + shots['secs_left']
    shots['period_type'] = np.where(shots['quarter'] <= 4, 'Regulation', 'Overtime')
    shots['end_of_period'] = ((shots['time_remaining_seconds'] < 120) & 
                             ((shots['quarter'] == 4) | (shots['period_type'] == 'Overtime'))).astype(int)

# Calculate score situation if possible
if 'score_margin' in shots.columns:
    conditions = [
        (shots['score_margin'] < -15),
        (shots['score_margin'] < -5) & (shots['score_margin'] >= -15),
        (shots['score_margin'] < 0) & (shots['score_margin'] >= -5),
        (shots['score_margin'] == 0),
        (shots['score_margin'] > 0) & (shots['score_margin'] <= 5),
        (shots['score_margin'] > 5) & (shots['score_margin'] <= 15),
        (shots['score_margin'] > 15)
    ]
    values = ['Large Deficit', 'Moderate Deficit', 'Small Deficit', 'Tied', 
             'Small Lead', 'Moderate Lead', 'Large Lead']
    shots['score_situation'] = np.select(conditions, values, default='Unknown')
    
    if 'time_remaining_seconds' in shots.columns:
        shots['clutch_situation'] = ((abs(shots['score_margin']) <= 5) & 
                                    (shots['time_remaining_seconds'] < 300) & 
                                    ((shots['quarter'] == 4) | (shots['period_type'] == 'Overtime'))).astype(int)

## Historical Shot Feature Engineering

Historical shot features incorporate a player's past shooting performance. A player's previous success from a particular zone is often predictive of future success, providing valuable baseline information for our models.

In [6]:
# 3. Historical Shot Features
# Handle missing columns
for col in ['player_name', 'court_zone', 'season', 'shot_made']:
    if col not in shots.columns and col.upper() in shots.columns:
        shots[col] = shots[col.upper()]

# Calculate shooting percentages
player_zone_season = shots.groupby(['player_name', 'court_zone', 'season']).agg(
    shots=('shot_made', 'count'),
    makes=('shot_made', 'sum')
).reset_index()

player_zone_season['shooting_pct'] = player_zone_season['makes'] / player_zone_season['shots']
player_zone_season['shooting_pct'] = player_zone_season['shooting_pct'].fillna(0.5)
player_zone_season['prior_season'] = player_zone_season['season'] + 1

# Merge to get prior season stats
shots_with_prior = shots.merge(
    player_zone_season[['player_name', 'court_zone', 'prior_season', 'shooting_pct']], 
    left_on=['player_name', 'court_zone', 'season'], 
    right_on=['player_name', 'court_zone', 'prior_season'], 
    how='left', 
    suffixes=('', '_prior')
)

# Add prior_pct column
if 'shooting_pct_prior' in shots_with_prior.columns:
    shots_with_prior.rename(columns={'shooting_pct_prior': 'prior_pct'}, inplace=True)
    shots_with_prior['prior_pct'] = shots_with_prior['prior_pct'].fillna(0.5)
else:
    shots_with_prior['prior_pct'] = 0.5

shots = shots_with_prior
if 'prior_season' in shots.columns:
    shots.drop('prior_season', axis=1, inplace=True)

## Player Performance Feature Engineering

Player performance features capture individual player characteristics. Different players have different shooting abilities, tendencies, and roles, which significantly impact shot success probability beyond what court location alone would predict.

In [7]:
# 4. Player Performance Features
# Map column names
player_column_mapping = {
    'pts': 'points',
    'fga': 'field_goal_attempts',
    'fta': 'free_throw_attempts',
    'tov': 'turnovers',
    'mp': 'minutes'
}

for old_name, new_name in player_column_mapping.items():
    if old_name in player.columns and new_name not in player.columns:
        player.rename(columns={old_name: new_name}, inplace=True)

# Add default values for missing columns
for col in ['points', 'field_goal_attempts', 'free_throw_attempts', 'turnovers', 'minutes']:
    if col not in player.columns:
        player[col] = 0

# Calculate features
player['true_shooting'] = player['points'] / (2 * (player['field_goal_attempts'] + 0.44 * player['free_throw_attempts']))
player['true_shooting'] = player['true_shooting'].replace([np.inf, -np.inf], np.nan).fillna(0)

player['usage_rate'] = (player['field_goal_attempts'] + 0.44 * player['free_throw_attempts'] + player['turnovers']) / player['minutes']
player['usage_rate'] = player['usage_rate'].replace([np.inf, -np.inf], np.nan).fillna(0)

# Calculate experience if possible
if 'player' in player.columns and 'season' in player.columns:
    player_first_season = player.groupby('player')['season'].min().reset_index()
    player_first_season.rename(columns={'season': 'first_season'}, inplace=True)
    player = player.merge(player_first_season, on='player', how='left')
    player['experience'] = player['season'] - player['first_season']
    
    # Create experience bins
    bins = [-1, 2, 5, 9, 100]
    labels = ['Rookie (0-2)', 'Early Career (3-5)', 'Prime (6-9)', 'Veteran (10+)']
    player['career_stage'] = pd.cut(player['experience'], bins=bins, labels=labels, right=True)
else:
    player['experience'] = 0
    player['career_stage'] = 'Unknown'

## Team Feature Engineering

Team features describe the characteristics and performance of each team. Team playing style, offensive efficiency, and overall quality provide important context for understanding shot patterns and success rates.

In [8]:
# 5. Team Features
# Map column names
team_column_mapping = {
    'win': 'wins',
    'loss': 'losses',
    'pts_per_game': 'points_per_game',
    'pts_against_per_game': 'points_allowed_per_game',
    'fg3a': 'three_point_attempts',
    'fga': 'field_goal_attempts'
}

for old_name, new_name in team_column_mapping.items():
    if old_name in team.columns and new_name not in team.columns:
        team.rename(columns={old_name: new_name}, inplace=True)

# Add default values for missing columns
for col in ['wins', 'losses', 'points_per_game', 'points_allowed_per_game', 'pace', 'three_point_attempts', 'field_goal_attempts']:
    if col not in team.columns:
        team[col] = 0 if col != 'pace' else 100

# Calculate features
team['win_pct'] = team['wins'] / (team['wins'] + team['losses'])
team['win_pct'] = team['win_pct'].replace([np.inf, -np.inf], np.nan).fillna(0.5)

team['offensive_rating'] = team['points_per_game'] * (100 / team['pace'])
team['defensive_rating'] = team['points_allowed_per_game'] * (100 / team['pace'])
team['net_rating'] = team['offensive_rating'] - team['defensive_rating']

team['three_point_rate'] = team['three_point_attempts'] / team['field_goal_attempts']
team['three_point_rate'] = team['three_point_rate'].replace([np.inf, -np.inf], np.nan).fillna(0.25)

# Categorize playing style
pace_median = team['pace'].median()
team['pace_style'] = np.where(team['pace'] > pace_median, 'Fast', 'Slow')

three_pt_median = team['three_point_rate'].median()
team['shooting_style'] = np.where(team['three_point_rate'] > three_pt_median, 'Three-Heavy', 'Inside')

team['playing_style'] = team['pace_style'] + '-' + team['shooting_style']

## Feature Integration

Now that we've created features from multiple sources, we need to integrate them into a comprehensive dataset that our models can use. This integration process requires careful handling of join conditions and potential missing values.

In [9]:
# 6. Merge Features
# Prepare columns for merging
player_cols = [col for col in ['player', 'season', 'true_shooting', 'usage_rate', 'experience', 'career_stage'] 
               if col in player.columns]

team_cols = [col for col in ['team', 'season', 'win_pct', 'offensive_rating', 'defensive_rating', 'playing_style'] 
             if col in team.columns]

# Merge player features
if 'player_name' in shots.columns and 'player' in player.columns and 'season' in player.columns:
    shots_with_player = shots.merge(
        player[player_cols],
        left_on=['player_name', 'season'],
        right_on=['player', 'season'],
        how='left'
    )
    if 'player' in shots_with_player.columns:
        shots_with_player.drop('player', axis=1, inplace=True)
else:
    shots_with_player = shots

# Merge team features
if 'team_name' in shots_with_player.columns and 'team' in team.columns and 'season' in team.columns:
    final_shots = shots_with_player.merge(
        team[team_cols],
        left_on=['team_name', 'season'],
        right_on=['team', 'season'],
        how='left'
    )
    if 'team' in final_shots.columns:
        final_shots.drop('team', axis=1, inplace=True)
else:
    final_shots = shots_with_player

In [10]:
# 7. Save Features
final_shots.to_csv(features_dir / 'shots_with_features.csv', index=False)
player.to_csv(features_dir / 'player_features.csv', index=False)
team.to_csv(features_dir / 'team_features.csv', index=False)

## Feature Engineering Summary

In this notebook, we've created a rich set of features that capture the multidimensional nature of basketball shooting:

1. **Spatial Features**: We've transformed raw court coordinates into meaningful spatial features including shot distance, angle, court zones, and corner three indicators. These features capture the geometric aspects of shooting and are expected to be strong predictors of shot success.

2. **Game Context Features**: We've created features that capture the situational context of each shot, including time remaining, period type, end-of-period indicators, score situation, and clutch indicators. These features help our models understand how game situation affects shooting.

3. **Historical Shot Features**: We've incorporated each player's historical shooting percentage from different court zones, providing a baseline expectation for shot success based on past performance.

4. **Player Performance Features**: We've calculated advanced metrics like true shooting percentage and usage rate, and created career stage indicators based on experience. These features help our models understand player-specific tendencies.

5. **Team Features**: We've included team performance metrics like win percentage, offensive/defensive ratings, and playing style indicators. These features provide context about the team environment for each shot.

By engineering these diverse features, we've transformed our raw data into a feature-rich dataset that captures the complex factors influencing basketball shooting. This comprehensive feature set will enable our models to make more accurate predictions and generate more meaningful insights.
