# Feature Engineering

This notebook contains the code for the third part of this data science project - feature engineering. Section headings have been included for convenience and the full writeup is available [on my website](https://www.pineconedata.com/2024-05-30-basketball-feature_engineering/).

## Project Overview
This is part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning. The dataset used in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season.

### Articles in this Series   
1. [Acquiring and Combining the Datasets](https://www.pineconedata.com/2024-04-11-basketball-data-acquisition/)
2. [Cleaning and Preprocessing the Data](https://www.pineconedata.com/2024-05-02-basketball-data-cleaning-preprocessing/)
3. [Engineering New Features](https://www.pineconedata.com/2024-05-30-basketball-feature_engineering/) (This Notebook)
4. [Exploratory Data Analysis](https://www.pineconedata.com/2024-06-28-basketball-data-exploration/)
5. [Visualizations, Charts, and Graphs](https://www.pineconedata.com/2024-07-29-basketball-visualizations/)
6. [Selecting a Machine Learning Model](https://www.pineconedata.com/2024-08-12-basketball-select-ml-ols/)
7. [Training the Machine Learning Model](https://www.pineconedata.com/2024-09-13-basketball-train-ols/)
8. [Evaluating the Machine Learning Model](https://www.pineconedata.com/)


# Getting Started
Full requirements and environment setup information is detailed in the [first article of this series](https://www.pineconedata.com/2024-04-11-basketball-data-acquisition/).

## Import Packages

In [1]:
import pandas as pd
import numpy as np

# Import Data

In [2]:
from pathlib import Path


data_folder = Path.cwd().parent / 'data'

In [3]:
player_data = pd.read_excel(data_folder / 'player_data_clean.xlsx')
player_data.head()

Unnamed: 0,PLAYER_NAME,Team,Class,Height,Position,PLAYER_ID,TEAM_NAME,GAMES,MINUTES_PLAYED,FIELD_GOALS_MADE,...,FREE_THROW_PERCENTAGE,OFFENSIVE_REBOUNDS,DEFENSIVE_REBOUNDS,TOTAL_REBOUNDS,ASSISTS,TURNOVERS,STEALS,BLOCKS,FOULS,POINTS
0,Kiara Jackson,UNLV (Mountain West),Junior,67,Guard,ncaaw.p.67149,UNLV,29,895,128,...,75.0,27,102,129,135,42,31,5,47,323
1,Raven Johnson,South Carolina (SEC),Sophomore,68,Guard,ncaaw.p.67515,South Carolina,30,823,98,...,64.3,33,128,161,148,53,60,5,34,243
2,Gina Marxen,Montana (Big Sky),Senior,68,Guard,ncaaw.p.57909,Montana,29,778,88,...,72.4,6,54,60,111,38,16,2,26,297
3,McKenna Hofschild,Colorado St. (Mountain West),Senior,62,Guard,ncaaw.p.60402,Colorado St.,29,1046,231,...,83.5,6,109,115,211,71,36,4,34,654
4,Kaylah Ivey,Boston College (ACC),Junior,68,Guard,ncaaw.p.64531,Boston Coll.,33,995,47,...,60.7,12,45,57,186,64,36,1,48,143


# Feature Engineering

## Calculate Two-Point Basket Metrics

In [4]:
# Calculate two-pointers made
player_data['TWO_POINTS_MADE'] = player_data['FIELD_GOALS_MADE'] - player_data['THREE_POINTS_MADE']

# Calculate two-point attempts
player_data['TWO_POINT_ATTEMPTS'] = player_data['FIELD_GOAL_ATTEMPTS'] - player_data['THREE_POINT_ATTEMPTS']

# Calculate two-point percentage
player_data['TWO_POINT_PERCENTAGE'] = (player_data['TWO_POINTS_MADE'] / player_data['TWO_POINT_ATTEMPTS']) * 100

player_data.dtypes

PLAYER_NAME                object
Team                       object
Class                      object
Height                      int64
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
TWO_POINTS_MADE             int64
TWO_POINT_ATTEMPTS          int64
TWO_POINT_PERC

## Extract Conference from Team Name

In [5]:
player_data[['TEAM_NAME', 'Team']].sample(10)

Unnamed: 0,TEAM_NAME,Team
339,Alabama A&M,Ark.-Pine Bluff (SWAC)
730,St. Mary's,Idaho (Big Sky)
647,UC Davis,UC Davis (Big West)
376,E. Tennessee St.,ETSU (SoCon)
527,SMU,SMU (AAC)
334,Georgetown,Georgetown (Big East)
688,Northern Iowa,UNI (MVC)
412,Washington St.,Washington St. (Pac-12)
648,Tulane,Tulane (AAC)
229,Tenn-Martin,UT Martin (OVC)


In [6]:
player_data['Team'].str.split('\(', expand=True)[1].str.split('\)', expand=True)[0]

0      Mountain West
1                SEC
2            Big Sky
3      Mountain West
4                ACC
           ...      
890             MAAC
891              AAC
892            SoCon
893             SWAC
894             ASUN
Name: 0, Length: 895, dtype: object

In [7]:
player_data[player_data['Team'].str.count('\(') == 2]['Team']

76      St. John's (NY) (Big East)
101               Miami (FL) (ACC)
124                 LMU (CA) (WCC)
197       Saint Francis (PA) (NEC)
316     St. John's (NY) (Big East)
342               Miami (FL) (ACC)
389               Miami (OH) (MAC)
483     St. John's (NY) (Big East)
701                 LMU (CA) (WCC)
753               Miami (OH) (MAC)
770        Saint Mary's (CA) (WCC)
802                 LMU (CA) (WCC)
814        Saint Mary's (CA) (WCC)
820               Miami (FL) (ACC)
Name: Team, dtype: object

In [8]:
player_data['Team'].str.split('\(', expand=True)[1].str.split('\)', expand=True)[0].iloc[[124, 820]]

124    CA
820    FL
Name: 0, dtype: object

In [9]:
player_data['Team'].str.extract(r'\(([^)]+)\)$')

Unnamed: 0,0
0,Mountain West
1,SEC
2,Big Sky
3,Mountain West
4,ACC
...,...
890,MAAC
891,AAC
892,SoCon
893,SWAC


In [10]:
player_data['Team'].str.extract(r'\(([^)]+)\)$').iloc[[124, 820]]

Unnamed: 0,0
124,WCC
820,ACC


In [11]:
player_data['Conference'] = player_data['Team'].str.extract(r'\(([^)]+)\)$')

In [12]:
sorted(player_data['Conference'].unique())

['AAC',
 'ACC',
 'ASUN',
 'America East',
 'Atlantic 10',
 'Big 12',
 'Big East',
 'Big Sky',
 'Big South',
 'Big Ten',
 'Big West',
 'CAA',
 'CUSA',
 'DI Independent',
 'Horizon',
 'Ivy League',
 'MAAC',
 'MAC',
 'MEAC',
 'MVC',
 'Mountain West',
 'NEC',
 'OVC',
 'Pac-12',
 'Patriot',
 'SEC',
 'SWAC',
 'SoCon',
 'Southland',
 'Summit League',
 'Sun Belt',
 'WAC',
 'WCC']

## Calculate per-Game Metrics

In [13]:
player_data['MINUTES_PER_GAME'] = player_data['MINUTES_PLAYED'] / player_data['GAMES']
player_data['FOULS_PER_GAME'] = player_data['FOULS'] / player_data['GAMES']
player_data['POINTS_PER_GAME'] = player_data['POINTS'] / player_data['GAMES']
player_data['ASSISTS_PER_GAME'] = player_data['ASSISTS'] / player_data['GAMES']
player_data['STEALS_PER_GAME'] = player_data['STEALS'] / player_data['GAMES']
player_data['BLOCKS_PER_GAME'] = player_data['BLOCKS'] / player_data['GAMES']
player_data['REBOUNDS_PER_GAME'] = player_data['TOTAL_REBOUNDS'] / player_data['GAMES']

player_data[['PLAYER_NAME', 'MINUTES_PER_GAME', 'FOULS_PER_GAME', 'POINTS_PER_GAME', 'ASSISTS_PER_GAME', 'STEALS_PER_GAME', 'BLOCKS_PER_GAME', 'REBOUNDS_PER_GAME']].sample(5)

Unnamed: 0,PLAYER_NAME,MINUTES_PER_GAME,FOULS_PER_GAME,POINTS_PER_GAME,ASSISTS_PER_GAME,STEALS_PER_GAME,BLOCKS_PER_GAME,REBOUNDS_PER_GAME
162,Chloe Hodges,27.931034,1.827586,8.931034,3.275862,1.241379,0.344828,5.206897
77,Caitlin Clark,34.0,1.90625,31.875,8.8125,1.71875,0.53125,7.3125
568,Kiki Jefferson,24.575758,1.757576,12.515152,2.363636,1.090909,0.212121,4.606061
148,Eleyana Tafisi,25.586207,3.551724,6.724138,3.62069,1.62069,0.275862,3.448276
787,Jayda McNabb,28.193548,2.709677,6.193548,1.741935,1.806452,0.645161,6.290323


## Calculate Assist-to-Turnover Ratio

In [14]:
player_data['ASSIST_TO_TURNOVER'] = player_data['ASSISTS'] / player_data['TURNOVERS']

player_data[['PLAYER_NAME', 'ASSISTS', 'TURNOVERS', 'ASSIST_TO_TURNOVER']].sample(5)

Unnamed: 0,PLAYER_NAME,ASSISTS,TURNOVERS,ASSIST_TO_TURNOVER
366,Faith Stinson,20,28,0.714286
860,Dena Jarrells,112,102,1.098039
467,Marah Dykstra,78,75,1.04
196,Samantha Johnston,99,80,1.2375
361,Clara Strack,20,31,0.645161


## Calculate Fantasy Points

In [15]:
player_data['FANTASY_POINTS'] = (player_data['THREE_POINTS_MADE'] * 3) + \
                                (player_data['TWO_POINTS_MADE'] * 2) + \
                                (player_data['FREE_THROWS_MADE'] * 1) + \
                                (player_data['TOTAL_REBOUNDS'] * 1.2) + \
                                (player_data['ASSISTS'] * 1.5) + \
                                (player_data['BLOCKS'] * 2) + \
                                (player_data['STEALS'] * 2) + \
                                (player_data['TURNOVERS'] * -1)

player_data[['PLAYER_NAME', 'FANTASY_POINTS']].sample(5)

Unnamed: 0,PLAYER_NAME,FANTASY_POINTS
325,Taisha Exanor,544.9
584,Jasmine Shavers,795.4
704,Taniya Hanner,654.4
535,Faith Lee,597.6
609,Evanne Turner,655.1


# Wrap Up

In [16]:
player_data.dtypes

PLAYER_NAME                object
Team                       object
Class                      object
Height                      int64
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
TWO_POINTS_MADE             int64
TWO_POINT_ATTEMPTS          int64
TWO_POINT_PERC

# Export Data

In [17]:
player_data.to_excel(data_folder / 'player_data_engineered.xlsx', index=False)