# Outlier or Catilin Clark? A Data Science Project
## Part 3 - Feature Engineering

This notebook contains the code for the second part of this data science project - data cleaning and preprocessing. Section headings have been included for convenience and the full writeup is available [on my website](https://www.pineconedata.com/2024-05-30-basketball-feature_engineering/).

In summary, there will be a notebook (and post) for each part of the process - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, machine learning, and creating visualizations. The dataset used in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season.

# Getting Started
Full requirements and environment setup information is detailed in the [first article of this series](https://www.pineconedata.com/2024-04-11-basketball-data-acquisition/).

## Import Packages

In [1]:
import pandas as pd
import requests
import json
import os
import numpy as np
import openpyxl 

# Import Data

In [2]:
from pathlib import Path


data_folder = Path.cwd().parent / 'data'

In [3]:
player_data = pd.read_excel(data_folder / 'player_data_clean.xlsx')
player_data.head()

Unnamed: 0,PLAYER_NAME,Team,Class,Height,Position,PLAYER_ID,TEAM_NAME,GAMES,MINUTES_PLAYED,FIELD_GOALS_MADE,...,FREE_THROW_PERCENTAGE,OFFENSIVE_REBOUNDS,DEFENSIVE_REBOUNDS,TOTAL_REBOUNDS,ASSISTS,TURNOVERS,STEALS,BLOCKS,FOULS,POINTS
0,Kiara Jackson,UNLV (Mountain West),Junior,67,Guard,ncaaw.p.67149,UNLV,29,895,128,...,75.0,27,102,129,135,42,31,5,47,323
1,Raven Johnson,South Carolina (SEC),Sophomore,68,Guard,ncaaw.p.67515,South Carolina,30,823,98,...,64.3,33,128,161,148,53,60,5,34,243
2,Gina Marxen,Montana (Big Sky),Senior,68,Guard,ncaaw.p.57909,Montana,29,778,88,...,72.4,6,54,60,111,38,16,2,26,297
3,McKenna Hofschild,Colorado St. (Mountain West),Senior,62,Guard,ncaaw.p.60402,Colorado St.,29,1046,231,...,83.5,6,109,115,211,71,36,4,34,654
4,Kaylah Ivey,Boston College (ACC),Junior,68,Guard,ncaaw.p.64531,Boston Coll.,33,995,47,...,60.7,12,45,57,186,64,36,1,48,143


# Feature Engineering

## Calculate Two-Point Basket Metrics

In [4]:
# Calculate two-pointers made
player_data['TWO_POINTS_MADE'] = player_data['FIELD_GOALS_MADE'] - player_data['THREE_POINTS_MADE']

# Calculate two-point attempts
player_data['TWO_POINT_ATTEMPTS'] = player_data['FIELD_GOAL_ATTEMPTS'] - player_data['THREE_POINT_ATTEMPTS']

# Calculate two-point percentage
player_data['TWO_POINT_PERCENTAGE'] = (player_data['TWO_POINTS_MADE'] / player_data['TWO_POINT_ATTEMPTS']) * 100

player_data.dtypes

PLAYER_NAME                object
Team                       object
Class                      object
Height                      int64
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
TWO_POINTS_MADE             int64
TWO_POINT_ATTEMPTS          int64
TWO_POINT_PERC

## Extract Conference from Team Name

In [5]:
player_data[['TEAM_NAME', 'Team']].sample(10)

Unnamed: 0,TEAM_NAME,Team
429,Louisiana Tech,Louisiana Tech (CUSA)
746,Texas A&M,Texas A&M (SEC)
748,Ut. Tech,Niagara (MAAC)
758,Texas A&M-Corpus Christi,A&M-Corpus Christi (Southland)
290,NJIT,NJIT (America East)
142,N.C. A&T,N.C. A&T (CAA)
294,Kennesaw St.,Kennesaw St. (ASUN)
199,Grand Canyon,Grand Canyon (WAC)
800,PFW,Purdue Fort Wayne (Horizon)
597,Memphis,Memphis (AAC)


In [6]:
player_data['Team'].str.split('\(', expand=True)[1].str.split('\)', expand=True)[0]

0      Mountain West
1                SEC
2            Big Sky
3      Mountain West
4                ACC
           ...      
890             MAAC
891              AAC
892            SoCon
893             SWAC
894             ASUN
Name: 0, Length: 895, dtype: object

In [7]:
player_data.loc[[125, 824], 'Team']

125     Illinois (Big Ten)
824      Oakland (Horizon)
Name: Team, dtype: object

In [8]:
player_data['Team'].str.split('\(', expand=True)[1].str.split('\)', expand=True)[0].iloc[[125, 824]]

125    Big Ten
824    Horizon
Name: 0, dtype: object

In [9]:
player_data['Team'].str.extract(r'\(([^)]+)\)$')

Unnamed: 0,0
0,Mountain West
1,SEC
2,Big Sky
3,Mountain West
4,ACC
...,...
890,MAAC
891,AAC
892,SoCon
893,SWAC


In [10]:
player_data['Team'].str.extract(r'\(([^)]+)\)$').iloc[[125, 824]]

Unnamed: 0,0
125,Big Ten
824,Horizon


In [11]:
player_data['Conference'] = player_data['Team'].str.extract(r'\(([^)]+)\)$')

In [12]:
sorted(player_data['Conference'].unique())

['AAC',
 'ACC',
 'ASUN',
 'America East',
 'Atlantic 10',
 'Big 12',
 'Big East',
 'Big Sky',
 'Big South',
 'Big Ten',
 'Big West',
 'CAA',
 'CUSA',
 'DI Independent',
 'Horizon',
 'Ivy League',
 'MAAC',
 'MAC',
 'MEAC',
 'MVC',
 'Mountain West',
 'NEC',
 'OVC',
 'Pac-12',
 'Patriot',
 'SEC',
 'SWAC',
 'SoCon',
 'Southland',
 'Summit League',
 'Sun Belt',
 'WAC',
 'WCC']

## Calculate per-Game Metrics

In [13]:
player_data['MINUTES_PER_GAME'] = player_data['MINUTES_PLAYED'] / player_data['GAMES']
player_data['FOULS_PER_GAME'] = player_data['FOULS'] / player_data['GAMES']
player_data['POINTS_PER_GAME'] = player_data['POINTS'] / player_data['GAMES']
player_data['ASSISTS_PER_GAME'] = player_data['ASSISTS'] / player_data['GAMES']
player_data['STEALS_PER_GAME'] = player_data['STEALS'] / player_data['GAMES']
player_data['BLOCKS_PER_GAME'] = player_data['BLOCKS'] / player_data['GAMES']
player_data['REBOUNDS_PER_GAME'] = player_data['TOTAL_REBOUNDS'] / player_data['GAMES']

player_data[['PLAYER_NAME', 'MINUTES_PER_GAME', 'FOULS_PER_GAME', 'POINTS_PER_GAME', 'ASSISTS_PER_GAME', 'STEALS_PER_GAME', 'BLOCKS_PER_GAME', 'REBOUNDS_PER_GAME']].sample(5)

Unnamed: 0,PLAYER_NAME,MINUTES_PER_GAME,FOULS_PER_GAME,POINTS_PER_GAME,ASSISTS_PER_GAME,STEALS_PER_GAME,BLOCKS_PER_GAME,REBOUNDS_PER_GAME
247,Anna Miller,24.451613,1.935484,13.193548,2.096774,1.064516,2.645161,9.870968
196,Samantha Johnston,34.068966,1.862069,10.172414,3.413793,0.931034,1.034483,4.344828
600,Derin Erdogan,35.8,2.4,15.04,4.2,1.44,0.2,4.88
230,Maddie Scherr,33.115385,2.730769,12.461538,3.423077,1.576923,0.730769,4.769231
22,Kindyll Wetta,23.193548,2.419355,5.580645,3.935484,1.709677,0.16129,3.322581


## Calculate Assist-to-Turnover Ratio

In [14]:
player_data['ASSIST_TO_TURNOVER'] = player_data['ASSISTS'] / player_data['TURNOVERS']

player_data[['PLAYER_NAME', 'ASSISTS', 'TURNOVERS', 'ASSIST_TO_TURNOVER']].sample(5)

Unnamed: 0,PLAYER_NAME,ASSISTS,TURNOVERS,ASSIST_TO_TURNOVER
584,Jasmine Shavers,50,80,0.625
676,Sydney Affolter,74,28,2.642857
636,Sarah Te-Biasu,91,68,1.338235
439,Kelsey Rees,24,35,0.685714
606,Taylor Donaldson,44,61,0.721311


## Calculate Fantasy Points

In [15]:
player_data['FANTASY_POINTS'] = (player_data['THREE_POINTS_MADE'] * 3) + \
                                (player_data['TWO_POINTS_MADE'] * 2) + \
                                (player_data['FREE_THROWS_MADE'] * 1) + \
                                (player_data['TOTAL_REBOUNDS'] * 1.2) + \
                                (player_data['ASSISTS'] * 1.5) + \
                                (player_data['BLOCKS'] * 2) + \
                                (player_data['STEALS'] * 2) + \
                                (player_data['TURNOVERS'] * -1)

player_data[['PLAYER_NAME', 'FANTASY_POINTS']].sample(5)

Unnamed: 0,PLAYER_NAME,FANTASY_POINTS
765,Jada Lee,542.1
131,Destiny Jackson,651.9
832,Emma Von Essen,468.1
338,Leilani Kapinus,797.9
265,Riley Stack,389.5


# Wrap Up

In [16]:
player_data.dtypes

PLAYER_NAME                object
Team                       object
Class                      object
Height                      int64
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
TWO_POINTS_MADE             int64
TWO_POINT_ATTEMPTS          int64
TWO_POINT_PERC

In [17]:
player_data.to_excel(data_folder / 'player_data_engineered.xlsx', index=False)