# Cleaning NBA Box Score Data

In this project I am going to be importing a few different data sets around NBA box scores.  I am going to want to be narrowing this down to 2017 - 2018 and limiting it to the Portland Trail Blazers.  I will be looking at doing the following to make this data more managable.

* Import all data sets and combine them
* Remove uneeded and duplicate features and find the features that are the most useful.
* Fix Structural Errors
* Fill in missing data
* Normalize data


## 1. Import Data

The first step is going to be to import both datasets for the NBA box scores.  

In [96]:
import pandas as pd
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Load datasets
player_box_score = '/Users/jordan.kasper/Desktop/ML_Projects/nba_box_score/playerBoxScore.csv'
team_box_score = '/Users/jordan.kasper/Desktop/ML_Projects/nba_box_score/teamBoxScore.csv'

player_dataset = read_csv(player_box_score)
team_dataset = read_csv(team_box_score)

## 2. Reviewing the Data

After importing the data we are going to take a quick look at the raw data to see what we are dealing with.  

In [42]:
# looking at the size of the data

print(player_dataset.shape)
print(team_dataset.shape)

(26109, 51)
(2460, 123)


In [13]:
# reviewing the column headers

print(player_dataset.columns.values)
print(team_dataset.columns.values)

['gmDate' 'gmTime' 'seasTyp' 'playLNm' 'playFNm' 'teamAbbr' 'teamConf'
 'teamDiv' 'teamLoc' 'teamRslt' 'teamDayOff' 'offLNm1' 'offFNm1' 'offLNm2'
 'offFNm2' 'offLNm3' 'offFNm3' 'playDispNm' 'playStat' 'playMin' 'playPos'
 'playHeight' 'playWeight' 'playBDate' 'playPTS' 'playAST' 'playTO'
 'playSTL' 'playBLK' 'playPF' 'playFGA' 'playFGM' 'playFG%' 'play2PA'
 'play2PM' 'play2P%' 'play3PA' 'play3PM' 'play3P%' 'playFTA' 'playFTM'
 'playFT%' 'playORB' 'playDRB' 'playTRB' 'opptAbbr' 'opptConf' 'opptDiv'
 'opptLoc' 'opptRslt' 'opptDayOff']
['gmDate' 'gmTime' 'seasTyp' 'offLNm1' 'offFNm1' 'offLNm2' 'offFNm2'
 'offLNm3' 'offFNm3' 'teamAbbr' 'teamConf' 'teamDiv' 'teamLoc' 'teamRslt'
 'teamMin' 'teamDayOff' 'teamPTS' 'teamAST' 'teamTO' 'teamSTL' 'teamBLK'
 'teamPF' 'teamFGA' 'teamFGM' 'teamFG%' 'team2PA' 'team2PM' 'team2P%'
 'team3PA' 'team3PM' 'team3P%' 'teamFTA' 'teamFTM' 'teamFT%' 'teamORB'
 'teamDRB' 'teamTRB' 'teamPTS1' 'teamPTS2' 'teamPTS3' 'teamPTS4'
 'teamPTS5' 'teamPTS6' 'teamPTS7' 'team

## 3. Merging the Data

In [14]:
# Merge datasets

df = pd.merge(player_dataset,team_dataset)

In [15]:
print(df.columns.values)

['gmDate' 'gmTime' 'seasTyp' 'playLNm' 'playFNm' 'teamAbbr' 'teamConf'
 'teamDiv' 'teamLoc' 'teamRslt' 'teamDayOff' 'offLNm1' 'offFNm1' 'offLNm2'
 'offFNm2' 'offLNm3' 'offFNm3' 'playDispNm' 'playStat' 'playMin' 'playPos'
 'playHeight' 'playWeight' 'playBDate' 'playPTS' 'playAST' 'playTO'
 'playSTL' 'playBLK' 'playPF' 'playFGA' 'playFGM' 'playFG%' 'play2PA'
 'play2PM' 'play2P%' 'play3PA' 'play3PM' 'play3P%' 'playFTA' 'playFTM'
 'playFT%' 'playORB' 'playDRB' 'playTRB' 'opptAbbr' 'opptConf' 'opptDiv'
 'opptLoc' 'opptRslt' 'opptDayOff' 'teamMin' 'teamPTS' 'teamAST' 'teamTO'
 'teamSTL' 'teamBLK' 'teamPF' 'teamFGA' 'teamFGM' 'teamFG%' 'team2PA'
 'team2PM' 'team2P%' 'team3PA' 'team3PM' 'team3P%' 'teamFTA' 'teamFTM'
 'teamFT%' 'teamORB' 'teamDRB' 'teamTRB' 'teamPTS1' 'teamPTS2' 'teamPTS3'
 'teamPTS4' 'teamPTS5' 'teamPTS6' 'teamPTS7' 'teamPTS8' 'teamTREB%'
 'teamASST%' 'teamTS%' 'teamEFG%' 'teamOREB%' 'teamDREB%' 'teamTO%'
 'teamSTL%' 'teamBLK%' 'teamBLKR' 'teamPPS' 'teamFIC' 'teamFIC40'
 'team

## 3. Feature Selection

Here we will remove any columns that we do not need to help solve our question.  We will also remove any rows that do not deal with Portland's players.

In [17]:
# removing teams that are not Portland

por_df = df.loc[df['teamAbbr'] == 'POR']

In [18]:
# reviewing the new shape

print(por_df.shape)

(876, 153)


In [19]:
por_df.head()

Unnamed: 0,gmDate,gmTime,seasTyp,playLNm,playFNm,teamAbbr,teamConf,teamDiv,teamLoc,teamRslt,...,opptFIC40,opptOrtg,opptDrtg,opptEDiff,opptPlay%,opptAR,opptAST/TO,opptSTL/TO,poss,pace
227,2017-10-18,10:00,Regular,Lillard,Damian,POR,West,Northwest,Away,Win,...,24.6888,76.2205,124.3598,-48.1393,0.2947,8.1354,0.625,56.25,99.7107,98.8866
228,2017-10-18,10:00,Regular,Aminu,Al-Farouq,POR,West,Northwest,Away,Win,...,24.6888,76.2205,124.3598,-48.1393,0.2947,8.1354,0.625,56.25,99.7107,98.8866
229,2017-10-18,10:00,Regular,Harkless,Maurice,POR,West,Northwest,Away,Win,...,24.6888,76.2205,124.3598,-48.1393,0.2947,8.1354,0.625,56.25,99.7107,98.8866
230,2017-10-18,10:00,Regular,Turner,Evan,POR,West,Northwest,Away,Win,...,24.6888,76.2205,124.3598,-48.1393,0.2947,8.1354,0.625,56.25,99.7107,98.8866
231,2017-10-18,10:00,Regular,Nurkic,Jusuf,POR,West,Northwest,Away,Win,...,24.6888,76.2205,124.3598,-48.1393,0.2947,8.1354,0.625,56.25,99.7107,98.8866


Now that we have cleaned up the non Portland teams we are going to remove the following columns that dont deal with Portland's usable features

In [93]:
port_df = por_df.drop(['gmTime','playFNm','playDispNm', 'seasTyp', 'teamConf','teamDiv','teamLoc','teamDayOff',
             'offLNm1','offFNm1','offLNm2','offFNm2','offLNm3','offFNm3',
             'playHeight','playWeight','playBDate','opptAbbr','opptDiv',
             'opptConf','opptLoc','opptRslt','opptDayOff','opptMin',
             'opptPTS','opptAST','opptTO','opptSTL','opptBLK','opptPF',
             'opptFGA','opptFGM','opptFG%','oppt2PA','oppt2PM','oppt2P%',
             'oppt3PA','oppt3PM','oppt3P%','opptFTA','opptFTM','opptFT%',
            'opptORB','opptDRB','opptTRB','opptPTS1','opptPTS2','opptPTS3',
            'opptPTS4','opptPTS5','opptPTS6','opptPTS7','opptPTS8','opptTREB%',
            'opptASST%','opptTS%','opptEFG%','opptOREB%','opptDREB%','opptTO%',
            'opptSTL%','opptBLK%','opptBLKR','opptPPS','opptFIC','opptFIC40','opptOrtg',
            'opptDrtg','opptEDiff','opptPlay%','opptAR','opptAST/TO','opptSTL/TO','opptSTL/TO',
            'poss','pace','teamPTS6','teamPTS7','teamPTS8','teamPTS1','teamPTS2', 'teamPTS3' ,'teamPTS4' ,'teamPTS5',
 'teamPTS8','playPos','teamMin','teamPPS','teamFIC','teamFIC40','teamOrtg','teamDrtg','teamEDiff','teamPlay%'], axis=1)

In [94]:
# reviewing the change in shape and the columns that are left

print(port_df.shape)
print(port_df.columns.values)
port_df.head()

(876, 61)
['gmDate' 'playLNm' 'teamAbbr' 'teamRslt' 'playStat' 'playMin' 'playPTS'
 'playAST' 'playTO' 'playSTL' 'playBLK' 'playPF' 'playFGA' 'playFGM'
 'playFG%' 'play2PA' 'play2PM' 'play2P%' 'play3PA' 'play3PM' 'play3P%'
 'playFTA' 'playFTM' 'playFT%' 'playORB' 'playDRB' 'playTRB' 'teamPTS'
 'teamAST' 'teamTO' 'teamSTL' 'teamBLK' 'teamPF' 'teamFGA' 'teamFGM'
 'teamFG%' 'team2PA' 'team2PM' 'team2P%' 'team3PA' 'team3PM' 'team3P%'
 'teamFTA' 'teamFTM' 'teamFT%' 'teamORB' 'teamDRB' 'teamTRB' 'teamTREB%'
 'teamASST%' 'teamTS%' 'teamEFG%' 'teamOREB%' 'teamDREB%' 'teamTO%'
 'teamSTL%' 'teamBLK%' 'teamBLKR' 'teamAR' 'teamAST/TO' 'teamSTL/TO']


Unnamed: 0,gmDate,playLNm,teamAbbr,teamRslt,playStat,playMin,playPTS,playAST,playTO,playSTL,...,teamEFG%,teamOREB%,teamDREB%,teamTO%,teamSTL%,teamBLK%,teamBLKR,teamAR,teamAST/TO,teamSTL/TO
227,2017-10-18,Lillard,POR,Win,Starter,30,27,5,0,1,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
228,2017-10-18,Aminu,POR,Win,Starter,28,5,2,1,0,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
229,2017-10-18,Harkless,POR,Win,Starter,26,8,2,2,0,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
230,2017-10-18,Turner,POR,Win,Starter,25,12,3,2,1,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
231,2017-10-18,Nurkic,POR,Win,Starter,23,11,1,5,1,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444


## 4. Checking for Null Data

In [95]:
# checking what rows have null info

null_data = port_df[port_df.isnull().any(axis=1)]

In [96]:
print(null_data)

Empty DataFrame
Columns: [gmDate, playLNm, teamAbbr, teamRslt, playStat, playMin, playPTS, playAST, playTO, playSTL, playBLK, playPF, playFGA, playFGM, playFG%, play2PA, play2PM, play2P%, play3PA, play3PM, play3P%, playFTA, playFTM, playFT%, playORB, playDRB, playTRB, teamPTS, teamAST, teamTO, teamSTL, teamBLK, teamPF, teamFGA, teamFGM, teamFG%, team2PA, team2PM, team2P%, team3PA, team3PM, team3P%, teamFTA, teamFTM, teamFT%, teamORB, teamDRB, teamTRB, teamTREB%, teamASST%, teamTS%, teamEFG%, teamOREB%, teamDREB%, teamTO%, teamSTL%, teamBLK%, teamBLKR, teamAR, teamAST/TO, teamSTL/TO]
Index: []

[0 rows x 61 columns]


## 5. Verifying Data

Now that we have the features selected and missing data handled we are going to review the data we have now to make sure it looks acurate.

In [97]:
# Taking a look at total number of games and players that played

print(port_df.groupby('gmDate').size())

gmDate
2017-10-18    11
2017-10-20    12
2017-10-21     9
2017-10-24    11
2017-10-26    10
2017-10-28     9
2017-10-30    10
2017-11-01     9
2017-11-02    10
2017-11-05     9
2017-11-07     9
2017-11-10    10
2017-11-13    12
2017-11-15    10
2017-11-17    12
2017-11-18    13
2017-11-20    10
2017-11-22    13
2017-11-24    10
2017-11-25    10
2017-11-27    10
2017-11-30    13
2017-12-02    11
2017-12-05    12
2017-12-09    10
2017-12-11    11
2017-12-13    11
2017-12-15    10
2017-12-16    10
2017-12-18     9
              ..
2018-02-04     9
2018-02-05    13
2018-02-08    11
2018-02-09    13
2018-02-11    12
2018-02-14    10
2018-02-23    10
2018-02-24    10
2018-02-27    13
2018-03-01    10
2018-03-03    10
2018-03-05    10
2018-03-06    12
2018-03-09    12
2018-03-12    11
2018-03-15    10
2018-03-17    11
2018-03-18    13
2018-03-20    10
2018-03-23     9
2018-03-25    10
2018-03-27    10
2018-03-28     9
2018-03-30    12
2018-04-01    12
2018-04-03    10
2018-04-05    12
2018-04

In [98]:
port_df.loc[df['gmDate'] == '2017-10-18']

Unnamed: 0,gmDate,playLNm,teamAbbr,teamRslt,playStat,playMin,playPTS,playAST,playTO,playSTL,...,teamEFG%,teamOREB%,teamDREB%,teamTO%,teamSTL%,teamBLK%,teamBLKR,teamAR,teamAST/TO,teamSTL/TO
227,2017-10-18,Lillard,POR,Win,Starter,30,27,5,0,1,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
228,2017-10-18,Aminu,POR,Win,Starter,28,5,2,1,0,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
229,2017-10-18,Harkless,POR,Win,Starter,26,8,2,2,0,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
230,2017-10-18,Turner,POR,Win,Starter,25,12,3,2,1,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
231,2017-10-18,Nurkic,POR,Win,Starter,23,11,1,5,1,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
232,2017-10-18,Connaughton,POR,Win,Bench,32,24,2,1,0,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
233,2017-10-18,Napier,POR,Win,Bench,23,10,3,1,2,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
234,2017-10-18,Swanigan,POR,Win,Bench,18,8,3,2,1,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
235,2017-10-18,Davis,POR,Win,Bench,14,10,0,1,0,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444
236,2017-10-18,Layman,POR,Win,Bench,12,3,1,0,2,...,0.5667,39.4737,80.7692,14.9601,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444


The above box score looks good, now that things have been cleaned up we will start to look at visualizing the data to get a better understanding of it.

In [137]:
column_name = "teamRslt"
df = port_df

dummies = pd.get_dummies(df[column_name],prefix=column_name)
dummies.head()

def create_dummies(df,column_name):
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies],axis=1)
    return df

newDf = create_dummies(port_df,"teamRslt")
newDf = create_dummies(newDf,"playStat")

newDf = newDf.drop(['playStat'],axis=1)
newDf = newDf.drop(['teamRslt'],axis=1)
newDf = newDf.drop(['teamAbbr'],axis=1)
newDf = newDf.drop(['playLNm'],axis=1)
newDf = newDf.drop(['gmDate'],axis=1)



newDf.head()

Unnamed: 0,playMin,playPTS,playAST,playTO,playSTL,playBLK,playPF,playFGA,playFGM,playFG%,...,teamSTL%,teamBLK%,teamBLKR,teamAR,teamAST/TO,teamSTL/TO,teamRslt_Loss,teamRslt_Win,playStat_Bench,playStat_Starter
227,30,27,5,0,1,3,1,20,10,0.5,...,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444,0,1,0,1
228,28,5,2,1,0,1,1,4,2,0.5,...,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444,0,1,0,1
229,26,8,2,2,0,1,3,9,3,0.3333,...,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444,0,1,0,1
230,25,12,3,2,1,1,2,8,4,0.5,...,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444,0,1,0,1
231,23,11,1,5,1,0,3,7,2,0.2857,...,8.0232,7.0203,10.6061,15.4581,1.2222,44.4444,0,1,0,1


In [122]:
newDf.corr()

Unnamed: 0,playMin,playPTS,playAST,playTO,playSTL,playBLK,playPF,playFGA,playFGM,playFG%,...,teamSTL%,teamBLK%,teamBLKR,teamAR,teamAST/TO,teamSTL/TO,teamRslt_Loss,teamRslt_Win,playStat_Bench,playStat_Starter
playMin,1.0,0.756291,0.602081,0.463895,0.378479,0.215182,0.353771,0.820549,0.742036,0.239384,...,0.006163,0.067055,0.056622,-0.038609,0.037373,0.049971,0.002237,-0.002237,-0.707605,0.707605
playPTS,0.756291,1.0,0.633087,0.433102,0.323488,0.132646,0.191162,0.898935,0.96228,0.359209,...,-0.012892,0.031049,0.027792,0.036726,0.06937,0.031649,-0.051355,0.051355,-0.591247,0.591247
playAST,0.602081,0.633087,1.0,0.377636,0.24539,0.033916,0.073096,0.644854,0.581795,0.106207,...,0.004408,0.044176,0.042859,0.153472,0.106957,0.024026,-0.051134,0.051134,-0.43449,0.43449
playTO,0.463895,0.433102,0.377636,1.0,0.207745,0.138974,0.240896,0.455856,0.408775,0.095459,...,-0.009186,0.013197,0.041288,-0.036568,-0.190518,-0.147031,0.007609,-0.007609,-0.381039,0.381039
playSTL,0.378479,0.323488,0.24539,0.207745,1.0,0.04697,0.150276,0.313861,0.313621,0.106901,...,0.267288,-0.008254,-0.021384,-0.016688,0.005479,0.211757,0.005934,-0.005934,-0.29094,0.29094
playBLK,0.215182,0.132646,0.033916,0.138974,0.04697,1.0,0.231944,0.147795,0.181069,0.154094,...,-0.034747,0.304159,0.287615,0.012099,0.05182,0.010892,-0.045156,0.045156,-0.215098,0.215098
playPF,0.353771,0.191162,0.073096,0.240896,0.150276,0.231944,1.0,0.206036,0.225158,0.151368,...,0.000691,0.061372,0.059988,-0.043719,-0.02696,0.000427,0.034469,-0.034469,-0.242199,0.242199
playFGA,0.820549,0.898935,0.644854,0.455856,0.313861,0.147795,0.206036,1.0,0.8967,0.139616,...,0.022232,0.051613,0.027331,-0.022759,0.086881,0.092283,-0.002098,0.002098,-0.649475,0.649475
playFGM,0.742036,0.96228,0.581795,0.408775,0.313621,0.181069,0.225158,0.8967,1.0,0.420293,...,-0.002262,0.039336,0.027453,0.047743,0.089726,0.051957,-0.040324,0.040324,-0.590602,0.590602
playFG%,0.239384,0.359209,0.106207,0.095459,0.106901,0.154094,0.151368,0.139616,0.420293,1.0,...,-0.025483,0.020178,0.032341,0.078427,0.008593,-0.044579,-0.049088,0.049088,-0.12116,0.12116


In [123]:
c = newDf.corr().abs()
s = c.unstack()
so = s.sort_values(kind='quicksort').where(s > .07)
so = so.sort_values(kind='quicksort').where(s < .3)

newSo = so.dropna()
print(newSo)

teamFG%         teamDRB           0.070005
teamDRB         teamFG%           0.070005
play3P%         teamSTL/TO        0.070252
teamSTL/TO      play3P%           0.070252
teamBLK         playMin           0.070665
playMin         teamBLK           0.070665
team2P%         teamASST%         0.070806
teamASST%       team2P%           0.070806
teamFT%         playFT%           0.070919
playFT%         teamFT%           0.070919
teamTS%         playTRB           0.071327
playTRB         teamTS%           0.071327
teamTO%         playMin           0.071385
playMin         teamTO%           0.071385
teamOREB%       teamBLK           0.071586
teamBLK         teamOREB%         0.071586
playORB         teamSTL/TO        0.071962
teamSTL/TO      playORB           0.071962
teamBLKR        playDRB           0.072002
playDRB         teamBLKR          0.072002
teamPTS         playFTM           0.072103
playFTM         teamPTS           0.072103
play2PM         team2P%           0.072199
team2P%    

## 6. Evaluating Algorithms

### 6.1 Seperate out a validation dataset

In [138]:
# Split-out validation dataset
from sklearn.model_selection import train_test_split

array = newDf.values
X = array[:,:-1]
Y = array[:,-1]
validation_size = 0.20
seed = 7

X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

In [139]:
# Spot-Check Algorithms
models = []
models.append(('LR', LogisticRegression())) 
models.append(('LDA', LinearDiscriminantAnalysis())) 
models.append(('KNN', KNeighborsClassifier())) 
models.append(('CART', DecisionTreeClassifier())) 
models.append(('NB', GaussianNB())) 
models.append(('SVM', SVC()))

# evaluate each model in turn
results = []
names = []

for name, model in models:
    kfold = KFold(n_splits=10, random_state=seed)

    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy') 
    results.append(cv_results) 
    names.append(name)

    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())

    print(msg)

LR: 1.000000 (0.000000)




LDA: 0.824286 (0.060626)
KNN: 0.687143 (0.066225)
CART: 1.000000 (0.000000)
NB: 1.000000 (0.000000)
SVM: 0.671429 (0.063888)
