# Exploratory Data Analysis of NBA Shot Data

## Step 3: Data Cleaning (Continued)

This is a continuation of Part I. The dataset saved from the last point there will be used

In [4]:
import pandas as pd

import os

parent_path = os.path.dirname(os.path.dirname(os.getcwd()))

replace_double_slash = parent_path.replace('\\', '/')

data_path = replace_double_slash + '/data/shot_logs_clean_1.csv'

nba_data_clean_1 = pd.read_csv(data_path)

In [5]:
nba_data_clean_1.head(5)

Unnamed: 0,Game Id,Location,W,Final Margin,Shot Number,Period,Game Clock,Shot Clock,Dribbles,Touch Time,...,Closest Defender,Closest Defender Player Id,Close Def Dist,Fgm,Pts,Player Name,Player Id,Date,Home Team,Away Team
0,21400899,A,W,24,1,1,1:09,10.8,2,1.9,...,"Anderson, Alan",101187,1.3,1,2,brian roberts,203148,2015-03-04,BKN,CHA
1,21400899,A,W,24,2,1,0:14,3.4,0,0.8,...,"Bogdanovic, Bojan",202711,6.1,0,0,brian roberts,203148,2015-03-04,BKN,CHA
2,21400899,A,W,24,3,1,0:00,,3,2.7,...,"Bogdanovic, Bojan",202711,0.9,0,0,brian roberts,203148,2015-03-04,BKN,CHA
3,21400899,A,W,24,4,2,11:47,10.3,2,1.9,...,"Brown, Markel",203900,3.4,0,0,brian roberts,203148,2015-03-04,BKN,CHA
4,21400899,A,W,24,5,2,10:34,10.9,2,2.7,...,"Young, Thaddeus",201152,1.1,0,0,brian roberts,203148,2015-03-04,BKN,CHA


In [6]:
nba_data_clean_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128069 entries, 0 to 128068
Data columns (total 23 columns):
Game Id                       128069 non-null int64
Location                      128069 non-null object
W                             128069 non-null object
Final Margin                  128069 non-null int64
Shot Number                   128069 non-null int64
Period                        128069 non-null int64
Game Clock                    128069 non-null object
Shot Clock                    122502 non-null float64
Dribbles                      128069 non-null int64
Touch Time                    128069 non-null float64
Shot Dist                     128069 non-null float64
Pts Type                      128069 non-null int64
Shot Result                   128069 non-null object
Closest Defender              128069 non-null object
Closest Defender Player Id    128069 non-null int64
Close Def Dist                128069 non-null float64
Fgm                           128069 non-nul

Let's move the new data columns that we had created on the right to the left so it is easier to read. We can do this by storing each series as a variable, dropping from the original dataset, then re-inserting the data back in a particular position

In [7]:
date_col = nba_data_clean_1['Date']

home_team_col = nba_data_clean_1['Home Team']

away_team_col = nba_data_clean_1['Away Team']

nba_data_clean_1.drop(labels = ['Date', 'Home Team', 'Away Team'], inplace=True, axis = 1)

nba_data_clean_1.insert(1, 'Date', date_col)

nba_data_clean_1.insert(2, 'Home Team', home_team_col)

nba_data_clean_1.insert(3, 'Away Team', away_team_col)

nba_data_clean_1.head(5)

Unnamed: 0,Game Id,Date,Home Team,Away Team,Location,W,Final Margin,Shot Number,Period,Game Clock,...,Shot Dist,Pts Type,Shot Result,Closest Defender,Closest Defender Player Id,Close Def Dist,Fgm,Pts,Player Name,Player Id
0,21400899,2015-03-04,BKN,CHA,A,W,24,1,1,1:09,...,7.7,2,made,"Anderson, Alan",101187,1.3,1,2,brian roberts,203148
1,21400899,2015-03-04,BKN,CHA,A,W,24,2,1,0:14,...,28.2,3,missed,"Bogdanovic, Bojan",202711,6.1,0,0,brian roberts,203148
2,21400899,2015-03-04,BKN,CHA,A,W,24,3,1,0:00,...,10.1,2,missed,"Bogdanovic, Bojan",202711,0.9,0,0,brian roberts,203148
3,21400899,2015-03-04,BKN,CHA,A,W,24,4,2,11:47,...,17.2,2,missed,"Brown, Markel",203900,3.4,0,0,brian roberts,203148
4,21400899,2015-03-04,BKN,CHA,A,W,24,5,2,10:34,...,3.7,2,missed,"Young, Thaddeus",201152,1.1,0,0,brian roberts,203148


To make it a little easier to see all the columns (will have to scroll a little) we're going to change the print option settings)

In [8]:
pd.set_option('display.max_columns', 30)

In [9]:
nba_data_clean_1.head(5)

Unnamed: 0,Game Id,Date,Home Team,Away Team,Location,W,Final Margin,Shot Number,Period,Game Clock,Shot Clock,Dribbles,Touch Time,Shot Dist,Pts Type,Shot Result,Closest Defender,Closest Defender Player Id,Close Def Dist,Fgm,Pts,Player Name,Player Id
0,21400899,2015-03-04,BKN,CHA,A,W,24,1,1,1:09,10.8,2,1.9,7.7,2,made,"Anderson, Alan",101187,1.3,1,2,brian roberts,203148
1,21400899,2015-03-04,BKN,CHA,A,W,24,2,1,0:14,3.4,0,0.8,28.2,3,missed,"Bogdanovic, Bojan",202711,6.1,0,0,brian roberts,203148
2,21400899,2015-03-04,BKN,CHA,A,W,24,3,1,0:00,,3,2.7,10.1,2,missed,"Bogdanovic, Bojan",202711,0.9,0,0,brian roberts,203148
3,21400899,2015-03-04,BKN,CHA,A,W,24,4,2,11:47,10.3,2,1.9,17.2,2,missed,"Brown, Markel",203900,3.4,0,0,brian roberts,203148
4,21400899,2015-03-04,BKN,CHA,A,W,24,5,2,10:34,10.9,2,2.7,3.7,2,missed,"Young, Thaddeus",201152,1.1,0,0,brian roberts,203148


Now the **Location** column doesn't make it clear on which team the Wins column is talking about (You would have to check the away team column or home team column, then look at the location, then the win indicator to determine which team we're talking about. We can fix that by creating a winning and losing team column to better indicate which team won or lost

In [10]:
#Create a function that will be used to return a series that has the team we're talking about

#Scenarios: Combo is (A,W) or (H,L) then return Away Team. Else return Home Team

def winning_team(df):
    if (df['Location'] == 'A' and df['W'] == 'W') or (df['Location'] == 'H' and df['W'] == 'L'):
        return df['Away Team']
    else:
        return df['Home Team']

In [11]:
winner = nba_data_clean_1.apply(winning_team, axis = 1)

nba_data_clean_1['Location'] = winner

nba_data_clean_1.rename(columns={'Location': 'Winning Team'}, inplace=True)

nba_data_clean_1.head(5)

Unnamed: 0,Game Id,Date,Home Team,Away Team,Winning Team,W,Final Margin,Shot Number,Period,Game Clock,Shot Clock,Dribbles,Touch Time,Shot Dist,Pts Type,Shot Result,Closest Defender,Closest Defender Player Id,Close Def Dist,Fgm,Pts,Player Name,Player Id
0,21400899,2015-03-04,BKN,CHA,CHA,W,24,1,1,1:09,10.8,2,1.9,7.7,2,made,"Anderson, Alan",101187,1.3,1,2,brian roberts,203148
1,21400899,2015-03-04,BKN,CHA,CHA,W,24,2,1,0:14,3.4,0,0.8,28.2,3,missed,"Bogdanovic, Bojan",202711,6.1,0,0,brian roberts,203148
2,21400899,2015-03-04,BKN,CHA,CHA,W,24,3,1,0:00,,3,2.7,10.1,2,missed,"Bogdanovic, Bojan",202711,0.9,0,0,brian roberts,203148
3,21400899,2015-03-04,BKN,CHA,CHA,W,24,4,2,11:47,10.3,2,1.9,17.2,2,missed,"Brown, Markel",203900,3.4,0,0,brian roberts,203148
4,21400899,2015-03-04,BKN,CHA,CHA,W,24,5,2,10:34,10.9,2,2.7,3.7,2,missed,"Young, Thaddeus",201152,1.1,0,0,brian roberts,203148


Now the first few columns make sense. For example if I wanted to see data that had the Charlotte Hornets as the winner I can do the following

In [12]:
CHA_Win = nba_data_clean_1[nba_data_clean_1['Winning Team'] == 'CHA']

CHA_Win.head()

Unnamed: 0,Game Id,Date,Home Team,Away Team,Winning Team,W,Final Margin,Shot Number,Period,Game Clock,Shot Clock,Dribbles,Touch Time,Shot Dist,Pts Type,Shot Result,Closest Defender,Closest Defender Player Id,Close Def Dist,Fgm,Pts,Player Name,Player Id
0,21400899,2015-03-04,BKN,CHA,CHA,W,24,1,1,1:09,10.8,2,1.9,7.7,2,made,"Anderson, Alan",101187,1.3,1,2,brian roberts,203148
1,21400899,2015-03-04,BKN,CHA,CHA,W,24,2,1,0:14,3.4,0,0.8,28.2,3,missed,"Bogdanovic, Bojan",202711,6.1,0,0,brian roberts,203148
2,21400899,2015-03-04,BKN,CHA,CHA,W,24,3,1,0:00,,3,2.7,10.1,2,missed,"Bogdanovic, Bojan",202711,0.9,0,0,brian roberts,203148
3,21400899,2015-03-04,BKN,CHA,CHA,W,24,4,2,11:47,10.3,2,1.9,17.2,2,missed,"Brown, Markel",203900,3.4,0,0,brian roberts,203148
4,21400899,2015-03-04,BKN,CHA,CHA,W,24,5,2,10:34,10.9,2,2.7,3.7,2,missed,"Young, Thaddeus",201152,1.1,0,0,brian roberts,203148


Next steps we'll be dealing with is more formatting changes to other columns, removing uneccesary data columns, and dealing with missing data in the Shot Clock column

In [13]:
nba_clean_2 = nba_data_clean_1

save_path = replace_double_slash + '/data/shot_logs_clean_2.csv'

nba_clean_2.to_csv(save_path, index=False)