# 02 Data Wrangling Introduction

## 2.1 Contents

    2.2 Introduction
    2.3 Imports
    2.4 Objectives
    2.5 Load the NBA Data
    2.6 Clean the Data
    2.7 Merging Data Sets
    2.8 Saving the Data

## 2.2 Introduction To The Notebook

The goal of this notebook is to organize the different data sets that were scraped off different open-source websites. I also need to make sure the data is well-defined to do effective analysis down the road with minimal mistakes. The full EDA and cleaning will be in Notebook 03, however some will be done at this stage to organize it a little for that process.

## 2.3 Imports

In [57]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

## 2.4 Objectives

The goal is to answer the following questions:

What kind of cleaning steps did you perform?
How did you deal with missing values, if there were any?
Do you think you may have the data you need to tackle the desired question?
Have you identified the required target value?
Do you have potentially useful features?
Do you have any fundamental issues with the data?

Ultimately, we want a dataset that is clean and ready for EDA in notebook 3.

## 2.5 Load the NBA Data

In [58]:
stats = pd.read_csv('NBA_Stats1980.csv')

In [59]:
stats.head()

Unnamed: 0.1,Unnamed: 0,Player,Year,Ht,Wt,Colleges,Pos,Age,Tm,Team,...,Pts Won,Pts Max,Share,W,L,W/L%,GB,PS/G,PA/G,SRS
0,0,Alaa Abdelnaby,1991.0,6-10,240.0,Duke,PF,22,POR,Portland Trail Blazers,...,0.0,0.0,0.0,63,19,0.768,0.0,114.7,106.0,8.47
1,1,Danny Ainge,1991.0,6-4,175.0,BYU,SG,31,POR,Portland Trail Blazers,...,0.0,0.0,0.0,63,19,0.768,0.0,114.7,106.0,8.47
2,2,Mark Bryant,1991.0,6-9,245.0,Seton Hall,PF,25,POR,Portland Trail Blazers,...,0.0,0.0,0.0,63,19,0.768,0.0,114.7,106.0,8.47
3,3,Wayne Cooper,1991.0,6-10,220.0,New Orleans,C,34,POR,Portland Trail Blazers,...,0.0,0.0,0.0,63,19,0.768,0.0,114.7,106.0,8.47
4,4,Walter Davis,1991.0,6-6,193.0,UNC,SG,36,POR,Portland Trail Blazers,...,0.0,0.0,0.0,63,19,0.768,0.0,114.7,106.0,8.47


In [60]:
salaries1 = pd.read_csv('salaries.csv')
salaries2 = pd.read_csv('Salary_Cap_By_Year.csv')

In [61]:
salaries1.head()

Unnamed: 0,playerName,seasonStartYear,salary,inflationAdjSalary
0,Michael Jordan,1996,"$30,140,000","$52,258,566"
1,Horace Grant,1996,"$14,857,000","$25,759,971"
2,Reggie Miller,1996,"$11,250,000","$19,505,934"
3,Shaquille O'Neal,1996,"$10,714,000","$18,576,585"
4,Gary Payton,1996,"$10,212,000","$17,706,187"


In [62]:
salaries2.head()

Unnamed: 0,Year,Salary Cap
0,1984,"$3,600,000.00"
1,1985,"$4,233,000.00"
2,1986,"$4,945,000.00"
3,1987,"$6,164,000.00"
4,1988,"$7,232,000.00"


## 2.6 Clean the Data

In [63]:
stats = stats.drop(columns = 'Unnamed: 0')

In [64]:
stats['Year'].isna().sum()

0

In [65]:
stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17976 entries, 0 to 17975
Data columns (total 44 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Player    17976 non-null  object 
 1   Year      17976 non-null  float64
 2   Ht        17976 non-null  object 
 3   Wt        17976 non-null  float64
 4   Colleges  17976 non-null  object 
 5   Pos       17976 non-null  object 
 6   Age       17976 non-null  int64  
 7   Tm        17976 non-null  object 
 8   Team      17976 non-null  object 
 9   G         17976 non-null  int64  
 10  GS        17976 non-null  int64  
 11  MP        17976 non-null  float64
 12  FG        17976 non-null  float64
 13  FGA       17976 non-null  float64
 14  FG%       17976 non-null  float64
 15  3P        17976 non-null  float64
 16  3PA       17976 non-null  float64
 17  3P%       17976 non-null  float64
 18  2P        17976 non-null  float64
 19  2PA       17976 non-null  float64
 20  2P%       17976 non-null  fl

In [66]:
stats['Year'] = stats['Year'].astype('int64')
stats['Year'].head()

0    1991
1    1991
2    1991
3    1991
4    1991
Name: Year, dtype: int64

In [67]:
stats = stats[stats['Year']>1995]

In [68]:
stats.columns

Index(['Player', 'Year', 'Ht', 'Wt', 'Colleges', 'Pos', 'Age', 'Tm', 'Team',
       'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA',
       '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL',
       'BLK', 'TOV', 'PF', 'PTS', 'Pts Won', 'Pts Max', 'Share', 'W', 'L',
       'W/L%', 'GB', 'PS/G', 'PA/G', 'SRS'],
      dtype='object')

Here are the meanings of the column names:

 - 'Year' - year played
 - 'Ht' - Height
 - 'Wt' - Weight
 - 'Colleges' - Which school they went to
 - ‘Pos’ — position
 - 'Age' - Age of that year
 - ‘Tm’ — team (abbr.)
 - ‘G’ — games played
 - 'GS' - games started
 - ‘MP’ — minutes played
 - ‘FG’ — field goals made
 - ‘FGA’ — field goals attempted
 - ‘FG%’ — field goal percentage
 - ‘3P’ — 3-pointers made
 - ‘3PA’ — 3-pointers attempted
 - ‘3P%’ — 3-point percentage
 - ‘2P’ — 2-pointers made
 - ‘2PA’ — 2-pointers attempted
 - ‘2P%’ — 2-point percentage’
 - ‘eFG%’ — effective field goal percentage
 - ‘FT’ — free throws made
 - ‘FTA’ — free throws attempted
 - ‘FT%’ — free throw percentage
 - ‘ORB’ — offensive rebounds
 - ‘DRB’ — defensive rebounds
 - ‘TRB’ — total rebounds
 - ‘AST’ — assists
 - ‘STL’ — steals
 - ‘BLK’ — blocks
 - ‘TOV’ — turnovers
 - ‘PF’ — personal fouls
 - ‘PTS’ — points
 - 'Pts Won' - 
 - 'Pts Max' -
 - 'Share' - statistic divvying up team success for individual
 - 'Team' - team full name
 - 'W' - wins
 - 'L' - losses
 - 'W/L%' - Win to loss percentage
 - 'GB' - games behind/back
 - 'PS/G' - 
 - 'PA/G' - 
 - 'SRS' - Simple Rating System

In [69]:
salaries1 = salaries1.drop(columns = 'inflationAdjSalary')

In [70]:
salaries1 = salaries1.rename(columns = {'playerName': 'Player', 'seasonStartYear': 'Year', 'salary': 'Salary'})

In [71]:
salaries1.head()

Unnamed: 0,Player,Year,Salary
0,Michael Jordan,1996,"$30,140,000"
1,Horace Grant,1996,"$14,857,000"
2,Reggie Miller,1996,"$11,250,000"
3,Shaquille O'Neal,1996,"$10,714,000"
4,Gary Payton,1996,"$10,212,000"


## 2.7 Merging Data Sets

In [72]:
players1 = pd.merge(stats, salaries1, how = 'left', on = ['Player', 'Year'])

In [73]:
players1.head()

Unnamed: 0,Player,Year,Ht,Wt,Colleges,Pos,Age,Tm,Team,G,...,Pts Max,Share,W,L,W/L%,GB,PS/G,PA/G,SRS,Salary
0,Mahmoud Abdul-Rauf,1996,6-1,162.0,LSU,PG,26,DEN,Denver Nuggets,57,...,0.0,0.0,35,47,0.427,24.0,97.7,100.4,-2.62,"$3,100,000"
1,Rastko Cvetković,1996,7-1,260.0,Not American,C,25,DEN,Denver Nuggets,14,...,0.0,0.0,35,47,0.427,24.0,97.7,100.4,-2.62,
2,Dale Ellis,1996,6-7,205.0,Tennessee,SF,35,DEN,Denver Nuggets,81,...,0.0,0.0,35,47,0.427,24.0,97.7,100.4,-2.62,"$1,600,000"
3,LaPhonso Ellis,1996,6-8,240.0,Notre Dame,SF,25,DEN,Denver Nuggets,45,...,0.0,0.0,35,47,0.427,24.0,97.7,100.4,-2.62,"$3,294,000"
4,Matt Fish,1996,6-11,235.0,UNC Wilmington,C,26,DEN,Denver Nuggets,18,...,0.0,0.0,35,47,0.427,24.0,97.7,100.4,-2.62,"$247,500"


In [74]:
players2 = pd.merge(players1, salaries2, how = 'left', on = 'Year')

In [75]:
players2['Salary'].isnull().sum()
players2[players2['Salary'].isnull() == True]

Unnamed: 0,Player,Year,Ht,Wt,Colleges,Pos,Age,Tm,Team,G,...,Share,W,L,W/L%,GB,PS/G,PA/G,SRS,Salary,Salary Cap
1,Rastko Cvetković,1996,7-1,260.0,Not American,C,25,DEN,Denver Nuggets,14,...,0.0,35,47,0.427,24.0,97.7,100.4,-2.62,,"$24,363,000.00"
5,Greg Grant,1996,5-7,140.0,Trenton State University,PG,29,DEN,Denver Nuggets,31,...,0.0,35,47,0.427,24.0,97.7,100.4,-2.62,,"$24,363,000.00"
15,Randy Woods,1996,5-10,185.0,La Salle,PG,25,DEN,Denver Nuggets,8,...,0.0,35,47,0.427,24.0,97.7,100.4,-2.62,,"$24,363,000.00"
21,Jeff Grayer,1997,6-5,200.0,Iowa State,SG,31,SAC,Sacramento Kings,25,...,0.0,34,48,0.415,23.0,96.4,99.8,-3.64,,"$26,900,000.00"
31,Mahmoud Abdul-Rauf,1998,6-1,162.0,LSU,PG,28,SAC,Sacramento Kings,31,...,0.0,27,55,0.329,34.0,93.1,98.7,-5.83,,"$30,000,000.00"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12258,Dahntay Jones,2013,6-6,225.0,"Rutgers University, Duke",SF,32,ATL,Atlanta Hawks,78,...,0.0,44,38,0.537,22.0,98.0,97.5,-0.08,,"$58,679,000.00"
12262,Johan Petro,2013,7-0,247.0,Not American,C,27,ATL,Atlanta Hawks,31,...,0.0,44,38,0.537,22.0,98.0,97.5,-0.08,,"$58,679,000.00"
12265,DeShawn Stevenson,2013,6-5,210.0,Not American,SG,31,ATL,Atlanta Hawks,56,...,0.0,44,38,0.537,22.0,98.0,97.5,-0.08,,"$58,679,000.00"
12267,Anthony Tolliver,2013,6-8,240.0,Creighton,SF,27,ATL,Atlanta Hawks,62,...,0.0,44,38,0.537,22.0,98.0,97.5,-0.08,,"$58,679,000.00"


In [76]:
players2 = players2.dropna(subset = ['Salary'])

In [77]:
players2['Salary'].isnull().sum()

0

$ sign data points are invalid to operate on, so we need to change them from objects to integers or floats.

In [80]:
players2['Salary'] = players2['Salary'].replace("[$,]", "", regex=True).astype(int)
players2['Salary Cap'] = players2['Salary Cap'].replace("[$,]", "", regex=True).astype(float)

It worked out. So now we can create our target columns to do the EDA on.

In [85]:
players2['League Weight'] = players2['Salary'] / players2['Salary Cap']

In [93]:
players2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8575 entries, 0 to 12268
Data columns (total 47 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Player         8575 non-null   object 
 1   Year           8575 non-null   int64  
 2   Ht             8575 non-null   object 
 3   Wt             8575 non-null   float64
 4   Colleges       8575 non-null   object 
 5   Pos            8575 non-null   object 
 6   Age            8575 non-null   int64  
 7   Tm             8575 non-null   object 
 8   Team           8575 non-null   object 
 9   G              8575 non-null   int64  
 10  GS             8575 non-null   int64  
 11  MP             8575 non-null   float64
 12  FG             8575 non-null   float64
 13  FGA            8575 non-null   float64
 14  FG%            8575 non-null   float64
 15  3P             8575 non-null   float64
 16  3PA            8575 non-null   float64
 17  3P%            8575 non-null   float64
 18  2P     

In [94]:
players2.head()

Unnamed: 0,Player,Year,Ht,Wt,Colleges,Pos,Age,Tm,Team,G,...,W,L,W/L%,GB,PS/G,PA/G,SRS,Salary,Salary Cap,League Weight
0,Mahmoud Abdul-Rauf,1996,6-1,162.0,LSU,PG,26,DEN,Denver Nuggets,57,...,35,47,0.427,24.0,97.7,100.4,-2.62,3100000,24363000.0,0.127242
2,Dale Ellis,1996,6-7,205.0,Tennessee,SF,35,DEN,Denver Nuggets,81,...,35,47,0.427,24.0,97.7,100.4,-2.62,1600000,24363000.0,0.065673
3,LaPhonso Ellis,1996,6-8,240.0,Notre Dame,SF,25,DEN,Denver Nuggets,45,...,35,47,0.427,24.0,97.7,100.4,-2.62,3294000,24363000.0,0.135205
4,Matt Fish,1996,6-11,235.0,UNC Wilmington,C,26,DEN,Denver Nuggets,18,...,35,47,0.427,24.0,97.7,100.4,-2.62,247500,24363000.0,0.010159
6,Tom Hammonds,1996,6-9,215.0,Georgia Tech,PF,28,DEN,Denver Nuggets,71,...,35,47,0.427,24.0,97.7,100.4,-2.62,1070000,24363000.0,0.043919


## 2.8 Saving the Data

Everything looks good, so now by saving the dataset, we can perform on it in future notebooks and it won't overwrite our original data.

In [98]:
players2.to_csv('nba_players_cleaned.csv')

Ready for EDA!