# 02 Data Wrangling Introduction

### 2.1 Contents

    2.2 Introduction
    2.3 Imports
    2.4 Objectives
    2.5 Load the NBA Data
    2.6 Merging Data Sets
    2.7 Saving the Data

### 2.2 Introduction To The Notebook

The goal of this notebook is to organize the different data sets that were scraped off different open-source websites. I also need to make sure the data is well-defined to do effective analysis down the road with minimal mistakes. The full EDA and cleaning will be in Notebook 03, however some will be done at this stage to organize it a little for that process.

### 2.3 Imports

In [29]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

### 2.4 Objectives

What kind of cleaning steps did you perform?
How did you deal with missing values, if there were any?
Were there outliers, and how did you handle them?

Do you think you may have the data you need to tackle the desired question?
Have you identified the required target value?
Do you have potentially useful features?
Do you have any fundamental issues with the data?

### 2.5 Load the NBA Data

In [30]:
team_salary = pd.read_csv('team_salary_cap.csv')

In [31]:
team_salary.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Salary,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
0,Rk,Team,2022-23,2023-24,2024-25,2025-26,2026-27,2027-28
1,1,Los Angeles Clippers,"$194,515,174","$200,523,047","$158,685,012","$20,482,758",,
2,2,Golden State Warriors,"$185,674,582","$160,334,640","$80,779,109","$64,026,973",,
3,3,Brooklyn Nets,"$182,391,756","$138,524,280","$98,224,536","$53,282,609",,
4,4,Milwaukee Bucks,"$175,135,725","$145,750,211","$101,013,102","$70,162,390",,


In [32]:
team_salary.info()
team_salary.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  31 non-null     object
 1   Unnamed: 1  31 non-null     object
 2   Salary      31 non-null     object
 3   Unnamed: 3  31 non-null     object
 4   Unnamed: 4  31 non-null     object
 5   Unnamed: 5  29 non-null     object
 6   Unnamed: 6  14 non-null     object
 7   Unnamed: 7  7 non-null      object
dtypes: object(8)
memory usage: 2.1+ KB


(31, 8)

In [33]:
contracts = pd.read_csv('player_contracts.csv')

In [34]:
contracts.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,player_contracts
,,,Salary,,,,,,,
Rk,Player,Tm,2022-23,2023-24,2024-25,2025-26,2026-27,2027-28,Signed Using,Guaranteed
1,Stephen Curry,GSW,"$48,070,014.00","$51,915,615.00","$55,761,216.00","$59,606,817.00",,,Bird,"$215,353,662.00"
2,Russell Westbrook,LAL,"$47,063,478.00",,,,,,Bird Rights,"$47,063,478.00"
3,LeBron James,LAL,"$44,474,988.00",,,,,,Bird,"$44,474,988.00"


In [35]:
contracts.info()
contracts.shape

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 448 entries, (nan, nan, nan, 'Salary', nan, nan, nan, nan, nan, nan) to ('446', 'Armoni Brooks', 'TOR', '$50,000.00', nan, nan, nan, nan, nan, nan)
Data columns (total 1 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   player_contracts  406 non-null    object
dtypes: object(1)
memory usage: 80.6+ KB


(448, 1)

In [36]:
all_players1991 = pd.read_csv('players.csv')

In [37]:
all_players1991.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,1,Alaa Abdelnaby,PF,22,POR,43,0,6.7,1.3,2.7,...,0.6,1.4,2.1,0.3,0.1,0.3,0.5,0.9,3.1,1991
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,19,22.5,6.2,15.1,...,0.5,1.3,1.8,3.1,0.8,0.1,1.6,2.2,14.1,1991
2,3,Mark Acres,C,28,ORL,68,0,19.3,1.6,3.1,...,2.1,3.2,5.3,0.4,0.4,0.4,0.6,3.2,4.2,1991
3,4,Michael Adams,PG,28,DEN,66,66,35.5,8.5,21.5,...,0.9,3.0,3.9,10.5,2.2,0.1,3.6,2.5,26.5,1991
4,5,Mark Aguirre,SF,31,DET,78,13,25.7,5.4,11.7,...,1.7,3.1,4.8,1.8,0.6,0.3,1.6,2.7,14.2,1991


In [38]:
all_players1991.info()
all_players1991.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18044 entries, 0 to 18043
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Rk      18044 non-null  object
 1   Player  18044 non-null  object
 2   Pos     18044 non-null  object
 3   Age     18044 non-null  object
 4   Tm      18044 non-null  object
 5   G       18044 non-null  object
 6   GS      18044 non-null  object
 7   MP      18044 non-null  object
 8   FG      18044 non-null  object
 9   FGA     18044 non-null  object
 10  FG%     18044 non-null  object
 11  3P      18044 non-null  object
 12  3PA     18044 non-null  object
 13  3P%     18044 non-null  object
 14  2P      18044 non-null  object
 15  2PA     18044 non-null  object
 16  2P%     18044 non-null  object
 17  eFG%    18044 non-null  object
 18  FT      18044 non-null  object
 19  FTA     18044 non-null  object
 20  FT%     18044 non-null  object
 21  ORB     18044 non-null  object
 22  DRB     18044 non-null

(18044, 31)

In [39]:
regseason_21_22 = pd.read_csv('2021-2022 NBA Player Stats - Regular.csv')

In [40]:
regseason_21_22.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,...,0.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
1,2,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,...,0.543,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9
2,3,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,...,0.753,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1
3,4,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,...,0.625,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1
4,5,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,...,0.873,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9


In [41]:
regseason_21_22.info()
regseason_21_22.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 812 entries, 0 to 811
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Rk      812 non-null    int64  
 1   Player  812 non-null    object 
 2   Pos     812 non-null    object 
 3   Age     812 non-null    int64  
 4   Tm      812 non-null    object 
 5   G       812 non-null    int64  
 6   GS      812 non-null    int64  
 7   MP      812 non-null    float64
 8   FG      812 non-null    float64
 9   FGA     812 non-null    float64
 10  FG%     812 non-null    float64
 11  3P      812 non-null    float64
 12  3PA     812 non-null    float64
 13  3P%     812 non-null    float64
 14  2P      812 non-null    float64
 15  2PA     812 non-null    float64
 16  2P%     812 non-null    float64
 17  eFG%    812 non-null    float64
 18  FT      812 non-null    float64
 19  FTA     812 non-null    float64
 20  FT%     812 non-null    float64
 21  ORB     812 non-null    float64
 22  DR

(812, 30)

## 2.6 Cleaning The Data

In [42]:
team_salary.index
team_salary = team_salary.reset_index()

In [43]:
team_salary.head()
team_salary.drop(0)

team_salary = team_salary.rename(columns = {'Unnamed: 0' : 'Rk', 'Unnamed: 1' : 'Team', 'Salary' : '2022-23', 'Unnamed: 3' : '2023-24'})
team_salary.head()

team_salary.drop('index', axis=1, inplace=True)
team_salary.drop('Unnamed: 4', axis=1, inplace=True)
team_salary.drop('Unnamed: 5', axis=1, inplace=True)
team_salary.drop('Unnamed: 6', axis=1, inplace=True)
team_salary.drop('Unnamed: 7', axis=1, inplace=True)

team_salary.head()
team_salary.drop([0])

team_salary.set_index('Rk', inplace=True)
team_salary.head()

Unnamed: 0_level_0,Team,2022-23,2023-24
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Rk,Team,2022-23,2023-24
1,Los Angeles Clippers,"$194,515,174","$200,523,047"
2,Golden State Warriors,"$185,674,582","$160,334,640"
3,Brooklyn Nets,"$182,391,756","$138,524,280"
4,Milwaukee Bucks,"$175,135,725","$145,750,211"


In [44]:
contracts = contracts.reset_index()
contracts.head()

Unnamed: 0,level_0,level_1,level_2,level_3,level_4,level_5,level_6,level_7,level_8,level_9,player_contracts
0,,,,Salary,,,,,,,
1,Rk,Player,Tm,2022-23,2023-24,2024-25,2025-26,2026-27,2027-28,Signed Using,Guaranteed
2,1,Stephen Curry,GSW,"$48,070,014.00","$51,915,615.00","$55,761,216.00","$59,606,817.00",,,Bird,"$215,353,662.00"
3,2,Russell Westbrook,LAL,"$47,063,478.00",,,,,,Bird Rights,"$47,063,478.00"
4,3,LeBron James,LAL,"$44,474,988.00",,,,,,Bird,"$44,474,988.00"


In [45]:
contracts.rename(columns=contracts.iloc[1], inplace = True)

In [46]:
contracts = contracts.drop(labels=[0, 1], axis=0)

In [47]:
contracts = contracts.fillna(0)
contracts.head()

Unnamed: 0,Rk,Player,Tm,2022-23,2023-24,2024-25,2025-26,2026-27,2027-28,Signed Using,Guaranteed
2,1,Stephen Curry,GSW,"$48,070,014.00","$51,915,615.00","$55,761,216.00","$59,606,817.00",0,0,Bird,"$215,353,662.00"
3,2,Russell Westbrook,LAL,"$47,063,478.00",0,0,0,0,0,Bird Rights,"$47,063,478.00"
4,3,LeBron James,LAL,"$44,474,988.00",0,0,0,0,0,Bird,"$44,474,988.00"
5,4,Kevin Durant,BRK,"$44,119,845.00","$46,407,433.00","$49,856,021.00","$53,282,609.00",0,0,Bird,"$193,665,908.00"
6,5,Bradley Beal,WAS,"$43,279,250.00","$46,741,590.00","$50,203,930.00","$53,666,270.00","$57,128,610.00",0,Bird,"$193,891,040.00"


In [48]:
contracts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 446 entries, 2 to 447
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Rk            446 non-null    object
 1   Player        446 non-null    object
 2   Tm            446 non-null    object
 3   2022-23       446 non-null    object
 4   2023-24       446 non-null    object
 5   2024-25       446 non-null    object
 6   2025-26       446 non-null    object
 7   2026-27       446 non-null    object
 8   2027-28       446 non-null    object
 9   Signed Using  446 non-null    object
 10  Guaranteed    446 non-null    object
dtypes: object(11)
memory usage: 38.5+ KB


## 2.6 Merging Data Sets

In [49]:
players_stats = pd.merge(regseason_21_22, contracts, how='left', on=['Player', 'Tm'])

In [50]:
players_stats.head()

Unnamed: 0,Rk_x,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,PTS,Rk_y,2022-23,2023-24,2024-25,2025-26,2026-27,2027-28,Signed Using,Guaranteed
0,1,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,...,9.1,300.0,"$2,840,160.00","$4,379,527.00",0,0,0.0,0.0,1st Round Pick,"$2,840,160.00"
1,2,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,...,6.9,76.0,"$17,926,829.00",0,0,0,0.0,0.0,1st Round Pick,"$17,926,829.00"
2,3,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,...,19.1,33.0,"$30,351,834.00","$32,600,118.00","$34,848,402.00","$37,096,686.00",0.0,0.0,Bird,"$134,897,040.00"
3,4,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,...,4.1,348.0,"$2,094,240.00","$2,194,200.00","$3,960,531.00",0,0.0,0.0,1st Round Pick,"$2,094,240.00"
4,5,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,...,12.9,,,,,,,,,


In [51]:
players_stats.drop(columns = 'Rk_y', axis=0, inplace=True)

In [52]:
players_stats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 812 entries, 0 to 811
Data columns (total 38 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rk_x          812 non-null    int64  
 1   Player        812 non-null    object 
 2   Pos           812 non-null    object 
 3   Age           812 non-null    int64  
 4   Tm            812 non-null    object 
 5   G             812 non-null    int64  
 6   GS            812 non-null    int64  
 7   MP            812 non-null    float64
 8   FG            812 non-null    float64
 9   FGA           812 non-null    float64
 10  FG%           812 non-null    float64
 11  3P            812 non-null    float64
 12  3PA           812 non-null    float64
 13  3P%           812 non-null    float64
 14  2P            812 non-null    float64
 15  2PA           812 non-null    float64
 16  2P%           812 non-null    float64
 17  eFG%          812 non-null    float64
 18  FT            812 non-null    

In [55]:
players_stats.set_index('Player', inplace=True)

In [62]:
players_stats.loc['Stephen Curry']

Rk_x                        126
Pos                          PG
Age                          33
Tm                          GSW
G                            64
GS                           64
MP                         34.5
FG                          8.4
FGA                        19.1
FG%                       0.437
3P                          4.5
3PA                        11.7
3P%                        0.38
2P                          3.9
2PA                         7.4
2P%                       0.527
eFG%                      0.554
FT                          4.3
FTA                         4.7
FT%                       0.923
ORB                         0.5
DRB                         4.7
TRB                         5.2
AST                         6.3
STL                         1.3
BLK                         0.4
TOV                         3.2
PF                          2.0
PTS                        25.5
2022-23          $48,070,014.00
2023-24          $51,915,615.00
2024-25 

## 2.7 Saving the Data

In [63]:
players_stats.to_csv('players_stats.csv')