# From Rookies to the All-NBA team
# Step 1: Data Acquisition and Cleaning

## Importing Libraries, Packages and Modules

In [1]:
import pandas as pd
from ydata_profiling import ProfileReport

## Loading and Cleaning All-Rookie Data

In order to attain a dataset conducive to modeling which NBA rookies will reach an All-NBA team, first we must load several datasets. These will then be merged together appropriately.

The first dataset is 'End of Season Teams.csv', which contains data about which players made end-of-season teams such as All-NBA, All-Defense, All-Rookie, etc.

All data for this project was taken from the NBA Stats (1947-Present) dataset posted on Kaggle by Sumitro Datta. This ata was scraped from Basketball-Reference.com.

A link to the dataset can be found here: https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats

A link to Basketball Reference can be found here: https://www.basketball-reference.com/

In [2]:
df = pd.read_csv('End of Season Teams.csv', na_values = '?')

We want to have a variable indicating whether players made an All-Rookie team or not. For this, we start by filtering the type of team to 'All-Rookie' and drop columns that are irrelevant to the analysis (namely position and birth year).

In [3]:
df = df[df['type'] == 'All-Rookie'].drop(['position', 'birth_year'], axis = 1)

In [4]:
df

Unnamed: 0,season,lg,type,number_tm,player,seas_id,player_id,tm,age
25,2024,NBA,All-Rookie,1st,Chet Holmgren,31230,5125,OKC,21
26,2024,NBA,All-Rookie,1st,Jaime Jaquez Jr.,31426,5151,MIA,22
27,2024,NBA,All-Rookie,1st,Brandon Miller,31200,5121,CHO,21
28,2024,NBA,All-Rookie,1st,Brandin Podziemski,31196,5120,GSW,20
29,2024,NBA,All-Rookie,1st,Victor Wembanyama,31850,5209,SAS,20
...,...,...,...,...,...,...,...,...,...
2024,1963,NBA,All-Rookie,1st,Chet Walker,2476,762,SYR,22
2025,1963,NBA,All-Rookie,1st,Dave DeBusschere,2486,767,DET,22
2026,1963,NBA,All-Rookie,1st,John Havlicek,2526,778,BOS,22
2027,1963,NBA,All-Rookie,1st,Terry Dischinger,2563,788,CHZ,22


## Loading Per-Game Data

The next set of important variables for our model are player per-game statistics. We get this data by loading the 'Player Per Game.csv' dataset.

In [5]:
df2 = pd.read_csv('Player Per Game.csv', na_values = '?')

## Merging and Cleaning Data

Now we merge this new per-game data with the all-rookie data. We want to ensure that all rookies are included here, so we perform a left-join with the per-game data on the left.

In [6]:
df2 = df2.merge(df, how = "left", on = "seas_id", suffixes=('', '_y'))
df2.drop(df2.filter(regex = '_y$').columns.tolist(), axis = 1, inplace = True)

Now, we must filter this dataset to only include the rows and columns relevant to our model. We start by dropping the unnecessary 'birth_year' column.

In [8]:
df2 = df2.drop(['birth_year'], axis = 1)

Next, we filter the data to only include rookies (a.k.a. players with an experience value of 1 season).

In [51]:
df2 = df2[df2['experience'] == 1]

Subsequently, we only include players who had rookie seasons from 1988-89 onward. This is to ensure consistency in the dataset while still giving us enough observations to work with.

The reason for choosing 1988-89 is because this is the season the NBA expanded the end-of-season teams to include a 3rd All-NBA team and a 2nd All-Rookie team. Moreover, many statistics (e.g. three-point field goals, steals, and blocks) were not tracked until varying points prior to 1988. This is a reasonable cutoff season as a result.

In [10]:
df2 = df2[df2['season'] > 1988]

Additionally, we will want to drop the 'type' column, as it is 'All-Rookie' or missing for all players. Moreover, to be more specifically pertaining to All-Rookie teams, 'number_tm' will be renamed. 

In [11]:
df2 = df2.drop(['type'], axis = 1)

In [12]:
df2.number_tm = df2.number_tm.fillna('Not Selected')

In [13]:
df2.rename(columns={'number_tm': 'rookie_tm'}, inplace = True)

In [14]:
df2

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,...,orb_per_game,drb_per_game,trb_per_game,ast_per_game,stl_per_game,blk_per_game,tov_per_game,pf_per_game,pts_per_game,rookie_tm
7,31143,2024,5109,Adam Flagler,SG,24.0,1,NBA,OKC,2,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1.5,Not Selected
8,31144,2024,5110,Adama Sanogo,PF,21.0,1,NBA,CHI,9,...,2.1,1.9,4.0,0.0,0.1,0.0,0.6,0.6,4.0,Not Selected
18,31154,2024,5111,Alex Fudge,SF,20.0,1,NBA,TOT,6,...,0.7,0.2,0.8,0.0,0.5,0.0,0.3,0.3,2.5,Not Selected
19,31155,2024,5111,Alex Fudge,SF,20.0,1,NBA,LAL,4,...,0.5,0.0,0.5,0.0,0.0,0.0,0.3,0.3,1.0,Not Selected
20,31156,2024,5111,Alex Fudge,SF,20.0,1,NBA,DAL,2,...,1.0,0.5,1.5,0.0,1.5,0.0,0.5,0.5,5.5,Not Selected
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20491,11784,1989,2474,Vernon Maxwell,SG,23.0,1,NBA,SAS,79,...,0.6,1.9,2.6,3.8,1.1,0.1,2.3,1.7,11.7,Not Selected
20493,11786,1989,2475,Vinny Del Negro,SG,22.0,1,NBA,SAC,80,...,0.6,1.5,2.1,2.6,0.8,0.2,1.0,2.0,7.1,Not Selected
20502,11795,1989,2476,Wayne Engelstad,PF,23.0,1,NBA,DEN,11,...,0.5,1.0,1.5,0.6,0.1,0.0,0.3,1.1,2.5,Not Selected
20503,11796,1989,2477,Will Perdue,C,23.0,1,NBA,CHI,30,...,0.6,0.9,1.5,0.4,0.1,0.2,0.5,1.3,2.2,Not Selected


In the data, we can see that Alex Fudge appears multiple times, as he played for multiple teams as a rookie. To account for players such as Fudge, we will remove all rows for these players except the ones containing their full-season stats. These rwos are the ones designated with 'TOT' as the team value.

In [15]:
multi_team = df2[df2['tm'] == 'TOT'].player_id

In [16]:
cond = df2['player_id'].isin(list(multi_team))

In [17]:
cond2 = df2['tm'] != 'TOT'

In [18]:
cond2

7         True
8         True
18       False
19        True
20        True
         ...  
20491     True
20493     True
20502     True
20503     True
20504     True
Name: tm, Length: 3170, dtype: bool

In [19]:
df2 = df2[~(cond & cond2)]

In [20]:
df2 = df2.reset_index()

In [21]:
df2 = df2.drop(['index'], axis = 1)

## Loading All-Defense Data

We would like to include more data to show which players made an All-Defense team as rookies. Such players are exceedingly rare, but this achievement would be a great indication of future NBA success.

We attain this data by again using the 'End of Season Teams.csv' file.

In [23]:
df3 = pd.read_csv('End of Season Teams.csv', na_values = '?')

In [24]:
all_defense = df3[df3['type'] == 'All-Defense']
all_defense = all_defense.drop(['season', 'lg', 'type', 'player', 'player_id', 'position', 'tm', 'age', 'birth_year'], axis = 1)

## Merging and Cleaning All-Defense Data

Similarly as above, we will perform a left-join between the existing dataset on the left and the all-defense data on the right.

In [25]:
df2 = df2.merge(all_defense, how = "left", on = "seas_id", suffixes=('', '_y'))
df2.drop(df2.filter(regex = '_y$').columns.tolist(), axis = 1, inplace = True)

In [26]:
df2

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,...,drb_per_game,trb_per_game,ast_per_game,stl_per_game,blk_per_game,tov_per_game,pf_per_game,pts_per_game,rookie_tm,number_tm
0,31143,2024,5109,Adam Flagler,SG,24.0,1,NBA,OKC,2,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1.5,Not Selected,
1,31144,2024,5110,Adama Sanogo,PF,21.0,1,NBA,CHI,9,...,1.9,4.0,0.0,0.1,0.0,0.6,0.6,4.0,Not Selected,
2,31154,2024,5111,Alex Fudge,SF,20.0,1,NBA,TOT,6,...,0.2,0.8,0.0,0.5,0.0,0.3,0.3,2.5,Not Selected,
3,31160,2024,5112,Amari Bailey,PG,19.0,1,NBA,CHO,10,...,0.3,0.9,0.7,0.3,0.0,0.4,0.2,2.3,Not Selected,
4,31161,2024,5113,Amen Thompson,SF,21.0,1,NBA,HOU,62,...,4.2,6.6,2.6,1.3,0.6,1.5,2.3,9.5,2nd,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2803,11784,1989,2474,Vernon Maxwell,SG,23.0,1,NBA,SAS,79,...,1.9,2.6,3.8,1.1,0.1,2.3,1.7,11.7,Not Selected,
2804,11786,1989,2475,Vinny Del Negro,SG,22.0,1,NBA,SAC,80,...,1.5,2.1,2.6,0.8,0.2,1.0,2.0,7.1,Not Selected,
2805,11795,1989,2476,Wayne Engelstad,PF,23.0,1,NBA,DEN,11,...,1.0,1.5,0.6,0.1,0.0,0.3,1.1,2.5,Not Selected,
2806,11796,1989,2477,Will Perdue,C,23.0,1,NBA,CHI,30,...,0.9,1.5,0.4,0.1,0.2,0.5,1.3,2.2,Not Selected,


We also again rename the 'number_tm' column appropriately.

In [27]:
df2.rename(columns = {'number_tm': 'defense_tm'}, inplace = True)

In [28]:
df2[df2['defense_tm'] == '1st']

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,...,drb_per_game,trb_per_game,ast_per_game,stl_per_game,blk_per_game,tov_per_game,pf_per_game,pts_per_game,rookie_tm,defense_tm
100,31850,2024,5209,Victor Wembanyama,C,20.0,1,NBA,SAS,71,...,8.4,10.6,3.9,1.2,3.6,3.7,2.2,21.4,1st,1st


## Loading All-NBA Data

We would also like to include more data to show which players made an All-NBA team as rookies. Such players are again exceedingly rare, but this achievement would also be a great indication of future NBA success.

We attain this data by once again using the 'End of Season Teams.csv' file.

In [29]:
all_nba = df3[df3['type'] == 'All-NBA']

In [30]:
all_nba = all_nba.drop(['season', 'lg', 'type', 'player', 'player_id', 'position', 'tm', 'age', 'birth_year'], axis = 1)

## Merging and Cleaning All-Defense Data

Similarly as above, we will perform a left-join between the existing dataset on the left and the all-NBA data on the right.

In [31]:
df2 = df2.merge(all_nba, how = "left", on = "seas_id", suffixes=('', '_y'))
df2.drop(df2.filter(regex = '_y$').columns.tolist(), axis = 1, inplace = True)

We also again rename the 'number_tm' column appropriately.

In [32]:
df2.rename(columns = {'number_tm': 'nba_tm'}, inplace=True)

Moreover, we fill in the missing values with 'Not Selected'.

In [33]:
df2.defense_tm = df2.defense_tm.fillna('Not Selected')

In [34]:
df2.nba_tm = df2.nba_tm.fillna('Not Selected')

In [35]:
df2

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,...,trb_per_game,ast_per_game,stl_per_game,blk_per_game,tov_per_game,pf_per_game,pts_per_game,rookie_tm,defense_tm,nba_tm
0,31143,2024,5109,Adam Flagler,SG,24.0,1,NBA,OKC,2,...,0.0,2.0,0.0,0.0,0.0,0.0,1.5,Not Selected,Not Selected,Not Selected
1,31144,2024,5110,Adama Sanogo,PF,21.0,1,NBA,CHI,9,...,4.0,0.0,0.1,0.0,0.6,0.6,4.0,Not Selected,Not Selected,Not Selected
2,31154,2024,5111,Alex Fudge,SF,20.0,1,NBA,TOT,6,...,0.8,0.0,0.5,0.0,0.3,0.3,2.5,Not Selected,Not Selected,Not Selected
3,31160,2024,5112,Amari Bailey,PG,19.0,1,NBA,CHO,10,...,0.9,0.7,0.3,0.0,0.4,0.2,2.3,Not Selected,Not Selected,Not Selected
4,31161,2024,5113,Amen Thompson,SF,21.0,1,NBA,HOU,62,...,6.6,2.6,1.3,0.6,1.5,2.3,9.5,2nd,Not Selected,Not Selected
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2803,11784,1989,2474,Vernon Maxwell,SG,23.0,1,NBA,SAS,79,...,2.6,3.8,1.1,0.1,2.3,1.7,11.7,Not Selected,Not Selected,Not Selected
2804,11786,1989,2475,Vinny Del Negro,SG,22.0,1,NBA,SAC,80,...,2.1,2.6,0.8,0.2,1.0,2.0,7.1,Not Selected,Not Selected,Not Selected
2805,11795,1989,2476,Wayne Engelstad,PF,23.0,1,NBA,DEN,11,...,1.5,0.6,0.1,0.0,0.3,1.1,2.5,Not Selected,Not Selected,Not Selected
2806,11796,1989,2477,Will Perdue,C,23.0,1,NBA,CHI,30,...,1.5,0.4,0.1,0.2,0.5,1.3,2.2,Not Selected,Not Selected,Not Selected


## Loading Advanced Data

The next set of important variables for our model are player advanced statistics. We get this data by loading the 'Advanced.csv' dataset.

In [36]:
advanced = pd.read_csv('Advanced.csv', na_values = '?')

In [37]:
advanced = advanced.drop(['season', 'player_id', 'player', 'birth_year', 'pos', 'age', 'experience', 'lg', 'tm', 'g'], axis = 1)

## Merging and Cleaning Advanced Data

Similarly as above, we will perform a left-join between the existing dataset on the left and the advanced data on the right.

In [38]:
df2 = df2.merge(advanced, how = "left", on = "seas_id", suffixes=('', '_y'))
df2.drop(df2.filter(regex = '_y$').columns.tolist(), axis = 1, inplace = True)

In [39]:
df2

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,...,tov_percent,usg_percent,ows,dws,ws,ws_48,obpm,dbpm,bpm,vorp
0,31143,2024,5109,Adam Flagler,SG,24.0,1,NBA,OKC,2,...,0.0,21.7,0.0,0.0,0.0,-0.159,-6.7,-3.9,-10.6,0.0
1,31144,2024,5110,Adama Sanogo,PF,21.0,1,NBA,CHI,9,...,13.4,24.8,0.1,0.1,0.2,0.133,-0.4,-6.3,-6.7,-0.1
2,31154,2024,5111,Alex Fudge,SF,20.0,1,NBA,TOT,6,...,11.2,19.2,0.0,0.0,0.0,-0.007,-3.7,-1.5,-5.2,0.0
3,31160,2024,5112,Amari Bailey,PG,19.0,1,NBA,CHO,10,...,12.9,21.1,-0.1,0.0,-0.1,-0.044,-5.2,-3.1,-8.3,-0.1
4,31161,2024,5113,Amen Thompson,SF,21.0,1,NBA,HOU,62,...,14.9,18.5,1.9,2.4,4.3,0.149,-0.1,1.9,1.8,1.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2803,11784,1989,2474,Vernon Maxwell,SG,23.0,1,NBA,SAS,79,...,16.0,20.9,0.1,1.4,1.5,0.035,-1.3,-1.2,-2.5,-0.3
2804,11786,1989,2475,Vinny Del Negro,SG,22.0,1,NBA,SAC,80,...,12.3,16.4,1.5,1.0,2.5,0.077,-1.7,0.5,-1.2,0.3
2805,11795,1989,2476,Wayne Engelstad,PF,23.0,1,NBA,DEN,11,...,8.2,27.9,0.0,0.1,0.0,0.031,-5.0,-1.7,-6.7,-0.1
2806,11796,1989,2477,Will Perdue,C,23.0,1,NBA,CHI,30,...,16.1,21.0,-0.3,0.2,-0.1,-0.021,-6.7,-1.6,-8.3,-0.3


# Adding Actual All-NBA Outcomes

We also need to include the actual career outcomes for players in the dataset. We start this process by acquiring the player_id for every player who made an All-NBA team.

In [52]:
all_nba_ids = df3[df3['type'] == 'All-NBA']['player_id']

Now we create a dataframe with these player ids and 'all_nba' values of True. We also make sure to drop the duplicate rows representing players who were selected to multiple All-NBA teams.

In [41]:
outcomes = pd.DataFrame({'player_id': all_nba_ids, 'all_nba': True})

In [42]:
outcomes = outcomes.reset_index()

In [43]:
outcomes = outcomes.drop(['index'], axis = 1)

In [44]:
outcomes = outcomes.drop_duplicates()

Next, we perform a left-join between the previous features data on the left and the new labels data on the right. We also fill in the resulting missing values in the 'all_nba' column with False.

In [45]:
df2 = df2.merge(outcomes, how = "left", on = "player_id", suffixes=('', '_y'))
df2.drop(df2.filter(regex = '_y$').columns.tolist(), axis = 1, inplace = True)

In [46]:
df2.all_nba = df2.all_nba.fillna(False)

In [47]:
#profile = ProfileReport(df2,title="Data Profile")

# Save the report to .html
#profile.to_file("data_profile.html")

In [48]:
df2[df2['all_nba'] == True]

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,...,usg_percent,ows,dws,ws,ws_48,obpm,dbpm,bpm,vorp,all_nba
314,28972,2021,4808,Anthony Edwards,SG,19.0,1,NBA,MIN,72,...,27.0,-0.6,1.4,0.8,0.017,-0.4,-1.7,-2.1,-0.1,True
398,29613,2021,4892,Tyrese Haliburton,PG,20.0,1,NBA,SAC,58,...,18.1,2.8,0.7,3.5,0.096,1.9,-0.5,1.4,1.5,True
441,28537,2020,4723,Ja Morant,PG,20.0,1,NBA,MEM,67,...,25.9,2.3,1.5,3.8,0.088,1.4,-1.2,0.3,1.2,True
570,27864,2019,4630,Jalen Brunson,PG,22.0,1,NBA,DAL,73,...,19.1,1.5,1.0,2.6,0.077,-1.2,-0.5,-1.7,0.1,True
594,28015,2019,4654,Luka Dončić,SG,19.0,1,NBA,DAL,72,...,30.5,2.1,2.8,4.9,0.101,3.6,0.3,3.9,3.4,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2680,11937,1990,2509,Glen Rice,SF,22.0,1,NBA,MIA,77,...,21.6,-0.4,1.8,1.4,0.029,-2.1,-1.1,-3.2,-0.7,True
2714,12190,1990,2544,Shawn Kemp,PF,20.0,1,NBA,SEA,81,...,22.4,0.5,1.6,2.1,0.092,-2.2,0.2,-2.0,0.0,True
2722,12219,1990,2552,Tim Hardaway,PG,23.0,1,NBA,GSW,79,...,20.5,1.9,1.4,3.3,0.060,0.6,-0.8,-0.2,1.2,True
2778,11657,1989,2449,Mitch Richmond,SG,23.0,1,NBA,GSW,79,...,26.2,3.6,1.9,5.6,0.098,1.3,-1.4,-0.1,1.3,True


# Finishing Touches

There are 18 columns for rate statistics, including 'fg_percent', 'ft_percent', 'x3p_percent', 'ws_48', and more, that contain missing values for players who saw little playing time. These missing values will all be replaced with zeroes.

In [49]:
df2 = df2.fillna(0)

In [50]:
df2.to_csv('rookie.csv')