# Pre_processing: teams_model_data

In this notebook I created a dataset with the NBA teams from 1999 to 2021 with the necessary variables to create the linear model:


- Season
- Team
- W/L%

- DRtg_Unit_1_Rank 
- DRtg_Unit_1 
- DRtg_Unit_2_Rank
- DRtg_Unit_2
- DRtg_Unit_combined_Rank
- DRtg_Unit_combined 

- PER_Unit_1_Rank 
- PER_Unit_1
- PER_Unit_2_Rank
- PER_Unit_2
- PER_Unit_combined_Rank 
- PER_Unit_combined


# Virtual environment and packages

In [1]:
import sys
sys.executable

'C:\\Users\\pipeg\\miniconda3\\envs\\nba_team_maker\\python.exe'

In [2]:
print(os.getcwd())
path = 'c:/Users/pipeg/Documents/GitHub/nba-team-creator/'
os.chdir(path)
os.getcwd()

c:\Users\pipeg\Documents\GitHub\nba-team-creator\pre_processing


'c:\\Users\\pipeg\\Documents\\GitHub\\nba-team-creator'

In [3]:
import numpy as np 
import pandas as pd
from preprocessing_functions import create_unit_indicator, sum_stat_by_unit

 # Load raw data

In [4]:
adv_stats = pd.read_csv('raw_data/advanced.csv')
poss_stats = pd.read_csv('raw_data/100_possessions.csv')
teams_data = pd.read_csv('raw_data/teams.csv')
salaries = pd.read_csv('raw_data/salaries.csv')


In [5]:
# Remove combination of stats of different teams
stats = adv_stats[adv_stats.Tm != "TOT"][['Player', 'Pos', 'Team', 'Season', 'MP', 'PER']]
# Create a unit indicator
stats = create_unit_indicator(stats)
# Add DRtg
stats = stats.merge(poss_stats[['Player', 'Team', 'Season', 'DRtg']], on = ['Player', 'Team', 'Season'])

## Creating:

- `MP_Rank`: a rank variable of the number of minutes a player played for regular season for a team. Example: The MP_Rank for Russell Westbrook is 1 because is the player of WAS that played the most. The MP_Rank of Bradley Beal is 2, of Rui Hachimura is 3 and so forth until the last player that has played for that team. 

- `Unit`: a categorical variable with the values 1,2 or 3. These labels divide the players into the First Unit, Second Unit, and Third Unit according to their minutes played. 

- First Unit (FU) are the 5 players that played the most minutes for a particular team and season.

- Second Unit (SU) are the 5 following players that played the most minutes for a particular team and season.

- Third Unit (TU) are the rest of the players that played some minutes for a particular team and season.

- `PER_Rank`: a rank variable of the summed PER of the players for regular season for a team. 

- `PER Rank FU`: refers to the rank for the summed PER of the 5 players that played the most minutes for a particular team and season. Example: a "PER Rank FU" of 1 means that the summed PER of the First Unit of that team ranked first that year. 

- `PER Rank FU and SU`: refers to the rank for the summed PER of the 10 players that played the most minutes for a particular team and season.

- `PER Rank all players`: refers to the rank for the summed PER of all the players of the team roster at a given season.

In [8]:
df = pd.DataFrame()


# Create summed per units
df['PER_Unit_1'] = sum_stat_by_unit(stats, 1, 'PER')
df['PER_Unit_2'] = sum_stat_by_unit(stats, 2, 'PER')
df['PER_Unit_combined'] = df['PER_Unit_1'] + df['PER_Unit_2']

# Create rank of per season
df['PER_Unit_1_Rank'] = df.groupby(level = 1).rank('dense', ascending=False)['PER_Unit_1']
df['PER_Unit_2_Rank'] = df.groupby(level = 1).rank('dense', ascending=False)['PER_Unit_2']
df['PER_Unit_combined_Rank'] = df.groupby(level = 1).rank('dense', ascending=False)['PER_Unit_combined']

# Create summed DRtg per units
df['DRtg_Unit_1'] = sum_stat_by_unit(stats, 1, 'DRtg')
df['DRtg_Unit_2'] = sum_stat_by_unit(stats, 2, 'DRtg')
df['DRtg_Unit_combined'] = df['DRtg_Unit_1'] + df['DRtg_Unit_2']

# Create rank of DRtg per season
df['DRtg_Unit_1_Rank'] = df.groupby(level = 1).rank('dense', ascending=False)['DRtg_Unit_1']
df['DRtg_Unit_2_Rank'] = df.groupby(level = 1).rank('dense', ascending=False)['DRtg_Unit_2']
df['DRtg_Unit_combined_Rank'] = df.groupby(level = 1).rank('dense', ascending=False)['DRtg_Unit_combined']

df.reset_index(inplace = True)

In [9]:
df.head()

Unnamed: 0,Team,Season,PER_Unit_1,PER_Unit_2,PER_Unit_combined,PER_Unit_1_Rank,PER_Unit_2_Rank,PER_Unit_combined_Rank,DRtg_Unit_1,DRtg_Unit_2,DRtg_Unit_combined,DRtg_Unit_1_Rank,DRtg_Unit_2_Rank,DRtg_Unit_combined_Rank
0,Atlanta Hawks,1999/00,77.0,62.7,139.7,20.0,17.0,21.0,540.0,542.0,1082.0,5.0,3.0,4.0
1,Atlanta Hawks,2000/01,73.9,51.7,125.6,24.0,28.0,28.0,519.0,526.0,1045.0,12.0,8.0,10.0
2,Atlanta Hawks,2001/02,77.4,53.9,131.3,19.0,25.0,26.0,531.0,538.0,1069.0,9.0,5.0,5.0
3,Atlanta Hawks,2002/03,80.3,53.4,133.7,20.0,25.0,26.0,530.0,534.0,1064.0,7.0,5.0,6.0
4,Atlanta Hawks,2003/04,75.2,69.6,144.8,24.0,10.0,16.0,528.0,535.0,1063.0,7.0,4.0,5.0


In [10]:
stats[(stats.Team == "Washington Wizards") & (stats.Season == "2020/21")].sort_values(by = "MP_Rank", ascending= "False")

Unnamed: 0,Player,Pos,Team,Season,MP,PER,MP_Rank,Unit,DRtg
19,Russell Westbrook,PG,Washington Wizards,2020/21,2369,19.5,1.0,1,110.0
1,Bradley Beal,SG,Washington Wizards,2020/21,2147,22.7,2.0,1,115.0
9,Rui Hachimura,PF,Washington Wizards,2020/21,1797,11.4,3.0,1,115.0
3,Dāvis Bertāns,PF,Washington Wizards,2020/21,1464,11.4,4.0,1,116.0
14,Raul Neto,PG,Washington Wizards,2020/21,1403,13.0,5.0,1,113.0
12,Robin Lopez,C,Washington Wizards,2020/21,1354,16.7,6.0,2,116.0
0,Deni Avdija,SF,Washington Wizards,2020/21,1257,7.6,7.0,2,113.0
13,Garrison Mathews,SG,Washington Wizards,2020/21,1038,9.7,8.0,2,116.0
17,Ish Smith,PG,Washington Wizards,2020/21,924,11.6,9.0,2,113.0
11,Alex Len,C,Washington Wizards,2020/21,903,17.5,10.0,2,110.0


In [11]:
df[(df.Season == "2020/21")].sort_values(by = "DRtg_Unit_1", ascending= False).head()

Unnamed: 0,Team,Season,PER_Unit_1,PER_Unit_2,PER_Unit_combined,PER_Unit_1_Rank,PER_Unit_2_Rank,PER_Unit_combined_Rank,DRtg_Unit_1,DRtg_Unit_2,DRtg_Unit_combined,DRtg_Unit_1_Rank,DRtg_Unit_2_Rank,DRtg_Unit_combined_Rank
555,Sacramento Kings,2020/21,85.3,64.9,150.2,12.0,21.0,15.0,589.0,584.0,1173.0,1.0,1.0,1.0
116,Cleveland Cavaliers,2020/21,70.2,65.3,135.5,29.0,20.0,27.0,580.0,564.0,1144.0,2.0,11.0,6.0
533,Portland Trail Blazers,2020/21,94.6,71.3,165.9,2.0,12.0,2.0,578.0,581.0,1159.0,3.0,2.0,2.0
52,Brooklyn Nets,2020/21,77.9,84.5,162.4,21.0,1.0,4.0,577.0,567.0,1144.0,4.0,8.0,6.0
467,Orlando Magic,2020/21,70.5,61.7,132.2,28.0,27.0,28.0,573.0,577.0,1150.0,5.0,3.0,4.0


In [12]:
teams_model_data = teams_data[['Team', 'Season', 'W/L%']].merge(df, on = ['Team', 'Season'])

In [13]:
print(teams_model_data.columns)
teams_model_data.head()

Index(['Team', 'Season', 'W/L%', 'PER_Unit_1', 'PER_Unit_2',
       'PER_Unit_combined', 'PER_Unit_1_Rank', 'PER_Unit_2_Rank',
       'PER_Unit_combined_Rank', 'DRtg_Unit_1', 'DRtg_Unit_2',
       'DRtg_Unit_combined', 'DRtg_Unit_1_Rank', 'DRtg_Unit_2_Rank',
       'DRtg_Unit_combined_Rank'],
      dtype='object')


Unnamed: 0,Team,Season,W/L%,PER_Unit_1,PER_Unit_2,PER_Unit_combined,PER_Unit_1_Rank,PER_Unit_2_Rank,PER_Unit_combined_Rank,DRtg_Unit_1,DRtg_Unit_2,DRtg_Unit_combined,DRtg_Unit_1_Rank,DRtg_Unit_2_Rank,DRtg_Unit_combined_Rank
0,Utah Jazz,2020/21,0.722,80.4,78.0,158.4,18.0,3.0,9.0,542.0,543.0,1085.0,21.0,19.0,23.0
1,Phoenix Suns,2020/21,0.708,89.1,71.5,160.6,6.0,11.0,7.0,557.0,557.0,1114.0,16.0,16.0,19.0
2,Philadelphia 76ers,2020/21,0.681,93.6,69.8,163.4,4.0,14.0,3.0,538.0,537.0,1075.0,23.0,21.0,25.0
3,Brooklyn Nets,2020/21,0.667,77.9,84.5,162.4,21.0,1.0,4.0,577.0,567.0,1144.0,4.0,8.0,6.0
4,Los Angeles Clippers,2020/21,0.653,92.7,74.9,167.6,5.0,8.0,1.0,553.0,561.0,1114.0,19.0,14.0,19.0


# Save created data

In [14]:
teams_model_data.to_csv('out_data/teams_model_data.csv', index = False)