<img src="https://theundefeated.com/wp-content/uploads/2017/05/nba-logo.png?w=50" style="float: left; margin: 20px; height: 55px">

# NBA Player Prediction : 1.2 Preprocessing

_Authors: Patrick Wales-Dinan_

---

This project is an attempt to glean information about NBA Players. When teams invest in a player they are taking a risk. They pay them millions of dollars for the promise of a return on the basketball court and or through marketing/ticket sales. These two are obviously related but that relationship is not one to one. Here I wanted to focus on whether you could predict the future performance of an NBA player based on his statistics in his first two years. I had a number of different theories about what might be interesting and useful. Ultimately I decided that I wanted to see the following **`Can I use a players stats from their first two seasons in the NBA, to predict if they will achieve a benchmark`**. What benchmark? Here I have decided that I am going to try to predict if a player will have a [PER](https://www.basketball-reference.com/about/per.html) of over 20 and play over 1000 minutes. Below I will walk through the steps that I used to preprocess the dataframes and prepare for modeling.

---

## Notebook Contents:
- [Part 1](#Part-1)
- [Part 2](#Part-2)
- [Part 3](#Part-3)
- [Appendix](#Appendix)

## Other Notebooks:

- [1.3_Pre_Processing](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)
- [1.3_Pre_Processing](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)
- [1.3_Pre_Processing](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)
- [1.3_Pre_Processing](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)

In [1]:
import numpy as np
import pandas as pd
import os

pd.set_option('display.max_rows', 2000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [2]:
%store -r nba
%store -r improvement
%store -r per_improvement

## Part 1
### Second Year Dataframe Creation
Below I am creating a dataframe that contains only rookie years and a dataframe that contains only second year players. This means that in the `rookie_year` DF we have every players rookie's season from 2007 - 2019. The `second_year` DF has every player's second year from 2007 - 20019. I will use these respective dataframes to test different benchmarks in the future.

In [3]:
rookie_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'])] # Where the season is equal to their (Draft_Year+1)
rookie_year = rookie_year.loc[rookie_year["MP"] > 1200] # They must play over 1200 minutes to qualify
second_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 1)] # Where the season is equal to (Draft_Year+1) + 1
second_year = second_year.loc[second_year["MP"] > 1200] # They must play over 1200 minutes to qualify

Below we are looking at the players who had a PER of over 20 in their rookie season. Since the task is to predict if players will achieve a PER greater than 20 in a future NBA season we are NOT looking to predict these players. These players are, for lack of a better term, superstar players who are above this modeling system.

In [4]:
# These are players who had over 20 PER in their rookie season
rookie_year.loc[(rookie_year["PER"] > 20)]

Unnamed: 0,Player_name,player_id,SEASON,Tm_x,DRAFT_YEAR+1,Draft_team,Pk,Pos,Age,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,College,Yrs,PTS,TRB,AST,FG%,3P%,FT%,PPG,RPG,APG,draft_round
28,Deandre Ayton,aytonde01,2019,PHO,2019.0,PHO,1.0,C,20,71,2183,20.5,0.608,0.005,0.217,11.1,26.1,18.5,9.6,1.3,2.6,11.7,21.2,4.1,1.7,5.8,0.128,-0.1,0.2,0.2,1.2,University of Arizona,1.0,1159.0,729.0,125.0,0.585,0.0,0.746,16.3,10.3,1.8,1.0
567,Mitchell Robinson,robinmi01,2019,NYK,2019.0,NYK,36.0,C,20,66,1360,22.0,0.692,0.0,0.464,13.3,19.4,16.3,4.2,1.8,10.0,9.1,12.1,4.0,2.2,6.1,0.217,0.5,5.2,5.7,2.7,,1.0,485.0,423.0,37.0,0.694,,0.6,7.3,6.4,0.6,2.0
2163,Karl-Anthony Towns,townska01,2016,MIN,2016.0,MIN,1.0,C,20,82,2627,22.5,0.59,0.076,0.239,10.3,27.5,19.0,11.3,1.1,4.3,12.6,24.9,5.5,2.8,8.3,0.151,1.6,1.1,2.8,3.2,University of Kentucky,4.0,7187.0,3831.0,839.0,0.537,0.392,0.836,22.3,11.9,2.6,1.0
3283,Anthony Davis,davisan02,2013,NOP,2013.0,NOH,1.0,PF,19,64,1846,21.7,0.559,0.009,0.333,10.5,23.5,16.8,6.1,2.2,5.1,10.3,21.8,3.7,2.4,6.1,0.159,0.5,1.9,2.3,2.0,University of Kentucky,7.0,11059.0,4906.0,982.0,0.517,0.314,0.795,23.7,10.5,2.1,1.0
3298,Andre Drummond,drumman01,2013,DET,2013.0,DET,9.0,C,19,60,1243,21.6,0.578,0.006,0.465,15.4,27.2,21.2,4.1,2.5,6.1,12.2,17.2,2.2,2.3,4.5,0.172,-0.3,2.6,2.4,1.4,University of Connecticut,7.0,7661.0,7424.0,626.0,0.543,0.147,0.448,14.1,13.7,1.2,1.0
3879,Kyrie Irving,irvinky01,2012,CLE,2012.0,CLE,1.0,PG,19,51,1558,21.4,0.566,0.245,0.262,3.1,11.2,7.0,36.5,1.8,1.0,16.1,28.7,3.4,0.6,4.1,0.125,4.9,-1.7,3.3,2.1,Duke University,8.0,11294.0,1852.0,2884.0,0.465,0.39,0.875,22.2,3.6,5.7,1.0


Below we are looking at the players who had a PER of over 20 in their second season. Since the task is to predict if players will achieve a PER greater than 20 in a future NBA season we are NOT looking to predict these players. These players are, for lack of a better term, superstar players who are above this modeling system. It is important to note here that this list also has all of the players from above. This is good. It means that not only did the, above players achieve a PER higher than 20 in their first season, but that they also repeated that feat in their second season. You'll also note that `Deandre Ayton` and `Mitchell Robinson` are not included because they were rookies in 2019 and have not yet played their second season.

In [5]:
# These are players who had over 20 PER in their second season
second_year.loc[second_year["PER"] > 20]

Unnamed: 0,Player_name,player_id,SEASON,Tm_x,DRAFT_YEAR+1,Draft_team,Pk,Pos,Age,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,College,Yrs,PTS,TRB,AST,FG%,3P%,FT%,PPG,RPG,APG,draft_round
90,Thomas Bryant,bryanth01,2019,WAS,2018.0,UTA,42.0,C,21,72,1496,21.0,0.674,0.197,0.273,8.1,25.1,16.5,9.3,0.8,3.8,9.6,17.6,4.3,1.3,5.6,0.178,1.2,0.4,1.6,1.3,Indiana University,2.0,780.0,471.0,98.0,0.606,0.312,0.767,9.0,5.4,1.1,2.0
139,John Collins,collijo01,2019,ATL,2018.0,ATL,19.0,PF,21,61,1829,21.8,0.627,0.19,0.32,12.3,22.1,17.1,11.0,0.6,1.8,11.2,23.7,5.0,1.0,6.0,0.158,2.6,-1.2,1.4,1.6,Wake Forest University,2.0,1965.0,1136.0,219.0,0.566,0.346,0.743,14.6,8.4,1.6,1.0
1680,Karl-Anthony Towns,townska01,2017,MIN,2016.0,MIN,1.0,C,21,82,3030,25.9,0.618,0.186,0.289,11.2,27.8,19.4,13.2,0.9,2.9,11.3,27.5,9.9,2.8,12.7,0.201,4.7,0.2,4.9,5.3,University of Kentucky,4.0,7187.0,3831.0,839.0,0.537,0.392,0.836,22.3,11.9,2.6,1.0
1969,Nikola Jokić,jokicni01,2016,DEN,2015.0,DEN,41.0,C,20,80,1733,21.5,0.582,0.14,0.317,11.4,24.7,17.9,18.1,2.3,2.3,13.2,19.9,4.5,2.2,6.7,0.185,2.6,2.2,4.8,3.0,,4.0,5006.0,2946.0,1586.0,0.523,0.345,0.828,16.3,9.6,5.1,2.0
2382,Rudy Gobert,goberru01,2015,UTA,2014.0,DEN,27.0,C,22,82,2158,21.6,0.627,0.005,0.639,14.3,27.2,20.7,8.2,1.6,7.0,16.9,14.0,5.0,4.3,9.3,0.206,0.7,5.1,5.8,4.3,,6.0,4524.0,4275.0,545.0,0.629,0.0,0.631,11.1,10.5,1.3,1.0
2817,Anthony Davis,davisan02,2014,NOP,2013.0,NOH,1.0,PF,20,67,2358,26.5,0.582,0.009,0.438,10.0,23.3,16.6,8.0,2.0,6.7,8.3,25.2,7.2,3.2,10.4,0.212,2.3,2.2,4.5,3.9,University of Kentucky,7.0,11059.0,4906.0,982.0,0.517,0.314,0.795,23.7,10.5,2.1,1.0
2831,Andre Drummond,drumman01,2014,DET,2013.0,DET,9.0,C,20,81,2619,22.6,0.599,0.003,0.427,17.5,27.7,22.3,2.2,2.0,4.0,10.7,16.7,6.4,3.5,9.9,0.182,1.1,0.5,1.6,2.4,University of Connecticut,7.0,7661.0,7424.0,626.0,0.543,0.147,0.448,14.1,13.7,1.2,1.0
3388,Kyrie Irving,irvinky01,2013,CLE,2012.0,CLE,1.0,PG,20,59,2048,21.4,0.553,0.261,0.271,1.8,10.8,6.1,32.7,2.3,0.8,13.8,30.2,4.2,1.1,5.3,0.125,4.9,-1.6,3.3,2.8,Duke University,8.0,11294.0,1852.0,2884.0,0.465,0.39,0.875,22.2,3.6,5.7,1.0
3749,DeMarcus Cousins,couside01,2012,SAC,2011.0,SAC,5.0,C,21,64,1950,21.7,0.499,0.014,0.373,14.2,25.9,19.8,9.5,2.4,2.8,12.8,29.7,2.1,2.2,4.2,0.104,-0.4,0.0,-0.3,0.8,University of Kentucky,9.0,12006.0,6131.0,1832.0,0.461,0.332,0.737,21.2,10.9,3.2,1.0
3966,Greg Monroe,monrogr01,2012,DET,2011.0,DET,7.0,C,21,66,2082,22.0,0.563,0.001,0.348,13.3,23.6,18.3,14.0,2.1,1.7,15.2,23.6,4.3,2.7,7.0,0.161,2.4,1.4,3.8,3.1,Georgetown University,9.0,8326.0,5229.0,1354.0,0.514,0.059,0.704,13.2,8.3,2.1,1.0


#### Applying Target
I am going to use the `second_year` dataframe to create my model. So I need to remove all of the players from that dataframe that are listed in the table above. I am assigning them to the variable remove. I then make a list of those players ID numbers and iterate through the entire NBA DF and remove those guys completely. Next I iterate through the rest of the players and see if they achieved this metric in any of their next 5 seasons. If they did, they get tagged as a 1. You can see below what that dataframe looks like:

In [6]:
# Remove is all of the players that had a PER  > 20 in their rookie or second year. 
# They don't count in this evaluation, so they are removed
remove = second_year.loc[(second_year["PER"] > 20) & (second_year["MP"] > 1000)]
names = remove.player_id.unique()
for name in names:
    nba = nba.loc[nba["player_id"] != name]

# assigning a 1 to the target column for any player who met the below criteria in their 3rd - 7th season
nba["target"] = 0
for i in range(2, 6):
    nba.loc[(nba["PER"] > 20) & (nba["MP"] > 1000) & (nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + i), "target"] = 1
# Take a look at the DF to make sure it looks correct
nba.sort_values(by='target', ascending=False).head(3)

Unnamed: 0,Player_name,player_id,SEASON,Tm_x,DRAFT_YEAR+1,Draft_team,Pk,Pos,Age,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,College,Yrs,PTS,TRB,AST,FG%,3P%,FT%,PPG,RPG,APG,draft_round,target
3352,James Harden,hardeja01,2013,HOU,2010.0,OKC,3.0,SG,23,78,2985,23.0,0.6,0.364,0.592,2.4,11.9,7.2,25.7,2.4,1.0,14.9,29.0,9.8,3.0,12.8,0.206,5.4,0.1,5.5,5.7,Arizona State University,10.0,18627.0,4000.0,4743.0,0.443,0.365,0.857,24.3,5.2,6.2,1.0,1
1671,Isaiah Thomas,thomais02,2017,BOS,2012.0,SAC,60.0,PG,27,76,2569,26.5,0.625,0.439,0.441,1.9,7.0,4.4,32.5,1.4,0.4,10.7,34.0,10.9,1.6,12.5,0.234,8.7,-3.3,5.4,4.8,University of Washington,8.0,9020.0,1220.0,2453.0,0.437,0.36,0.874,18.6,2.5,5.1,2.0,1
3162,Russell Westbrook,westbru01,2014,OKC,2009.0,SEA,4.0,PG,25,46,1412,24.7,0.545,0.271,0.37,4.6,15.9,10.5,40.2,3.1,0.4,16.1,34.4,3.0,2.3,5.2,0.178,5.2,1.2,6.4,3.0,University of California Los Angeles,11.0,18859.0,5760.0,6897.0,0.434,0.308,0.801,23.0,7.0,8.4,1.0,1


Finally I need to remove any players who were drafter before 2006 because I don't have their first or second year seasons. So they won't go into this analysis. **Note** it would be really interesting to go back and do this for more and more seasons. I think that I'll do this after. 

In [7]:
# removing players drafted before 2006
delete = nba.loc[(nba['DRAFT_YEAR+1'].isnull()) & (nba["SEASON"] == 2006)]["player_id"]
nba = nba.loc[~nba["player_id"].isin(delete)]
nba.sort_values(by=["player_id", "SEASON"], ascending=(False, False)).head(8)

Unnamed: 0,Player_name,player_id,SEASON,Tm_x,DRAFT_YEAR+1,Draft_team,Pk,Pos,Age,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,College,Yrs,PTS,TRB,AST,FG%,3P%,FT%,PPG,RPG,APG,draft_round,target
706,Ivica Zubac,zubaciv01,2019,LAL,2017.0,LAL,32.0,C,21,33,516,19.5,0.633,0.0,0.342,11.0,21.2,16.2,7.3,0.4,4.3,12.9,20.5,1.1,0.6,1.8,0.165,-1.3,0.4,-0.9,0.1,,3.0,970.0,644.0,118.0,0.54,0.0,0.761,6.9,4.6,0.8,2.0,0
707,Ivica Zubac,zubaciv01,2019,LAC,2017.0,LAL,32.0,C,21,26,524,18.3,0.574,0.0,0.323,12.7,27.2,20.2,10.9,0.9,3.6,14.8,20.1,0.7,0.7,1.4,0.132,-1.2,1.5,0.3,0.3,,3.0,970.0,644.0,118.0,0.54,0.0,0.761,6.9,4.6,0.8,2.0,0
1247,Ivica Zubac,zubaciv01,2018,LAL,2017.0,LAL,32.0,C,20,43,410,15.3,0.557,0.008,0.418,11.8,20.1,16.0,8.8,0.9,3.0,15.3,17.6,0.5,0.5,1.0,0.118,-2.7,0.5,-2.2,0.0,,3.0,970.0,644.0,118.0,0.54,0.0,0.761,6.9,4.6,0.8,2.0,0
1733,Ivica Zubac,zubaciv01,2017,LAL,2017.0,LAL,32.0,C,19,38,609,17.0,0.547,0.013,0.206,7.1,21.9,14.3,8.1,1.1,4.4,10.4,20.3,0.6,0.5,1.1,0.086,-2.7,0.3,-2.5,-0.1,,3.0,970.0,644.0,118.0,0.54,0.0,0.761,6.9,4.6,0.8,2.0,0
704,Ante Žižić,zizican01,2019,CLE,2017.0,BOS,23.0,C,22,59,1082,16.2,0.59,0.0,0.399,10.6,22.8,16.4,7.7,0.6,1.8,13.6,18.2,1.7,0.3,2.0,0.087,-1.7,-1.4,-3.1,-0.3,,2.0,578.0,380.0,58.0,0.583,,0.708,6.4,4.2,0.6,1.0,0
1246,Ante Žižić,zizican01,2018,CLE,2017.0,BOS,23.0,C,21,32,214,24.2,0.746,0.0,0.433,12.8,18.6,15.7,3.8,0.5,5.2,12.1,18.8,0.9,0.2,1.0,0.231,1.3,-1.2,0.1,0.1,,2.0,578.0,380.0,58.0,0.583,,0.708,6.4,4.2,0.6,1.0,0
1245,Paul Zipser,zipsepa01,2018,CHI,2017.0,CHI,48.0,SF,23,54,824,5.2,0.445,0.47,0.107,1.6,16.0,8.5,8.0,1.2,1.6,14.9,15.2,-1.1,0.6,-0.6,-0.034,-5.5,-0.3,-5.9,-0.8,,2.0,458.0,256.0,82.0,0.371,0.335,0.769,4.7,2.6,0.8,2.0,0
1732,Paul Zipser,zipsepa01,2017,CHI,2017.0,CHI,48.0,SF,22,44,843,6.9,0.503,0.448,0.181,1.9,14.2,8.0,6.1,0.9,1.5,14.4,14.4,-0.3,0.8,0.5,0.03,-3.6,-0.1,-3.7,-0.4,,2.0,458.0,256.0,82.0,0.371,0.335,0.769,4.7,2.6,0.8,2.0,0


Finally I need to pull out just the second year from this NBA DF so that I can model it. I do that below by look at the specific locations in the `nba` dataframe where the season is equal to the `DRAFT_YEAR+1`. This rendered 490 unique players. I then stored that dateframe as **`second_year_prime`** for use in the modeling and visualization notebooks. **Note** This is a dataframe that only includes the players second years, so nothing about what they accomplished in their rookie year is being expressed in this. This could be a problem so we also want to look at the rookie year + the second year. We'll do this coming up.

In [8]:
second_year_prime = nba.loc[nba['DRAFT_YEAR+1'] + 1 == nba["SEASON"]]
print(second_year_prime.shape)

(490, 44)


## Part 2
### Rookie and Second Year Aggregated Dataframe Creation
Here I am going to create a dataframe that combines each players first and second year stats into one aggregated entry. In order to do that we have to make some important decisions about how we are going to treat each variable or feature. For example `G` (games played) and `MP` (minutes played) will be added together from one season to the next, but features like [`TS%`](https://www.basketball-reference.com/about/glossary.html) (true shooting percentage) is a measure of shooting efficiency that takes into account field goals, 3-point field goals, and free throws. This is a rate statistic and thus needs to be weighted before it can be aggregated. I found some code on Stack Overflow showing how to [calculate a weighted average](https://stackoverflow.com/questions/31521027/groupby-weighted-average-and-sum-in-pandas-dataframe) for columns of your dataframe. I used this function to calculate the weighted average for my rate variable which you can see below and then I aggregated the variables that were cumulative in nature like `points`, `rebounds`, `assists`, `minutes played` etc. 

In [9]:
# Here I am grabbing all of the row that consist of a players first or second season in the NBA
first_second_seasons = nba.loc[(nba['SEASON'] == nba['DRAFT_YEAR+1'] + 1) | (nba['SEASON'] == nba['DRAFT_YEAR+1'])].sort_values(by="Player_name")
first_second_seasons.fillna(0, inplace=True) # A few players have NAN for 3pt% because they didn't take any. It is appropriate to fill these with zeros

In [10]:
# Creating two lists of the colums that will be weighted and the columns that will be aggregated.
weighted_cols = ['PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'WS/48', 'OBPM', 'DBPM', 'BPM','FG%', '3P%', 'FT%', 'PPG', 'RPG', 'APG']
sum_cols = ['G', 'MP', 'VORP', 'PTS', 'TRB', 'AST', 'WS']

In [11]:
# I am defining the weighted average function that I found on stack overflow to my data. I have added to the function
def weighted_average(df, data_col, weight_col,by_col):
    df['_data_times_weight'] = df[data_col]*df[weight_col]
    df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
    g = df.groupby(by_col) # This is my own code
    result = pd.DataFrame() # This is my own code
    result[data_col] = round(g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum(),4)
    del df['_data_times_weight'], df['_weight_where_notnull']
    return result

In [12]:
# Here I am applying the weighted average function to the DF to transform some of those columns. First I create a new
# Dataframe that has the columns that I want. I am calling the DF first_second_seasons_agg
first_second_seasons_agg = first_second_seasons[['Player_name', 'player_id', 'Tm_x', 'DRAFT_YEAR+1', 'Draft_team', 'Pk', 'Pos', 'Age', 'College', 'Yrs', 'draft_round', 'target']]
first_second_seasons_agg.drop_duplicates(subset='player_id', inplace=True)
# Here I am applying the weighted average function to all of the columns in the weighted_cols list
for col in weighted_cols:
    weighted_df = weighted_average(first_second_seasons, col, 'MP', 'player_id')
    first_second_seasons_agg = first_second_seasons_agg.merge(weighted_df, on='player_id') # Then I am merging this dataframe with the one from above: first_second_seasons_agg
sum_df = first_second_seasons.groupby(by='player_id')[sum_cols].sum() # This is a DF of the sum of the list of sum_cols
first_second_seasons_agg = first_second_seasons_agg.merge(sum_df, on='player_id') # This is merging that sum_df with the first_second_seasons_agg df
first_second_seasons_agg

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,Player_name,player_id,Tm_x,DRAFT_YEAR+1,Draft_team,Pk,Pos,Age,College,Yrs,draft_round,target,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,WS/48,OBPM,DBPM,BPM,FG%,3P%,FT%,PPG,RPG,APG,G,MP,VORP,PTS,TRB,AST,WS
0,A.J. Hammons,hammoaj01,DAL,2017.0,DAL,46.0,C,24,Purdue University,1.0,2.0,0,8.4,0.472,0.238,0.476,5.4,21.0,12.8,3.8,0.3,7.2,16.4,17.6,-0.001,-7.5,2.0,-5.6,0.405,0.5,0.45,2.2,1.6,0.2,22,163,-0.1,48.0,36.0,4.0,0.0
1,A.J. Price,priceaj01,IND,2010.0,IND,52.0,PG,23,University of Connecticut,6.0,2.0,0,12.4196,0.4936,0.4795,0.2316,1.8831,8.7901,5.3127,21.5536,1.9042,0.1521,13.1042,22.7,0.0434,-0.4226,-2.1437,-2.5663,0.38,0.316,0.742,5.8,1.4,2.2,106,1660,-0.3,3046.0,754.0,1134.0,1.5
2,Aaron Brooks,brookaa01,HOU,2008.0,HOU,26.0,PG,23,University of Oregon,10.0,1.0,0,12.9467,0.5243,0.4158,0.2195,2.1167,7.21,4.74,20.9699,1.1767,0.37,13.2599,22.6434,0.0921,0.1533,-1.57,-1.4167,0.413,0.37,0.837,9.7,1.7,3.0,131,2606,0.4,12518.0,2158.0,3850.0,5.0
3,Aaron Gordon,gordoaa01,ORL,2015.0,ORL,4.0,PF,19,University of Arizona,5.0,1.0,0,15.3221,0.5338,0.2408,0.321,8.221,20.2513,14.1712,9.1015,1.5101,2.3401,10.4682,16.7607,0.1153,-0.4187,0.8405,0.4217,0.45,0.321,0.707,12.5,6.2,2.2,125,2660,1.6,8498.0,4224.0,1472.0,6.4
4,Aaron Gray,grayaa01,CHI,2008.0,CHI,49.0,C,24,University of Pittsburgh,7.0,2.0,0,12.6078,0.5177,0.0055,0.4139,13.0384,20.3152,16.6306,10.4309,1.3693,1.8923,17.8924,17.962,0.0768,-2.2385,0.5077,-1.7308,0.509,0.0,0.562,3.4,3.7,0.7,117,1328,0.1,2132.0,2380.0,414.0,2.1
5,Aaron Holiday,holidaa01,IND,2019.0,IND,23.0,PG,22,University of California Los Angeles,1.0,1.0,0,11.9,0.518,0.485,0.191,0.9,10.4,5.8,19.3,1.6,1.8,12.3,21.9,0.065,-1.5,-1.1,-2.6,0.401,0.339,0.82,5.9,1.3,1.7,50,646,-0.1,294.0,67.0,87.0,0.9
6,Abdel Nader,naderab01,BOS,2017.0,BOS,58.0,SF,24,Iowa State University,2.0,2.0,0,5.1,0.439,0.436,0.262,2.8,12.0,7.4,7.1,1.4,1.6,17.0,17.1,-0.014,-5.9,0.3,-5.6,0.387,0.333,0.667,3.6,1.7,0.4,48,522,-0.5,387.0,187.0,46.0,-0.1
7,Acie Law,lawac01,ATL,2008.0,ATL,11.0,PG,24,Texas A&M University,4.0,1.0,0,8.8396,0.4688,0.1689,0.293,1.4821,7.9256,4.7039,21.5789,1.5249,0.1572,16.6604,16.4214,0.0201,-3.6677,-2.3642,-6.0319,0.413,0.235,0.778,3.9,1.0,1.6,111,1425,-1.4,1450.0,390.0,618.0,0.6
8,Adam Morrison,morriad01,CHO,2007.0,CHA,3.0,PF,22,Gonzaga University,3.0,1.0,0,7.9,0.45,0.273,0.179,2.7,9.2,5.9,11.9,0.6,0.2,11.3,22.4,-0.03,-3.0,-2.5,-5.5,0.373,0.331,0.71,7.5,2.1,1.4,78,2326,-2.0,1200.0,342.0,222.0,-1.5
9,Adreian Payne,paynead01,MIN,2015.0,ATL,15.0,PF,24,Michigan State University,4.0,1.0,0,5.6,0.422,0.221,0.179,4.8,21.5,13.3,8.9,1.7,1.8,18.7,17.7,-0.047,-5.9,-0.2,-6.1,0.406,0.254,0.68,4.0,2.9,0.6,52,486,-0.5,429.0,315.0,66.0,-0.5


## Part 3
### Target Application to both Dataframes
So now I have two dataframes. The first is called `second_year_prime` it contains all players second year data as individual observations. There are 16 players who I removed from this DF. These players were removed because they recored a PER of over 20 and over 1200 minutes played in either their rookie year or their second season or both. The second dataframe is called `first_second_seasons_agg`. This dataframe contains the aggregated or weighted average of every player rookie and second years combined. Keep in mind that none of the players in these two DF have seasons where they have achieved a PER of over 20 and over 1200 minutes played. What we know is that many of them will. So what I am doing below is tagging the players who will go on to achieve this with a 1 in a column called target. Remember that before we applied 1 as a [target](#Applying-Target) to the players seasons in the entire `nba` dataframe. So now we can sort through that DF and match up the target people to this dataframe. You can see that we are doing that below

In [13]:
# Getting the subset of data where target = 1
explore = nba.loc[nba["target"] == 1]
# applying the target label to everyone in the second year DF
second_year_prime.loc[second_year_prime["player_id"].isin(explore['player_id']), "target"] = 1
first_second_seasons_agg.loc[first_second_seasons_agg["player_id"].isin(explore['player_id']), "target"] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [14]:
first_second_seasons_agg.sort_values(by='target', ascending=False).head()

Unnamed: 0,Player_name,player_id,Tm_x,DRAFT_YEAR+1,Draft_team,Pk,Pos,Age,College,Yrs,draft_round,target,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,WS/48,OBPM,DBPM,BPM,FG%,3P%,FT%,PPG,RPG,APG,G,MP,VORP,PTS,TRB,AST,WS
48,Bradley Beal,bealbr01,WAS,2013.0,WAS,3.0,SG,20,University of Florida,7.0,1.0,1,14.0143,0.5103,0.3174,0.1913,2.5225,10.4449,6.4633,15.0612,1.4408,0.8857,9.8898,23.4428,0.0784,0.1184,-0.7694,-0.6918,0.452,0.384,0.798,19.8,3.9,3.7,129,4275,1.4,19368.0,3854.0,3648.0,7.0
84,Clint Capela,capelca01,HOU,2015.0,HOU,25.0,C,21,0,5.0,1.0,1,17.9425,0.5447,0.0028,0.5461,14.2789,23.2247,18.7076,5.0135,1.9193,5.3499,10.9231,16.3998,0.1368,-1.6382,1.6596,0.0215,0.635,0.0,0.525,12.0,9.2,0.9,89,1561,0.8,7064.0,5412.0,552.0,4.4
544,Stephen Curry,curryst01,GSW,2010.0,GSW,7.0,PG,21,Davidson College,10.0,1.0,1,17.7329,0.5805,0.3288,0.194,2.0311,11.4916,6.6613,26.2177,2.3613,0.5462,16.4538,23.0017,0.1006,2.5858,-1.116,1.416,0.477,0.436,0.905,23.5,4.5,6.6,154,5385,4.6,32630.0,6264.0,9176.0,11.3
348,Kenneth Faried,farieke01,DEN,2012.0,DEN,22.0,PF,22,Morehead State University,8.0,1.0,1,19.5733,0.5872,0.0,0.4318,14.2417,23.1053,18.7735,5.2263,1.7684,3.121,12.3316,18.0157,0.1812,1.2735,1.3422,2.5788,0.546,0.222,0.654,11.4,8.1,1.0,126,3285,3.8,10860.0,7768.0,940.0,12.4
208,Gordon Hayward,haywago01,UTA,2011.0,UTA,9.0,SF,20,Butler University,9.0,1.0,1,13.6916,0.5718,0.2642,0.3715,3.5637,9.8623,6.713,13.7526,1.3623,1.487,15.0563,16.8582,0.0904,0.7191,-0.5898,0.1293,0.446,0.364,0.821,15.1,4.2,3.4,138,3233,1.8,17808.0,4944.0,4012.0,6.1


In [15]:
%store first_second_seasons_agg
%store second_year_prime

Stored 'first_second_seasons_agg' (DataFrame)
Stored 'second_year_prime' (DataFrame)


## Appendix
I may use these for future projects or visualizations but I didn't actually use them here

In [None]:
third_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 2)]
fourth_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 3)]
fifth_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 4)]
sixth_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 5)]
seventh_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 6)]
eight_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 7)]
ninth_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 8)]
tenth_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 9)]
eleventh_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 10)]
twelth_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 11)]
thirteen_year = nba.loc[(nba['SEASON'] ==  nba['DRAFT_YEAR+1'] + 12)]
eastern = nba.loc[nba['Tm_x'].isin(["MIA", "BRK", "MIL", "TOR", "CHI", "WAS", "ORL", "CHO", "BOS", "ATL", 
                          "IND", "PHI", "DET", "NYK", "CLE"])]
western = nba.loc[nba['Tm_x'].isin(["OKC", "PHO", "DEN", "SAS", "UTA", "POR", "MEM", "DAL", "HOU", "SAC", 
                                "LAL", "MIN", "GSW", "NOP", "LAC"])]