# Name: Michael Taylor
### Date: 8/19

<style>
.jp-Notebook {
    padding: var(--jp-notebook-padding);
    margin-left: 160px;
    outline: none;
    overflow: auto;
    background: var(--jp-layout-color0);
}
</style>

<img src="https://cdn.nba.com/logos/nba/1610612760/primary/L/logo.svg" alt="logo" style="position: fixed; top: -40px; left: 5px; height: 250px;">

# Introduction  

The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this jupyter notebook, answering questions as you go along. Please begin by adding your name to the top markdown chunk in this document. When you're finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise and avoid long printouts. Feel free to add in as many new code chunks as you'd like.

Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use packages like pandas/numpy and matplotlib/seaborn as much as possible (instead of base python data manipulations and explicit loops.)  

**WARNING:** Your project will **ONLY** be graded if it's knit to an HTML document where we can see your code. Be careful to make sure that any long lines of code appropriately visibly wrap around visibly to the next line, as code that's cut off from the side of the document cannot be graded.  

**Note:**    

**Throughout this document, any `season` column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. For most of the rest of the project, we will refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).** 

# Answers  

## Part 1      

**Question 1:**   

- 1st Team: 25.9 points per game  
- 2nd Team: 23.1 points per game  
- 3rd Team: 20.6 points per game  
- All-Star: 21.6 points per game   

**Question 2:** 4.7 Years  

**Question 3:** 

- Elite: 2 players.  
- All-Star: 1 players.  
- Starter: 11 players.  
- Rotation: 7 players.  
- Roster: 14 players.  
- Out of League: 38 players.  

**Open Ended Modeling Question:** Please show your work and leave all responses below in the document.


## Part 2  

**Question 1:** 30.0%   
**Question 2:** Written question, put answer below in the document.    
**Question 3:** Written question, put answer below in the document.    
  


# Setup and Data    

In [852]:
import pandas as pd
import numpy as np
# Note you will likely have to change these paths. 
# If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/awards_project/ from each string.
awards = pd.read_csv("../../Data/awards_project/awards_data.csv")
player_data = pd.read_csv("../../Data/awards_project/player_stats.csv")
team_data = pd.read_csv("../../Data/awards_project/team_stats.csv")
rebounding_data = pd.read_csv("../../Data/awards_project/team_rebounding_data_22.csv")

## Part 1 -- Awards  

In this section, you're going to work with data relating to player awards and statistics. You'll start with some data manipulation questions and work towards building a model to predict broad levels of career success.  


### Question 1  

**QUESTION:** What is the average number of points per game for players in the 2007-2021 seasons who won All NBA First, Second, and Third teams (**not** the All Defensive Teams), as well as for players who were in the All-Star Game (**not** the rookie all-star game)?


 

In [854]:
#Finding regular season ppg of players who made each all nba team
#subset player data to include only needed data
playersP1 = player_data[["season", "nbapersonid", "points", "games", "player"]]
#account for players who switched teams in the middle of the season
agg_functions = {'season': 'first', 'nbapersonid': 'first', 
                 'points': 'sum', 'games': 'sum', 'player': 'first'}
playersP1 = playersP1.groupby(['season', 'nbapersonid']).aggregate(agg_functions)
playersP1 = playersP1[["points", "games", "player"]].reset_index()

#Subset to only needed players/columns and rows that include won allNba awards
allNba = awards[["season", "nbapersonid", "All NBA First Team", 
                 "All NBA Second Team", "All NBA Third Team", "all_star_game"]]
allNba = allNba[(awards["All NBA First Team"] == 1) | 
                (awards["All NBA Second Team"] == 1) | 
                (awards["All NBA Third Team"] == 1) | 
                (awards["all_star_game"] == True)]
#Join dataframes where each row is a unique season for a player
playersP1 = pd.merge(allNba, playersP1, how="left", 
                     left_on=['season','nbapersonid'], right_on = ['season','nbapersonid'])

#Calculate each players ppg and find mean
playersP1["ppg"] = playersP1["points"] / playersP1["games"]

#Finds and prints answers
AN1ppg = playersP1[playersP1['All NBA First Team'] == 1].ppg.mean()
AN2ppg = playersP1[playersP1['All NBA Second Team'] == 1].ppg.mean()
AN3ppg = playersP1[playersP1['All NBA Third Team'] == 1].ppg.mean()
ASppg = playersP1[playersP1['all_star_game'] == True].ppg.mean()

print("1st Team: " + str(round(AN1ppg, 1)) + " points per game")
print("2nd Team: " + str(round(AN2ppg, 1)) + " points per game")
print("3rd Team: " + str(round(AN3ppg, 1)) + " points per game")
print("All-Star: " + str(round(ASppg, 1)) + " points per game")

1st Team: 25.9 points per game
2nd Team: 23.1 points per game
3rd Team: 20.6 points per game
All-Star: 21.6 points per game


<strong><span style="color:red">ANSWER 1:</span></strong>   

1st Team: 25.9 points per game  
2nd Team: 23.1 points per game  
3rd Team: 20.6 points per game  
All-Star: 21.6 points per game  

### Question 2  

**QUESTION:** What was the average number of years of experience in the league it takes for players to make their first All NBA Selection (1st, 2nd, or 3rd team)? Please limit your sample to players drafted in 2007 or later who did eventually go on to win at least one All NBA selection. For example:

- Luka Doncic is in the dataset as 2 years. He was drafted in 2018 and won his first All NBA award in 2019 (which was his second season).  
- LeBron James is not in this dataset, as he was drafted prior to 2007.  
- Lu Dort is not in this dataset, as he has not received any All NBA honors.  



In [855]:
#create column that contains year of experience for player during the given season
def exp(df):
    df["exp"] = df["season"] - df["draftyear"] + 1
    return df

In [856]:
#subset to players with an allNba appearance
allNbaP2 = awards[["season", "nbapersonid", "All NBA First Team", 
                   "All NBA Second Team", "All NBA Third Team"]]
allNbaP2 = allNbaP2[(allNbaP2["All NBA First Team"] == 1) | 
                (allNbaP2["All NBA Second Team"] == 1) | 
                (allNbaP2["All NBA Third Team"] == 1)]

#group awards by player and only show season of first award won
agg_functions2 = {'nbapersonid': 'first', 'season': 'first'}
allNbaP2 = allNbaP2.groupby(["nbapersonid"]).aggregate(agg_functions2)
allNbaP2 = allNbaP2[["season"]]
allNbaP2 = allNbaP2.reset_index()

#group players where each row is a unique player with their draft year
agg_functions3 = {'nbapersonid': 'first', 'player': 'first', 'draftyear': 'first'}
playersP2 = player_data[["nbapersonid", "draftyear", "player"]]
playersP2 = playersP2.groupby(["nbapersonid"]).aggregate(agg_functions3)
playersP2 = playersP2[["player", "draftyear"]]
playersP2 = playersP2.reset_index()

#add row that contains difference between year of first award and draft year
joined1 = pd.merge(allNbaP2, playersP2, how="inner", 
                   left_on=['nbapersonid'], right_on = ['nbapersonid'])
joined1 = exp(joined1)
joined1 = joined1[joined1["draftyear"] > 2006] 
print(round(joined1.exp.mean(), 1))


4.7


<strong><span style="color:red">ANSWER 2:</span></strong>  

4.7 Years  

## Data Cleaning Interlude  

You're going to work to create a dataset with a "career outcome" for each player, representing the highest level of success that the player achieved for **at least two** seasons *after his first four seasons in the league* (examples to follow below!). To do this, you'll start with single season level outcomes. On a single season level, the outcomes are:  

- Elite: A player is "Elite" in a season if he won any All NBA award (1st, 2nd, or 3rd team), MVP, or DPOY in that season.    
- All-Star: A player is "All-Star" in a season if he was selected to be an All-Star that season.   
- Starter:  A player is a "Starter" in a season if he started in at least 41 games in the season OR if he played at least 2000 minutes in the season.    
- Rotation:  A player is a "Rotation" player in a season if he played at least 1000 minutes in the season.   
- Roster:  A player is a "Roster" player in a season if he played at least 1 minute for an NBA team but did not meet any of the above criteria.     
- Out of the League: A player is "Out of the League" if he is not in the NBA in that season.   

We need to make an adjustment for determining Starter/Rotation qualifications for a few seasons that didn't have 82 games per team. Assume that there were 66 possible games in the 2011 lockout season and 72 possible games in each of the 2019 and 2020 seasons that were shortened due to covid. Specifically, if a player played 900 minutes in 2011, he **would** meet the rotation criteria because his final minutes would be considered to be 900 * (82/66) = 1118. Please use this math for both minutes and games started, so a player who started 38 games in 2019 or 2020 would be considered to have started 38 * (82/72) = 43 games, and thus would qualify for starting 41. Any answers should be calculated assuming you round the multiplied values to the nearest whole number.

Note that on a season level, a player's outcome is the highest level of success he qualifies for in that season. Thus, since Shai Gilgeous-Alexander was both All-NBA 1st team and an All-Star last year, he would be considered to be "Elite" for the 2022 season, but would still qualify for a career outcome of All-Star if in the rest of his career he made one more All-Star game but no more All-NBA teams. Note this is a hypothetical, and Shai has not yet played enough to have a career outcome.    

Examples:  

- A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Rotation (3), Roster (4), Roster (5), Out of the League (6+) would be considered "Out of the League," because after his first four seasons, he only has a single Roster year, which does not qualify him for any success outcome.  
- A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Starter (3), Starter (4), Starter (5), Starter (6), All-Star (7), Elite (8), Starter (9) would be considered "All-Star," because he had at least two seasons after his first four at all-star level of production or higher.  
- A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Starter (3), Starter (4), Starter (5), Starter (6), Rotation (7), Rotation (8), Roster (9) would be considered a "Starter" because he has two seasons after his first four at a starter level of production. 


### Question 3  

**QUESTION:** There are 73 players in the `player_data` dataset who have 2010 listed as their draft year. How many of those players have a **career** outcome in each of the 6 buckets?  

In [857]:
#function takes in a row that is a unique player season and returns the season outcome
def label_season (row):
    if ((row['All NBA First Team'] == 1) | 
            (row['All NBA Second Team'] == 1) | 
            (row['All NBA Third Team'] == 1) | 
            (row['Defensive Player Of The Year_rk'] == 1) | 
            (row['Most Valuable Player_rk'] == 1)):
        return 'Elite'
    elif row['all_star_game'] == True :
        return 'All-Star'
    elif (row['games_start'] >= 41) | (row['mins'] >= 2000):
        return 'Starter'
    elif (row['mins'] >= 1000):
        return 'Rotation'
    elif (row['mins'] >= 1):
        return 'Roster'
    else:
        return "Out of the League"

In [858]:
#function takes in row containing unique player and counts of season outcomes and returns career label
def label_player (row):
    if row['Elite'] >= 2:
        return 'Elite'
    elif row['All-Star'] >= 2 :
        return 'All-Star'
    elif (row['Starter'] >= 2):
        return 'Starter'
    elif (row['Rotation'] >= 2):
        return 'Rotation'
    elif (row['Roster'] >= 2):
        return 'Roster'
    else:
        return "Out of the League"

In [859]:
#Function takes in dataframe of players to label their current career
#Works when each row is a unique players season
#Returns df with nbapersonid and a count of each seasons outcome starting after their 4th year
#also returns dictionary with counts of each outcome
def label_careers (playersDf):
    awardsDf = awards[["season", "nbapersonid", "All NBA First Team", 
                       "All NBA Second Team", "All NBA Third Team",
                       "Defensive Player Of The Year_rk", "Most Valuable Player_rk", "all_star_game"]]

    #joins players with awards where each row is unique players season
    joined = pd.merge(playersDf, awardsDf, how="left", 
                      left_on=['season','nbapersonid'], 
                      right_on = ['season','nbapersonid'])
    #only includes seasons after 4th year, gets count of players
    #x and y keep track of players that did not have a 5th season in nba
    x = joined.nbapersonid.nunique()
    joined = joined[joined.season.gt(joined.draftyear + 3)]
    y = joined.nbapersonid.nunique()
    
    #creates outcome column that contains label for each season
    tester = joined.apply (lambda row: label_season(row), axis=1)
    joined['outcome'] = joined.apply (lambda row: label_season(row), axis=1)

    #create dfs where each row is unique players and column is count of season outcomes
    grouped1 = joined.groupby('nbapersonid')['outcome'].apply(
        lambda x: (x=='Elite').sum()).reset_index(name='Elite')
    grouped2 = joined.groupby('nbapersonid')['outcome'].apply(
        lambda x: (x=='All-Star').sum()).reset_index(name='All-Star')
    grouped3 = joined.groupby('nbapersonid')['outcome'].apply(
        lambda x: (x=='Starter').sum()).reset_index(name='Starter')
    grouped4 = joined.groupby('nbapersonid')['outcome'].apply(
        lambda x: (x=='Rotation').sum()).reset_index(name='Rotation')
    grouped5 = joined.groupby('nbapersonid')['outcome'].apply(
        lambda x: (x=='Roster').sum()).reset_index(name='Roster')

    #Combine dfs
    finalDf = grouped1
    finalDf["All-Star"] = grouped2["All-Star"]
    finalDf["Starter"] = grouped3["Starter"]
    finalDf["Rotation"] = grouped4["Rotation"]
    finalDf["Roster"] = grouped5["Roster"]
    finalDf['career outcome'] = finalDf.apply (lambda row: label_player(row), axis=1)
    
    #create dictionary of counts of each career outcomes
    careerDict = {}
    careerDict['numElites'] = finalDf['career outcome'].value_counts()['Elite']
    careerDict['numAllStars'] = finalDf['career outcome'].value_counts()['All-Star']
    careerDict['numStarters'] = finalDf['career outcome'].value_counts()['Starter']
    careerDict['numRotation'] = finalDf['career outcome'].value_counts()['Rotation']
    careerDict['numRoster'] = finalDf['career outcome'].value_counts()['Roster']
    careerDict['numOOL'] = finalDf['career outcome'].value_counts()['Out of the League'] + (x - y)
    
    return (finalDf, careerDict)


In [860]:
#Clean df where only wanted players and columns
#each row unique players season
playersP3 = player_data[["season", "nbapersonid", "games_start", "mins", "draftyear"]]
playersP3 = playersP3[(playersP3["draftyear"] == 2010)]
agg_functions4 = {'season': 'first', 'nbapersonid': 'first', 'games_start': 'sum', 
                  'mins': 'sum', 'draftyear': 'first'}
playersP3 = playersP3.groupby(['season', 'nbapersonid']).aggregate(agg_functions4)
playersP3 = playersP3[["games_start", "mins", "draftyear"]].reset_index()

#call function to label each player career, output is df and dictionary with outcome counts
finalDf, counts = label_careers(playersP3)

#prints counts of each outcome
print("Elite: " + str(counts['numElites']) + " players.")
print("All-Star: " + str(counts['numAllStars']) + " players.")
print("Starter: " + str(counts['numStarters']) + " players.")
print("Rotation: " + str(counts['numRotation']) + " players.")
print("Roster: " + str(counts['numRoster']) + " players.")
print("Out of League: " + str(counts['numOOL']) + " players.")

Elite: 2 players.
All-Star: 1 players.
Starter: 11 players.
Rotation: 7 players.
Roster: 14 players.
Out of League: 38 players.


<strong><span style="color:red">ANSWER 3:</span></strong>  

Elite: 2 players.  
All-Star: 1 player. \
Starter: 11 players.  
Rotation: 7 players.  
Roster: 14 players.  
Out of League: 38 players.  

### Open Ended Modeling Question   

In this question, you will work to build a model to predict a player's career outcome based on information up through the first four years of his career. 

This question is intentionally left fairly open ended, but here are some notes and specifications.  

1. We know modeling questions can take a long time, and that qualified candidates will have different levels of experience with "formal" modeling. Don't be discouraged. It's not our intention to make you spend excessive time here. If you get your model to a good spot but think you could do better by spending a lot more time, you can just write a bit about your ideas for future improvement and leave it there. Further, we're more interested in your thought process and critical thinking than we are in specific modeling techniques. Using smart features is more important than using fancy mathematical machinery, and a successful candidate could use a simple regression approach. 

2. You may use any data provided in this project, but please do not bring in any external sources of data. Note that while most of the data provided goes back to 2007, All NBA and All Rookie team voting is only included back to 2011.  

3. A player needs to complete three additional seasons after their first four to be considered as having a distinct career outcome for our dataset. Because the dataset in this project ends in 2021, this means that a player would need to have had the chance to play in the '21, '20, and '19 seasons after his first four years, and thus his first four years would have been '18, '17, '16, and '15. **For this reason, limit your training data to players who were drafted in or before the 2015 season.** Karl-Anthony Towns was the #1 pick in that season.  

4. Once you build your model, predict on all players who were drafted in 2018-2021 (They have between 1 and 4 seasons of data available and have not yet started accumulating seasons that inform their career outcome).  

5. You can predict a single career outcome for each player, but it's better if you can predict the probability that each player falls into each outcome bucket.    

6. Include, as part of your answer:  
  - A brief written overview of how your model works, targeted towards a decision maker in the front office without a strong statistical background. 
  - What you view as the strengths and weaknesses of your model.  
  - How you'd address the weaknesses if you had more time and or more data.  
  - A matplotlib or plotly visualization highlighting some part of your modeling process, the model itself, or your results.  
  - Your predictions for Shai Gilgeous-Alexander, Zion Williamson, James Wiseman, and Josh Giddey.  
  - (Bonus!) An html table (for example, see the package `reactable`) containing all predictions for the players drafted in 2019-2021.  



In [884]:
#Function that takes in a row that contains a player who was traded in the middle of the season
#and combines the 2 seasons stats together. 
#Some stats are not perfect like assist percentage as would need team stats but normalizing on
#minutes played creates a very close to accurate value
#returns row that contains cumulitative season stats for player
def combine_season (row):
    #finds if player was traded in middle of season
    if (row['teamid1'] != row['teamid2']):
        games = row['games1'] + row['games2']
        gameStarts = row['gamesStart1'] + row['gamesStart2']
        mins = row['mins1'] + row['mins2']
        fgm2 = row['fgm21'] + row['fgm22']
        fga2 = row['fga21'] + row['fga22']
        fgm3 = row['fgm31'] + row['fgm32']
        fga3 = row['fga31'] + row['fga32']
        ftm = row['ftm1'] + row['ftm2']
        fta = row['fta1'] + row['fta2']
        efg = ((row['fgm21'] + row['fgm22']) + 1.5*(row['fgm31'] + row['fgm32'])) / 
        (row['fga21'] + row['fga22'] + row['fga31'] + row['fga32'])
        offReb = row['offReb1'] + row['offReb2']
        defReb = row['defReb1'] + row['defReb2']
        ast = row['ast1'] + row['ast2']
        blocks = row['blocks1'] + row['blocks2']
        tov = row['tov1'] + row['tov2']
        tot_fouls = row['tot_fouls1'] + row['tot_fouls2']
        per = ((row['per1']*row['mins1']) + (row['per2']*row['mins2'])) / (row['mins1']+row['mins2'])
        offRebP = ((row['offRebP1']*row['mins1']) + (row['offRebP2']*row['mins2'])) / (row['mins1']+row['mins2'])
        defRebP = ((row['defRebP1']*row['mins1']) + (row['defRebP2']*row['mins2'])) / (row['mins1']+row['mins2'])
        ast_pct = ((row['ast_pct1']*row['mins1']) + (row['ast_pct2']*row['mins2'])) / (row['mins1']+row['mins2'])
        stl_pct = ((row['stl_pct1']*row['mins1']) + (row['stl_pct2']*row['mins2'])) / (row['mins1']+row['mins2'])
        blk_pct = ((row['blk_pct1']*row['mins1']) + (row['blk_pct2']*row['mins2'])) / (row['mins1']+row['mins2'])
        tov_pct = ((row['tov_pct1']*row['mins1']) + (row['tov_pct2']*row['mins2'])) / (row['mins1']+row['mins2'])
        usg = ((row['usg1']*row['mins1']) + (row['usg2']*row['mins2'])) / (row['mins1']+row['mins2'])
        OWS = row['OWS1'] + row['OWS2']
        DWS = row['DWS1'] + row['DWS2']
        OBPM = ((row['OBPM1']*row['games1']) + (row['OBPM2']*row['games2'])) / (row['games1']+row['games2'])
        DBPM = ((row['DBPM1']*row['games1']) + (row['DBPM2']*row['games2'])) / (row['games1']+row['games2'])
        VORP = row['VORP1'] + row['VORP2']
        
        return pd.Series([games, gameStarts, mins, fgm2, fga2, fgm3, fga3, ftm, fta, efg, offReb, defReb, ast,
                         blocks, tov, tot_fouls, per, offRebP, defRebP, ast_pct, stl_pct, blk_pct, tov_pct,
                         usg, OWS, DWS, VORP])
    #accounts for players that were not traded in the middle of the season
    else:
        games = row['games1']
        gameStarts = row['gamesStart1']
        mins = row['mins1']
        fgm2 = row['fgm21']
        fga2 = row['fga21']
        fgm3 = row['fgm31']
        fga3 = row['fga31']
        ftm = row['ftm1']
        fta = row['fta1']
        efg = row['efg1']        
        offReb = row['offReb1']
        defReb = row['defReb1']
        ast = row['ast1']
        blocks = row['blocks1']
        tov = row['tov1']
        tot_fouls = row['tot_fouls1']
        per = row['per1']
        offRebP = row['offRebP1']
        defRebP = row['defRebP1']
        ast_pct = row['ast_pct1']
        stl_pct = row['stl_pct1']
        blk_pct = row['blk_pct1']
        tov_pct = row['tov_pct1']
        usg = row['usg1']
        OWS = row['OWS1']
        DWS = row['DWS1']
        OBPM = row['OBPM1']
        DBPM = row['DBPM1']
        VORP = row['VORP1']
        return pd.Series([games, gameStarts, mins, fgm2, fga2, fgm3, fga3, ftm, fta, efg, offReb, defReb, ast,
                         blocks, tov, tot_fouls, per, offRebP, defRebP, ast_pct, stl_pct, blk_pct, tov_pct,
                         usg, OWS, DWS, VORP])

SyntaxError: invalid syntax (1151514872.py, line 18)

In [890]:
#Creates dataframe with each row being a unique players season with correct aggregated data
combPlayers = player_data

teamid1 = pd.NamedAgg(column="nbateamid", aggfunc="first")
teamid2 = pd.NamedAgg(column="nbateamid", aggfunc="last")
playerName = pd.NamedAgg(column="player", aggfunc="first")
draftyear = pd.NamedAgg(column="draftyear", aggfunc="first")
games1 = pd.NamedAgg(column="games", aggfunc="first")
games2 = pd.NamedAgg(column="games", aggfunc="last")
gamesStart1 = pd.NamedAgg(column="games_start", aggfunc="first")
gamesStart2 = pd.NamedAgg(column="games_start", aggfunc="last")
mins1 = pd.NamedAgg(column="mins", aggfunc="first")
mins2 = pd.NamedAgg(column="mins", aggfunc="last")
fgm21 = pd.NamedAgg(column="fgm2", aggfunc="first")
fgm22 = pd.NamedAgg(column="fgm2", aggfunc="last")
fga21 = pd.NamedAgg(column="fga2", aggfunc="first")
fga22 = pd.NamedAgg(column="fga2", aggfunc="last")
fgm31 = pd.NamedAgg(column="fgm3", aggfunc="first")
fgm32 = pd.NamedAgg(column="fgm3", aggfunc="last")
fga31 = pd.NamedAgg(column="fga3", aggfunc="first")
fga32 = pd.NamedAgg(column="fga3", aggfunc="last")
efg1 = pd.NamedAgg(column="efg", aggfunc="first")
efg2 = pd.NamedAgg(column="efg", aggfunc="last")
ftm1 = pd.NamedAgg(column="ftm", aggfunc="first")
ftm2 = pd.NamedAgg(column="ftm", aggfunc="last")
fta1 = pd.NamedAgg(column="fta", aggfunc="first")
fta2 = pd.NamedAgg(column="fta", aggfunc="last")
offReb1 = pd.NamedAgg(column="off_reb", aggfunc="first")
offReb2 = pd.NamedAgg(column="off_reb", aggfunc="last")
defReb1 = pd.NamedAgg(column="def_reb", aggfunc="first")
defReb2 = pd.NamedAgg(column="def_reb", aggfunc="last")
ast1 = pd.NamedAgg(column="ast", aggfunc="first")
ast2 = pd.NamedAgg(column="ast", aggfunc="last")
steals1 = pd.NamedAgg(column="steals", aggfunc="first")
steals2 = pd.NamedAgg(column="steals", aggfunc="last")
blocks1 = pd.NamedAgg(column="blocks", aggfunc="first")
blocks2 = pd.NamedAgg(column="blocks", aggfunc="last")
tov1 = pd.NamedAgg(column="tov", aggfunc="first")
tov2 = pd.NamedAgg(column="tov", aggfunc="last")
tot_fouls1 = pd.NamedAgg(column="tot_fouls", aggfunc="first")
tot_fouls2 = pd.NamedAgg(column="tot_fouls", aggfunc="last")
per1 = pd.NamedAgg(column="PER", aggfunc="first")
per2 = pd.NamedAgg(column="PER", aggfunc="last")
offRebP1 = pd.NamedAgg(column="off_reb_pct", aggfunc="first")
offRebP2 = pd.NamedAgg(column="off_reb_pct", aggfunc="last")
defRebP1 = pd.NamedAgg(column="def_reb_pct", aggfunc="first")
defRebP2 = pd.NamedAgg(column="def_reb_pct", aggfunc="last")
ast_pct1 = pd.NamedAgg(column="ast_pct", aggfunc="first")
ast_pct2 = pd.NamedAgg(column="ast_pct", aggfunc="last")
stl_pct1 = pd.NamedAgg(column="stl_pct", aggfunc="first")
stl_pct2 = pd.NamedAgg(column="stl_pct", aggfunc="last")
blk_pct1 = pd.NamedAgg(column="blk_pct", aggfunc="first")
blk_pct2 = pd.NamedAgg(column="blk_pct", aggfunc="last")
tov_pct1 = pd.NamedAgg(column="tov_pct", aggfunc="first")
tov_pct2 = pd.NamedAgg(column="tov_pct", aggfunc="last")
usg1 = pd.NamedAgg(column="usg", aggfunc="first")
usg2 = pd.NamedAgg(column="usg", aggfunc="last")
OWS1 = pd.NamedAgg(column="OWS", aggfunc="first")
OWS2 = pd.NamedAgg(column="OWS", aggfunc="last")
DWS1 = pd.NamedAgg(column="DWS", aggfunc="first")
DWS2 = pd.NamedAgg(column="DWS", aggfunc="last")
OBPM1 = pd.NamedAgg(column="OBPM", aggfunc="first")
OBPM2 = pd.NamedAgg(column="OBPM", aggfunc="last")
DBPM1 = pd.NamedAgg(column="DBPM", aggfunc="first")
DBPM2 = pd.NamedAgg(column="DBPM", aggfunc="last")
VORP1 = pd.NamedAgg(column="VORP", aggfunc="first")
VORP2 = pd.NamedAgg(column="VORP", aggfunc="last")


combPlayers = combPlayers.groupby(['season', 'nbapersonid']).aggregate(teamid1 = teamid1, teamid2 = teamid2,
                                                                 playerName = playerName, draftyear = draftyear,
                                                                 games1 = games1, games2 = games2,
                                                                 gamesStart1 = gamesStart1, 
                                                                 gamesStart2 = gamesStart2,
                                                                 mins1 = mins1, mins2 = mins2,
                                                                 fgm21 = fgm21, fgm22 = fgm22,
                                                                 fga21 = fga21, fga22 = fga22,
                                                                 fgm31 = fgm31, fgm32 = fgm32,
                                                                 fga31 = fga31, fga32 = fga32,
                                                                 efg1 = efg1, efg2 = efg2,
                                                                 ftm1 = ftm1, ftm2 = ftm2,
                                                                 fta1 = fta1, fta2 = fta2,
                                                                 offReb1 = offReb1, offReb2 = offReb2,
                                                                 defReb1 = defReb1, defReb2 = defReb2,
                                                                 ast1 = ast1, ast2 = ast2, 
                                                                 steals1 = steals1, steals2 = steals2,
                                                                 blocks1 = blocks1, blocks2 = blocks2,
                                                                 tov1 = tov1, tov2 = tov2,
                                                                 tot_fouls1 = tot_fouls1, tot_fouls2 = tot_fouls2,
                                                                 per1 = per1, per2 = per2,
                                                                 offRebP1 = offRebP1, offRebP2 = offRebP2,
                                                                 defRebP1 = defRebP1, defRebP2 = defRebP2,
                                                                 ast_pct1 = ast_pct1, ast_pct2 = ast_pct2,
                                                                 stl_pct1 = stl_pct1, stl_pct2 = stl_pct1,
                                                                 blk_pct1 = blk_pct1, blk_pct2 = blk_pct2,
                                                                 tov_pct1 = tov_pct1, tov_pct2 = tov_pct2,
                                                                 usg1 = usg1, usg2 = usg2,
                                                                 OWS1 = OWS1, OWS2 = OWS2,
                                                                 DWS1 = DWS1, DWS2 = DWS2,
                                                                 OBPM1 = OBPM1, OBPM2 = OBPM1,
                                                                 DBPM1 = DBPM1, DBPM2 = DBPM2,
                                                                 VORP1 = VORP1, VORP2 = VORP2
                                                                )

combPlayers = combPlayers.reset_index()
combPlayers[['games', 'gameStarts', 'mins', 'fgm2', 'fga2', 'fgm3', 'fga3', 'ftm', 
             'fta', 'efg', 'offReb', 'defReb', 'ast', 'blocks', 'tov', 'tot_fouls', 
             'per', 'offRebP', 'defRebP', 'ast_pct', 'stl_pct', 'blk_pct', 'tov_pct',
                         'usg', 'OWS', 'DWS', 'VORP']] = combPlayers.apply(lambda row: combine_season(row), axis=1)
combPlayers = combPlayers[["season", "nbapersonid", "playerName", "draftyear", 'games', 
                           'gameStarts', 'mins', 'fgm2', 'fga2', 'fgm3', 'fga3', 'ftm', 
                           'fta', 'efg', 'offReb', 'defReb', 'ast', 'blocks', 'tov', 
                           'tot_fouls', 'per', 'offRebP', 'defRebP', 'ast_pct', 'stl_pct', 
                           'blk_pct', 'tov_pct','usg', 'OWS', 'DWS', 'VORP']]
combPlayers.rename(columns={'gameStarts':'games_start'}, inplace=True)

In [864]:
#add row to df that contains the players minutes for the next season
def addMins(df):
    df["nextMins"] = np.nan
    df["changeMins"] = np.nan
    for index, row in df.iterrows():
        player = df.loc[index, "nbapersonid"]
        season = df.loc[index, "season"]
        try:
            nextMin = df.loc[(df['nbapersonid'] == player) & (df['season'] == (season + 1))]['mins'].values[0]
            changeMin = nextMin - df.loc[index, "mins"]
            df.loc[index, 'nextMins'] = nextMin
            df.loc[index, 'changeMins'] = changeMin
            
        except:
            df.loc[index, 'nextMins'] = 0
            df.loc[index, 'changeMins'] = 0 - df.loc[index, "mins"]
    return df

In [886]:
#Prepare training and predicting dataframes
modelDf = exp(combPlayers)
predictDf = exp(combPlayers)
predictDf = predictDf[(predictDf["draftyear"] > 2017)]
modelDf = modelDf[(modelDf["draftyear"] < 2016) & (modelDf["draftyear"] > 2006)]
modelDf = addMins(trainingData1)
outcomes, countsX = label_careers(modelDf)
outcomes = outcomes [["nbapersonid", "career outcome"]]
modelDf = pd.merge(modelDf, outcomes, how="left", on = 'nbapersonid')
modelDf = modelDf.fillna("Out of the League")
modelDf = modelDf.rename(columns={'career outcome': 'Career_Outcome'})

td1 = modelDf[modelDf['exp'] == 1]
td2 = modelDf[modelDf['exp'] == 2]
td3 = modelDf[modelDf['exp'] == 3]
td4 = modelDf[modelDf['exp'] == 4]


In [887]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import svm

In [888]:
#attempt at predicting minutes for the next season using current season
#ended up not being used in final analysis because not accurate enough
X = td3[['games_start', 'games', 'ast', 'blocks', 'mins', 'fga2', 'DWS']]

#I assume than the y is the first column of the dataset as you said
y = td3[['nextMins']]

#I split the data X, y into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#Convert pandas dataframes into numpy arrays (it is needed for the fitting)
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

#Create and fit the model
model = LinearRegression()
#model = svm.SVR()


#Fit the model using the training data
model.fit(X_train,y_train)

#Predict unseen data
y_predicted =model.predict(X_test)
scores = model.score(X_test, y_test)


print(y_predicted.size)
print(y.shape)
print(scores)

141
(426, 1)
0.6265768352955958


In [889]:
#Correlation map to find significant variables that predict career outcome
#Subseted to players 1st seasons
corrMapDf1 = td1[['games',
       'games_start', 'mins', 'fgm2', 'fga2', 'fgm3', 'fga3', 'ftm', 'fta',
       'efg', 'offReb', 'defReb', 'ast', 'blocks', 'tov', 'tot_fouls', 'per',
       'offRebP', 'defRebP', 'ast_pct', 'stl_pct', 'blk_pct', 'tov_pct', 'usg',
       'OWS', 'DWS', 'VORP', 'nextMins', 'changeMins', 'Career_Outcome']]
maps = {"Out of the League": 1, 'Roster': 2, 'Rotation': 3, 'Starter': 4, 'All-Star': 5, 'Elite': 6}
corrMapDf1 = corrMapDf1.replace({'Career_Outcome':maps})
corrMapDf=(corrMapDf1-corrMapDf1.mean())/corrMapDf1.std()
corr = corrMapDf.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Career_Outcome,DWS,OWS,VORP,ast,ast_pct,blk_pct,blocks,changeMins,defReb,defRebP,fga2,fga3,fgm2,fgm3,fta,ftm,games,games_start,mins,nextMins,offReb,offRebP,per,stl_pct,tot_fouls,tov,usg
Career_Outcome,1.0,0.475872,0.329893,0.291443,0.387174,0.081777,0.012725,0.343641,0.303085,0.476087,0.023303,0.492899,0.335774,0.490264,0.333311,0.516432,0.511592,0.498102,0.422043,0.521565,0.654716,0.390232,-0.064783,0.172627,0.061326,0.508491,0.472694,0.069164
DWS,0.475872,1.0,0.369496,0.416139,0.488388,0.050563,0.122529,0.651331,-0.094685,0.838766,0.182758,0.7089,0.37568,0.71544,0.35765,0.683892,0.650456,0.73395,0.62475,0.76439,0.567964,0.704007,0.027015,0.258377,0.069932,0.810461,0.665065,0.021198
OWS,0.329893,0.369496,1.0,0.745458,0.258763,-0.023309,0.070818,0.361353,0.028994,0.475892,0.107671,0.437115,0.257631,0.5178,0.309204,0.455335,0.465814,0.356745,0.445169,0.467798,0.410762,0.514132,0.109796,0.376979,-0.017037,0.430324,0.27705,-0.01968
VORP,0.291443,0.416139,0.745458,1.0,0.340362,0.135233,0.036109,0.208874,0.066189,0.324009,0.083499,0.327865,0.292414,0.379718,0.33198,0.353062,0.379379,0.134232,0.35888,0.313818,0.309517,0.23996,0.037646,0.334664,0.122011,0.182492,0.264709,0.092308
ast,0.387174,0.488388,0.258763,0.340362,1.0,0.571495,-0.186689,0.153487,-0.154566,0.519368,-0.147654,0.733028,0.666264,0.691713,0.645092,0.694955,0.715979,0.574648,0.677008,0.765764,0.525598,0.240328,-0.216824,0.176001,0.096541,0.559611,0.875714,0.211753
ast_pct,0.081777,0.050563,-0.023309,0.135233,0.571495,1.0,-0.119324,-0.178075,-0.056108,0.029871,-0.222944,0.214273,0.29288,0.178317,0.276134,0.188354,0.217535,0.117806,0.175257,0.210284,0.134404,-0.14644,-0.344903,0.095063,0.129238,0.059411,0.369239,0.339634
blk_pct,0.012725,0.122529,0.070818,0.036109,-0.186689,-0.119324,1.0,0.451052,-0.001388,0.130355,0.361904,0.008336,-0.225864,0.034695,-0.218132,0.028537,-0.011902,0.026143,0.015069,-0.029688,-0.025739,0.24233,0.189798,0.268683,-0.151765,0.107535,-0.069195,-0.050873
blocks,0.343641,0.651331,0.361353,0.208874,0.153487,-0.178075,0.451052,1.0,-0.075304,0.757312,0.287737,0.586724,0.070561,0.621804,0.058114,0.572234,0.514668,0.558452,0.542187,0.562589,0.41394,0.783768,0.169349,0.260369,-0.082158,0.704951,0.446762,-0.033928
changeMins,0.303085,-0.094685,0.028994,0.066189,-0.154566,-0.056108,-0.001388,-0.075304,1.0,-0.148843,0.00593,-0.133547,-0.11131,-0.120952,-0.099299,-0.115592,-0.114637,-0.146231,-0.192564,-0.183335,0.57391,-0.115518,0.027953,0.030162,-0.000345,-0.156798,-0.176055,0.046807
defReb,0.476087,0.838766,0.475892,0.324009,0.519368,0.029871,0.130355,0.757312,-0.148843,1.0,0.251421,0.860247,0.39842,0.876824,0.37888,0.811241,0.775323,0.78656,0.762894,0.875069,0.620811,0.855664,0.051754,0.295396,-0.01672,0.893777,0.764293,0.061797


In [869]:
#Correlation map to find significant variables that predict career outcome
#Subseted to players 2nd seasons
corrMapDf2 = td2[['games',
       'games_start', 'mins', 'fgm2', 'fga2', 'fgm3', 'fga3', 'ftm', 'fta',
       'efg', 'offReb', 'defReb', 'ast', 'blocks', 'tov', 'tot_fouls', 'per',
       'offRebP', 'defRebP', 'ast_pct', 'stl_pct', 'blk_pct', 'tov_pct', 'usg',
       'OWS', 'DWS', 'VORP', 'nextMins', 'changeMins', 'Career_Outcome']]
maps = {"Out of the League": 1, 'Roster': 2, 'Rotation': 3, 'Starter': 4, 'All-Star': 5, 'Elite': 6}
corrMapDf2 = corrMapDf2.replace({'Career_Outcome':maps})
corrMapDf=(corrMapDf2-corrMapDf2.mean())/corrMapDf2.std()
corr = corrMapDf.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Career_Outcome,DWS,OWS,VORP,ast,ast_pct,blk_pct,blocks,changeMins,defReb,defRebP,fga2,fga3,fgm2,fgm3,fta,ftm,games,games_start,mins,nextMins,offReb,offRebP,per,stl_pct,tot_fouls,tov,usg
Career_Outcome,1.0,0.599562,0.536428,0.539146,0.472952,0.160208,0.114539,0.441271,0.166142,0.572357,0.125338,0.601677,0.362805,0.600274,0.365474,0.614155,0.612639,0.53876,0.534161,0.627459,0.696814,0.446031,0.021447,0.36254,0.007459,0.593161,0.593197,0.185164
DWS,0.599562,1.0,0.586023,0.638726,0.523101,0.074675,0.214022,0.683795,-0.072245,0.850553,0.288678,0.737756,0.336094,0.748423,0.318154,0.701066,0.665103,0.743429,0.706438,0.798154,0.708429,0.714065,0.163914,0.408875,0.034138,0.809739,0.700171,0.096609
OWS,0.536428,0.586023,1.0,0.858269,0.465938,0.06629,0.101671,0.490298,-0.079831,0.665031,0.160805,0.664216,0.387307,0.715368,0.416595,0.727269,0.736862,0.536491,0.618753,0.6771,0.589348,0.61638,0.163947,0.511449,-0.036637,0.605839,0.574094,0.168015
VORP,0.539146,0.638726,0.858269,1.0,0.526632,0.21095,0.096708,0.42863,0.010356,0.576239,0.144479,0.581618,0.374188,0.618923,0.393645,0.644088,0.660875,0.381593,0.54995,0.566879,0.541849,0.45326,0.060656,0.468992,0.080308,0.438245,0.555173,0.233329
ast,0.472952,0.523101,0.465938,0.526632,1.0,0.642521,-0.1901,0.177297,-0.172585,0.476773,-0.160316,0.704427,0.55778,0.664858,0.531497,0.66763,0.692962,0.600498,0.626704,0.733606,0.584508,0.234665,-0.251949,0.310472,0.07748,0.558364,0.85891,0.313385
ast_pct,0.160208,0.074675,0.06629,0.21095,0.642521,1.0,-0.347371,-0.164422,-0.000709,0.002978,-0.315856,0.212951,0.236515,0.175818,0.212578,0.206217,0.235117,0.129697,0.142052,0.193251,0.182058,-0.145459,-0.437588,0.11149,0.292881,0.053076,0.397715,0.333265
blk_pct,0.114539,0.214022,0.101671,0.096708,-0.1901,-0.347371,1.0,0.60826,0.032446,0.244408,0.503815,0.063672,-0.285391,0.100023,-0.278439,0.079594,0.029678,0.057025,0.06671,0.018884,0.038193,0.374273,0.574098,0.204342,-0.137311,0.195664,-0.01908,-0.182298
blocks,0.441271,0.683795,0.490298,0.42863,0.177297,-0.164422,0.60826,1.0,-0.103441,0.76668,0.427979,0.573318,0.019333,0.611214,0.012962,0.564117,0.50396,0.525962,0.55712,0.568776,0.472234,0.798086,0.385199,0.376115,-0.068598,0.700031,0.467855,0.00268
changeMins,0.166142,-0.072245,-0.079831,0.010356,-0.172585,-0.000709,0.032446,-0.103441,1.0,-0.167481,0.066419,-0.197905,-0.195835,-0.188233,-0.186537,-0.155335,-0.1532,-0.205358,-0.245569,-0.24098,0.399917,-0.104311,0.053805,0.017459,0.054784,-0.196322,-0.199789,0.001516
defReb,0.572357,0.850553,0.665031,0.576239,0.476773,0.002978,0.244408,0.76668,-0.167481,1.0,0.430119,0.83444,0.32497,0.855639,0.311525,0.789057,0.742046,0.764905,0.772935,0.846725,0.694539,0.874521,0.258586,0.455507,-0.051815,0.877487,0.752201,0.147207


In [870]:
#Correlation map to find significant variables that predict career outcome
#Subseted to players 3rd seasons
corrMapDf3 = td3[['games',
       'games_start', 'mins', 'fgm2', 'fga2', 'fgm3', 'fga3', 'ftm', 'fta',
       'efg', 'offReb', 'defReb', 'ast', 'blocks', 'tov', 'tot_fouls', 'per',
       'offRebP', 'defRebP', 'ast_pct', 'stl_pct', 'blk_pct', 'tov_pct', 'usg',
       'OWS', 'DWS', 'VORP', 'nextMins', 'changeMins', 'Career_Outcome']]
maps = {"Out of the League": 1, 'Roster': 2, 'Rotation': 3, 'Starter': 4, 'All-Star': 5, 'Elite': 6}
corrMapDf3 = corrMapDf3.replace({'Career_Outcome':maps})
corrMapDf=(corrMapDf3-corrMapDf3.mean())/corrMapDf3.std()
corr = corrMapDf.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,games,games_start,mins,fgm2,fga2,fgm3,fga3,ftm,fta,offReb,defReb,ast,blocks,tov,tot_fouls,per,offRebP,defRebP,ast_pct,stl_pct,blk_pct,usg,OWS,DWS,VORP,nextMins,changeMins,Career_Outcome
games,1.0,0.579188,0.860998,0.705027,0.709299,0.487106,0.502015,0.605266,0.633678,0.570495,0.711554,0.549271,0.471829,0.705517,0.847735,0.148784,-0.121346,0.070027,0.081187,0.033269,-0.019029,0.10219,0.49194,0.684152,0.369142,0.652306,-0.21534,0.522771
games_start,0.579188,1.0,0.826258,0.80318,0.801511,0.418598,0.432818,0.698885,0.72557,0.595596,0.740521,0.603171,0.506775,0.76105,0.745388,0.233768,-0.039761,0.110665,0.190318,0.040773,0.024096,0.292248,0.595971,0.682603,0.564429,0.618221,-0.217879,0.592191
mins,0.860998,0.826258,1.0,0.883763,0.894325,0.614384,0.632536,0.81387,0.829341,0.601615,0.807548,0.721846,0.500264,0.875667,0.888366,0.233901,-0.142377,0.022251,0.224912,0.070396,-0.063695,0.299649,0.667996,0.779301,0.589461,0.745543,-0.26756,0.640098
fgm2,0.705027,0.80318,0.883763,1.0,0.990787,0.34889,0.369181,0.866657,0.895509,0.706345,0.820713,0.642683,0.580945,0.867271,0.821271,0.328894,-0.009694,0.173006,0.219695,0.021049,0.052393,0.475472,0.707321,0.703386,0.637068,0.654608,-0.24264,0.601751
fga2,0.709299,0.801511,0.894325,0.990787,1.0,0.385679,0.40728,0.87697,0.898773,0.661928,0.79271,0.677007,0.535555,0.891513,0.812599,0.30466,-0.042844,0.124887,0.254637,0.034266,0.013755,0.501183,0.655206,0.691363,0.601183,0.651057,-0.261986,0.597839
fgm3,0.487106,0.418598,0.614384,0.34889,0.385679,1.0,0.989905,0.44195,0.383805,-0.057746,0.272687,0.559896,-0.047097,0.515544,0.378369,0.085245,-0.349087,-0.308795,0.2747,0.094668,-0.308961,0.281495,0.410147,0.373221,0.441723,0.477466,-0.136312,0.378254
fga3,0.502015,0.432818,0.632536,0.369181,0.40728,0.989905,1.0,0.463865,0.406336,-0.056723,0.28293,0.591896,-0.047165,0.543402,0.389511,0.078094,-0.360349,-0.31425,0.302594,0.111003,-0.319187,0.302086,0.389909,0.386847,0.431796,0.483973,-0.151328,0.379708
ftm,0.605266,0.698885,0.81387,0.866657,0.87697,0.44195,0.463865,1.0,0.98506,0.535231,0.706765,0.657794,0.439323,0.841712,0.692994,0.312364,-0.070807,0.084614,0.274013,0.04158,-0.016091,0.490073,0.75947,0.635304,0.705758,0.596836,-0.232129,0.592374
fta,0.633678,0.72557,0.829341,0.895509,0.898773,0.383805,0.406336,0.98506,1.0,0.623078,0.760182,0.634794,0.510548,0.849462,0.742338,0.319908,-0.02872,0.142284,0.241577,0.036203,0.03127,0.472028,0.745512,0.667773,0.68236,0.614657,-0.227178,0.60887
offReb,0.570495,0.595596,0.601615,0.706345,0.661928,-0.057746,-0.056723,0.535231,0.623078,1.0,0.855452,0.169112,0.753745,0.498541,0.754206,0.275513,0.30316,0.52397,-0.167996,-0.081589,0.309121,0.106148,0.54661,0.639589,0.399568,0.461608,-0.14206,0.434306


In [871]:
#Correlation map to find significant variables that predict career outcome
#Subseted to players 4th seasons
corrMapDf4 = td4[['games_start', 'mins', 'fgm2', 'fga2', 'ftm', 'fta',
        'defReb', 'ast', 'tov', 'OWS', 'DWS', 'VORP', 'Career_Outcome']]
maps = {"Out of the League": 1, 'Roster': 2, 'Rotation': 3, 'Starter': 4, 'All-Star': 5, 'Elite': 6}
corrMapDf4 = corrMapDf4.replace({'Career_Outcome':maps})
corrMapDf=(corrMapDf4-corrMapDf4.mean())/corrMapDf4.std()
corr = corrMapDf.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,games_start,mins,fgm2,fga2,ftm,fta,defReb,ast,tov,OWS,DWS,VORP,Career_Outcome
games_start,1.0,0.839091,0.785496,0.791251,0.690495,0.704285,0.736622,0.632076,0.76939,0.638387,0.703481,0.641853,0.62995
mins,0.839091,1.0,0.865895,0.877994,0.785011,0.791951,0.787944,0.714939,0.853191,0.721075,0.758483,0.660908,0.6865
fgm2,0.785496,0.865895,1.0,0.989617,0.846417,0.879101,0.830456,0.629886,0.855841,0.740428,0.728522,0.703406,0.673523
fga2,0.791251,0.877994,0.989617,1.0,0.860095,0.881913,0.794685,0.673159,0.879168,0.704839,0.707284,0.681666,0.674404
ftm,0.690495,0.785011,0.846417,0.860095,1.0,0.980045,0.680007,0.672783,0.854629,0.79816,0.622578,0.774631,0.673069
fta,0.704285,0.791951,0.879101,0.881913,0.980045,1.0,0.750916,0.632881,0.847373,0.782493,0.676772,0.760377,0.670676
defReb,0.736622,0.787944,0.830456,0.794685,0.680007,0.750916,1.0,0.425024,0.697169,0.657527,0.847062,0.666192,0.593965
ast,0.632076,0.714939,0.629886,0.673159,0.672783,0.632881,0.425024,1.0,0.863372,0.553466,0.504505,0.632152,0.522197
tov,0.76939,0.853191,0.855841,0.879168,0.854629,0.847373,0.697169,0.863372,1.0,0.650466,0.687456,0.705106,0.647704
OWS,0.638387,0.721075,0.740428,0.704839,0.79816,0.782493,0.657527,0.553466,0.650466,1.0,0.605489,0.879441,0.651394


In [813]:
#Trains random forrest using players rookie seasons and varibales considered influential
#using the correlation map
X1 = td1[["games", "mins", "fga2", "ftm", "defReb", "DWS", "OWS", "VORP"]]
y1 = td1[["Career_Outcome"]]

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2)

rf1 = RandomForestClassifier()
rf1.fit(X1_train, y1_train)
y_pred = rf1.predict(X1_test)

x1, y1 = np.unique(y_pred, return_counts = True)
predDf = pd.DataFrame([y1], columns=[x1])
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
#print(predDf)
#print(y_test.value_counts())

pd1 = predictDf[(predictDf['playerName'] == 'Josh Giddey') | (predictDf['playerName'] == 'James Wiseman')]
preds1 = pd1[["games", "mins", "fga2", "ftm", "defReb", "DWS", "OWS", "VORP"]]
testPred = rf1.predict(preds1)
x1, y1 = np.unique(y_pred, return_counts = True)
pd1['outcomePredictions'] = testPred.tolist()
pd1 = pd1[["nbapersonid", "playerName", "draftyear", "outcomePredictions"]]
print(pd1)


Accuracy: 0.4105263157894737
      nbapersonid     playerName  draftyear outcomePredictions
6739      1630164  James Wiseman       2020  Out of the League
7375      1630581    Josh Giddey       2021            Starter


  rf1.fit(X1_train, y1_train)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pd1['outcomePredictions'] = testPred.tolist()


In [347]:
#trains random forrest for 2 years of exp
X = td2[["games_start", "ast_pct", "mins", "fga2", "ftm", "defReb", "DWS"]]
y = td2[["Career_Outcome"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rf2 = RandomForestClassifier()
rf2.fit(X_train, y_train)
y_pred = rf2.predict(X_test)

x1, y1 = np.unique(y_pred, return_counts = True)
predDf = pd.DataFrame([y1], columns=[x1])
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(predDf)
print(y_test.value_counts())

Accuracy: 0.5670103092783505
  All-Star Elite Out of the League Roster Rotation Starter
0        1     5                59      6        5      21
Career_Outcome   
Out of the League    48
Starter              21
Roster               13
Rotation              8
Elite                 4
All-Star              3
dtype: int64


  rf2.fit(X_train, y_train)


In [269]:
#trains random forrest for 3 years of exp

X = td3[["games_start", "mins", "fgm2", "fta", "defReb", "DWS", "VORP"]]
y = td3[["Career_Outcome"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rf3 = RandomForestClassifier()
rf3.fit(X_train, y_train)
y_pred = rf3.predict(X_test)

x1, y1 = np.unique(y_pred, return_counts = True)
predDf = pd.DataFrame([y1], columns=[x1])
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(predDf)
print(y_test.value_counts())

Accuracy: 0.5232558139534884
  Elite Out of the League Roster Rotation Starter
0     6                34      2       12      32
Career_Outcome   
Starter              26
Out of the League    24
Roster               15
Rotation             13
Elite                 5
All-Star              3
dtype: int64


  rf3.fit(X_train, y_train)


In [357]:
#trains random forrest for 4 years of exp

X = td4[["games_start", "mins", "fgm2", "fta", "defReb", "OWS", "DWS", "VORP"]]
y = td4[["Career_Outcome"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rf4 = RandomForestClassifier()
rf4.fit(X_train, y_train)
y_pred = rf4.predict(X_test)

x1, y1 = np.unique(y_pred, return_counts = True)
predDf = pd.DataFrame([y1], columns=[x1])
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(predDf)
print(y_test.value_counts())

pd4 = predictDf[predictDf["playerName"] == "Shai Gilgeous-Alexander"]
preds4 = pd4[["games_start", "mins", "fgm2", "fta", "defReb", "OWS", "DWS", "VORP"]]
testPred4 = rf4.predict(preds4)
pd4['outcomePredictions'] = testPred4.tolist()
pd4 = pd4[["nbapersonid", "playerName", "draftyear", "outcomePredictions"]]
print(pd4)

Accuracy: 0.4567901234567901
  Elite Out of the League Roster Rotation Starter
0     6                31     11       14      19
Career_Outcome   
Out of the League    26
Starter              23
Roster               16
Rotation              9
Elite                 5
All-Star              2
dtype: int64
      nbapersonid               playerName  draftyear outcomePredictions
5670      1628983  Shai Gilgeous-Alexander       2018            Starter
6113      1628983  Shai Gilgeous-Alexander       2018            Starter
6592      1628983  Shai Gilgeous-Alexander       2018            Starter
7104      1628983  Shai Gilgeous-Alexander       2018           All-Star


  rf4.fit(X_train, y_train)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pd4['outcomePredictions'] = testPred4.tolist()


In [872]:
#Function that takes in players season, and returns predicted outcome for players career
#models are trained on specific years of experiences so uses diferrent model depending on 
#years experience of current season
def predict_player_outcome(row):
    predVar1 = ["games", "mins", "fga2", "ftm", "defReb", "DWS", "OWS", "VORP"]
    predVar2 = ["games_start", "ast_pct", "mins", "fga2", "ftm", "defReb", "DWS"]
    predVar3 = ["games_start", "mins", "fgm2", "fta", "defReb", "DWS", "VORP"]
    predVar4 = ["games_start", "mins", "fgm2", "fta", "defReb", "OWS", "DWS", "VORP"]
    prediction = np.nan
    if row['exp'] == 1:
        row = row[predVar1]
        row = row.to_numpy()
        row = row.reshape(1, -1)
        prediction = rf1.predict(row)
    elif row['exp'] == 2:
        row = row[predVar2]
        row = row.to_numpy()
        row = row.reshape(1, -1)
        prediction = rf2.predict(row)
    elif row['exp'] == 3:
        row = row[predVar3]
        row = row.to_numpy()
        row = row.reshape(1, -1)
        prediction = rf3.predict(row)
    elif row['exp'] == 4:
        row = row[predVar4]
        row = row.to_numpy()
        row = row.reshape(1, -1)
        prediction = rf4.predict(row)
    return prediction

In [873]:
import warnings
warnings.filterwarnings('ignore')

#predicts outcomes of all players drafted on or after 2017, using their most recent year of play
predictDf = exp(combPlayers)
predictDf = predictDf[(predictDf["draftyear"] > 2017)]

predictDf['outcome'] = predictDf.apply (lambda row: predict_player_outcome(row), axis=1)

predictDf = predictDf[["nbapersonid", "playerName", "season", "draftyear", "outcome"]]
predictDf = predictDf.sort_values(by=['season'])
agg_functionsF = {'nbapersonid': 'first', 'playerName': 'first', 'draftyear': 'first', 'outcome':'last'}
predictDf = predictDf.groupby(["nbapersonid"]).aggregate(agg_functionsF)

In [883]:
#save predictions to html or csv
predictDf = predictDf.sort_values(by=['outcome'])
predictionsHTML = predictDf.to_html()
#predictDf.to_csv("finalPredictions.csv")

In [874]:
predictedPlayers = predictDf[(predictDf['playerName'] == 'Shai Gilgeous-Alexander') | 
                             (predictDf['playerName'] == 'Zion Williamson') | 
                             (predictDf['playerName'] == 'James Wiseman') | 
                             (predictDf['playerName'] == 'Josh Giddey')]
print(predictedPlayers[["playerName", "outcome"]])

                          playerName              outcome
nbapersonid                                              
1628983      Shai Gilgeous-Alexander           [All-Star]
1629627              Zion Williamson              [Elite]
1630164                James Wiseman  [Out of the League]
1630581                  Josh Giddey            [Starter]


## Written Overview
   I used random forest models to predict the career outcome of players with less than 5 seasons in the NBA. In order to create fast performing models, I only took into account each players most recent season for predictions. This created a problem though that a rookie season would be judged to the same standards as a players 4th season. To fix this I created 4 different models, one for each level of experience in the NBA. For example, Josh Giddey performed very well his rookie season when compared to other players rookie seasons, and was categorized as a career starter. Though if I compared that season to the rest of the NBA, he might look like a career rotation player. Each model used most of the same variables, but with some notable differences. For rookies, free throws made, minutes played, and defensive rebounds were very predictive of career outcome. While for 4th year players, offensive win shares, 2-point field goals made, and games started were more influential. When testing my model on players we already know the career outcome of, it correctly classified players from 40% - 60% of the time based on their years experience. Since there are 6 different categories, and my dataset was limited, these are fair results.

## Strengths
Predicting the career outcome for a rookie is hard, but there are hundreds of rookie seasons in the NBA with labels for their career outcome so I took advantage of that by training my model on all of them. My model also includes separate models for each level of NBA experience. As I found through my data exploration, there are different measurements we should use to predict the season outcome of a rookie versus 4th year, and a good measurement may look different between the years. Another strength is my models runtime. Random forests have efficient runtime while still performing well. If the model would grow to include more variables or data it would scale efficiently too. It also handles many edge cases like James Wiseman who finished his rookie year but was absent for his entire sophomore year. My model compared his rookie season to other rookie season and classified him accordingly. My model also predicts a a good proportion of each label. You could make a classifier with over 70% accuracy by simply classifying every player as "Out of the League". This is not a good model though, despite the higher accuracy. My model consistently guessed a similar amount of each label as there was in the training data.

## Weaknesses and fixes
Weaknesses of my model include potential over fitting to the training data. One iteration of the random forest model for rookies achieved a 59% accuracy while the lowest I found was just below 40%. To fix this I tried to find models that gave a representative distribution of outcomes, while achieving an accuracy around the middle of the max and minium. For a more in depth fix, I could analyze each of the trees used in the random forest. Some trees may be able to be cut simply by looking at them and finding weird splits in the data. My model also only takes into account a single season. A player who drastically improved from his rookie season to his 2nd year could be viewed as the same of another player who had a good rookie season but declined in their next season. With more time I would include multiple season into analysis and put emphasis on the improvement or decline from season to season. There is also more work to be done on the analysis of variables. There may be conditional dependencies in the dependent variables that would make a big difference in the predictions if accounted for.

## Weaknesses in the data
If I was able to use more data, I would include positional data, and game data. Centers should be judged on different statistics than point guards when predicting career outcomes. It also makes since that a player who starts a season strong, but progressively under performs the rest of the season should be counted more negatively than another player who improves as the season goes on. Adding more data could account for these.

## Player Predictions
Shai Gilgeous-Alexander: All-Star \
Zion Williamson: Elite \
James Wiseman: Out of the League \
Josh Giddey: Starter

My visualizations are the heatmaps that show the correlation between variables and the career outcome

## Part 2 -- Predicting Team Stats  

In this section, we're going to introduce a simple way to predict team offensive rebound percent in the next game and then discuss ways to improve those predictions.  
 
### Question 1   

Using the `rebounding_data` dataset, we'll predict a team's next game's offensive rebounding percent to be their average offensive rebounding percent in all prior games. On a single game level, offensive rebounding percent is the number of offensive rebounds divided by their number offensive rebound "chances" (essentially the team's missed shots). On a multi-game sample, it should be the total number of offensive rebounds divided by the total number of offensive rebound chances.    

Please calculate what OKC's predicted offensive rebound percent is for game 81 in the data. That is, use games 1-80 to predict game 81.  

In [882]:
thunderTrain = rebounding_data[(rebounding_data['team'] == 'OKC') & 
                               (rebounding_data['game_number'] < 81)]
meanORebPer = thunderTrain.oreb_pct.mean()
print(round(meanORebPer, 1)*100)

30.0


<strong><span style="color:red">ANSWER 1:</span></strong>  

30.0% 

### Question 2  

There are a few limitations to the method we used above. For example, if a team has a great offensive rebounder who has played in most games this season but will be out due to an injury for the next game, we might reasonably predict a lower team offensive rebound percent for the next game.  

Please discuss how you would think about changing our original model to better account for missing players. You do not have to write any code or implement any changes, and you can assume you have access to any reasonable data that isn't provided in this project. Try to be clear and concise with your answer.  

<strong><span style="color:red">ANSWER 2:</span></strong>  
To start this problem, I would explore what is more predictive of offensive rebound percentage, a small amount of dominant rebounders or a team of good rebounders. If one good rebounder is the difference between a team with a very high percentage and a team with a very low percentage, I would know there should be a significant difference in my estimation with and without the player. If I find the percentage is more reliant on a team of rebounders versus one great rebounder, and the Thunder have multiple good rebounders on the team, I would not weigh the loss as heavy.
The next variable I would think would be very influential is the replacement player. If the replacement player has similar per36min stats as the injured player, I would not adjust the prediction too much, but if there is a significant difference, I would know there should be a significant change in predictions.
Next I would explore if 2 teams with players who have similar rebounding statistics have a similar team offensive rebound percentage. If this is the case, I would looks for a team that has similar player rebounding performances to the new Thunder starting 5. The stats I would be comparing would be reb per36min and player offensive rebound percent. If another team in the NBA this year, or recent previous years, had a team of players with rebounding stats that matched up similarly to the rebounding stats of the Thunder minus the injured player, their offensive rebound percent would be a good baseline for the prediction.

### Question 3  

In question 2, you saw and discussed how to deal with one weakness of the model. For this question, please write about 1-3 other potential weaknesses of the simple average model you made in question 1 and discuss how you would deal with each of them. You may either explain a weakness and discuss how you'd fix that weakness, then move onto the next issue, or you can start by explaining multiple weaknesses with the original approach and discuss one overall modeling methodology you'd use that gets around most or all of them. Again, you do not need to write any code or implement any changes, and you can assume you have access to any reasonable data that isn't provided in this project. Try to be clear and concise with your answer.  


<strong><span style="color:red">ANSWER 3:</span></strong>  
Another weakness of this model is that it does not take into account the opposing team. The opposing team may be an even more influential factor than the Thunders previous offensive rebound statistics. If a team grabs 100% of their defensive rebounds through the year, it does not matter what the Thunder have done, the best prediction may be 0% if that is their opponent. Next, even if we take into account injured players, there still might be changes to other players minutes. Depending on if the Thunder are fighting for a playoff spot or sitting in the 1 seed, some players may have drastically different minutes than earlier in the season. The last weakness I will talk about is the change in performance through the season. While an average is a decent statistic, it does not take into account that there could be a large variance. If the Thunder started the season off weak in the rebounding department, but picked it up late in the season, we might weigh the more recent games heavier in the prediction.