# Name: Igor Dantas Gomes Franca
### Date: 19/08/2023

<style>
.jp-Notebook {
    padding: var(--jp-notebook-padding);
    margin-left: 160px;
    outline: none;
    overflow: auto;
    background: var(--jp-layout-color0);
}
</style>

<img src="https://cdn.nba.com/logos/nba/1610612760/primary/L/logo.svg" alt="logo" style="position: fixed; top: -40px; left: 5px; height: 250px;">

# Introduction  

The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this jupyter notebook, answering questions as you go along. Please begin by adding your name to the top markdown chunk in this document. When you're finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise and avoid long printouts. Feel free to add in as many new code chunks as you'd like.

Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use packages like pandas/numpy and matplotlib/seaborn as much as possible (instead of base python data manipulations and explicit loops.)  

**WARNING:** Your project will **ONLY** be graded if it's knit to an HTML document where we can see your code. Be careful to make sure that any long lines of code appropriately visibly wrap around visibly to the next line, as code that's cut off from the side of the document cannot be graded.  

**Note:**    

**Throughout this document, any `season` column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. For most of the rest of the project, we will refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).** 

# Answers  

## Part 1      

**Question 1:**   

- 1st Team: XX.X points per game  
- 2nd Team: XX.X points per game  
- 3rd Team: XX.X points per game  
- All-Star: XX.X points per game   

**Question 2:** XX.X Years  

**Question 3:** 

- Elite: X players.  
- All-Star: X players.  
- Starter: X players.  
- Rotation: X players.  
- Roster: X players.  
- Out of League: X players.  

**Open Ended Modeling Question:** Please show your work and leave all responses below in the document.


## Part 2  

**Question 1:** XX.X%   
**Question 2:** Written question, put answer below in the document.    
**Question 3:** Written question, put answer below in the document.    
  


# Setup and Data    

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import clear_output
# Note you will likely have to change these paths. 
# If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/awards_project/ from each string.
awards = pd.read_csv("awards_data.csv")
player_data = pd.read_csv("player_stats.csv")
team_data = pd.read_csv("team_stats.csv")
rebounding_data = pd.read_csv("team_rebounding_data_22.csv")

## Part 1 -- Awards  

In this section, you're going to work with data relating to player awards and statistics. You'll start with some data manipulation questions and work towards building a model to predict broad levels of career success.  


In [2]:
awards.columns
player_data.columns

Index(['nbapersonid', 'player', 'draftyear', 'draftpick', 'season',
       'nbateamid', 'team', 'games', 'games_start', 'mins', 'fgm', 'fga',
       'fgp', 'fgm3', 'fga3', 'fgp3', 'fgm2', 'fga2', 'fgp2', 'efg', 'ftm',
       'fta', 'ftp', 'off_reb', 'def_reb', 'tot_reb', 'ast', 'steals',
       'blocks', 'tov', 'tot_fouls', 'points', 'PER', 'FTr', 'off_reb_pct',
       'def_reb_pct', 'tot_reb_pct', 'ast_pct', 'stl_pct', 'blk_pct',
       'tov_pct', 'usg', 'OWS', 'DWS', 'WS', 'OBPM', 'DBPM', 'BPM', 'VORP'],
      dtype='object')

### Question 1  

**QUESTION:** What is the average number of points per game for players in the 2007-2021 seasons who won All NBA First, Second, and Third teams (**not** the All Defensive Teams), as well as for players who were in the All-Star Game (**not** the rookie all-star game)?


 

In [3]:
# Since player_data and awards_data are all between 2007-2021, there's no need to filter that

# creating a column for points-per-game (ppg)
player_data_copy = player_data
player_data_copy['ppg'] = player_data_copy['points'] / player_data_copy['games']

# merge ppg values from "player_data" with "awards"
awards_copy = awards
merged_data = awards_copy.merge(player_data_copy, on=['season', 'nbapersonid'], how='left')
merged_data = merged_data.rename(columns={'ppg_y':'ppg'})

# Get ppg from All NBA First
awards_first_team = merged_data[merged_data['All NBA First Team'] == 1]
print(f"1st Team: {round(awards_first_team['ppg'].mean(),1)} points per game")

# ...
awards_second_team = merged_data[merged_data['All NBA Second Team'] == 1]
print(f"2nd Team: {round(awards_second_team['ppg'].mean(),1)} points per game")

# ...
awards_third_team = merged_data[merged_data['All NBA Third Team'] == 1]
print(f"3rd Team: {round(awards_third_team['ppg'].mean(),1)} points per game")

# Get ppg from all-star's
awards_allstar = merged_data[merged_data['all_star_game'] == 1]
print(f"All-Star: {round(awards_allstar['ppg'].mean(),1)} points per game")

1st Team: 25.9 points per game
2nd Team: 23.1 points per game
3rd Team: 20.5 points per game
All-Star: 21.6 points per game


<strong><span style="color:red">ANSWER 1:</span></strong>   

1st Team: 25.9 points per game  
2nd Team: 23.1 points per game  
3rd Team: 20.5 points per game  
All-Star: 21.6 points per game  

### Question 2  

**QUESTION:** What was the average number of years of experience in the league it takes for players to make their first All NBA Selection (1st, 2nd, or 3rd team)? Please limit your sample to players drafted in 2007 or later who did eventually go on to win at least one All NBA selection. For example:

- Luka Doncic is in the dataset as 2 years. He was drafted in 2018 and won his first All NBA award in 2019 (which was his second season).  
- LeBron James is not in this dataset, as he was drafted prior to 2007.  
- Lu Dort is not in this dataset, as he has not received any All NBA honors.  



In [4]:
# Filtering awards data, to keep only the first appearences at one of the awards for each nbapersonid

encountered_ids = set()
rows_filtered = []

for index, row in awards.iterrows():
    nbapersonid = row['nbapersonid']
    
    # Check if nbapersonid has been encountered before with a 1 value in first, second, or third
    if nbapersonid in encountered_ids:
        continue
        
    # If the current row has a 1 value in first, second, or third, mark nbapersonid as encountered
    if row['All NBA First Team'] == 1 or row['All NBA Second Team'] == 1 or row['All NBA Third Team'] == 1:
        encountered_ids.add(nbapersonid)
    
    # Append the current row to the filtered_data list
    rows_filtered.append(row)
    
filtered_awards = pd.DataFrame(rows_filtered)


In [5]:
# merge data of draft year to awards data
merged_data = filtered_awards.merge(player_data, on=['season','nbapersonid'], how='left')
merged_data = merged_data.rename(columns={'draftyear_y':'draftyear'})

# keeping only players drafted in 2007 or later
merged_data = merged_data[merged_data['draftyear'] >= 2007]

# Creating column for years until first nba selection after draft
merged_data['time_until_selection'] = merged_data['season'] - merged_data['draftyear']

print(f"{round(merged_data['time_until_selection'].mean(),1)} Years")

3.6 Years


<strong><span style="color:red">ANSWER 2:</span></strong>  

3.6 Years  

## Data Cleaning Interlude  

You're going to work to create a dataset with a "career outcome" for each player, representing the highest level of success that the player achieved for **at least two** seasons *after his first four seasons in the league* (examples to follow below!). To do this, you'll start with single season level outcomes. On a single season level, the outcomes are:  

- Elite: A player is "Elite" in a season if he won any All NBA award (1st, 2nd, or 3rd team), MVP, or DPOY in that season.    
- All-Star: A player is "All-Star" in a season if he was selected to be an All-Star that season.   
- Starter:  A player is a "Starter" in a season if he started in at least 41 games in the season OR if he played at least 2000 minutes in the season.    
- Rotation:  A player is a "Rotation" player in a season if he played at least 1000 minutes in the season.   
- Roster:  A player is a "Roster" player in a season if he played at least 1 minute for an NBA team but did not meet any of the above criteria.     
- Out of the League: A player is "Out of the League" if he is not in the NBA in that season.   

We need to make an adjustment for determining Starter/Rotation qualifications for a few seasons that didn't have 82 games per team. Assume that there were 66 possible games in the 2011 lockout season and 72 possible games in each of the 2019 and 2020 seasons that were shortened due to covid. Specifically, if a player played 900 minutes in 2011, he **would** meet the rotation criteria because his final minutes would be considered to be 900 * (82/66) = 1118. Please use this math for both minutes and games started, so a player who started 38 games in 2019 or 2020 would be considered to have started 38 * (82/72) = 43 games, and thus would qualify for starting 41. Any answers should be calculated assuming you round the multiplied values to the nearest whole number.

Note that on a season level, a player's outcome is the highest level of success he qualifies for in that season. Thus, since Shai Gilgeous-Alexander was both All-NBA 1st team and an All-Star last year, he would be considered to be "Elite" for the 2022 season, but would still qualify for a career outcome of All-Star if in the rest of his career he made one more All-Star game but no more All-NBA teams. Note this is a hypothetical, and Shai has not yet played enough to have a career outcome.    

Examples:  

- A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Rotation (3), Roster (4), Roster (5), Out of the League (6+) would be considered "Out of the League," because after his first four seasons, he only has a single Roster year, which does not qualify him for any success outcome.  
- A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Starter (3), Starter (4), Starter (5), Starter (6), All-Star (7), Elite (8), Starter (9) would be considered "All-Star," because he had at least two seasons after his first four at all-star level of production or higher.  
- A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Starter (3), Starter (4), Starter (5), Starter (6), Rotation (7), Rotation (8), Roster (9) would be considered a "Starter" because he has two seasons after his first four at a starter level of production. 


### Question 3  

**QUESTION:** There are 73 players in the `player_data` dataset who have 2010 listed as their draft year. How many of those players have a **career** outcome in each of the 6 buckets?  

In [121]:
player_data_copy = player_data

# merging data to make it easier while checking conditions
player_data_copy = player_data_copy.merge(awards, on=['season','nbapersonid'], how='left')

# Transforming 'mins' and 'games' values, due to its season
condition_2011 = player_data_copy['season'] == 2011
contition_2019_2022 = player_data_copy['season'].isin([2019,2020])

player_data_copy.loc[condition_2011, 'mins'] = round(player_data_copy.loc[condition_2011, 'mins'] * (82 / 66))
player_data_copy.loc[condition_2011, 'games'] = round(player_data_copy.loc[condition_2011, 'games'] * (82 / 66))

player_data_copy.loc[contition_2019_2022, 'mins'] = round(player_data_copy.loc[contition_2019_2022, 'mins'] * (82 / 72))
player_data_copy.loc[contition_2019_2022, 'games'] = round(player_data_copy.loc[contition_2019_2022, 'games'] * (82 / 72))

player_data_copy = player_data_copy[player_data_copy['draftyear']==2010]

In [122]:
# Sorting players in alfabetic order and putting seasons in ascending order
sorted_players = player_data_copy.sort_values(by=['player', 'season'])

# Declaring functions to check conditions for each outcome

# Elite: A player is "Elite" in a season if he won any All NBA award (1st, 2nd, or 3rd team), MVP, or DPOY in that season.
def elite(row):
    first = row['All NBA First Team']==1
    second = row['All NBA Second Team']==1
    third = row['All NBA Third Team']==1
    mvp = row['Most Valuable Player_rk']==1
    dpoy = row['Defensive Player Of The Year_rk']==1
    return first or second or third or mvp or dpoy

# A player is "All-Star" in a season if he was selected to be an All-Star that season.
def all_star(row):
    return row['all_star_game']==1

# A player is a "Starter" in a season if he started in at least 41 games in the season OR if he played at least 2000 minutes in the season.
def starter(row): 
    started = row['games_start'] >= 41
    minuted = row['mins'] >= 2000
    return started or minuted

# Rotation: A player is a "Rotation" player in a season if he played at least 1000 minutes in the season.
def rotation(row):
    return row['mins'] >= 1000

# Roster: A player is a "Roster" player in a season if he played at least 1 minute for an NBA team but did not meet any of the above criteria.
def roster(row):
    return row['mins'] >= 1

In [123]:
outcomes = []
for index, row in sorted_players.iterrows():
    if elite(row):
        outcomes.append(5)
        continue
    if all_star(row):
        outcomes.append(4)
        continue
    if starter(row):
        outcomes.append(3)
        continue
    if rotation(row):
        outcomes.append(2)
        continue
    if roster(row):
        outcomes.append(1)
        continue
    outcomes.append(0)
    
sorted_players['outcome'] = outcomes

In [125]:
sorted_players_listed = sorted_players.groupby('player').agg(list).reset_index()

In [126]:
# Checking the highest outcome that happened at least twice after first 4 seasons
def check_highest(row):
    if len(row['outcome']) < 6:
        return 0
        
    lista = [0,1,2,3,4,5]
    highest = 0
    
    for i in lista:
        count = len([k for k in row['outcome'][4:] if k >= i])
        if count >= 2 and i > highest :
            highest = i
    return highest

career_outcome = []

for index, row in sorted_players_listed.iterrows():
    career_outcome.append(check_highest(row))
    
sorted_players_listed['career_outcome'] = career_outcome

In [127]:
sorted_players_listed[sorted_players_listed['player'] == 'John Wall'][['player','season','outcome', 'career_outcome']]

Unnamed: 0,player,season,outcome,career_outcome
44,John Wall,"[2010, 2011, 2012, 2013, 2014, 2015, 2016, 201...","[3, 3, 3, 4, 4, 4, 5, 4, 2, 2]",4


In [129]:
print(f"Elite: {career_outcome.count(5)}")
print(f"All-Star: {career_outcome.count(4)}")
print(f"Starter: {career_outcome.count(3)}")
print(f"Rotation: {career_outcome.count(2)}")
print(f"Roster: {career_outcome.count(1)}")
print(f"Out of the league: {career_outcome.count(0)}")

Elite: 2
All-Star: 1
Starter: 10
Rotation: 10
Roster: 9
Out of the league: 41


<strong><span style="color:red">ANSWER 3:</span></strong>  

Elite: 2 players.  
All-Star: 1 players.  
Starter: 10 players.  
Rotation: 10 players.  
Roster: 9 players.  
Out of League: 41 players.  

### Open Ended Modeling Question   

In this question, you will work to build a model to predict a player's career outcome based on information up through the first four years of his career. 

This question is intentionally left fairly open ended, but here are some notes and specifications.  

1. We know modeling questions can take a long time, and that qualified candidates will have different levels of experience with "formal" modeling. Don't be discouraged. It's not our intention to make you spend excessive time here. If you get your model to a good spot but think you could do better by spending a lot more time, you can just write a bit about your ideas for future improvement and leave it there. Further, we're more interested in your thought process and critical thinking than we are in specific modeling techniques. Using smart features is more important than using fancy mathematical machinery, and a successful candidate could use a simple regression approach. 

2. You may use any data provided in this project, but please do not bring in any external sources of data. Note that while most of the data provided goes back to 2007, All NBA and All Rookie team voting is only included back to 2011.  

3. A player needs to complete three additional seasons after their first four to be considered as having a distinct career outcome for our dataset. Because the dataset in this project ends in 2021, this means that a player would need to have had the chance to play in the '21, '20, and '19 seasons after his first four years, and thus his first four years would have been '18, '17, '16, and '15. **For this reason, limit your training data to players who were drafted in or before the 2015 season.** Karl-Anthony Towns was the #1 pick in that season.  

4. Once you build your model, predict on all players who were drafted in 2018-2021 (They have between 1 and 4 seasons of data available and have not yet started accumulating seasons that inform their career outcome).  

5. You can predict a single career outcome for each player, but it's better if you can predict the probability that each player falls into each outcome bucket.    

6. Include, as part of your answer:  
  - A brief written overview of how your model works, targeted towards a decision maker in the front office without a strong statistical background. 
  - What you view as the strengths and weaknesses of your model.  
  - How you'd address the weaknesses if you had more time and or more data.  
  - A matplotlib or plotly visualization highlighting some part of your modeling process, the model itself, or your results.  
  - Your predictions for Shai Gilgeous-Alexander, Zion Williamson, James Wiseman, and Josh Giddey.  
  - (Bonus!) An html table (for example, see the package `reactable`) containing all predictions for the players drafted in 2019-2021.  



In [None]:
# Note: Here as well as anywhere else, feel free to add as many code chunks as you'd like.

## Part 2 -- Predicting Team Stats  

In this section, we're going to introduce a simple way to predict team offensive rebound percent in the next game and then discuss ways to improve those predictions.  
 
### Question 1   

Using the `rebounding_data` dataset, we'll predict a team's next game's offensive rebounding percent to be their average offensive rebounding percent in all prior games. On a single game level, offensive rebounding percent is the number of offensive rebounds divided by their number offensive rebound "chances" (essentially the team's missed shots). On a multi-game sample, it should be the total number of offensive rebounds divided by the total number of offensive rebound chances.    

Please calculate what OKC's predicted offensive rebound percent is for game 81 in the data. That is, use games 1-80 to predict game 81.  

<strong><span style="color:red">ANSWER 1:</span></strong>  

XX.X% 

### Question 2  

There are a few limitations to the method we used above. For example, if a team has a great offensive rebounder who has played in most games this season but will be out due to an injury for the next game, we might reasonably predict a lower team offensive rebound percent for the next game.  

Please discuss how you would think about changing our original model to better account for missing players. You do not have to write any code or implement any changes, and you can assume you have access to any reasonable data that isn't provided in this project. Try to be clear and concise with your answer.  

<strong><span style="color:red">ANSWER 2:</span></strong>  

### Question 3  

In question 2, you saw and discussed how to deal with one weakness of the model. For this question, please write about 1-3 other potential weaknesses of the simple average model you made in question 1 and discuss how you would deal with each of them. You may either explain a weakness and discuss how you'd fix that weakness, then move onto the next issue, or you can start by explaining multiple weaknesses with the original approach and discuss one overall modeling methodology you'd use that gets around most or all of them. Again, you do not need to write any code or implement any changes, and you can assume you have access to any reasonable data that isn't provided in this project. Try to be clear and concise with your answer.  


<strong><span style="color:red">ANSWER 3:</span></strong>  