# NBA Player Career Projection #
## _DSCI 100 Group Project_ ##

## Introduction ## 

Prompts:
- provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
- clearly state the question you tried to answer with your project
- identify and describe the dataset that was used to answer the question

Basketball is a globally renowned sport with a massive following, and the professional leagues are the pinnacle of talent and competition. Understanding player statistics is essential for team management, player evaluation, and fan engagement. The `all_seasons` dataset captures data that outlines the performance of basketball players across various seasons (from 1996-1997 season to 2022-2023 season). Using this dataset, our predictive question could be, for instance, how a player's attributes and performance in their rookie seasons relate to their overall career performance. The dataset encompasses several key attributes, including:

- `player_name`, name of a NBA player 
- `team_abbreviation`, abbreviated name of the team they played on
- `age`, a player's age
- `player_height`, a player's height
- `player_weight`, a player's weight 
- `college`, the college they played for
- `country`, their nationality
- `draft_year`, the year they were drafted
- `draft_round`, the round they were drafted in
- `draft_number`, the pick number they were drafted with
- `gp`, number of games played in a season
- `pts`, average points per game in a season
- `reb`, average rebounds per game in a season
- `ast`, average assists per game in a season
- `net_rating`, average net rating in a season
- `oreb_pct`, average offensive rebound percentage in a season
- `dreb_pct`, average defensive rebound percentage in a season
- `usg_pct`, average usage percentage in a season
- `ts_pct`, average true shooting percentage in a season
- `ast_pct`, average assist percentage in a season
- `season`, the season they played in which these stats were recorded


By analyzing this dataset, we aim to draw insights and patterns from past player statistics, potentially aiding in the selection, trading, and performance prediction of rookie players in future basketball seasons by estimating their potential through statistics. 

### Research question: ###
__How does a player's physical attributes and scoring statistics in their rookie year correlate to their total career points?__ We will attempt to answer this question by forming a regression line using physical attributes and scoring statistics of past players, and more specifically: `pts` (points per game), `gp` (games played), `player_height` (cm), `player_weight` (kg), `usg_pct` (usage percentage) and `ts_pct` (true shooting percentage).

## Methods and results ##

- describe in written English the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
- your report should include code which:
    - loads data from the original source on the web 
    - wrangles and cleans the data from it's original (downloaded) format to the format necessary for the planned analysis
    - performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
    - creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
    - performs the data analysis
    - creates a visualization of the analysis 
    
note: all tables and figure should have a figure/table number and a legend

## Preliminary exploratory data analysis ##

From the preliminary exploratory data analysis completed in our project proposal, we demonstrated that we can read the NBA data set can be read by downloading the dataset from the website (link: https://www.kaggle.com/datasets/justinas/nba-players-data/) and read into R using read_csv from the tidyverse library. We have stored it in the "data" folder. 

In our preliminary exploratory data analysis, we also completed necessary steps to tidy the data to ensure consistency, remove irrelevant information, and maintain the three criteria necessary for tidy data: each column is a variable, each row is a single observation, and each cell is a value. 

In [4]:
# importing the tidyverse library
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Throughout this report, the data set we will work with is named `all_seasons.csv` and it lives in the `/data` directory. Below, we read the dataset using `read_csv` and used `head` to look at the first 6 rows of the dataset to ensure that we have read the correct dataset. 

In [5]:
# reading the dataset in the data folder
nba_raw <- read_csv("data/all_seasons.csv")
# looking at the the first 6 rows
tail(nba_raw)

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m12844[39m [1mColumns: [22m[34m22[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (8): player_name, team_abbreviation, college, country, draft_year, draf...
[32mdbl[39m (14): ...1, age, player_height, player_weight, gp, pts, reb, ast, net_ra...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


...1,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,⋯,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season
<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
12838,Joe Wieskamp,TOR,23,198.12,92.98636,Iowa,USA,2021,2,⋯,1.0,0.4,0.3,1.0,0.0,0.068,0.115,0.321,0.083,2022-23
12839,Joel Embiid,PHI,29,213.36,127.00576,Kansas,Cameroon,2014,1,⋯,33.1,10.2,4.2,8.8,0.057,0.243,0.37,0.655,0.233,2022-23
12840,John Butler Jr.,POR,20,213.36,86.18248,Florida State,USA,Undrafted,Undrafted,⋯,2.4,0.9,0.6,-16.1,0.012,0.065,0.102,0.411,0.066,2022-23
12841,John Collins,ATL,25,205.74,102.51179,Wake Forest,USA,2017,1,⋯,13.1,6.5,1.2,-0.2,0.035,0.18,0.168,0.593,0.052,2022-23
12842,Jericho Sims,NYK,24,208.28,113.398,Texas,USA,2021,2,⋯,3.4,4.7,0.5,-6.7,0.117,0.175,0.074,0.78,0.044,2022-23
12843,JaMychal Green,GSW,33,205.74,102.96538,Alabama,USA,Undrafted,Undrafted,⋯,6.4,3.6,0.9,-8.2,0.087,0.164,0.169,0.65,0.094,2022-23


**Table 1.** Table of the first 6 rows of raw data from all_seasons.csv

### Data tidying ###
Looking at the columns, we see that "draft_year" and "draft_round" are character columns, instead of numeric. Upon investigation into the data we see that this is because some players came into the NBA league undrafted and were picked up by teams through other methods, and thus are marked as "undrafted" under the "draft_year" and "draft_round" columns. Since we want to select and use rookie players who have only played in the 2022 season as our test data, we must filter out players who have played in more than one season for training and players who have only played in 2022 season for test data. However, because some players went undrafted, it is difficult to determine what year those players were rookies. (If we say that a player is a rookie in the first season of all the observations we have of them, then what happens if the first season of all the observations we have of them is the first season recorded in this data set and they were undrafted? Since we have no data of earlier seasons we wouldn't know if that season was truly their rookie season.) Thus, to make data manipulation and analysis easier, we will only consider players that were drafted 

Since college, country and the team that they played for are not important for our data analysis, we will select the rest of the columns during our data processing. Additionally, to make data manipulation easier, we will also change season into a numeric value by only keeping the year the season began (ex. "1996-1997" into 1996).

In [6]:
nba_data <- nba_raw |>
    filter(draft_year != "Undrafted" & draft_round != "Undrafted" & draft_number != "Undrafted") |>
    separate(season, into = c("season_start", "season_end"), "-") |>
    mutate(season_start = as.numeric(season_start), draft_year = as.numeric(draft_year), 
           draft_round = as.numeric(draft_round), draft_number = as.numeric(draft_number)) |>
    select(player_name, age:player_weight, draft_year:season_start) 
# looking at the first 6 rows of tidied data
head(nba_data)

player_name,age,player_height,player_weight,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season_start
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Randy Livingston,22,193.04,94.80073,1996,2,42,64,3.9,1.5,2.4,0.3,0.042,0.071,0.169,0.487,0.248,1996
Gaylon Nickerson,28,190.5,86.18248,1994,2,34,4,3.8,1.3,0.3,8.9,0.03,0.111,0.174,0.497,0.043,1996
George Lynch,26,203.2,103.41898,1993,1,12,41,8.3,6.4,1.9,-8.2,0.106,0.185,0.175,0.512,0.125,1996
George McCloud,30,203.2,102.0582,1989,1,7,64,10.2,2.8,1.7,-2.7,0.027,0.111,0.206,0.527,0.125,1996
George Zidek,23,213.36,119.74829,1995,1,22,52,2.8,1.7,0.3,-14.1,0.102,0.169,0.195,0.5,0.064,1996
Gerald Wilkins,33,198.12,102.0582,1985,2,47,80,10.6,2.2,2.2,-5.8,0.031,0.064,0.203,0.503,0.143,1996


**Table 2.** Table of the first 6 rows of tidied `all_seasons` data

Prompts: 
- Demonstrate that the dataset can be read from the web into R 
- Clean and wrangle your data into a tidy format
- Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

## Methods ##

Prompts 
- Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?

- Describe at least one way that you will visualize the results

Response

- To conduct our experiment, we will use the knn regression model using the columns `pts` (points per game), `gp` (games played), `player_height` (cm), `player_weight` (kg), `usg_pct` (usage percentage) and `ts_pct` (true shooting percentage) as predictors since these factors are likely to have a significant influence on the total number of points scored over a career. Using these, predictors, we will find the projected number of points a player will score based on the average of its K nearest neighbors where the K value will be determined through evaluation and tuning.

- We will use scatter plots, with a regression line to visualize the data and results as it will give a clear view on the K nearest neighbors to the point on a line. Futhermore we will make distinctions between amount of seasons played by the player by colors in plots.

Separating the data into training data and testing data.

In [26]:
# trying to separate data into training data (excludes rookies) and test data (rookies)
nba_rookie <- nba_data |>
    filter(draft_year == 2022)

nba_non_rookies <- nba_data |>
    filter(draft_year >= 1996, draft_year <= 2021) 

# gets the first season they've played 
nba_first_season <- nba_data[match(unique(nba_data$player_name), nba_data$player_name),]

# first season of rookies (defined as first season is 2022)
nba_test <- nba_first_season |>
    filter(season_start == 2022)

# first season of non-rookies where draft year is after 1996 and the first season they've played is not 2022
nba_training <- nba_first_season |>
    filter(draft_year >= 1996, season_start <= 2021)

nrow(nba_training)

# nba_training <- nba_non_rookies |>
#    filter(draft_year == season_start) |>
#    select(player_name) |>
#    unique()

# the problem is that some players did not play in the season they were drafted, we can see that ray is not counted as a rookie even though he played 
# his first season in 2022 since he was drafted in 2021. 
# tail(nba_training)

# nba_ray <- nba_data |>
#    filter(player_name == "RaiQuan Gray")

# nba_ray

# this can be solved using this:
# df1[(df1$name %in% df2$name),] 
# or:
# library(dplyr)
# anti_join(df1, df2, by = "name")

# data of all seasons of non-rookies
nba_non_rookies_data <- nba_data[(nba_data$player_name %in% nba_training$player_name),] 
head(nba_non_rookies_data)

player_name,age,player_height,player_weight,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season_start
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Randy Livingston,22,193.04,94.80073,1996,2,42,64,3.9,1.5,2.4,0.3,0.042,0.071,0.169,0.487,0.248,1996
Erick Dampier,21,210.82,120.20188,1996,1,10,72,5.1,4.1,0.6,-2.0,0.107,0.216,0.218,0.451,0.074,1996
Jerome Williams,24,205.74,93.43995,1996,1,26,33,1.5,1.5,0.2,3.0,0.144,0.182,0.181,0.419,0.071,1996
John Wallace,23,205.74,102.0582,1996,1,18,68,4.8,2.3,0.5,2.7,0.08,0.148,0.204,0.571,0.081,1996
Jermaine O'Neal,18,210.82,102.51179,1996,1,17,45,4.1,2.8,0.2,1.3,0.099,0.198,0.199,0.494,0.03,1996
Jeff McInnis,22,193.04,86.18248,1996,2,37,13,5.0,0.5,1.4,-17.8,0.021,0.04,0.259,0.609,0.327,1996


In [28]:
# finding career points of nba players drafted between 1996 and 2021 (inclusive)
nba_total_points <- nba_non_rookies_data |>
    group_by(player_name) |>
    summarize("total_points" = sum(pts*gp))

nrow(nba_total_points)
nba_training_labelled <- cbind(nba_training, nba_total_points)
nba_training_labelled

# new problem is that the players are sorted alphabetically in total points but chronologically in nba_training
# to solve this issue, we must sort nba_training alphabetically and then cbind
options(repr.plot.width = 10, repr.plot.height = 8)

player_name,age,player_height,player_weight,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season_start,player_name,total_points
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>.1,<dbl>
Randy Livingston,22,193.04,94.80073,1996,2,42,64,3.9,1.5,2.4,0.3,0.042,0.071,0.169,0.487,0.248,1996,A.J. Bramlett,8.0
Erick Dampier,21,210.82,120.20188,1996,1,10,72,5.1,4.1,0.6,-2.0,0.107,0.216,0.218,0.451,0.074,1996,A.J. Guyton,441.0
Jerome Williams,24,205.74,93.43995,1996,1,26,33,1.5,1.5,0.2,3.0,0.144,0.182,0.181,0.419,0.071,1996,AJ Hammons,48.4
John Wallace,23,205.74,102.05820,1996,1,18,68,4.8,2.3,0.5,2.7,0.080,0.148,0.204,0.571,0.081,1996,AJ Price,1521.7
Jermaine O'Neal,18,210.82,102.51179,1996,1,17,45,4.1,2.8,0.2,1.3,0.099,0.198,0.199,0.494,0.030,1996,Aaron Brooks,6263.7
Jeff McInnis,22,193.04,86.18248,1996,2,37,13,5.0,0.5,1.4,-17.8,0.021,0.040,0.259,0.609,0.327,1996,Aaron Gordon,7993.0
Jamie Feick,22,203.20,115.66596,1996,2,48,41,3.7,5.2,0.6,-12.2,0.133,0.253,0.150,0.405,0.065,1996,Aaron Gray,1066.3
Jason Sasser,23,200.66,102.05820,1996,2,41,8,2.4,1.0,0.3,-24.2,0.018,0.140,0.172,0.413,0.067,1996,Aaron Holiday,2039.8
Chris Robinson,23,195.58,90.71840,1996,2,51,41,4.6,1.7,1.6,-11.4,0.039,0.088,0.155,0.486,0.156,1996,Aaron Nesmith,1151.1
Brian Evans,23,203.20,99.79024,1996,1,27,14,1.4,0.6,0.5,-12.5,0.017,0.121,0.168,0.455,0.194,1996,Aaron Wiggins,891.0


## Discussion ##
Prompts
- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

Response from expected outcomes and significance in project proposal
- We are trying to find the total points that a player may score at the end of his career based on other players' previous performance including points per game, games played, heights, weights, true shooting percentage and their usage percentage. 

- NBA teams could use this predicted score to identify whether a rookie player has potential or not. Also, it may be a useful index to determine a player's future trajectory. 

- This prediction model does not account for injuries and other factors. The player could improve significantly over the years and thus their rookie year statistics may not be helpful in predicting how they will perform in the future. Future questions we can also explore would be how their attributes and rookie statistics can predict other measures of performance such as assists, rebounds, etc. This way we can have a better description and prediction of their future career, not just in terms of points.

## References ##
