# NBA Player Career Projection #
## _DSCI 100 Group Project_ ##

## Introduction ## 

Prompts: (Section for Raj, delete this markdown section when done)
- Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
- Clearly state the question you will try to answer with your project
- Identify and describe the dataset that will be used to answer the question

Basketball is a globally renowned sport with a massive following, and the professional leagues are the pinnacle of talent and competition. Understanding player statistics is essential for team management, player evaluation, and fan engagement. The "all_seasons" dataset captures data that outlines the performance of basketball players across various seasons (from 1996-1997 season to 2022-2023 season). Using this dataset, our predictive question could be, for instance, how a player's attributes and performance in their rookie seasons relate to their overall career performance. The dataset encompasses several key attributes, including:

- player_name
- team_abbreviation
- age
- player_height
- player_weight
- college
- country
- draft_year
- draft_round
- points per game (pts)
- rebounds per game (reb)
- assists per game (ast)
- net_rating
- offensive rebound percentage (oreb_pct)
- defensive rebound percentage (dreb_pct)
- usage percentage (usg_pct)
- true shooting percentage (ts_pct)
- assist percentage (ast_pct)
- season

By analyzing this dataset, we aim to draw insights and patterns from past player statistics, potentially aiding in the selection, trading, and performance prediction of rookie players in future basketball seasons by estimating their potential through statistics. 

## Preliminary exploratory data analysis ##

Prompts: (Section for James, delete this markdown section when done)
- Demonstrate that the dataset can be read from the web into R 
- Clean and wrangle your data into a tidy format
- Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

The nba data set can be read by downloading the dataset from the website (link: https://www.kaggle.com/datasets/justinas/nba-players-data/) and read into R using read_csv from the tidyverse library. We have stored it in the "data" folder.

In [37]:
# importing the tidyverse library
library(tidyverse)

In [39]:
# reading the dataset in the data folder
nba_raw <- read_csv("data/all_seasons.csv")

# looking at the the first 6 rows
head(nba_raw)

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m12844[39m [1mColumns: [22m[34m22[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (8): player_name, team_abbreviation, college, country, draft_year, draf...
[32mdbl[39m (14): ...1, age, player_height, player_weight, gp, pts, reb, ast, net_ra...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


...1,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,⋯,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season
<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0,Randy Livingston,HOU,22,193.04,94.80073,Louisiana State,USA,1996,2,⋯,3.9,1.5,2.4,0.3,0.042,0.071,0.169,0.487,0.248,1996-97
1,Gaylon Nickerson,WAS,28,190.5,86.18248,Northwestern Oklahoma,USA,1994,2,⋯,3.8,1.3,0.3,8.9,0.03,0.111,0.174,0.497,0.043,1996-97
2,George Lynch,VAN,26,203.2,103.41898,North Carolina,USA,1993,1,⋯,8.3,6.4,1.9,-8.2,0.106,0.185,0.175,0.512,0.125,1996-97
3,George McCloud,LAL,30,203.2,102.0582,Florida State,USA,1989,1,⋯,10.2,2.8,1.7,-2.7,0.027,0.111,0.206,0.527,0.125,1996-97
4,George Zidek,DEN,23,213.36,119.74829,UCLA,USA,1995,1,⋯,2.8,1.7,0.3,-14.1,0.102,0.169,0.195,0.5,0.064,1996-97
5,Gerald Wilkins,ORL,33,198.12,102.0582,Tennessee-Chattanooga,USA,1985,2,⋯,10.6,2.2,2.2,-5.8,0.031,0.064,0.203,0.503,0.143,1996-97


In [41]:
# some values in draft_year are "undrafted" hence the reason why draft_year is a character column
nba_undrafted <- filter(nba_data, draft_year == "Undrafted")
head(nba_undrafted)

...1,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,⋯,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season
<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
29,Emanual Davis,HOU,28,195.58,87.99685,Delaware State,USA,Undrafted,Undrafted,⋯,5.0,1.7,2.0,6.6,0.011,0.098,0.144,0.565,0.191,1996-97
39,Erick Strickland,DAL,23,190.5,95.25432,Nebraska,USA,Undrafted,Undrafted,⋯,10.6,3.2,2.4,-6.4,0.032,0.112,0.216,0.51,0.161,1996-97
41,Evric Gray,NJN,27,200.66,106.59412,Nevada-Las Vegas,USA,Undrafted,Undrafted,⋯,2.6,0.6,0.4,17.5,0.026,0.049,0.192,0.388,0.065,1996-97
46,Henry James,ATL,31,203.2,99.79024,St. Mary's (TX),USA,Undrafted,Undrafted,⋯,6.7,1.5,0.4,1.2,0.034,0.067,0.171,0.555,0.036,1996-97
54,Jimmy Carruth,MIL,27,208.28,120.20188,Virginia Tech,USA,Undrafted,Undrafted,⋯,1.3,1.0,0.0,-17.7,0.0,0.211,0.103,0.727,0.0,1996-97
57,Joe Courtney,SAS,27,203.2,106.59412,Southern Mississippi,USA,Undrafted,Undrafted,⋯,2.8,1.8,0.0,-6.1,0.097,0.095,0.151,0.388,0.0,1996-97


### Data tidying ###
Looking at the columns, we see that "draft_year" and "draft_round" are character columns, instead of numeric. Upon investigation into the data we see that this is because some players came into the NBA league undrafted and were picked up by teams through other methods, and thus are marked as "undrafted" under the "draft_year" and "draft_round" columns. Since we want to select and use rookie players who have only played in the 2022 season as our test data, we must filter out players who have played in more than one season and players who have only played in 2022 season. 

Since college, country and the team that they played for are not important for our data analysis, we will select the rest of the columns during our data processing. Additionally, to make data manipulation easier, we will also change season into a numeric value by only keeping the year the season began (ex. "1996-1997" into 1996).

In [55]:
nba_data <- nba_raw |>
    separate(season, into = c("season_start", "season_end"), "-") |>
    mutate(season_start = as.numeric(season_start)) |>
    select(player_name, age:player_weight, draft_year:season_start) 
nba_data

player_name,age,player_height,player_weight,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season_start
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Randy Livingston,22,193.04,94.80073,1996,2,42,64,3.9,1.5,2.4,0.3,0.042,0.071,0.169,0.487,0.248,1996
Gaylon Nickerson,28,190.50,86.18248,1994,2,34,4,3.8,1.3,0.3,8.9,0.030,0.111,0.174,0.497,0.043,1996
George Lynch,26,203.20,103.41898,1993,1,12,41,8.3,6.4,1.9,-8.2,0.106,0.185,0.175,0.512,0.125,1996
George McCloud,30,203.20,102.05820,1989,1,7,64,10.2,2.8,1.7,-2.7,0.027,0.111,0.206,0.527,0.125,1996
George Zidek,23,213.36,119.74829,1995,1,22,52,2.8,1.7,0.3,-14.1,0.102,0.169,0.195,0.500,0.064,1996
Gerald Wilkins,33,198.12,102.05820,1985,2,47,80,10.6,2.2,2.2,-5.8,0.031,0.064,0.203,0.503,0.143,1996
Gheorghe Muresan,26,231.14,137.43838,1993,2,30,73,10.6,6.6,0.4,6.9,0.098,0.217,0.185,0.618,0.024,1996
Glen Rice,30,203.20,99.79024,1989,1,4,79,26.8,4.0,2.0,3.2,0.025,0.087,0.272,0.605,0.088,1996
Glenn Robinson,24,200.66,106.59412,1994,1,1,80,21.1,6.3,3.1,-2.9,0.051,0.144,0.278,0.528,0.146,1996
Grant Hill,24,203.20,102.05820,1994,1,3,80,21.4,9.0,7.3,6.9,0.049,0.232,0.283,0.556,0.356,1996


In [56]:
# separating data into training data (excludes rookies) and test data (includes rookies)

## Methods ##

Prompts 
- Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?

- Describe at least one way that you will visualize the results

Response

- To conduct our experiment, we will use the knn regression model using the columns pts (points per game) , gp (games played), height (cm), weight (kg), usg% (usage percentage) and ts% (true shooting percentage) as predictors since these factors are likely to have a significant influence on the total number of points scored over a career. Using these, predictors, we will find the projected number of points a player will score based on the average of its K nearest neighbors where the K value will be determined through evaluation and tuning.

- We will use scatter plots, with a regression line to visualize the data and results as it will give a clear view on the K nearest neighbors to the point on a line.

## Expected outcomes and significance ##

Prompts (Section for Allen, delete when done)
- What do you expect to find?

- What impact could such findings have?

- What future questions could this lead to?



(Start your response here)

-We are trying to find the total score that a player may get at the end of his career based on other players' previous performance including scores, rebounds, assists and their weights. 

-NBA teams could use this predicted score to identify a player whether the player has potential or not. Also, it is a useful index for teams to adjust their team members.

-This prediction model may not perfect enough to reflect the true skills that a player owns. It is because we need to include other factors. For instance, the playing styles of players and their cooperation skills. An important playing style usually results in different results of a game. Therefore, the cooperation between players is a key factor to determine whether the team would win.