In [None]:
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.model_selection import train_test_split
# (x,y,train_size)
import pandas as pd


# No column size limit
pd.set_option('display.max_columns', None)


# My Classification Problem and Dataset

## [NBA 2024 Shots Datset](https://www.kaggle.com/datasets/mexwell/nba-shots?select=NBA_2024_Shots.csv)  
- It is a Binary classification dataset with 10 or more features
- each row represents a shot taken by an NBA player in game from the 2024 NBA Season
### Raw Features
- **SEASON_1 & SEASON_2**  
  Season indicator variables

- **TEAM_ID**  
  NBA's unique ID variable of that specific team in their API.

- **PLAYER_ID**  
  NBA's unique ID variable of that specific player in their API.

- **GAME_DATE**  
  Date of the game (M-D-Y // Month-Date-Year).

- **GAME_ID**  
  NBA's unique ID variable of that specific game in their API.

- **EVENT_TYPE**  
  Character variable denoting a shot outcome (`Made Shot` // `Missed Shot`).

- **SHOT_MADE**  
  True/False variable denoting a shot outcome (`True` // `False`).

- **ACTION_TYPE**  
  Description of shot type (layup, dunk, jump shot, etc.).

- **SHOT_TYPE**  
  Type of shot (`2PT` or `3PT`).

- **BASIC_ZONE**  
  Name of the court zone the shot took place in.  
  Values:  
  - Restricted Area  
  - In the Paint (non-RA)  
  - Midrange  
  - Left Corner 3  
  - Right Corner 3  
  - Above the Break  
  - Backcourt

- **ZONE_NAME**  
  Name of the side of court the shot took place in.  
  Values:  
  - left  
  - left side center  
  - center  
  - right side center  
  - right

- **ZONE_ABB**  
  Abbreviation of the side of court.  
  Values:  
  - (L)  
  - (LC)  
  - (C)  
  - (RC)  
  - (R)

- **ZONE_RANGE**  
  Distance range of shot by zones.  
  Values:  
  - Less than 8 ft.  
  - 8-16 ft.  
  - 16-24 ft.  
  - 24+ ft.

- **LOC_X**  
  X coordinate of the shot in the x, y plane of the court (0 to 50).

- **LOC_Y**  
  Y coordinate of the shot in the x, y plane of the court (0 to 50).

- **SHOT_DISTANCE**  
  Distance of the shot with respect to the center of the hoop, in feet.

- **QUARTER**  
  Quarter of the game.

- **MINS_LEFT**  
  Minutes remaining in the quarter.

- **SECS_LEFT**  
  Seconds remaining in the minute of the quarter.
  




# Basic Preprocessing

## Dropping Unnecessary Columns
here is reasoning for doing so:
  - to provide signal (help discriminate made vs. missed shots)
  - reduce redundancy
  - Avoid irrelevant info
  - Avoid data leakage
    - don't train the model on the target


In [None]:

from sklearn import tree
import pandas as pd

shotData = pd.read_csv('/content/NBA_2024_Shots.csv')

## These columns for dropped for the reasonings stated about
shotData_x = shotData.drop(['SEASON_1', 'SEASON_2', 'TEAM_ID', 'SHOT_MADE', 'PLAYER_NAME', 'POSITION_GROUP', 'GAME_DATE', 'GAME_ID', 'EVENT_TYPE', 'ZONE_NAME', 'MINS_LEFT', 'SECS_LEFT' ], axis=1)
makeData_y = shotData['SHOT_MADE']
shotData_x.head() # show first 5 rows to make sure correctly loaded

Unnamed: 0,TEAM_NAME,PLAYER_ID,POSITION,HOME_TEAM,AWAY_TEAM,ACTION_TYPE,SHOT_TYPE,BASIC_ZONE,ZONE_ABB,ZONE_RANGE,LOC_X,LOC_Y,SHOT_DISTANCE,QUARTER
0,Washington Wizards,1629673,SG,MIA,WAS,Driving Floating Jump Shot,2PT Field Goal,In The Paint (Non-RA),C,8-16 ft.,-0.4,17.45,12,1
1,Washington Wizards,1630166,SF,MIA,WAS,Jump Shot,3PT Field Goal,Above the Break 3,C,24+ ft.,1.5,30.55,25,1
2,Washington Wizards,1626145,PG,MIA,WAS,Driving Layup Shot,2PT Field Goal,Restricted Area,C,Less Than 8 ft.,-3.3,6.55,3,1
3,Washington Wizards,1629673,SG,MIA,WAS,Running Finger Roll Layup Shot,2PT Field Goal,Restricted Area,C,Less Than 8 ft.,-1.0,5.85,1,1
4,Washington Wizards,1626145,PG,MIA,WAS,Cutting Layup Shot,2PT Field Goal,Restricted Area,C,Less Than 8 ft.,-0.0,6.25,1,1


In [None]:


# Initialize our decision tree object
shot_tree = tree.DecisionTreeClassifier()

# Train our decision tree (tree induction + pruning)
shot_tree = shot_tree.fit(iris.data, iris.target)

# Background

## Things learned/reviewed.
- Parameters change with models
- Decision Tree Classifier (CART) has the followinn in Scikit:
1. **max_depth:** The max depth of the tree where we will stop splitting the nodes. Lower will make your model faster but not as accurate; higher can give you accuracy but risks overfitting and may be slow.

2. **min_samples_split:** The minimum number of samples required to split a node. There exists a trade-off between smaller minimum count and a larger one. Try finding out how this helps combat overfitting.

3. **max_features:** The number of features to consider when looking for the best split. Higher means potentially better results with the tradeoff of training taking longer.

4. **min_impurity_split:** Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold. This can be used to tradeoff combating overfitting (high value, small tree) vs high accuracy (low value, big tree).
- scikit-learn and sklearn refer to the exact same thing
- One of the main points of the pandas library is to view tabular data`
- IPython, Jupyter, (and therefore Co-lab) only display the expression if it is the final line (like `df.head`)
- When working with a Decision Tree Classifier, cross validation is more common to utilize


## Python/Sklearn notes
- `shotData = shotData.drop('LOC_X', axis=1)`
  - axis specifies if you want to drop row or column, first argument is label of said row or column
    - 0-row
    - 1-column


## Terminology Helper
- **Gini Index** - is an attribute selection method that helps assess how well a particular split classifies.
- **ASM** - Broader term that stands for attribute selection method. Helps to pick what attribute should use to split
  - Gini Index is an ASM
- **[Kaggle](https://www.kaggle.com/)** - Very useful dataset site to utilize for machine learning projects.
- **Dropping Features** - it is not 'bad practice' to drop features and sometimes neccessary
- **CART** - The algorithim for decision Tree Classifiers
- **IPython** is interactive interface on top of Python
- **Tree Inductons** - the process of actually building or learning the decsision tree model from the data
- **Cross Validation** - data set is split up into k-folds (equal parts)
  - there are usually 5 or 10 folds
  - for each fold, you train on k-1 folds, and validate on the remaining fold)
  - you then average the k scores and that is your cross validation estimate







# To Dos and Questions

## Potential To Dos
- See if I end up needing to convert target to int
  - shouldn't need to with SKLearn

- convert non-numeric features to numbers
- missing value audit
- vif
- split into x and y



## Questions

