# **NBA Ratings Classifier EDA**

### Dataset: nba_ratings.csv 
> We will immediately remote several attributes from the dataset that will are either too specific (ex. player name, team name, season) 
or redundant (ex. offensive rebounds, defensive rebounds, 3pt attempts, FG attempts)


> The dataset that we will be using will consist of the attributes below. We will derive all subsets of data from this dataset.
* GP (Games Played)
* AGE (Age)
* W (wins)
* L (losses)
* MIN (minutes played)
* PTS (points per game)
* FG % (field goal percentage)
* FGM (field goals made)
* 3P% (3 pointers made)
* FT% (free throw percentage)
* REB (rebounds)
* +/- (plus/minus metric)

> These attributes will be used to classify the "ranking" (NBA 2K Ratings) of each instance (player)





In [2]:
import importlib

import mysklearn.mypytable as mypytable
import mysklearn.myutils as myutils
import mysklearn.myevaluation as myevaluation
import mysklearn.myclassifiers as myclassifiers


importlib.reload(myutils)
importlib.reload(mypytable)
importlib.reload(myevaluation)

<module 'mysklearn.myevaluation' from '/home/CPSC322-Final-Project/mysklearn/myevaluation.py'>

In [5]:
attributes = ["AGE","GP", "W", "L", "MIN", "PTS", "FG%", "FGM", "3P%", "FT%", "REB", "+/-"]

## load data into a MyPyTable
nba_data = mypytable.MyPyTable()
nba_data.load_from_file("nba_ratings.csv")
before_len = len(nba_data.data)


# clean table by removing players with 0% 3 point percentage or FG percentage

nba_data.drop_rows_with_zero_in_col(["3P%", "FG%"])
after_len = len(nba_data.data)
print(before_len - after_len, "rows were removed during cleaning")

# get ratings (y) 
ratings = nba_data.pop_column("rankings")
ratings = myutils.min_max_normalize(ratings)


# get the player names col to use as key
player_names = nba_data.pop_column("PLAYER")

# extract the dataset that we are interested in working with from the orginal dataset from kaggle
nba_data = nba_data.get_subtable(attributes)
print(nba_data)




simple_dataset = nba_data.get_subtable(["AGE", "W", "L", "MIN", "PTS"])



358 rows were removed during cleaning
GP W L MIN PTS FG% FGM 3P% FT% REB +/- 
62.0 30.0 32.0 32.5 14.4 43.7 5.4 30.8 67.4 7.7 -1.1 
66.0 42.0 24.0 24.5 9.5 41.4 3.5 39.4 85.1 2.4 1.7 
55.0 37.0 18.0 15.8 6.3 46.8 2.2 37.5 77.3 1.8 -1.5 
11.0 3.0 8.0 10.2 2.9 42.9 1.1 40.0 50.0 0.9 4.5 
33.0 9.0 24.0 11.2 3.0 38.0 1.1 31.1 66.7 1.4 -1.7 
67.0 39.0 28.0 30.2 11.9 45.0 4.8 35.0 76.3 6.8 1.9 
18.0 7.0 11.0 21.1 4.3 29.1 1.4 25.0 65.5 4.8 -1.9 
66.0 21.0 45.0 26.6 15.0 41.8 4.9 38.5 88.7 4.3 -3.4 
14.0 1.0 13.0 9.9 4.2 50.0 1.4 23.1 84.2 1.9 -0.6 
64.0 48.0 16.0 18.4 5.5 41.2 1.9 33.3 73.4 1.9 3.8 
55.0 18.0 37.0 17.6 8.0 55.5 3.2 27.1 64.8 5.8 -2.0 
40.0 11.0 29.0 14.8 4.6 42.7 1.8 21.5 71.0 2.8 -2.3 
17.0 10.0 7.0 6.9 2.0 41.4 0.7 37.5 70.0 2.8 1.5 
37.0 5.0 32.0 17.6 4.6 35.6 1.7 30.3 71.4 2.1 -6.0 
24.0 2.0 22.0 12.1 6.5 48.1 2.1 35.8 79.1 1.2 -3.0 
18.0 11.0 7.0 8.8 3.2 42.6 1.3 31.6 54.5 0.9 -0.3 
57.0 21.0 36.0 33.0 17.7 53.3 7.3 14.3 57.5 15.2 -3.4 
21.0 10.0 11.0 19.9 4.6 43.2 1.8 

ValueError: 'AGE' is not in list

## Naive Bayes Classification
Does the simple dataset classify better than the dataset with all of the attributes?


Simple Dataset: (AGE, WINS, LOSSES, MINS, PTS)


In [None]:
## get train / test folds (full and simple datasets)
full_train_folds, full_test_folds = myevaluation.stratified_kfold_cross_validation(nba_data.data, ratings, n_splits=10)
simple_train_folds, simple_test_folds = myevaluation.stratified_kfold_cross_validation(simple_dataset.data, ratings, n_splits=10)

## get scores
full_acc, full_err, full_pres, full_recall, full_f1, full_matrix = \
  myutils.get_scores_from_folds(nba_data.data, ratings, full_train_folds, full_test_folds, myclassifiers.MyKNeighborsClassifier(10))

simple_acc, simple_err, simple_pres, simple_recall, simple_f1, simple_matrix = \
  myutils.get_scores_from_folds(simple_dataset.data, ratings, simple_train_folds, simple_test_folds, myclassifiers.MyKNeighborsClassifier(10))


In [None]:
scores_table = mypytable.MyPyTable()
scores_table.column_names = ["Dataset", "Accuracy", "Error", "Precision", "Recall", "F1 score"]
scores_table.data = []
scores_table.data.append(["Full dataset", full_acc, full_err, full_pres, full_recall, full_f1])
scores_table.data.append(["Simple dataset", simple_acc, simple_err, simple_pres, simple_recall, simple_f1])

scores_table.print_data()

## TODO
# need to discretize the rest of the attributes to use other classifiers

Dataset           Accuracy    Error    Precision    Recall    F1 score
--------------  ----------  -------  -----------  --------  ----------
Full dataset          0.82     0.18            0         0           0
Simple dataset        0.81     0.19            0         0           0
