# Authors - Justin Clifton

---
## Table of Contents (Work In Progress)

1. [Introduction](#intro)
2. [Basic Exploration & Cleaning](#cleaning) 
3. [Data Visualization](#visualization)
---


## Introduction <a class="anchor" id="intro"></a>



In this notebook, we will be extending the work that was done in this [notebook](https://github.com/juliajanu/Math475_Project_4-/blob/master/nba_stats.ipynb) by using the dataset that was created. We will analyze the data and attempt to create a useful model for predicting the NBA MVP.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

#models
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.neighbors import KNeighborsClassifier
#from sklearn.ensemble import StackingClassifier

#feature selection
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import GridSearchCV

#scalers
from sklearn.preprocessing import StandardScaler

#score function
from sklearn.metrics import f1_score

## Basic Exploration & Cleaning <a class="anchor" id="cleaning"></a>

In [2]:
stats_df = pd.read_csv('NBA_Stats_MVP.csv')

In [3]:
stats_df.head()

Unnamed: 0,Year,Pos,Age,Tm,G,GS,MP,PER,TS%,3PAr,...,TRB,AST,STL,BLK,TOV,PF,PTS,is_allstar,Name,was_mvp
0,1974.0,C,27.0,HOU,79.0,,2459.0,15.9,0.516,,...,923.0,166.0,80.0,104.0,,227.0,865.0,0,Zaid Abdul-Aziz,0
1,1974.0,C,26.0,MIL,81.0,,3548.0,24.4,0.564,,...,1178.0,386.0,112.0,283.0,,238.0,2191.0,1,Kareem Abdul-Jabbar,0
2,1974.0,SF,26.0,DET,74.0,,2298.0,10.9,0.457,,...,448.0,141.0,110.0,12.0,,242.0,759.0,0,Don Adams,0
3,1974.0,PG,27.0,CHI,55.0,,618.0,10.0,0.447,,...,69.0,56.0,36.0,1.0,,63.0,182.0,0,Rick Adelman,0
4,1974.0,PG,26.0,MIL,72.0,,2388.0,18.8,0.536,,...,291.0,374.0,137.0,22.0,,215.0,1268.0,0,Lucius Allen,0


In [4]:
stats_df.describe()

Unnamed: 0,Year,Age,G,GS,MP,PER,TS%,3PAr,FTr,ORB%,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,is_allstar,was_mvp
count,20797.0,20797.0,20797.0,18233.0,20797.0,20792.0,20715.0,18839.0,20703.0,20792.0,...,20797.0,20797.0,20797.0,20797.0,20797.0,19645.0,20797.0,20797.0,20797.0,20797.0
mean,1998.15584,26.773477,50.263836,23.593375,1180.121796,12.453867,0.502274,0.158604,0.318799,6.181565,...,147.199404,209.388614,112.975237,39.897052,24.47026,73.939832,111.22686,496.276242,0.039813,0.002116
std,12.232484,3.94547,26.681105,28.632387,929.215744,6.100548,0.093337,0.187495,0.226552,4.872685,...,145.921912,207.77211,137.240043,38.713053,36.935084,67.713803,82.445297,484.09024,0.195525,0.045949
min,1974.0,18.0,1.0,0.0,0.0,-90.6,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1988.0,24.0,27.0,0.0,325.0,9.8,0.471,0.005,0.199,2.6,...,33.0,48.0,18.0,9.0,3.0,18.0,37.0,100.0,0.0,0.0
50%,1999.0,26.0,56.0,8.0,1013.0,12.7,0.514,0.064,0.286,5.4,...,106.0,149.0,64.0,29.0,11.0,55.0,102.0,348.0,0.0,0.0
75%,2009.0,29.0,76.0,45.0,1926.0,15.6,0.549,0.288,0.392,9.0,...,212.0,302.0,157.0,60.0,29.0,112.0,174.0,764.0,0.0,0.0
max,2017.0,44.0,87.0,83.0,3638.0,129.1,1.136,1.0,6.0,100.0,...,1111.0,1530.0,1164.0,301.0,456.0,464.0,386.0,3041.0,1.0,1.0


In [5]:
stats_df.isnull().sum() > 0

Year          False
Pos           False
Age           False
Tm            False
G             False
GS             True
MP            False
PER            True
TS%            True
3PAr           True
FTr            True
ORB%           True
DRB%           True
TRB%           True
AST%           True
STL%           True
BLK%           True
TOV%           True
USG%           True
OWS           False
DWS           False
WS            False
WS/48          True
OBPM          False
DBPM          False
BPM           False
VORP          False
FG            False
FGA           False
FG%            True
3P             True
3PA            True
3P%            True
2P            False
2PA           False
2P%            True
eFG%           True
FT            False
FTA           False
FT%            True
ORB           False
DRB           False
TRB           False
AST           False
STL           False
BLK           False
TOV            True
PF            False
PTS           False
is_allstar    False


In [6]:
stats_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20797 entries, 0 to 20796
Data columns (total 52 columns):
Year          20797 non-null float64
Pos           20797 non-null object
Age           20797 non-null float64
Tm            20797 non-null object
G             20797 non-null float64
GS            18233 non-null float64
MP            20797 non-null float64
PER           20792 non-null float64
TS%           20715 non-null float64
3PAr          18839 non-null float64
FTr           20703 non-null float64
ORB%          20792 non-null float64
DRB%          20792 non-null float64
TRB%          20792 non-null float64
AST%          20792 non-null float64
STL%          20792 non-null float64
BLK%          20792 non-null float64
TOV%          19582 non-null float64
USG%          19640 non-null float64
OWS           20797 non-null float64
DWS           20797 non-null float64
WS            20797 non-null float64
WS/48         20792 non-null float64
OBPM          20797 non-null float64
DBPM 

In [7]:
stats_df.fillna(0, inplace = True)

In [8]:
stats_df.isnull().sum() > 0

Year          False
Pos           False
Age           False
Tm            False
G             False
GS            False
MP            False
PER           False
TS%           False
3PAr          False
FTr           False
ORB%          False
DRB%          False
TRB%          False
AST%          False
STL%          False
BLK%          False
TOV%          False
USG%          False
OWS           False
DWS           False
WS            False
WS/48         False
OBPM          False
DBPM          False
BPM           False
VORP          False
FG            False
FGA           False
FG%           False
3P            False
3PA           False
3P%           False
2P            False
2PA           False
2P%           False
eFG%          False
FT            False
FTA           False
FT%           False
ORB           False
DRB           False
TRB           False
AST           False
STL           False
BLK           False
TOV           False
PF            False
PTS           False
is_allstar    False


In [9]:
pos_dummies = pd.get_dummies(stats_df.Pos,drop_first =True, prefix = "Position")
stats_df = pd.concat([stats_df, pos_dummies], axis = 1)
#Maybe consider getting rid of swing positions. All players must fall under one of PG, SG, SF, PF, C? Drop position entirely?
#stats_df = stats_df.drop(columns = ['Pos', 'Name', 'Tm'])

In [10]:
stats_df

Unnamed: 0,Year,Pos,Age,Tm,G,GS,MP,PER,TS%,3PAr,...,Position_PG,Position_PG-SF,Position_PG-SG,Position_SF,Position_SF-PF,Position_SF-SG,Position_SG,Position_SG-PF,Position_SG-PG,Position_SG-SF
0,1974.0,C,27.0,HOU,79.0,0.0,2459.0,15.9,0.516,0.000,...,0,0,0,0,0,0,0,0,0,0
1,1974.0,C,26.0,MIL,81.0,0.0,3548.0,24.4,0.564,0.000,...,0,0,0,0,0,0,0,0,0,0
2,1974.0,SF,26.0,DET,74.0,0.0,2298.0,10.9,0.457,0.000,...,0,0,0,1,0,0,0,0,0,0
3,1974.0,PG,27.0,CHI,55.0,0.0,618.0,10.0,0.447,0.000,...,1,0,0,0,0,0,0,0,0,0
4,1974.0,PG,26.0,MIL,72.0,0.0,2388.0,18.8,0.536,0.000,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20792,2017.0,PF,24.0,CHO,62.0,58.0,1725.0,16.7,0.604,0.002,...,0,0,0,0,0,0,0,0,0,0
20793,2017.0,C,27.0,BOS,51.0,5.0,525.0,13.0,0.508,0.006,...,0,0,0,0,0,0,0,0,0,0
20794,2017.0,C,20.0,ORL,19.0,0.0,108.0,7.3,0.346,0.000,...,0,0,0,0,0,0,0,0,0,0
20795,2017.0,SF,22.0,CHI,44.0,18.0,843.0,6.9,0.503,0.448,...,0,0,0,1,0,0,0,0,0,0


## Feature Engineering <a class="anchor" id="engineering"></a>

The "50–40–90 club" is an informal statistic that requires a player to achieve 50% field goal percentage, 40% three-point field goal percentage and 90% free throw percentage over the course of a regular season. In NBA history, only eight players have recorded a 50–40–90 season. We will add this as a feature in our dataset.

In [22]:
club_50_40_90 = stats_df[(stats_df['FT%'] >= 0.9) & (stats_df['3P%'] >= .40) & (stats_df['FG%'] >= .5) & (stats_df['G'] >= 50)]
club_50_40_90['Name']

4461         Larry Bird
4840         Larry Bird
5541         Mark Price
7787      Reggie Miller
8660         Steve Kerr
14137        Steve Nash
14685     Dirk Nowitzki
14945     Jose Calderon
15243        Steve Nash
15836        Steve Nash
16436        Steve Nash
17937      Kevin Durant
19321    Meyers Leonard
19740     Stephen Curry
Name: Name, dtype: object

0

## Data Visualization <a class="anchor" id="visualization"></a>