## Overview

In the world of team sports, it is known that players’ athletic abilities and their individual performance are important critical factors in the industry. The motivation for this final tutorial is to investigate the impact of player’s individual effort in the National Basketball Association or NBA. This tutorial will analyze players’ individual statistical data with a focus on Box Plus/Minus, and will hope to find correlation with players’ individual and team success.  

The first part  will include data collecting of players’ Box Plus/Minus and other supporting individual stats and data cleaning processes. The second part will demonstrate how to analyze the given data from year 2007 to now, and display visualization. The third part will come up with a linear regression model to process the analysis and verify the hypotheses implied from it.

Before starting explaining Box Plus/Minus or BPM is needed in order to have a better understanding of the data during analyzation. According to Basketball-Reference’s website, BPM is a box-score based metric for evaluating players in the NBA for their performance through individual approximate contribution for their team. BPM is a per-100-possession stat: 0.0 is league average, +5 means the player is 5 points better than an average player over 100 possessions (which is about All-NBA level), -2 is replacement level, and -5 is really bad.


## Required Tools

In order to create and share documents that contain live Python code, equations, visualizations and narrative text for data analysis in this tutoral using Jupyter Notebook is recommended; it includes data cleaning and transformation, statistical modeling, data visualization, machine learning and etc. Jupyter Notebook also have built in libraries that is needed for this tutorial, which are the following:
1. Pandas
2. Numpy
3. Scikit-learn
4. Matplotlib
5. Folium

For the dataset, the NBA playes' and teams' data can be retrived at https://www.kaggle.com 


In [1]:
#Import needed libraries
!pip install folium
from bs4 import BeautifulSoup, SoupStrainer
import requests
import pandas as pd
import numpy as np
import folium
import matplotlib.pyplot as plt
from sklearn import linear_model

Collecting folium
  Downloading folium-0.5.0.tar.gz (79kB)
[K    100% |████████████████████████████████| 81kB 653kB/s ta 0:00:01
[?25hCollecting branca (from folium)
  Downloading branca-0.2.0-py3-none-any.whl
Building wheels for collected packages: folium
  Running setup.py bdist_wheel for folium ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/04/d0/a0/b2b8356443364ae79743fce0b9b6a5b045f7560742129fde22
Successfully built folium
Installing collected packages: branca, folium
Successfully installed branca-0.2.0 folium-0.5.0


In [2]:
# Create the dataframe from the csv file
players = pd.read_csv("Seasons_Stats.csv")


In [3]:
# Drop unneeded columns, keep identifier columns
# Keep PER: "Player Efficiency Rating"
# Keep BPM: "Box Plus-Minus"
adv_players = players.filter(['Year','Player', 'G', 'MP', 'Pos','Tm', 'PER', 'BPM'], axis=1)
adv_players['MP/G'] = adv_players['MP'] / adv_players['G']
adv_players['MP/G'] = adv_players['MP/G'].fillna(0).astype(int)

# Year is the year of the end of the season but the "Team" dataframe is the year at the beginning of the season so subtract 1 to 
# Match the years of the player stats and the team stats
adv_players['Year'] = adv_players['Year'] - 1

# Tidy the data to only include season stats from 2012 - 2016 season
adv_players = adv_players.drop(adv_players[adv_players.Year < 2011].index)
adv_players = adv_players.drop(adv_players[adv_players.Year > 2015].index)
adv_players = adv_players[np.isfinite(adv_players['Year'])]
adv_players['Year'] = adv_players['Year'].astype(int)

adv_players

Unnamed: 0,Year,Player,G,MP,Pos,Tm,PER,BPM,MP/G
21127,2011,Jeff Adrien,8.0,63.0,PF,HOU,11.2,-7.7,7
21128,2011,Arron Afflalo,62.0,2086.0,SG,DEN,14.7,0.8,33
21129,2011,Blake Ahearn,4.0,30.0,PG,UTA,-7.3,-16.3,7
21130,2011,Solomon Alabi,14.0,122.0,C,TOR,14.2,-4.1,8
21131,2011,Cole Aldrich,26.0,173.0,C,OKC,17.7,0.3,6
21132,2011,LaMarcus Aldridge,55.0,1994.0,PF,POR,22.7,2.4,36
21133,2011,Lavoy Allen,41.0,624.0,PF,PHI,12.7,-1.3,15
21134,2011,Ray Allen,46.0,1565.0,SG,BOS,14.8,2.6,34
21135,2011,Tony Allen,58.0,1525.0,SG,MEM,15.7,2.5,26
21136,2011,Morris Almond,4.0,67.0,SG,WAS,8.7,-2.0,16


In [31]:
groups = adv_players.groupby('Year')

group11 = groups.get_group(2011)
group12 = groups.get_group(2012)
group13 = groups.get_group(2013)
group14 = groups.get_group(2014)
group15 = groups.get_group(2015)
group15.sort('BPM', ascending=False)

  


Unnamed: 0,Year,Player,G,MP,Pos,Tm,PER,BPM,MP/G
24065,2015,Briante Weber,1.0,3.0,PG,MIA,39.3,14.0,3
23633,2015,Stephen Curry,79.0,2700.0,PG,GSW,31.5,12.5,34
24070,2015,Russell Westbrook,80.0,2750.0,PG,OKC,27.6,10.0,34
23781,2015,LeBron James,76.0,2709.0,SF,CLE,27.5,9.1,35
23681,2015,Jimmer Fredette,2.0,5.0,SG,NYK,47.4,8.8,2
23839,2015,Kawhi Leonard,72.0,2380.0,SF,SAS,26.0,8.3,33
23654,2015,Kevin Durant,72.0,2578.0,SF,OKC,28.2,7.9,35
23938,2015,Chris Paul,74.0,2420.0,PG,LAC,26.2,7.8,32
23849,2015,Kyle Lowry,77.0,2851.0,PG,TOR,22.2,6.8,37
23722,2015,James Harden,82.0,3125.0,SG,HOU,25.3,6.7,38


In [11]:
# bpm_avg_list = pd.DataFrame(columns = ['Pos','Year','AVERAGE BPM'],index = range(0,25))
# Dataframe of average bpm per position 2011-2015
bpm_avg_list = [['PG', 2015, (group15.loc[group15['Pos'] == 'PG'])['BPM'].mean()]]
bpm_avg_list.append(['SG', 2015, (group15.loc[group15['Pos'] == 'SG'])['BPM'].mean()])
bpm_avg_list.append(['SF', 2015, (group15.loc[group15['Pos'] == 'SF'])['BPM'].mean()])
bpm_avg_list.append(['PF', 2015, (group15.loc[group15['Pos'] == 'PF'])['BPM'].mean()])
bpm_avg_list.append(['C', 2015, (group15.loc[group15['Pos'] == 'C'])['BPM'].mean()])
bpm_avg_list.append(['PG', 2014, (group14.loc[group14['Pos'] == 'PG'])['BPM'].mean()])
bpm_avg_list.append(['SG', 2014, (group14.loc[group14['Pos'] == 'SG'])['BPM'].mean()])
bpm_avg_list.append(['SF', 2014, (group14.loc[group14['Pos'] == 'SF'])['BPM'].mean()])
bpm_avg_list.append(['PF', 2014, (group14.loc[group14['Pos'] == 'PF'])['BPM'].mean()])
bpm_avg_list.append(['C', 2014, (group14.loc[group14['Pos'] == 'C'])['BPM'].mean()])
bpm_avg_list.append(['PG', 2013, (group13.loc[group13['Pos'] == 'PG'])['BPM'].mean()])
bpm_avg_list.append(['SG', 2013, (group13.loc[group13['Pos'] == 'SG'])['BPM'].mean()])
bpm_avg_list.append(['SF', 2013, (group13.loc[group13['Pos'] == 'SF'])['BPM'].mean()])
bpm_avg_list.append(['PF', 2013, (group13.loc[group13['Pos'] == 'PF'])['BPM'].mean()])
bpm_avg_list.append(['C', 2013, (group13.loc[group13['Pos'] == 'C'])['BPM'].mean()])
bpm_avg_list.append(['PG', 2012, (group12.loc[group12['Pos'] == 'PG'])['BPM'].mean()])
bpm_avg_list.append(['SG', 2012, (group12.loc[group12['Pos'] == 'SG'])['BPM'].mean()])
bpm_avg_list.append(['SF', 2012, (group12.loc[group12['Pos'] == 'SF'])['BPM'].mean()])
bpm_avg_list.append(['PF', 2012, (group12.loc[group12['Pos'] == 'PF'])['BPM'].mean()])
bpm_avg_list.append(['C', 2012, (group12.loc[group12['Pos'] == 'C'])['BPM'].mean()])
bpm_avg_list.append(['PG', 2011, (group11.loc[group11['Pos'] == 'PG'])['BPM'].mean()])
bpm_avg_list.append(['SG', 2011, (group11.loc[group11['Pos'] == 'SG'])['BPM'].mean()])
bpm_avg_list.append(['SF', 2011, (group11.loc[group11['Pos'] == 'SF'])['BPM'].mean()])
bpm_avg_list.append(['PF', 2011, (group11.loc[group11['Pos'] == 'PF'])['BPM'].mean()])
bpm_avg_list.append(['C', 2011, (group11.loc[group11['Pos'] == 'C'])['BPM'].mean()])

bpm_avg_df = pd.DataFrame(bpm_avg_list,columns=['Position','Year','BPM Average per Year'])
bpm_avg_df



Unnamed: 0,Position,Year,BPM Average per Year
0,PG,2015,-2.089922
1,SG,2015,-2.367241
2,SF,2015,-1.558333
3,PF,2015,-1.557143
4,C,2015,-0.888462
5,PG,2014,-2.470213
6,SG,2014,-2.753147
7,SF,2014,-1.755172
8,PF,2014,-1.900699
9,C,2014,-0.773


## TEAM DATA TIDY

In [19]:
#gabe block

teams  = pd.read_excel("Historical NBA Performance.xlsx")
teams = teams.filter(['Year','Team', 'Record', 'Winning Percentage'], axis=1)

teams["Year"] = teams["Year"].fillna('')
teams["Year"] = teams["Year"].apply(lambda x: int(x[:4]) if isinstance(x, str) else int(str(x.year)))

teams["Record"] = teams["Record"].fillna('')
teams["Record"] = teams["Record"].apply(lambda x: x.split('-'))
teams["Win"] = teams["Record"].apply(lambda x: int(x[0]))
teams["Loss"] = teams["Record"].apply(lambda x: int(x[1]))
teams["Total Games"] = teams["Loss"] + teams["Win"]

teams = teams.filter(['Year','Team', 'Total Games', 'Win', 'Loss','Winning Percentage'], axis=1)

groups = teams.groupby("Year")
teams
# team_group11 = groups.get_group(2011).sort_values("Winning Percentage", ascending=[False])
# team_group12 = groups.get_group(2012).sort_values("Winning Percentage", ascending=[False])
# team_group13 = groups.get_group(2013).sort_values("Winning Percentage", ascending=[False])
# team_group14 = groups.get_group(2014).sort_values("Winning Percentage", ascending=[False])
# team_group15 = groups.get_group(2015).sort_values("Winning Percentage", ascending=[False])

Unnamed: 0,Year,Team,Total Games,Win,Loss,Winning Percentage
0,2016,Celtics,40,25,15,0.625
1,2015,Celtics,82,48,34,0.585
2,2014,Celtics,82,40,42,0.488
3,2013,Celtics,82,25,57,0.305
4,2012,Celtics,81,41,40,0.506
5,2011,Celtics,66,39,27,0.591
6,2010,Celtics,82,56,26,0.683
7,2009,Celtics,82,50,32,0.610
8,2008,Celtics,82,62,20,0.756
9,2007,Celtics,82,66,16,0.805


## BPM Per Team for 2015-2016 Season

In [30]:
season2015 = group15.groupby(group15['Tm']).mean()
season2015 = season2015.filter(['Tm', 'BPM Avg', 'PER Avg'])

group15.groupby(group15['Tm']).max()

Unnamed: 0_level_0,Year,Player,G,MP,Pos,PER,BPM,MP/G
Tm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ATL,2015,Walter Tavares,82.0,2647.0,SG,21.3,5.3,32
BOS,2015,Tyler Zeller,82.0,2644.0,SG,23.9,3.0,33
BRK,2015,Willie Reed,79.0,2457.0,SG,21.7,1.3,33
CHI,2015,Tony Snell,81.0,2474.0,SG,21.7,4.0,36
CHO,2015,Tyler Hansbrough,81.0,2885.0,SG,20.8,4.0,35
CLE,2015,Tristan Thompson,82.0,2709.0,SG,27.5,9.1,42
DAL,2015,Zaza Pachulia,80.0,2644.0,SG,24.0,4.3,33
DEN,2015,Will Barton,82.0,2439.0,SG,21.5,4.8,34
DET,2015,Tobias Harris,81.0,2856.0,SG,21.2,2.3,36
GSW,2015,Stephen Curry,81.0,2808.0,SG,31.5,12.5,34
