## Overview

In the world of team sports, it is known that players’ athletic abilities and their individual performance are important critical factors in the industry. The motivation for this final tutorial is to investigate the impact of player’s individual effort in the National Basketball Association or NBA. This tutorial will analyze players’ individual statistical data with a focus on Box Plus/Minus, and will hope to find correlation with players’ individual and team success.  

The first part  will include data collecting of players’ Box Plus/Minus and other supporting individual stats and data cleaning processes. The second part will demonstrate how to analyze the given data from year 2007 to now, and display visualization. The third part will come up with a linear regression model to process the analysis and verify the hypotheses implied from it.

Before starting explaining Box Plus/Minus or BPM is needed in order to have a better understanding of the data during analyzation. According to Basketball-Reference’s website, BPM is a box-score based metric for evaluating players in the NBA for their performance through individual approximate contribution for their team. BPM is a per-100-possession stat: 0.0 is league average, +5 means the player is 5 points better than an average player over 100 possessions (which is about All-NBA level), -2 is replacement level, and -5 is really bad.


## Required Tools

In order to create and share documents that contain live Python code, equations, visualizations and narrative text for data analysis in this tutoral using Jupyter Notebook is recommended; it includes data cleaning and transformation, statistical modeling, data visualization, machine learning and etc. Jupyter Notebook also have built in libraries that is needed for this tutorial, which are the following:
1. Pandas
2. Numpy
3. Scikit-learn
4. Matplotlib
5. Folium

For the dataset, the NBA playes' and teams' data can be retrived at https://www.kaggle.com 


In [41]:
#Import needed libraries
!pip install folium
from bs4 import BeautifulSoup, SoupStrainer
import requests
import pandas as pd
import numpy as np
import folium
import matplotlib.pyplot as plt
from sklearn import linear_model



In [42]:
# Create the dataframe from the csv file
players = pd.read_csv("Seasons_Stats.csv")


In [161]:
# Drop unneeded columns, keep identifier columns
# Keep PER: "Player Efficiency Rating"
# Keep BPM: "Box Plus-Minus"
adv_players = players.filter(['Year','Player', 'G', 'MP', 'Pos','Tm', 'PER', 'BPM'], axis=1)
adv_players['MP/G'] = adv_players['MP'] / adv_players['G']
adv_players['MP/G'] = adv_players['MP/G'].fillna(0).astype(int)

# Year is the year of the end of the season but the "Team" dataframe is the year at the beginning of the season so subtract 1 to 
# Match the years of the player stats and the team stats
adv_players['Year'] = adv_players['Year'] - 1

# Tidy the data to only include season stats from 2012 - 2016 season
adv_players = adv_players.drop(adv_players[adv_players.Year < 2012].index)
adv_players = adv_players[np.isfinite(adv_players['Year'])]

# adv_players

In [159]:
groups = adv_players.groupby('Year')

group12 = groups.get_group(2012.0)
group13 = groups.get_group(2013.0)
group14 = groups.get_group(2014.0)
group15 = groups.get_group(2015.0)
group16 = groups.get_group(2016.0)

In [162]:
bpm_avg = pd.DataFrame(columns = ['Pos','Year','AVERAGE BPM'],index = range(0,25))
# Dataframe of average bpm per position 2012-2016
bpm_avg = (group16.loc[group16['Pos'] == 'PG'])['BPM'].mean()
sg_16 = group16.loc[group16['Pos'] == 'SG']
sf_16 = group16.loc[group16['Pos'] == 'SF']
pf_16 = group16.loc[group16['Pos'] == 'PF']
c_16 = group16.loc[group16['Pos'] == 'C']
pg_15 = group15.loc[group15['Pos'] == 'PG']
sg_15 = group15.loc[group15['Pos'] == 'SG']
sf_15 = group15.loc[group15['Pos'] == 'SF']
pf_15 = group15.loc[group15['Pos'] == 'PF']
c_15 = group15.loc[group15['Pos'] == 'C']
pg_14 = group14.loc[group14['Pos'] == 'PG']
sg_14 = group14.loc[group14['Pos'] == 'SG']
sf_14 = group14.loc[group14['Pos'] == 'SF']
pf_14 = group14.loc[group14['Pos'] == 'PF']
c_14 = group14.loc[group14['Pos'] == 'C']
pg_13 = group13.loc[group13['Pos'] == 'PG']
sg_13 = group13.loc[group13['Pos'] == 'SG']
sf_13 = group13.loc[group13['Pos'] == 'SF']
pf_13 = group13.loc[group13['Pos'] == 'PF']
c_13 = group13.loc[group13['Pos'] == 'C']
pg_12 = group12.loc[group12['Pos'] == 'PG']
sg_12 = group12.loc[group12['Pos'] == 'SG']
sf_12 = group12.loc[group12['Pos'] == 'SF']
pf_12 = group12.loc[group12['Pos'] == 'PF']
c_12 = group12.loc[group12['Pos'] == 'C']


## TEAM DATA TIDY

In [157]:
#gabe block

teams  = pd.read_excel("Historical NBA Performance.xlsx")
teams = teams.filter(['Year','Team', 'Record', 'Winning Percentage'], axis=1)

teams["Year"] = teams["Year"].fillna('')
teams["Year"] = teams["Year"].apply(lambda x: int(x[:4]) if isinstance(x, str) else int(str(x.year)))

teams["Record"] = teams["Record"].fillna('')
teams["Record"] = teams["Record"].apply(lambda x: x.split('-'))
teams["Win"] = teams["Record"].apply(lambda x: int(x[0]))
teams["Loss"] = teams["Record"].apply(lambda x: int(x[1]))
teams["Total Games"] = teams["Loss"] + teams["Win"]

teams = teams.filter(['Year','Team', 'Total Games', 'Win', 'Loss','Winning Percentage'], axis=1)

groups = teams.groupby("Year")
# team_group11 = groups.get_group(2011).sort_values("Winning Percentage", ascending=[False])
# team_group12 = groups.get_group(2012).sort_values("Winning Percentage", ascending=[False])
# team_group13 = groups.get_group(2013).sort_values("Winning Percentage", ascending=[False])
# team_group14 = groups.get_group(2014).sort_values("Winning Percentage", ascending=[False])
# team_group15 = groups.get_group(2015).sort_values("Winning Percentage", ascending=[False])

## BPM Per Team for 2015-2016 Season

In [163]:
#gabe block
group16

Unnamed: 0,Year,Player,G,MP,Pos,Tm,PER,WS,BPM,MP/G
23519,2016.0,Steven Adams,80.0,2014.0,C,OKC,15.5,6.5,2.1,25.175000
23520,2016.0,Arron Afflalo,71.0,2371.0,SG,NYK,10.9,2.7,-2.4,33.394366
23523,2016.0,LaMarcus Aldridge,74.0,2261.0,PF,SAS,22.4,10.1,1.8,30.554054
23525,2016.0,Lavoy Allen,79.0,1599.0,PF,IND,12.4,3.7,-0.6,20.240506
23526,2016.0,Tony Allen,64.0,1620.0,SG,MEM,12.9,2.4,-0.1,25.312500
23527,2016.0,Al-Farouq Aminu,82.0,2341.0,SF,POR,12.7,4.0,0.2,28.548780
23535,2016.0,Kyle Anderson,78.0,1245.0,SF,SAS,12.9,3.5,1.8,15.961538
23536,2016.0,Ryan Anderson,66.0,2008.0,PF,NOP,17.2,3.9,-1.4,30.424242
23537,2016.0,Giannis Antetokounmpo,80.0,2823.0,PG,MIL,18.8,7.1,2.4,35.287500
23539,2016.0,Carmelo Anthony,72.0,2530.0,SF,NYK,20.3,6.4,2.6,35.138889
