## Overview

In the world of team sports, it is known that players’ athletic abilities and their individual performance are important critical factors in the industry. The motivation for this final tutorial is to investigate the impact of player’s individual effort in the National Basketball Association or NBA. This tutorial will analyze players’ individual statistical data with a focus on Box Plus/Minus, and will hope to find correlation with players’ individual and team success.  

The first part  will include data collecting of players’ Box Plus/Minus and other supporting individual stats and data cleaning processes. The second part will demonstrate how to analyze the given data from year 2007 to now, and display visualization. The third part will come up with a linear regression model to process the analysis and verify the hypotheses implied from it.

Before starting explaining Box Plus/Minus or BPM is needed in order to have a better understanding of the data during analyzation. According to Basketball-Reference’s website, BPM is a box-score based metric for evaluating players in the NBA for their performance through individual approximate contribution for their team. BPM is a per-100-possession stat: 0.0 is league average, +5 means the player is 5 points better than an average player over 100 possessions (which is about All-NBA level), -2 is replacement level, and -5 is really bad.


## Required Tools

In order to create and share documents that contain live Python code, equations, visualizations and narrative text for data analysis in this tutoral using Jupyter Notebook is recommended; it includes data cleaning and transformation, statistical modeling, data visualization, machine learning and etc. Jupyter Notebook also have built in libraries that is needed for this tutorial, which are the following:
1. Pandas
2. Numpy
3. Scikit-learn
4. Matplotlib
5. Folium

For the dataset, the NBA playes' and teams' data can be retrived at https://www.kaggle.com 


In [653]:
#Import needed libraries
!pip install folium
from bs4 import BeautifulSoup, SoupStrainer
import requests
import pandas as pd
import numpy as np
import folium
import matplotlib.pyplot as plt
from sklearn import linear_model



In [654]:
# Create the dataframe from the csv file
players = pd.read_csv("Seasons_Stats.csv")

In [655]:
# Drop unneeded columns, keep identifier columns
# Keep PER: "Player Efficiency Rating"
# Keep BPM: "Box Plus-Minus"
adv_players = players.filter(['Year','Player', 'Age', 'G', 'MP', 'Pos','Tm', 'PER', 'BPM'], axis=1)
adv_players['MP/G'] = adv_players['MP'] / adv_players['G']
adv_players['MP/G'] = adv_players['MP/G'].fillna(0).astype(int)

# Year is the year of the end of the season but the "Team" dataframe is the year at the beginning of the season so subtract 1 to 
# Match the years of the player stats and the team stats
adv_players['Year'] = adv_players['Year'] - 1

# Tidy the data to only include season stats from 2011 - 2015 season
adv_players = adv_players.drop(adv_players[adv_players.Year < 2011].index)
adv_players = adv_players.drop(adv_players[adv_players.Year > 2015].index)
adv_players = adv_players.drop(adv_players[adv_players.Tm == "TOT"].index)
adv_players = adv_players[np.isfinite(adv_players['Year'])]
adv_players['Year'] = adv_players['Year'].astype(int)

## Group Players by Year

In [664]:
groups = adv_players.groupby('Year')

# group11 = groups.get_group(2011)
# group12 = groups.get_group(2012)
# group13 = groups.get_group(2013)
# group14 = groups.get_group(2014)
# group15 = groups.get_group(2015)

#Top 6 players with High BPM in 2015 Season
group15.sort('BPM', ascending=False).head()

  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,Year,Player,Age,G,MP,Pos,Tm,PER,BPM,MP/G
24065,2015,Briante Weber,23.0,1.0,3.0,PG,MIA,39.3,14.0,3
23633,2015,Stephen Curry,27.0,79.0,2700.0,PG,GSW,31.5,12.5,34
24070,2015,Russell Westbrook,27.0,80.0,2750.0,PG,OKC,27.6,10.0,34
23781,2015,LeBron James,31.0,76.0,2709.0,SF,CLE,27.5,9.1,35
23681,2015,Jimmer Fredette,26.0,2.0,5.0,SG,NYK,47.4,8.8,2


In [657]:
#Average BPM per year from 2011 - 2015 seasons
avg_BPM = adv_players.groupby(adv_players['Year']).mean()
avg_BPM = avg_BPM.filter(['BPM'])
avg_BPM

Unnamed: 0_level_0,BPM
Year,Unnamed: 1_level_1
2011,-1.682718
2012,-2.34283
2013,-2.306022
2014,-1.965913
2015,-1.61553


In [658]:
# bpm_avg_list = pd.DataFrame(columns = ['Pos','Year','AVERAGE BPM'],index = range(0,25))
# Dataframe of average bpm per position 2011-2015
bpm_avg_list = [['PG', 2015, (group15.loc[group15['Pos'] == 'PG'])['BPM'].mean()]]
bpm_avg_list.append(['SG', 2015, (group15.loc[group15['Pos'] == 'SG'])['BPM'].mean()])
bpm_avg_list.append(['SF', 2015, (group15.loc[group15['Pos'] == 'SF'])['BPM'].mean()])
bpm_avg_list.append(['PF', 2015, (group15.loc[group15['Pos'] == 'PF'])['BPM'].mean()])
bpm_avg_list.append(['C', 2015, (group15.loc[group15['Pos'] == 'C'])['BPM'].mean()])
bpm_avg_list.append(['PG', 2014, (group14.loc[group14['Pos'] == 'PG'])['BPM'].mean()])
bpm_avg_list.append(['SG', 2014, (group14.loc[group14['Pos'] == 'SG'])['BPM'].mean()])
bpm_avg_list.append(['SF', 2014, (group14.loc[group14['Pos'] == 'SF'])['BPM'].mean()])
bpm_avg_list.append(['PF', 2014, (group14.loc[group14['Pos'] == 'PF'])['BPM'].mean()])
bpm_avg_list.append(['C', 2014, (group14.loc[group14['Pos'] == 'C'])['BPM'].mean()])
bpm_avg_list.append(['PG', 2013, (group13.loc[group13['Pos'] == 'PG'])['BPM'].mean()])
bpm_avg_list.append(['SG', 2013, (group13.loc[group13['Pos'] == 'SG'])['BPM'].mean()])
bpm_avg_list.append(['SF', 2013, (group13.loc[group13['Pos'] == 'SF'])['BPM'].mean()])
bpm_avg_list.append(['PF', 2013, (group13.loc[group13['Pos'] == 'PF'])['BPM'].mean()])
bpm_avg_list.append(['C', 2013, (group13.loc[group13['Pos'] == 'C'])['BPM'].mean()])
bpm_avg_list.append(['PG', 2012, (group12.loc[group12['Pos'] == 'PG'])['BPM'].mean()])
bpm_avg_list.append(['SG', 2012, (group12.loc[group12['Pos'] == 'SG'])['BPM'].mean()])
bpm_avg_list.append(['SF', 2012, (group12.loc[group12['Pos'] == 'SF'])['BPM'].mean()])
bpm_avg_list.append(['PF', 2012, (group12.loc[group12['Pos'] == 'PF'])['BPM'].mean()])
bpm_avg_list.append(['C', 2012, (group12.loc[group12['Pos'] == 'C'])['BPM'].mean()])
bpm_avg_list.append(['PG', 2011, (group11.loc[group11['Pos'] == 'PG'])['BPM'].mean()])
bpm_avg_list.append(['SG', 2011, (group11.loc[group11['Pos'] == 'SG'])['BPM'].mean()])
bpm_avg_list.append(['SF', 2011, (group11.loc[group11['Pos'] == 'SF'])['BPM'].mean()])
bpm_avg_list.append(['PF', 2011, (group11.loc[group11['Pos'] == 'PF'])['BPM'].mean()])
bpm_avg_list.append(['C', 2011, (group11.loc[group11['Pos'] == 'C'])['BPM'].mean()])

bpm_avg_df = pd.DataFrame(bpm_avg_list,columns=['Position','Year','BPM Average per Year'])
# bpm_avg_df



## TEAM DATA TIDY

Create a tidy table from the given TEAM data set that will consist of Year, Team, Record, Winning Percentage. In order to analyze and give more focus on a year, group the data by year from 2011 - 2015.

In [659]:
#Load the csv file and make dataframe
#TIDY the data so that creating our dataset will be easier
teams  = pd.read_excel("Historical NBA Performance.xlsx")
teams = teams.filter(['Year','Team', 'Record', 'Winning Percentage'], axis=1)

#Convert Year data to int
teams["Year"] = teams["Year"].fillna('')
teams["Year"] = teams["Year"].apply(lambda x: int(x[:4]) if isinstance(x, str) else int(str(x.year)))

#Split the Record "Win-Loss" in to two columns and convert the data to int
teams["Record"] = teams["Record"].fillna('')
teams["Record"] = teams["Record"].apply(lambda x: x.split('-'))
teams["Win"] = teams["Record"].apply(lambda x: int(x[0]))
teams["Loss"] = teams["Record"].apply(lambda x: int(x[1]))
teams["Total Games"] = teams["Loss"] + teams["Win"]

#Tidy the data by filtering which columns will be needed for this tutorial
teams = teams.filter(['Year','Team', 'Total Games', 'Win', 'Loss','Winning Percentage'], axis=1)

#For 
groups = teams.groupby("Year")
team_group11 = groups.get_group(2011).sort_values("Winning Percentage", ascending=[False])
team_group12 = groups.get_group(2012).sort_values("Winning Percentage", ascending=[False])
team_group13 = groups.get_group(2013).sort_values("Winning Percentage", ascending=[False])
team_group14 = groups.get_group(2014).sort_values("Winning Percentage", ascending=[False])
team_group15 = groups.get_group(2015).sort_values("Winning Percentage", ascending=[False])
team_group15

Unnamed: 0,Year,Team,Total Games,Win,Loss,Winning Percentage
453,2015,Warriors,82,73,9,0.89
1256,2015,Spurs,82,67,15,0.817
259,2015,Cavaliers,82,57,25,0.695
1297,2015,Raptors,82,56,26,0.683
945,2015,Thunder,82,55,27,0.671
615,2015,Clippers,82,53,29,0.646
1,2015,Celtics,82,48,34,0.585
72,2015,Hawks,82,48,34,0.585
181,2015,Hornets,82,48,34,0.585
753,2015,Heat,82,48,34,0.585


## BPM Per Team for 2015-2016 Season

The table below shows the Average BPM & Average PER per team. It also shows the stat of the player (PER & MP/G) who has the highest BPM per team

In [660]:
#Create a new table that split the data by Teams, and obtain the average BPM and PER
season2015 = group15.groupby(group15['Tm']).mean()
season2015['Tm'] = season2015.index

#Obtain the overall average for BPM on the league during this season
BPM_Overall_Mean = season2015['BPM'].mean()

#Count the players per team who has higher BPM than the average
season2015 = season2015.merge(group15.groupby('Tm')['BPM'].apply(lambda x: (x>BPM_Overall_Mean).sum()).reset_index(name='BPM Above Avg'))
season2015['BPM Avg'] = season2015['BPM']
season2015['PER Avg'] = season2015['PER']
season2015 = season2015.filter(['Tm','BPM Above Avg', 'BPM Avg', 'PER Avg']).sort_values('BPM Avg', ascending=[False])

season2015


Unnamed: 0,Tm,BPM Above Avg,BPM Avg,PER Avg
26,SAS,15,1.752941,15.841176
15,MIA,14,0.015789,15.336842
9,GSW,11,0.0125,15.65625
12,LAC,9,-0.288889,13.427778
11,IND,10,-0.66875,14.55625
27,TOR,14,-0.7,13.63125
6,DAL,12,-0.7125,14.925
5,CLE,11,-0.75,13.061111
28,UTA,10,-0.894118,12.435294
4,CHO,12,-0.911765,14.252941


In [661]:
#Obtain the max BPM per team
best_BPM = group15.loc[group15.groupby(["Tm"])["BPM"].idxmax()]
best_BPM = best_BPM.filter(['Tm', 'BPM', 'Player', 'PER', 'MP/G', 'Pos']) 

best_BPM

Unnamed: 0,Tm,BPM,Player,PER,MP/G,Pos
23892,ATL,5.3,Paul Millsap,21.3,32,PF
23792,BOS,3.0,Amir Johnson,16.0,22,PF
23846,BRK,1.3,Brook Lopez,21.7,33,C
23594,CHI,4.0,Jimmy Butler,21.3,36,SG
24059,CHO,4.0,Kemba Walker,20.8,35,PG
23781,CLE,9.1,LeBron James,27.5,35,SF
23837,DAL,4.3,David Lee,24.0,17,PF
23804,DEN,4.8,Nikola Jokic,21.5,21,C
23731,DET,2.3,Tobias Harris,18.2,33,PF
23633,GSW,12.5,Stephen Curry,31.5,34,PG
