<a href="https://colab.research.google.com/github/jmgjasongardner/DEEPJocksBox/blob/master/Jocks'_Box.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Jocks' Box

**Organization:** Rice Data Science Club, DEEP Project Team

**Mentor:** Jason Gardner

**Mentee:** Aaron Chang, Kai Hung, Ankit Patel, Nathan Powell, Sanjay Rajasekhar, Daniel Sedano, Luke Stancil, Brian Xu, Jeffrey Zhong

**Data Set:** NBA Enhanced Box Score and Standings (2012-2018 Team Box Scores) https://www.kaggle.com/pablote/nba-enhanced-stats 

**Objective:** 
1. Identify variables that heavily influence the winrate of a game in a season
2. Observe whether the same variables remain consistently impactful over the past few seasons
3. Predict winrate for any given team within a season based on their data prior to February
4. Reason through why our predictions are/are not accurate. 

**How:** We need to split the dataset into testing and training data for each season. The split date will be the trade deadlines of the respective seasons. We will then use the training data to craft a multivariate linear prediction model for the NBA teams' separated by NBA seasons based on various variables. Then, we will predict win rates with our testing data and the model we have created.

Our explanatory variables are as follows: 
1. *teamEDiff*: 
2. *teamASST%*: assisted field goal percentage by team
3. *team2P%*: two point percentage made by team
4. *team3P%*: three pointer percentage made by team
5. *teamFT%*: free throw percentage made by team
6. *opptTO%*: turnover percentage by opponent
7. *pace*: pace per game duration (analogous to possessions)
9. *Q4Diff*: the difference in team scores by the fourth quarter


Our response variable, *win_rate*, will be the win rate data of each team. 

**Technology:** Python, Jupyter Notebook

# Libraries & Data

We will be using the python libraries seaborn, pandas, sci-kit learn, and matplotlib in our analysis. 

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Let's import our data and take a look at it! 

In [None]:
nba_df = pd.read_csv("/2012-18_teamBoxScore.csv")

In [None]:
nba_df.head()

Unnamed: 0,gmDate,gmTime,seasTyp,offLNm1,offFNm1,offLNm2,offFNm2,offLNm3,offFNm3,teamAbbr,teamConf,teamDiv,teamLoc,teamRslt,teamMin,teamDayOff,teamPTS,teamAST,teamTO,teamSTL,teamBLK,teamPF,teamFGA,teamFGM,teamFG%,team2PA,team2PM,team2P%,team3PA,team3PM,team3P%,teamFTA,teamFTM,teamFT%,teamORB,teamDRB,teamTRB,teamPTS1,teamPTS2,teamPTS3,...,oppt2P%,oppt3PA,oppt3PM,oppt3P%,opptFTA,opptFTM,opptFT%,opptORB,opptDRB,opptTRB,opptPTS1,opptPTS2,opptPTS3,opptPTS4,opptPTS5,opptPTS6,opptPTS7,opptPTS8,opptTREB%,opptASST%,opptTS%,opptEFG%,opptOREB%,opptDREB%,opptTO%,opptSTL%,opptBLK%,opptBLKR,opptPPS,opptFIC,opptFIC40,opptOrtg,opptDrtg,opptEDiff,opptPlay%,opptAR,opptAST/TO,opptSTL/TO,poss,pace
0,2012-10-30,19:00,Regular,Brothers,Tony,Smith,Michael,Workman,Haywoode,WAS,East,Southeast,Away,Loss,240,0,84,26,13,11,10,19,90,32,0.3556,58,24,0.4138,32,8,0.25,20,12,0.6,18,21,39,24,15,23,...,0.4915,20,7,0.35,22,15,0.6818,18,36,54,31,19,24,20,0,0,0,0,58.0645,61.1111,0.53,0.5,33.3333,66.6667,19.1466,7.8704,5.6217,8.4746,1.1899,74.0,61.6667,105.6882,94.4447,11.2435,0.439,16.7072,1.0476,33.3333,88.9409,88.9409
1,2012-10-30,19:00,Regular,Brothers,Tony,Smith,Michael,Workman,Haywoode,CLE,East,Central,Home,Win,240,0,94,22,21,7,5,21,79,36,0.4557,59,29,0.4915,20,7,0.35,22,15,0.6818,18,36,54,31,19,24,...,0.4138,32,8,0.25,20,12,0.6,18,21,39,24,15,23,22,0,0,0,0,41.9355,81.25,0.4251,0.4,46.1538,53.8462,11.6279,12.3678,11.2434,17.2414,0.9333,67.25,56.0417,94.4447,105.6882,-11.2435,0.3765,18.8679,2.0,84.6154,88.9409,88.9409
2,2012-10-30,20:00,Regular,McCutchen,Monty,Wright,Sean,Fitzgerald,Kane,BOS,East,Atlantic,Away,Loss,240,0,107,24,16,4,2,23,75,39,0.52,62,33,0.5323,13,6,0.4615,28,23,0.8214,7,34,41,25,29,22,...,0.5556,16,8,0.5,32,26,0.8125,5,31,36,31,31,31,27,0,0,0,0,46.7532,58.1395,0.6446,0.5949,13.8889,86.1111,7.9145,8.4225,5.2641,7.9365,1.519,97.0,80.8333,126.3381,112.6515,13.6866,0.5244,19.8287,3.125,100.0,94.9832,94.9832
3,2012-10-30,20:00,Regular,McCutchen,Monty,Wright,Sean,Fitzgerald,Kane,MIA,East,Southeast,Home,Win,240,0,120,25,8,8,5,20,79,43,0.5443,63,35,0.5556,16,8,0.5,32,26,0.8125,5,31,36,31,31,31,...,0.5323,13,6,0.4615,28,23,0.8214,7,34,41,25,29,22,31,0,0,0,0,53.2468,61.5385,0.6127,0.56,17.0732,82.9268,15.4859,4.2113,2.1056,3.2258,1.4267,75.25,62.7083,112.6515,126.3381,-13.6866,0.4643,18.8501,1.5,25.0,94.9832,94.9832
4,2012-10-30,22:30,Regular,Foster,Scott,Zielinski,Gary,Dalen,Eric,DAL,West,Southwest,Away,Win,240,0,99,22,12,9,5,25,85,40,0.4706,70,35,0.5,15,5,0.3333,18,14,0.7778,9,31,40,25,23,26,...,0.5469,13,3,0.2308,31,12,0.3871,15,31,46,29,17,20,25,0,0,0,0,53.4884,63.1579,0.502,0.513,32.6087,67.3913,13.3792,6.5517,5.4598,7.8125,1.1818,70.375,58.6458,99.3678,108.1034,-8.7356,0.5,18.6567,1.7143,42.8571,91.579,91.579


# Data Cleaning and Manipulation

Adding Season/Type Columns: 

We need to add the columns of season and type (training vs. testing) to help organize our data in the later parts of the analysis. 

In [None]:
month_list = [10, 11, 12, 1]
seasons = []
win_loss = []
types = []

# iterate through every game in the dataframe
for index, row in nba_df.iterrows(): 

  # check if a game is before or after the trade deadline
  month = int(row["gmDate"][5:7])
  year = int(row["gmDate"][0:4])
  if month in month_list: 
    types.append("Train")
    if month != 1: 
      seasons.append(year + 1)
    else: 
      seasons.append(year)
  else: 
    seasons.append(year)
    types.append("Test")
  
  # now, determine if each of the games resulted in a win or loss
  if row["teamRslt"] == "Loss": 
    win_loss.append(0)
  elif row["teamRslt"] == "Win": 
    win_loss.append(1)

try: 
  nba_df.insert(nba_df.shape[1], "season", seasons)
except: 
  nba_df.loc[:, "season"] = seasons
try: 
  nba_df.insert(nba_df.shape[1], "type", types)
except: 
  nba_df.loc[:, "type"] = types
try: 
  nba_df.insert(nba_df.shape[1], "win/loss", win_loss)
except: 
  nba_df.loc[:, "win/loss"] = win_loss

nba_df.head
  


<bound method NDFrame.head of            gmDate gmTime  seasTyp    offLNm1  ...      pace season   type win/loss
0      2012-10-30  19:00  Regular   Brothers  ...   88.9409   2013  Train        0
1      2012-10-30  19:00  Regular   Brothers  ...   88.9409   2013  Train        1
2      2012-10-30  20:00  Regular  McCutchen  ...   94.9832   2013  Train        0
3      2012-10-30  20:00  Regular  McCutchen  ...   94.9832   2013  Train        1
4      2012-10-30  22:30  Regular     Foster  ...   91.5790   2013  Train        1
...           ...    ...      ...        ...  ...       ...    ...    ...      ...
14753  2018-04-11  10:30  Regular  Garretson  ...  101.7513   2018   Test        0
14754  2018-04-11  10:30  Regular     Cutler  ...   97.2708   2018   Test        0
14755  2018-04-11  10:30  Regular     Cutler  ...   97.6761   2018   Test        1
14756  2018-04-11  10:30  Regular      Tiven  ...   91.6047   2018   Test        0
14757  2018-04-11  10:30  Regular      Tiven  ...   91.98

Now, we need to take a sub-data-frame that represents our data with only the team names, our explanatory variables, the season, the types (training vs. testing), and win/loss. 

In [None]:
predictor_stats = nba_df[["teamAbbr","teamEDiff","teamASST%","team2P%","team3P%","teamFT%","opptTO%","season","type","pace","win/loss"]]
predictor_stats["Q4Diff"] = df["teamPTS4"] - df["opptPTS4"]
predictor_stats.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,teamAbbr,teamEDiff,teamASST%,team2P%,team3P%,teamFT%,opptTO%,season,type,pace,win/loss,Q4Diff
0,WAS,-11.2435,81.25,0.4138,0.25,0.6,19.1466,2013,Train,88.9409,0,2
1,CLE,11.2435,61.1111,0.4915,0.35,0.6818,11.6279,2013,Train,88.9409,1,-2
2,BOS,-13.6866,61.5385,0.5323,0.4615,0.8214,7.9145,2013,Train,94.9832,0,4
3,MIA,13.6866,58.1395,0.5556,0.5,0.8125,15.4859,2013,Train,94.9832,1,-4
4,DAL,8.7356,55.0,0.5,0.3333,0.7778,13.3792,2013,Train,91.579,1,0


# Modeling & Predicting Win Rate

Create a function to construct and run our machine learning for any given season. 

In [None]:
def nba_prediction (season):
  """
  Creates a prediction for a given year in the NBA based on a dataframe of team names, explanatory variables, 
  game season, the type of the data (training vs. testing), and the win/loss of a game represented by 0 or 1. 

  Input: season - an integer that represents the season we are interested in

  Output: Prints the coefficients, actual vs. expected, and mse of our linear model.

  Note: predictor_stats, the dataframe of NBA stats, should be initialized outside of this function
  """
  # initialize a set of training data and testing data dataframes for the current year
  season_df = predictor_stats.loc[predictor_stats.season == season]
  season_train = season_df.loc[season_df.type == 'Train']
  season_test = season_df.loc[season_df.type == 'Test']
  season_train = season_train.drop(['season', 'type'], axis = 1)
  season_test = season_test.drop(['season', 'type'], axis = 1)

  # aggregate all the data by team
  train_by_team = pd.DataFrame(season_train.groupby("teamAbbr").mean().reset_index())
  test_by_team = pd.DataFrame(season_test.groupby("teamAbbr").mean().reset_index())

  # construct the regression model
  regression_model = LinearRegression().fit(train_by_team.drop(['win/loss', 'teamAbbr'], axis = 1), train_by_team['win/loss'])

  # establish a dataframe of the coefficients
  coeff_df = pd.DataFrame(regression_model.coef_, train_by_team.drop(['win/loss', 'teamAbbr'], axis = 1).columns, columns=['Coefficient'])

  prediction_win_rate = regression_model.predict(test_by_team.drop(['win/loss', 'teamAbbr'], axis = 1))

  actual_vs_prediction_df = pd.DataFrame({'Team': test_by_team['teamAbbr'], 'Actual': test_by_team['win/loss'], 'Predicted': prediction_win_rate})
  mse = metrics.mean_squared_error(100 * actual_vs_prediction_df['Actual'], 100 * actual_vs_prediction_df['Predicted'])

  print("===============================================================")
  print("Coefficients")
  print(coeff_df)
  print("===============================================================")
  print("Actual Win Rates vs. Predicted Win Rates for", year)
  print(actual_vs_prediction_df)
  print("===============================================================")
  print("Mean Squared Error for", year)
  print(mse)
  print("===============================================================")
  print()

Now, let's run our model for different years! 

2013

In [None]:
nba_prediction(2013)

Coefficients
           Coefficient
teamEDiff     0.031738
teamASST%     0.000639
team2P%      -0.698221
team3P%       0.490663
teamFT%       0.144119
opptTO%       0.011254
pace          0.000282
Q4Diff       -0.002533
Actual Win Rates vs. Predicted Win Rates for 2018
   Team    Actual  Predicted
0   ATL  0.486486   0.474059
1   BKN  0.611111   0.568465
2   BOS  0.527778   0.512662
3   CHA  0.270270   0.140926
4   CHI  0.459459   0.416607
5   CLE  0.305556   0.360151
6   DAL  0.611111   0.554488
7   DEN  0.800000   0.718602
8   DET  0.333333   0.257211
9    GS  0.500000   0.542244
10  HOU  0.588235   0.644551
11  IND  0.628571   0.717781
12  LAC  0.628571   0.634306
13  LAL  0.694444   0.476001
14  MEM  0.729730   0.690038
15  MIA  0.925000   0.834264
16  MIL  0.368421   0.429023
17  MIN  0.350000   0.415616
18   NO  0.333333   0.356020
19   NY  0.666667   0.647013
20  OKC  0.694444   0.818766
21  ORL  0.162162   0.104104
22  PHI  0.405405   0.395106
23  PHO  0.250000   0.188930
24  P

2014

In [None]:
nba_prediction(2014)

Coefficients
           Coefficient
teamEDiff     0.026151
teamASST%     0.001311
team2P%       0.687363
team3P%      -0.054102
teamFT%      -0.460590
opptTO%      -0.011397
pace         -0.003827
Q4Diff        0.016137
Actual Win Rates vs. Predicted Win Rates for 2018
   Team    Actual  Predicted
0   ATL  0.378378   0.421039
1   BKN  0.631579   0.516782
2   BOS  0.294118   0.316174
3   CHA  0.647059   0.660641
4   CHI  0.675676   0.649819
5   CLE  0.472222   0.541265
6   DAL  0.647059   0.630402
7   DEN  0.378378   0.353339
8   DET  0.297297   0.395507
9    GS  0.647059   0.632697
10  HOU  0.676471   0.706954
11  IND  0.567568   0.483834
12  LAC  0.727273   0.802306
13  LAL  0.314286   0.332542
14  MEM  0.675676   0.632428
15  MIA  0.594595   0.686139
16  MIL  0.194444   0.332465
17  MIN  0.472222   0.438248
18   NO  0.405405   0.387525
19   NY  0.500000   0.521986
20  OKC  0.617647   0.578397
21  ORL  0.294118   0.369240
22  PHI  0.114286   0.128344
23  PHO  0.555556   0.539505
24  P

2015

In [None]:
nba_prediction(2015)

Coefficients
           Coefficient
teamEDiff     0.034482
teamASST%    -0.000107
team2P%      -0.595562
team3P%       0.296064
teamFT%       0.039354
opptTO%       0.015050
pace         -0.006124
Q4Diff       -0.013293
Actual Win Rates vs. Predicted Win Rates for 2018
   Team    Actual  Predicted
0   ATL  0.588235   0.612424
1   BKN  0.555556   0.414196
2   BOS  0.648649   0.578157
3   CHA  0.371429   0.332145
4   CHI  0.606061   0.641167
5   CLE  0.727273   0.771175
6   DAL  0.545455   0.456909
7   DEN  0.323529   0.368671
8   DET  0.411765   0.520009
9    GS  0.810811   0.815639
10  HOU  0.676471   0.568994
11  IND  0.636364   0.634103
12  LAC  0.676471   0.652981
13  LAL  0.228571   0.219509
14  MEM  0.571429   0.549327
15  MIA  0.472222   0.468598
16  MIL  0.457143   0.508445
17  MIN  0.228571   0.211330
18   NO  0.571429   0.510730
19   NY  0.228571   0.126707
20  OKC  0.628571   0.600032
21  ORL  0.312500   0.322628
22  PHI  0.235294   0.322209
23  PHO  0.333333   0.309134
24  P

2016

In [None]:
nba_prediction(2016)

Coefficients
           Coefficient
teamEDiff     0.028942
teamASST%     0.006010
team2P%      -0.487660
team3P%       0.025464
teamFT%      -0.220401
opptTO%      -0.006964
pace         -0.008515
Q4Diff        0.004812
Actual Win Rates vs. Predicted Win Rates for 2018
   Team    Actual  Predicted
0   ATL  0.636364   0.741439
1   BKN  0.264706   0.290374
2   BOS  0.636364   0.530832
3   CHA  0.735294   0.694185
4   CHI  0.444444   0.427764
5   CLE  0.638889   0.687452
6   DAL  0.437500   0.506131
7   DEN  0.441176   0.448546
8   DET  0.558824   0.492917
9    GS  0.852941   0.743395
10  HOU  0.500000   0.540018
11  IND  0.571429   0.499641
12  LAC  0.617647   0.639865
13  LAL  0.250000   0.225915
14  MEM  0.411765   0.381517
15  MIA  0.617647   0.578576
16  MIL  0.393939   0.393578
17  MIN  0.454545   0.425569
18   NO  0.333333   0.311884
19   NY  0.281250   0.385531
20  OKC  0.575758   0.654192
21  ORL  0.388889   0.399297
22  PHI  0.088235   0.201698
23  PHO  0.272727   0.242886
24  P

2017

In [None]:
nba_prediction(2017)

Coefficients
           Coefficient
teamEDiff     0.032068
teamASST%     0.000939
team2P%      -0.680116
team3P%      -0.340541
teamFT%      -0.324938
opptTO%       0.013106
pace          0.005210
Q4Diff        0.001935
Actual Win Rates vs. Predicted Win Rates for 2018
   Team    Actual  Predicted
0   ATL  0.441176   0.480407
1   BKN  0.323529   0.404580
2   BOS  0.676471   0.574603
3   CHA  0.393939   0.442479
4   CHI  0.515152   0.532014
5   CLE  0.542857   0.452063
6   DAL  0.441176   0.400832
7   DEN  0.542857   0.533225
8   DET  0.470588   0.484960
9    GS  0.764706   0.802486
10  HOU  0.633333   0.657320
11  IND  0.485714   0.493957
12  LAC  0.617647   0.623026
13  LAL  0.290323   0.298441
14  MEM  0.437500   0.506713
15  MIA  0.666667   0.700534
16  MIL  0.600000   0.460614
17  MIN  0.352941   0.414403
18   NO  0.454545   0.470733
19   NY  0.312500   0.322583
20  OKC  0.575758   0.558464
21  ORL  0.312500   0.234304
22  PHI  0.285714   0.323888
23  PHO  0.264706   0.366308
24  P

2018

In [None]:
nba_prediction(2018)

Coefficients
           Coefficient
teamEDiff     0.025764
teamASST%    -0.001140
team2P%       2.028516
team3P%       1.061127
teamFT%      -0.796107
opptTO%      -0.028980
pace         -0.014573
Q4Diff        0.017864
Actual Win Rates vs. Predicted Win Rates for 2018
   Team    Actual  Predicted
0   ATL  0.290323   0.267136
1   BKN  0.300000   0.403364
2   BOS  0.600000   0.521343
3   CHA  0.468750   0.539990
4   CHI  0.290323   0.145302
5   CLE  0.625000   0.701829
6   DAL  0.266667   0.377741
7   DEN  0.645161   0.679101
8   DET  0.484848   0.530943
9    GS  0.580645   0.604187
10  HOU  0.878788   0.887407
11  IND  0.633333   0.468233
12  LAC  0.531250   0.638956
13  LAL  0.500000   0.611079
14  MEM  0.125000   0.086809
15  MIA  0.483871   0.567393
16  MIL  0.515152   0.444326
17  MIN  0.535714   0.564973
18   NO  0.656250   0.578687
19   NY  0.200000   0.229744
20  OKC  0.580645   0.575951
21  ORL  0.312500   0.324723
22  PHI  0.823529   0.783846
23  PHO  0.100000   0.062608
24  P

# Visualization

# Conclusion


**Overview**

**Limitation**

**Future Studies**

**Learned**