<h1 align='center'>NBA SUPERVISED LEARNING CAPSTONE</h1>
<h2 align='center'>Philip Bowman</h2>

## Part 1: NBA Data Aggregation
1. [NBA Data Aggregation](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Aggregation.ipynb)*
2. [NBA Data Cleaning and Exploration](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Cleaning_Exploration.ipynb)
3. [NBA Modeling](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Modeling.ipynb)
4. [NBA Model Testing](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Model_Testing.ipynb)

## Purpose:
To take numerous NBA datasets and combine them into one data file to be worked with in the cleaning/exploration section of the project. The datasets were all obtained through Kaggle. Game and standings data comes from Paul Rossotti's [NBA Enhanced Box Score and Standings (2012 - 2018)](https://www.kaggle.com/pablote/nba-enhanced-stats) data aggregation. The betting data was obtained from Evan Hallmark's [NBA Historical Stats and Betting Data](https://www.kaggle.com/ehallmar/nba-historical-stats-and-betting-data) aggregation.

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

There are a number of files being combined here, they include:
- 'C:/Users/philb/NBA_data/nba-enhanced-stats/2012-18_teamBoxScore.csv'
- 'C:/Users/philb/NBA_data/nba-enhanced-stats/2012-18_standings.csv'
- 'C:/Users/philb/NBA_data/nba-historical-stats-and-betting-data/nba_betting_money_line.csv'
- 'C:/Users/philb/NBA_data/nba-historical-stats-and-betting-data/nba_betting_spread.csv'
- 'C:/Users/philb/NBA_data/nba-historical-stats-and-betting-data/nba_betting_totals.csv'
- 'C:/Users/philb/NBA_data/nba-historical-stats-and-betting-data/nba_games_all.csv'
- 'C:/Users/philb/NBA_data/nba-historical-stats-and-betting-data/nba_teams_all.csv'

Below are the stats and standings datasets put into dataframes.

In [2]:
team_df = pd.read_csv('C:/Users/philb/NBA_data/nba-enhanced-stats/2012-18_teamBoxScore.csv')
stand_df = pd.read_csv('C:/Users/philb/NBA_data/nba-enhanced-stats/2012-18_standings.csv')

Then the betting data and NBA games dataset need to be put in dataframes.

In [3]:
line_df = pd.read_csv('C:/Users/philb/NBA_data/nba-historical-stats-and-betting-data/nba_betting_money_line.csv')
spread_df = pd.read_csv('C:/Users/philb/NBA_data/nba-historical-stats-and-betting-data/nba_betting_spread.csv')
ptotals_df = pd.read_csv('C:/Users/philb/NBA_data/nba-historical-stats-and-betting-data/nba_betting_totals.csv')
teamKey_df = pd.read_csv('C:/Users/philb/NBA_data/nba-historical-stats-and-betting-data/nba_teams_all.csv')
gameKey_df = pd.read_csv('C:/Users/philb/NBA_data/nba-historical-stats-and-betting-data/nba_games_all.csv')

# Configure `team_df` for merging

The next step is to get the game data (`team_df`) in a format easy to merge with the other datasets of interest. First up, the standings dataset (`stand_df`).

Create a season filter to easily seperate seasons within the game data. There are no regular season games taking place between June 1st and October 1st for any year, so this filter is relatively easy to create.

In [4]:
season_filter = dict()
for year in range(2012, 2018):
    season_filter[str(year)] = pd.to_datetime(f'{year}-10-01'), pd.to_datetime(f'{year+1}-06-01')

Create a function to easily insert seasons into the dataframe, so we can use `.apply(assign_seasID)` in a moment.

In [5]:
def assign_seasID(x):
    if season_filter['2012'][0] < x < season_filter['2012'][1]:
        return 2012
    if season_filter['2013'][0] < x < season_filter['2013'][1]:
        return 2013
    if season_filter['2014'][0] < x < season_filter['2014'][1]:
        return 2014
    if season_filter['2015'][0] < x < season_filter['2015'][1]:
        return 2015
    if season_filter['2016'][0] < x < season_filter['2016'][1]:
        return 2016
    if season_filter['2017'][0] < x < season_filter['2017'][1]:
        return 2017

Create a unique ID for each game played (independent of the team perspective). Change datatype of gmDate information to datetime64. Add season information using the funcion above.

In [6]:
team_df['gameID'] = (team_df.groupby(['gmDate','offLNm1', 'offLNm2']).ngroup() + 10000).copy()
team_df = team_df.sort_values(by=['teamAbbr', 'gmDate']).reset_index(drop=True).copy()
team_df['gmDate'] = team_df.gmDate.astype('datetime64').copy()
team_df['seasID'] = team_df.gmDate.apply(assign_seasID).copy()

Total number of regular season games should be the amount below from both team's perspectives (i.e. double counting games).

In [7]:
30*82*6

14760

How many (double counted) games are actually in the dataset?

In [8]:
team_df.shape[0]

14758

At first glance it appears that a game is missing (since one game counted twice would bring us to 14760). However, this is not the case. In 2013, the Boston Marathon Bombing occurred. Due to this tragic event, the game between the Boston Celtics and Indiana Pacers scheduled for April 16, 2013 was canceled, resulting in a season of only 81 games for both the Pacers and Celtics in the 2012-2013 season. This means the dataset is complete and there are no missing games from the 2012-2013 season through the 2017-2018 season.

Manual reorganization of column order.

In [9]:
cols = ['gameID',
 'seasID',
 'gmDate',
 'gmTime',
 'seasTyp',
 'teamAbbr',
 'teamConf',
 'teamDiv',
 'teamLoc',
 'teamRslt',
 'teamMin',
 'teamDayOff',
 'teamPTS',
 'teamAST',
 'teamTO',
 'teamSTL',
 'teamBLK',
 'teamPF',
 'teamFGA',
 'teamFGM',
 'teamFG%',
 'team2PA',
 'team2PM',
 'team2P%',
 'team3PA',
 'team3PM',
 'team3P%',
 'teamFTA',
 'teamFTM',
 'teamFT%',
 'teamORB',
 'teamDRB',
 'teamTRB',
 'teamPTS1',
 'teamPTS2',
 'teamPTS3',
 'teamPTS4',
 'teamPTS5',
 'teamPTS6',
 'teamPTS7',
 'teamPTS8',
 'teamTREB%',
 'teamASST%',
 'teamTS%',
 'teamEFG%',
 'teamOREB%',
 'teamDREB%',
 'teamTO%',
 'teamSTL%',
 'teamBLK%',
 'teamBLKR',
 'teamPPS',
 'teamFIC',
 'teamFIC40',
 'teamOrtg',
 'teamDrtg',
 'teamEDiff',
 'teamPlay%',
 'teamAR',
 'teamAST/TO',
 'teamSTL/TO',
 'poss',
 'pace',
 'offLNm1',
 'offFNm1',
 'offLNm2',
 'offFNm2',
 'offLNm3',
 'offFNm3']

Take each game (from each team's perspective) and merge it onto itself, then eliminate rows where a team is "playing itself." This gets the data into a team A vs. team B perspective. We are still double counting games here, but that is needed in order to merge future data.

In [10]:
sin_team_df = team_df[cols].copy()

merged_df = pd.merge(sin_team_df, sin_team_df, suffixes=('_A', '_B'), 
         on=['gameID',
             'seasID',
             'gmDate',
             'gmTime',
             'seasTyp',
             'offLNm1',
             'offFNm1',
             'offLNm2',
             'offFNm2',
             'offLNm3',
             'offFNm3'])

games_df = merged_df[merged_df.teamAbbr_A != merged_df.teamAbbr_B].copy()
games_df = games_df.reset_index(drop=True).copy()

Quick look at the current format of the dataset.

In [11]:
games_df.head()

Unnamed: 0,gameID,seasID,gmDate,gmTime,seasTyp,teamAbbr_A,teamConf_A,teamDiv_A,teamLoc_A,teamRslt_A,teamMin_A,teamDayOff_A,teamPTS_A,teamAST_A,teamTO_A,teamSTL_A,teamBLK_A,teamPF_A,teamFGA_A,teamFGM_A,teamFG%_A,team2PA_A,team2PM_A,team2P%_A,team3PA_A,team3PM_A,team3P%_A,teamFTA_A,teamFTM_A,teamFT%_A,teamORB_A,teamDRB_A,teamTRB_A,teamPTS1_A,teamPTS2_A,teamPTS3_A,teamPTS4_A,teamPTS5_A,teamPTS6_A,teamPTS7_A,teamPTS8_A,teamTREB%_A,teamASST%_A,teamTS%_A,teamEFG%_A,teamOREB%_A,teamDREB%_A,teamTO%_A,teamSTL%_A,teamBLK%_A,teamBLKR_A,teamPPS_A,teamFIC_A,teamFIC40_A,teamOrtg_A,teamDrtg_A,teamEDiff_A,teamPlay%_A,teamAR_A,teamAST/TO_A,teamSTL/TO_A,poss_A,pace_A,offLNm1,offFNm1,offLNm2,offFNm2,offLNm3,offFNm3,teamAbbr_B,teamConf_B,teamDiv_B,teamLoc_B,teamRslt_B,teamMin_B,teamDayOff_B,teamPTS_B,teamAST_B,teamTO_B,teamSTL_B,teamBLK_B,teamPF_B,teamFGA_B,teamFGM_B,teamFG%_B,team2PA_B,team2PM_B,team2P%_B,team3PA_B,team3PM_B,team3P%_B,teamFTA_B,teamFTM_B,teamFT%_B,teamORB_B,teamDRB_B,teamTRB_B,teamPTS1_B,teamPTS2_B,teamPTS3_B,teamPTS4_B,teamPTS5_B,teamPTS6_B,teamPTS7_B,teamPTS8_B,teamTREB%_B,teamASST%_B,teamTS%_B,teamEFG%_B,teamOREB%_B,teamDREB%_B,teamTO%_B,teamSTL%_B,teamBLK%_B,teamBLKR_B,teamPPS_B,teamFIC_B,teamFIC40_B,teamOrtg_B,teamDrtg_B,teamEDiff_B,teamPlay%_B,teamAR_B,teamAST/TO_B,teamSTL/TO_B,poss_B,pace_B
0,10020,2012,2012-11-02,19:30,Regular,ATL,East,Southeast,Home,Loss,240,0,102,23,13,12,4,26,85,40,0.4706,63,33,0.5238,22,7,0.3182,17,15,0.8824,7,29,36,21,23,30,28,0,0,0,0,38.2979,57.5,0.5515,0.5118,16.6667,55.7692,12.3246,12.3226,4.1075,6.3492,1.2,73.625,61.3542,104.7423,111.9305,-7.1882,0.4396,17.9016,1.7692,92.3077,97.3819,97.3819,Malloy,Ed,Wright,Sean,Barnaky,Brent,HOU,West,Southwest,Away,Win,240,2,109,22,21,8,2,18,90,38,0.4222,60,30,0.5,30,8,0.2667,29,25,0.8621,23,35,58,28,25,28,28,0,0,0,0,61.7021,57.8947,0.5304,0.4667,44.2308,83.3333,16.9683,8.2151,2.0538,3.3333,1.2111,81.875,68.2292,111.9305,104.7423,7.1882,0.4318,15.0933,1.0476,38.0952,97.3819,97.3819
1,10020,2012,2012-11-02,19:30,Regular,HOU,West,Southwest,Away,Win,240,2,109,22,21,8,2,18,90,38,0.4222,60,30,0.5,30,8,0.2667,29,25,0.8621,23,35,58,28,25,28,28,0,0,0,0,61.7021,57.8947,0.5304,0.4667,44.2308,83.3333,16.9683,8.2151,2.0538,3.3333,1.2111,81.875,68.2292,111.9305,104.7423,7.1882,0.4318,15.0933,1.0476,38.0952,97.3819,97.3819,Malloy,Ed,Wright,Sean,Barnaky,Brent,ATL,East,Southeast,Home,Loss,240,0,102,23,13,12,4,26,85,40,0.4706,63,33,0.5238,22,7,0.3182,17,15,0.8824,7,29,36,21,23,30,28,0,0,0,0,38.2979,57.5,0.5515,0.5118,16.6667,55.7692,12.3246,12.3226,4.1075,6.3492,1.2,73.625,61.3542,104.7423,111.9305,-7.1882,0.4396,17.9016,1.7692,92.3077,97.3819,97.3819
2,10039,2012,2012-11-04,19:00,Regular,ATL,East,Southeast,Away,Win,240,2,104,20,11,12,1,20,83,41,0.494,58,33,0.569,25,8,0.32,20,14,0.7,12,26,38,30,17,28,29,0,0,0,0,50.6667,48.7805,0.5664,0.5422,28.5714,78.7879,10.7004,13.2351,1.1029,1.7241,1.253,77.75,64.7917,114.7038,104.7775,9.9263,0.5,16.2866,1.8182,109.0909,90.6683,90.6683,Wall,Scott,Callahan,Mike,Pantoja,Brenda,OKC,West,Northwest,Home,Loss,240,2,95,27,21,4,9,21,71,33,0.4648,49,24,0.4898,22,9,0.4091,22,20,0.9091,7,30,37,22,29,23,21,0,0,0,0,49.3333,81.8182,0.5887,0.5282,21.2121,71.4286,20.653,4.4117,9.9263,18.3673,1.338,71.5,59.5833,104.7775,114.7038,-9.9263,0.3882,20.9823,1.2857,19.0476,90.6683,90.6683
3,10039,2012,2012-11-04,19:00,Regular,OKC,West,Northwest,Home,Loss,240,2,95,27,21,4,9,21,71,33,0.4648,49,24,0.4898,22,9,0.4091,22,20,0.9091,7,30,37,22,29,23,21,0,0,0,0,49.3333,81.8182,0.5887,0.5282,21.2121,71.4286,20.653,4.4117,9.9263,18.3673,1.338,71.5,59.5833,104.7775,114.7038,-9.9263,0.3882,20.9823,1.2857,19.0476,90.6683,90.6683,Wall,Scott,Callahan,Mike,Pantoja,Brenda,ATL,East,Southeast,Away,Win,240,2,104,20,11,12,1,20,83,41,0.494,58,33,0.569,25,8,0.32,20,14,0.7,12,26,38,30,17,28,29,0,0,0,0,50.6667,48.7805,0.5664,0.5422,28.5714,78.7879,10.7004,13.2351,1.1029,1.7241,1.253,77.75,64.7917,114.7038,104.7775,9.9263,0.5,16.2866,1.8182,109.0909,90.6683,90.6683
4,10057,2012,2012-11-07,19:30,Regular,ATL,East,Southeast,Home,Win,240,3,89,24,17,8,3,15,87,38,0.4368,65,31,0.4769,22,7,0.3182,12,6,0.5,17,34,51,22,29,14,24,0,0,0,0,55.4348,63.1579,0.4822,0.477,36.1702,75.5556,15.5564,8.8711,3.3267,4.6154,1.023,72.25,60.2083,98.6912,95.3645,3.3267,0.4368,18.0072,1.4118,47.0588,90.1803,90.1803,Mauer,Ken,Richardson,Derek,Fitzgerald,Kane,IND,East,Central,Away,Loss,240,2,86,18,15,10,7,14,85,35,0.4118,59,27,0.4576,26,8,0.3077,9,8,0.8889,11,30,41,25,25,27,9,0,0,0,0,44.5652,51.4286,0.4834,0.4588,24.4444,63.8298,14.4286,11.0889,7.7622,11.8644,1.0118,65.375,54.4792,95.3645,98.6912,-3.3267,0.3933,14.7589,1.2,66.6667,90.1803,90.1803


Quick look at the dimensions of the dataset.

In [12]:
games_df.shape

(14758, 127)

There now exists a dataframe (`games_df`) which is a derivation of the original game data (`team_df`). Ultimately, we took singular game data (each row containing one team's box score for a particular game) and turned it into team A vs team B data where each row is a game which contains box scores for both team A and team B.

# Combine `games_df` and `stand_df`

Next on the docket is to actually merge the standings data to the box score data, which was the whole reason the final section was implemented in the first place.

First, change some variable information for the standings dataset so that it can easily merge with the box score data.

In [13]:
stand_df = stand_df.rename(columns={'stDate': 'gmDate'}).copy()
stand_df['gmDate'] = stand_df.gmDate.astype('datetime64').copy()

Next, make two copies of the standings data (one for team A and one for team B).

In [14]:
stand_df_A = stand_df.add_suffix('_A').rename(columns={'gmDate_A': 'gmDate'}).copy()
stand_df_B = stand_df.add_suffix('_B').rename(columns={'gmDate_B': 'gmDate'}).copy()

Quick look at team A standings data.

In [15]:
stand_df_A.head()

Unnamed: 0,gmDate,teamAbbr_A,rank_A,rankOrd_A,gameWon_A,gameLost_A,stk_A,stkType_A,stkTot_A,gameBack_A,ptsFor_A,ptsAgnst_A,homeWin_A,homeLoss_A,awayWin_A,awayLoss_A,confWin_A,confLoss_A,lastFive_A,lastTen_A,gamePlay_A,ptsScore_A,ptsAllow_A,ptsDiff_A,opptGmPlay_A,opptGmWon_A,opptOpptGmPlay_A,opptOpptGmWon_A,sos_A,rel%Indx_A,mov_A,srs_A,pw%_A,pyth%13.91_A,wpyth13.91_A,lpyth13.91_A,pyth%16.5_A,wpyth16.5_A,lpyth16.5_A
0,2012-10-30,ATL,3,3rd,0,0,-,-,0,0.5,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,82.0,0.0,0.0,82.0
1,2012-10-30,BKN,3,3rd,0,0,-,-,0,0.5,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,82.0,0.0,0.0,82.0
2,2012-10-30,BOS,14,14th,0,1,L1,loss,1,1.0,107,120,0,0,0,1,0,1,0,0,1,107.0,120.0,-13.0,0,0,0,0,0.0,0.0,-13.0,-13.0,0.072,0.1687,13.8334,68.1666,0.131,10.742,71.258
3,2012-10-30,CHA,3,3rd,0,0,-,-,0,0.5,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,82.0,0.0,0.0,82.0
4,2012-10-30,CHI,3,3rd,0,0,-,-,0,0.5,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,82.0,0.0,0.0,82.0


Last, merge standings for team A to the box score data (`games_df`), then merge standings for team B to the box score + team A standings dataframe, resulting in a dataframe with the box score data and standings data all in one (`games_stand_AB`).

In [16]:
games_stand_A = games_df.merge(stand_df_A, how='left', on=['gmDate', 'teamAbbr_A']).copy()
games_stand_AB = games_stand_A.merge(stand_df_B, how='left', on=['gmDate', 'teamAbbr_B']).copy()

Quick look at the resulting box score data + standings data (`games_stand_AB`).

In [17]:
games_stand_AB.head()

Unnamed: 0,gameID,seasID,gmDate,gmTime,seasTyp,teamAbbr_A,teamConf_A,teamDiv_A,teamLoc_A,teamRslt_A,teamMin_A,teamDayOff_A,teamPTS_A,teamAST_A,teamTO_A,teamSTL_A,teamBLK_A,teamPF_A,teamFGA_A,teamFGM_A,teamFG%_A,team2PA_A,team2PM_A,team2P%_A,team3PA_A,team3PM_A,team3P%_A,teamFTA_A,teamFTM_A,teamFT%_A,teamORB_A,teamDRB_A,teamTRB_A,teamPTS1_A,teamPTS2_A,teamPTS3_A,teamPTS4_A,teamPTS5_A,teamPTS6_A,teamPTS7_A,teamPTS8_A,teamTREB%_A,teamASST%_A,teamTS%_A,teamEFG%_A,teamOREB%_A,teamDREB%_A,teamTO%_A,teamSTL%_A,teamBLK%_A,teamBLKR_A,teamPPS_A,teamFIC_A,teamFIC40_A,teamOrtg_A,teamDrtg_A,teamEDiff_A,teamPlay%_A,teamAR_A,teamAST/TO_A,teamSTL/TO_A,poss_A,pace_A,offLNm1,offFNm1,offLNm2,offFNm2,offLNm3,offFNm3,teamAbbr_B,teamConf_B,teamDiv_B,teamLoc_B,teamRslt_B,teamMin_B,teamDayOff_B,teamPTS_B,teamAST_B,teamTO_B,teamSTL_B,teamBLK_B,teamPF_B,teamFGA_B,teamFGM_B,teamFG%_B,team2PA_B,team2PM_B,team2P%_B,team3PA_B,team3PM_B,team3P%_B,teamFTA_B,teamFTM_B,teamFT%_B,teamORB_B,teamDRB_B,teamTRB_B,teamPTS1_B,teamPTS2_B,teamPTS3_B,teamPTS4_B,teamPTS5_B,teamPTS6_B,teamPTS7_B,teamPTS8_B,teamTREB%_B,teamASST%_B,teamTS%_B,teamEFG%_B,teamOREB%_B,teamDREB%_B,teamTO%_B,teamSTL%_B,teamBLK%_B,teamBLKR_B,teamPPS_B,teamFIC_B,teamFIC40_B,teamOrtg_B,teamDrtg_B,teamEDiff_B,teamPlay%_B,teamAR_B,teamAST/TO_B,teamSTL/TO_B,poss_B,pace_B,rank_A,rankOrd_A,gameWon_A,gameLost_A,stk_A,stkType_A,stkTot_A,gameBack_A,ptsFor_A,ptsAgnst_A,homeWin_A,homeLoss_A,awayWin_A,awayLoss_A,confWin_A,confLoss_A,lastFive_A,lastTen_A,gamePlay_A,ptsScore_A,ptsAllow_A,ptsDiff_A,opptGmPlay_A,opptGmWon_A,opptOpptGmPlay_A,opptOpptGmWon_A,sos_A,rel%Indx_A,mov_A,srs_A,pw%_A,pyth%13.91_A,wpyth13.91_A,lpyth13.91_A,pyth%16.5_A,wpyth16.5_A,lpyth16.5_A,rank_B,rankOrd_B,gameWon_B,gameLost_B,stk_B,stkType_B,stkTot_B,gameBack_B,ptsFor_B,ptsAgnst_B,homeWin_B,homeLoss_B,awayWin_B,awayLoss_B,confWin_B,confLoss_B,lastFive_B,lastTen_B,gamePlay_B,ptsScore_B,ptsAllow_B,ptsDiff_B,opptGmPlay_B,opptGmWon_B,opptOpptGmPlay_B,opptOpptGmWon_B,sos_B,rel%Indx_B,mov_B,srs_B,pw%_B,pyth%13.91_B,wpyth13.91_B,lpyth13.91_B,pyth%16.5_B,wpyth16.5_B,lpyth16.5_B
0,10020,2012,2012-11-02,19:30,Regular,ATL,East,Southeast,Home,Loss,240,0,102,23,13,12,4,26,85,40,0.4706,63,33,0.5238,22,7,0.3182,17,15,0.8824,7,29,36,21,23,30,28,0,0,0,0,38.2979,57.5,0.5515,0.5118,16.6667,55.7692,12.3246,12.3226,4.1075,6.3492,1.2,73.625,61.3542,104.7423,111.9305,-7.1882,0.4396,17.9016,1.7692,92.3077,97.3819,97.3819,Malloy,Ed,Wright,Sean,Barnaky,Brent,HOU,West,Southwest,Away,Win,240,2,109,22,21,8,2,18,90,38,0.4222,60,30,0.5,30,8,0.2667,29,25,0.8621,23,35,58,28,25,28,28,0,0,0,0,61.7021,57.8947,0.5304,0.4667,44.2308,83.3333,16.9683,8.2151,2.0538,3.3333,1.2111,81.875,68.2292,111.9305,104.7423,7.1882,0.4318,15.0933,1.0476,38.0952,97.3819,97.3819,11.0,11th,0.0,1.0,L1,loss,1.0,1.5,102.0,109.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,102.0,109.0,-7.0,1.0,1.0,2.0,0.0,0.0,0.0,-7.0,-7.0,0.2695,0.2843,23.3126,58.6874,0.2506,20.5492,61.4508,1.0,1st,2.0,0.0,W2,win,2.0,0.0,214.0,198.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,2.0,2.0,107.0,99.0,8.0,1.0,0.0,2.0,1.0,0.1667,0.375,8.0,7.8333,0.7634,0.7467,61.2294,20.7706,0.7828,64.1896,17.8104
1,10020,2012,2012-11-02,19:30,Regular,HOU,West,Southwest,Away,Win,240,2,109,22,21,8,2,18,90,38,0.4222,60,30,0.5,30,8,0.2667,29,25,0.8621,23,35,58,28,25,28,28,0,0,0,0,61.7021,57.8947,0.5304,0.4667,44.2308,83.3333,16.9683,8.2151,2.0538,3.3333,1.2111,81.875,68.2292,111.9305,104.7423,7.1882,0.4318,15.0933,1.0476,38.0952,97.3819,97.3819,Malloy,Ed,Wright,Sean,Barnaky,Brent,ATL,East,Southeast,Home,Loss,240,0,102,23,13,12,4,26,85,40,0.4706,63,33,0.5238,22,7,0.3182,17,15,0.8824,7,29,36,21,23,30,28,0,0,0,0,38.2979,57.5,0.5515,0.5118,16.6667,55.7692,12.3246,12.3226,4.1075,6.3492,1.2,73.625,61.3542,104.7423,111.9305,-7.1882,0.4396,17.9016,1.7692,92.3077,97.3819,97.3819,1.0,1st,2.0,0.0,W2,win,2.0,0.0,214.0,198.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,2.0,2.0,107.0,99.0,8.0,1.0,0.0,2.0,1.0,0.1667,0.375,8.0,7.8333,0.7634,0.7467,61.2294,20.7706,0.7828,64.1896,17.8104,11.0,11th,0.0,1.0,L1,loss,1.0,1.5,102.0,109.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,102.0,109.0,-7.0,1.0,1.0,2.0,0.0,0.0,0.0,-7.0,-7.0,0.2695,0.2843,23.3126,58.6874,0.2506,20.5492,61.4508
2,10039,2012,2012-11-04,19:00,Regular,ATL,East,Southeast,Away,Win,240,2,104,20,11,12,1,20,83,41,0.494,58,33,0.569,25,8,0.32,20,14,0.7,12,26,38,30,17,28,29,0,0,0,0,50.6667,48.7805,0.5664,0.5422,28.5714,78.7879,10.7004,13.2351,1.1029,1.7241,1.253,77.75,64.7917,114.7038,104.7775,9.9263,0.5,16.2866,1.8182,109.0909,90.6683,90.6683,Wall,Scott,Callahan,Mike,Pantoja,Brenda,OKC,West,Northwest,Home,Loss,240,2,95,27,21,4,9,21,71,33,0.4648,49,24,0.4898,22,9,0.4091,22,20,0.9091,7,30,37,22,29,23,21,0,0,0,0,49.3333,81.8182,0.5887,0.5282,21.2121,71.4286,20.653,4.4117,9.9263,18.3673,1.338,71.5,59.5833,104.7775,114.7038,-9.9263,0.3882,20.9823,1.2857,19.0476,90.6683,90.6683,8.0,8th,1.0,1.0,W1,win,1.0,1.0,206.0,204.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,2.0,103.0,102.0,1.0,4.0,2.0,12.0,7.0,0.5278,0.520825,1.0,0.4722,0.5329,0.5339,43.7798,38.2202,0.5402,44.2964,37.7036,10.0,10th,1.0,2.0,L1,loss,1.0,2.0,285.0,282.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,3.0,95.0,94.0,1.0,5.0,4.0,16.0,8.0,0.7,0.608325,1.0,0.3,0.5329,0.5367,44.0094,37.9906,0.5435,44.567,37.433
3,10039,2012,2012-11-04,19:00,Regular,OKC,West,Northwest,Home,Loss,240,2,95,27,21,4,9,21,71,33,0.4648,49,24,0.4898,22,9,0.4091,22,20,0.9091,7,30,37,22,29,23,21,0,0,0,0,49.3333,81.8182,0.5887,0.5282,21.2121,71.4286,20.653,4.4117,9.9263,18.3673,1.338,71.5,59.5833,104.7775,114.7038,-9.9263,0.3882,20.9823,1.2857,19.0476,90.6683,90.6683,Wall,Scott,Callahan,Mike,Pantoja,Brenda,ATL,East,Southeast,Away,Win,240,2,104,20,11,12,1,20,83,41,0.494,58,33,0.569,25,8,0.32,20,14,0.7,12,26,38,30,17,28,29,0,0,0,0,50.6667,48.7805,0.5664,0.5422,28.5714,78.7879,10.7004,13.2351,1.1029,1.7241,1.253,77.75,64.7917,114.7038,104.7775,9.9263,0.5,16.2866,1.8182,109.0909,90.6683,90.6683,10.0,10th,1.0,2.0,L1,loss,1.0,2.0,285.0,282.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,3.0,95.0,94.0,1.0,5.0,4.0,16.0,8.0,0.7,0.608325,1.0,0.3,0.5329,0.5367,44.0094,37.9906,0.5435,44.567,37.433,8.0,8th,1.0,1.0,W1,win,1.0,1.0,206.0,204.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,2.0,103.0,102.0,1.0,4.0,2.0,12.0,7.0,0.5278,0.520825,1.0,0.4722,0.5329,0.5339,43.7798,38.2202,0.5402,44.2964,37.7036
4,10057,2012,2012-11-07,19:30,Regular,ATL,East,Southeast,Home,Win,240,3,89,24,17,8,3,15,87,38,0.4368,65,31,0.4769,22,7,0.3182,12,6,0.5,17,34,51,22,29,14,24,0,0,0,0,55.4348,63.1579,0.4822,0.477,36.1702,75.5556,15.5564,8.8711,3.3267,4.6154,1.023,72.25,60.2083,98.6912,95.3645,3.3267,0.4368,18.0072,1.4118,47.0588,90.1803,90.1803,Mauer,Ken,Richardson,Derek,Fitzgerald,Kane,IND,East,Central,Away,Loss,240,2,86,18,15,10,7,14,85,35,0.4118,59,27,0.4576,26,8,0.3077,9,8,0.8889,11,30,41,25,25,27,9,0,0,0,0,44.5652,51.4286,0.4834,0.4588,24.4444,63.8298,14.4286,11.0889,7.7622,11.8644,1.0118,65.375,54.4792,95.3645,98.6912,-3.3267,0.3933,14.7589,1.2,66.6667,90.1803,90.1803,4.0,4th,2.0,1.0,W2,win,2.0,1.0,295.0,290.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,2.0,3.0,98.3,96.7,1.7,10.0,5.0,46.0,19.0,0.471,0.519925,1.6667,1.1957,0.5549,0.5592,45.8544,36.1456,0.5701,46.7482,35.2518,9.0,9th,2.0,3.0,L2,loss,2.0,2.0,450.0,466.0,1.0,0.0,1.0,3.0,1.0,2.0,2.0,2.0,5.0,90.0,93.2,-3.2,16.0,7.0,70.0,38.0,0.4726,0.454475,-3.2,-3.6726,0.3946,0.3808,31.2256,50.7744,0.3597,29.4954,52.5046


Quick look at the dimensions of the new dataframe.

In [18]:
games_stand_AB.shape

(14758, 201)

We now have a dataframe, `games_stand_AB`, which contains all the statistical data at the game level.

# Create a betting dataframe using `line_df`, `spread_df`, `ptotals_df`

Next up, the data from the betting datasets need to be prepared for merging with the box score + standings data.

The data from each respective betting dataset contains multiple oddsmakers' odds and betting statistics. In order to get a singular statistic for each kind of bet, the average of the oddsmakers' odds will be used.

In [19]:
line_df = line_df.drop(columns='book_id').groupby(['game_id','team_id', 'a_team_id']).mean().reset_index().copy()
spread_df = spread_df.drop(columns='book_id').groupby(['game_id','team_id', 'a_team_id']).mean().reset_index().copy()
ptotals_df = ptotals_df.drop(columns='book_id').groupby(['game_id','team_id', 'a_team_id']).mean().reset_index().copy()

Rename certain columns from each betting dataset in order to easily merge them all into one.

In [20]:
line_df = line_df.rename(columns={'price1': 'line_price1', 'price2': 'line_price2'}).copy()
spread_df = spread_df.rename(columns={'price1':'spread_price1', 'price2':'spread_price2'}).copy()
ptotals_df = ptotals_df.rename(columns={'price1': 'total_price1', 'price2': 'total_price2'}).copy()

Create a singular betting dataframe, `bets_df`, by outer merging all the above dataframes.

In [21]:
bets_df = line_df.merge(spread_df, how='outer').merge(ptotals_df, how='outer').copy()
bets_df = bets_df.rename(columns={'team_id':'teamid1', 'a_team_id':'teamid2'}).copy()

The betting data is currently in a game by game format (each game being counted only once). In order to merge it to the big double counted set, the betting information for each team will need to be put into singular form. Essentially this is the opposite of what was done in the prior section.

In [22]:
bets_df_1s = bets_df[['game_id',
                     'teamid1',
                     'line_price1',
                     'spread1',
                     'spread_price1',
                     'total1',
                     'total_price1']].copy()
bets_df_2s = bets_df[['game_id',
                     'teamid2',
                     'line_price2',
                     'spread2',
                     'spread_price2',
                     'total2',
                     'total_price2']].copy()

Rename the second dataframe columns so that they are identical to the first. This is done in order to allow these two dataframes to easily be concatenated (one on top of the other). Then of course, concatenate the two dataframes. 

In [23]:
bets_df_2s.columns = bets_df_2s.columns.str.replace('2', '1').copy()
bets = pd.concat([bets_df_1s, bets_df_2s], ignore_index=True).copy()

There are a number of 'cleanup' steps when we look at the `gameKey_df`. Looking forward a bit, the ultimate goal is to get this betting data to match up with the game box score and standings data. In order to make them match up correctly, numerous team abbreviations in `gameKey_df` need to be put in the same format as the team abbreviations in `games_stand_AB`. The betting data is then limited to between the seasons 2011-2012 and 2018-2019 as the box score/standings data only ranges in those seasons. A few variable renames and filters later and betting data can be merged with its gameKey.

In [24]:
gameKey_df.matchup = gameKey_df.matchup.str.replace('GSW', 'GS').str.replace('NOH', 
                              'NO').str.replace('NOK', 'NO').str.replace('NOP', 'NO').str.replace('NYK', 
                              'NY').str.replace('PHX', 'PHO').str.replace('SAS', 'SA').copy()
gameKey_df = gameKey_df[(2011 < gameKey_df.season_year) & (gameKey_df.season_year < 2018)].copy()
gameKey_df['teamAbbr1'] = gameKey_df.matchup.str.split().apply(lambda x: x[2]).copy()
gameKey_df['teamAbbr2'] = gameKey_df.matchup.str.split().apply(lambda x: x[0]).copy()
gameKey_df['game_date'] = gameKey_df.game_date.astype('datetime64').copy()
gameKey_df = gameKey_df.rename(columns={'team_id': 'teamid1', 'a_team_id': 'teamid2'}).copy()
gameKey_df = gameKey_df[gameKey_df.season_type == 'Regular Season'].copy()

Merge the betting data to the gameKey.

In [25]:
bets_games = gameKey_df.merge(bets, on=['game_id', 'teamid1']).copy()

The majority of the gameKey will be redundant, except for the new betting data, so we can filter out the majority of the columns. Let's call the new filtered dataframe `red_bets` in the vein of red team vs blue team.

In [26]:
filter_cols = ['game_date',
 'teamAbbr1',
 'line_price1',
 'spread1',
 'spread_price1',
 'total1',
 'total_price1']

red_bets = bets_games[filter_cols].copy()

Now create a team A bet dataframe and a team B bet dataframe. This makes it very easy to finally merge the team A betting data and team B betting data to the box score/standings data.

In [27]:
A_cols = red_bets.columns.str.replace('1', '_A')
B_cols = red_bets.columns.str.replace('1', '_B')

red_bets_A = red_bets.copy()
red_bets_B = red_bets.copy()

red_bets_A.columns = A_cols
red_bets_B.columns = B_cols

Rename game_date to gmDate and prepare for final merge!

In [28]:
red_bets_A = red_bets_A.rename(columns={'game_date': 'gmDate'}).copy()
red_bets_B = red_bets_B.rename(columns={'game_date': 'gmDate'}).copy()

Quick look at one of the above dataframes, the only difference with red_bets_B is in the column names, 'A' is replaced with 'B' in each column.

In [29]:
red_bets_A.head()

Unnamed: 0,gmDate,teamAbbr_A,line_price_A,spread_A,spread_price_A,total_A,total_price_A
0,2017-02-02,LAC,279.7,7.7,-107.6,229.75,-108.6
1,2017-02-01,LAC,-143.0,-2.55,-107.0,220.1,-110.0
2,2017-01-24,LAC,-190.0,-4.55,-108.1,203.45,-109.6
3,2017-01-23,LAC,262.8,7.45,-86.7,206.5,-108.2
4,2017-01-14,LAC,-804.0,-11.5,-106.5,219.85,-107.5


Final look at the dimensions of the two dataframes to confirm.

In [30]:
red_bets_A.shape, red_bets_B.shape

((14266, 7), (14266, 7))

Quick recap for this section: all the betting data has been averaged and merged for each game and connected to the gameKey. Finally, two copies with different columns (one 'A', one 'B') are created in order to easily merge to the box score/standings dataframe.

# Merge `games_stand_AB` with `red_bets_A` and `red_bets_B`

At long last, the box score/standings dataframe can be merged with the betting data. This is done by merging betting data for team A onto `games_stand_AB` then merging betting data for team B to that resulting dataframe.

In [31]:
game_bet_A = games_stand_AB.merge(red_bets_A, how='left', on=['gmDate', 'teamAbbr_A']).copy()
agg_data = game_bet_A.merge(red_bets_B, how='left', on=['gmDate', 'teamAbbr_B']).copy()

There it is, `agg_data`, containing box score/standings/betting data by game for each regular season game in the NBA from the perspective of each team (each game double counted) from the 2012-2013 to 2017-2018 seasons. It contains 14758 rows with 211 variables for each.

In [32]:
agg_data.shape

(14758, 211)

# Shift standings data
Standings data needs shifted as it currently includes data from the current game (e.g. gameWon includes the game played in the given row). In order for standings data to be predictive it should only contain data available leading up to the game rather than the data available at the end of the game. An easy way to fix this is to shift the rows forward/down one row for each standings column by team by season.

In [33]:
shift_cols_A = ['rank_A',
 'rankOrd_A',
 'gameWon_A',
 'gameLost_A',
 'stk_A',
 'stkType_A',
 'stkTot_A',
 'gameBack_A',
 'ptsFor_A',
 'ptsAgnst_A',
 'homeWin_A',
 'homeLoss_A',
 'awayWin_A',
 'awayLoss_A',
 'confWin_A',
 'confLoss_A',
 'gamePlay_A',
 'ptsScore_A',
 'ptsAllow_A',
 'ptsDiff_A',
 'opptGmPlay_A',
 'opptGmWon_A',
 'opptOpptGmPlay_A',
 'opptOpptGmWon_A',
 'sos_A',
 'rel%Indx_A',
 'mov_A',
 'srs_A',
 'pw%_A',
 'pyth%13.91_A',
 'wpyth13.91_A',
 'lpyth13.91_A',
 'pyth%16.5_A',
 'wpyth16.5_A',
 'lpyth16.5_A']
shift_cols_B = [ 'rank_B',
 'rankOrd_B',
 'gameWon_B',
 'gameLost_B',
 'stk_B',
 'stkType_B',
 'stkTot_B',
 'gameBack_B',
 'ptsFor_B',
 'ptsAgnst_B',
 'homeWin_B',
 'homeLoss_B',
 'awayWin_B',
 'awayLoss_B',
 'confWin_B',
 'confLoss_B',
 'gamePlay_B',
 'ptsScore_B',
 'ptsAllow_B',
 'ptsDiff_B',
 'opptGmPlay_B',
 'opptGmWon_B',
 'opptOpptGmPlay_B',
 'opptOpptGmWon_B',
 'sos_B',
 'rel%Indx_B',
 'mov_B',
 'srs_B',
 'pw%_B',
 'pyth%13.91_B',
 'wpyth13.91_B',
 'lpyth13.91_B',
 'pyth%16.5_B',
 'wpyth16.5_B',
 'lpyth16.5_B']

Gather only the variables that we want to shift into their own dataframes.

In [34]:
shift_A_df = agg_data[['seasID', 'gmDate', 'teamAbbr_A'] + shift_cols_A].sort_values(by=['teamAbbr_A', 'gmDate']).copy()
shift_B_df = agg_data[['seasID', 'gmDate', 'teamAbbr_B'] + shift_cols_B].sort_values(by=['teamAbbr_B', 'gmDate']).copy()

Set up a function that will go through a dataframe, filter it by team and season, then shift the desired columns for each team and season, then return all the dataframes concatenated together (while maintaining the identifiers for each row so it can be merged to the original dataframe).

In [35]:
def shift_data(df, team_abbr, shift_cols):
    new_dfs = []
    for team in df[team_abbr].unique():
        for season in df[df[team_abbr] == team].seasID.unique():
            temp_df = df[(df[team_abbr] == team) & (df.seasID == season)]
            shift_data = temp_df[shift_cols].shift().copy()
            new_data = temp_df.drop(columns=shift_cols).join(shift_data).copy()
            new_dfs.append(new_data)
    shift_df = pd.concat(new_dfs)
    return shift_df

Run the shifting function for team A and B and get the shifted dataframes.

In [36]:
shift_A_df = shift_data(shift_A_df, 'teamAbbr_A', shift_cols_A).copy()
shift_B_df = shift_data(shift_B_df, 'teamAbbr_B', shift_cols_B).copy()

Take the resulting shifted dataframes, drop the shifted columns from the main `agg_data` dataframe, then merge those columns back into `agg_data`.

In [37]:
agg_data = agg_data.drop(columns=shift_cols_A).merge(shift_A_df).drop(columns=shift_cols_B).merge(shift_B_df).copy()

Reset column order manually.

In [38]:
col_order = ['gameID',
 'seasID',
 'gmDate',
 'gmTime',
 'seasTyp',
 'teamAbbr_A',
 'teamConf_A',
 'teamDiv_A',
 'teamLoc_A',
 'teamRslt_A',
 'teamMin_A',
 'teamDayOff_A',
 'teamPTS_A',
 'teamAST_A',
 'teamTO_A',
 'teamSTL_A',
 'teamBLK_A',
 'teamPF_A',
 'teamFGA_A',
 'teamFGM_A',
 'teamFG%_A',
 'team2PA_A',
 'team2PM_A',
 'team2P%_A',
 'team3PA_A',
 'team3PM_A',
 'team3P%_A',
 'teamFTA_A',
 'teamFTM_A',
 'teamFT%_A',
 'teamORB_A',
 'teamDRB_A',
 'teamTRB_A',
 'teamPTS1_A',
 'teamPTS2_A',
 'teamPTS3_A',
 'teamPTS4_A',
 'teamPTS5_A',
 'teamPTS6_A',
 'teamPTS7_A',
 'teamPTS8_A',
 'teamTREB%_A',
 'teamASST%_A',
 'teamTS%_A',
 'teamEFG%_A',
 'teamOREB%_A',
 'teamDREB%_A',
 'teamTO%_A',
 'teamSTL%_A',
 'teamBLK%_A',
 'teamBLKR_A',
 'teamPPS_A',
 'teamFIC_A',
 'teamFIC40_A',
 'teamOrtg_A',
 'teamDrtg_A',
 'teamEDiff_A',
 'teamPlay%_A',
 'teamAR_A',
 'teamAST/TO_A',
 'teamSTL/TO_A',
 'poss_A',
 'pace_A',
 'offLNm1',
 'offFNm1',
 'offLNm2',
 'offFNm2',
 'offLNm3',
 'offFNm3',
 'teamAbbr_B',
 'teamConf_B',
 'teamDiv_B',
 'teamLoc_B',
 'teamRslt_B',
 'teamMin_B',
 'teamDayOff_B',
 'teamPTS_B',
 'teamAST_B',
 'teamTO_B',
 'teamSTL_B',
 'teamBLK_B',
 'teamPF_B',
 'teamFGA_B',
 'teamFGM_B',
 'teamFG%_B',
 'team2PA_B',
 'team2PM_B',
 'team2P%_B',
 'team3PA_B',
 'team3PM_B',
 'team3P%_B',
 'teamFTA_B',
 'teamFTM_B',
 'teamFT%_B',
 'teamORB_B',
 'teamDRB_B',
 'teamTRB_B',
 'teamPTS1_B',
 'teamPTS2_B',
 'teamPTS3_B',
 'teamPTS4_B',
 'teamPTS5_B',
 'teamPTS6_B',
 'teamPTS7_B',
 'teamPTS8_B',
 'teamTREB%_B',
 'teamASST%_B',
 'teamTS%_B',
 'teamEFG%_B',
 'teamOREB%_B',
 'teamDREB%_B',
 'teamTO%_B',
 'teamSTL%_B',
 'teamBLK%_B',
 'teamBLKR_B',
 'teamPPS_B',
 'teamFIC_B',
 'teamFIC40_B',
 'teamOrtg_B',
 'teamDrtg_B',
 'teamEDiff_B',
 'teamPlay%_B',
 'teamAR_B',
 'teamAST/TO_B',
 'teamSTL/TO_B',
 'poss_B',
 'pace_B',
 'rank_A',
 'rankOrd_A',
 'gameWon_A',
 'gameLost_A',
 'stk_A',
 'stkType_A',
 'stkTot_A',
 'gameBack_A',
 'ptsFor_A',
 'ptsAgnst_A',
 'homeWin_A',
 'homeLoss_A',
 'awayWin_A',
 'awayLoss_A',
 'confWin_A',
 'confLoss_A',
 'lastFive_A',
 'lastTen_A',
 'gamePlay_A',
 'ptsScore_A',
 'ptsAllow_A',
 'ptsDiff_A',
 'opptGmPlay_A',
 'opptGmWon_A',
 'opptOpptGmPlay_A',
 'opptOpptGmWon_A',
 'sos_A',
 'rel%Indx_A',
 'mov_A',
 'srs_A',
 'pw%_A',
 'pyth%13.91_A',
 'wpyth13.91_A',
 'lpyth13.91_A',
 'pyth%16.5_A',
 'wpyth16.5_A',
 'lpyth16.5_A',
 'rank_B',
 'rankOrd_B',
 'gameWon_B',
 'gameLost_B',
 'stk_B',
 'stkType_B',
 'stkTot_B',
 'gameBack_B',
 'ptsFor_B',
 'ptsAgnst_B',
 'homeWin_B',
 'homeLoss_B',
 'awayWin_B',
 'awayLoss_B',
 'confWin_B',
 'confLoss_B',
 'lastFive_B',
 'lastTen_B',
 'gamePlay_B',
 'ptsScore_B',
 'ptsAllow_B',
 'ptsDiff_B',
 'opptGmPlay_B',
 'opptGmWon_B',
 'opptOpptGmPlay_B',
 'opptOpptGmWon_B',
 'sos_B',
 'rel%Indx_B',
 'mov_B',
 'srs_B',
 'pw%_B',
 'pyth%13.91_B',
 'wpyth13.91_B',
 'lpyth13.91_B',
 'pyth%16.5_B',
 'wpyth16.5_B',
 'lpyth16.5_B',
 'line_price_A',
 'spread_A',
 'spread_price_A',
 'total_A',
 'total_price_A',
 'line_price_B',
 'spread_B',
 'spread_price_B',
 'total_B',
 'total_price_B']

Reset column order (from above `col_order` variable).

In [39]:
agg_data = agg_data[col_order].copy()

Export data.

In [40]:
agg_data.to_csv('C:/Users/philb/Google Drive/Thinkful/Thinkful_repo/projects/supervised_capstone/Export Data/nba_2012-2018_reg_data.csv', index=False)

# Quick Recap

A bunch of different NBA datasets have been aggregated into one big dataframe, that dataframe has now been exported as a CSV file for future use.

Next step, [NBA Data Cleaning and Exploration](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Cleaning_Exploration.ipynb).

For navigational convenience:
1. [NBA Data Aggregation](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Aggregation.ipynb)*
2. [NBA Data Cleaning and Exploration](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Cleaning_Exploration.ipynb)
3. [NBA Modeling](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Modeling.ipynb)
4. [NBA Model Testing](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Model_Testing.ipynb)