<h1 align='center'>NBA SUPERVISED LEARNING CAPSTONE</h1>
<h2 align='center'>Philip Bowman</h2>

## Part 4: NBA Model Testing
1. [NBA Data Aggregation](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Aggregation.ipynb)
2. [NBA Data Cleaning and Exploration](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Cleaning_Exploration.ipynb)
3. [NBA Modeling](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Modeling.ipynb)
4. [NBA Model Testing](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Model_Testing.ipynb)*

## Purpose:

To test the model selected in the prior notebook and use it to make predictions on games for the current NBA season. Ultimately, to see if the model is viable on current season games.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 500)

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# How?

1. Retrain selected model using all the data from the 2012-2018 NBA seasons
2. Reobserve the features decided on in the data cleaning and exploration phase
3. Collect relevant data for current NBA season (2018-2019)
4. Manipulate data so it fits the format of the model
5. Execute and check performance of model

# Retraining selected model

Import data.

In [2]:
df = pd.read_csv('C:/Users/philb/Google Drive/Thinkful/Thinkful_repo/projects/supervised_capstone/Export Data/target_features.csv', index_col=0)

Instantiate scaler for standardization and instantiate selected model.

In [3]:
scaler = StandardScaler()
model = LogisticRegression(C=0.4, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

Split target and features.

In [4]:
target = df['teamWin_A']
features = df.drop(columns='teamWin_A').copy()

Standardize features (except for `teamHome_A` since it's an indicator).

In [5]:
s_features = pd.DataFrame(scaler.fit_transform(features.iloc[:, 1:]), columns=features.columns[1:], index=features.index).join(features.iloc[:, 0]).copy()

*Note: once the data is collected from the current NBA season, those will have to be transformed using the scaler.*

Retrain the model with all the data.

In [6]:
model.fit(s_features, target)

LogisticRegression(C=0.4, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

# Reobserve selected features

In [7]:
features.columns

Index(['teamHome_A', 'gameBack_A', 'ptsAllow_A', 'teamPTS1_A', 'teamDrtg_A',
       'team3PM_Diff_A', 'teamDRB_Diff_A', 'teamPTS1_Diff_A',
       'teamPTS2_Diff_A', 'teamPTS3_Diff_A', 'teamBLKR_Diff_A',
       'teamAST/TO_Diff_A', 'lastFiveWin%_B', 'teamDrtg_B', 'team3PM_Diff_B',
       'teamPTS1_Diff_B', 'teamPTS3_Diff_B', 'teamBLKR_Diff_B',
       'teamAST/TO_Diff_B', 'ptsAgnst_Diff', 'ptsScore_Diff', 'ptsAllow_Diff',
       'spread_Diff', 'lastFiveWin%_Diff'],
      dtype='object')

## Feature Definitions:
- teamHome_A: Indicates whether team A is home or away (1 is home, 0 is away)
- gameBack_A: The number of games behind team A is from the leader in their conference prior to the game
- ptsAllow_A: The average points scored against team A per game of all games in the season prior to the game
- teamPTS1_A: The average of the points scored in the first quarter by team A over games up to the prior 9 games
- teamDrtg_A: The average of the defensive ratings of team A over games up to the prior 9 games
- team3PM_Diff_A: The average of the differences of three point shots made by team A less their opponent over games up to the prior 9 games
- teamDRB_Diff_A: The average of the differences of defensive rebounds of team A less their opponent over games up to the prior 9 games
- teamPTS1_Diff_A: The average of the differences of points scored by team A less their opponent in the first quarter over games up to the prior 9 games
- teamPTS2_Diff_A: The average of the differences of points scored by team A less their opponent in the second quarter over games up to the prior 9 games
- teamPTS3_Diff_A: The average of the differences of points scored by team A less thier opponent in the third quarter over games up to the prior 9 games
- teamBLKR_Diff_A: The average of the differences of block rate for team A less their opponent over games up to the prior 9 games
- teamAST/TO_Diff_A: The average of the differences of team assist to turnover ratio for team A less their opponent over games up to the prior 9 games
- lastFiveWin%_B: The win percentage for team B over the prior five games
- teamDrtg_B: The average of the defensive ratings of team B over games up to the prior 9 games
- team3PM_Diff_B: The average of the differences of three point shots made by team B less their opponent over games up to the prior 9 games
- teamPTS1_Diff_B: The average of the differences of points scored by team B less their opponent in the first quarter over games up to the prior 9 games
- teamPTS3_Diff_B: The average of the differences of points scored by team B less their opponent in the third quarter over games up to the prior 9 games
- teamBLKR_Diff_B: The average of the differences of block rate for team B less their opponent over games up to the prior 9 games
- teamAST/TO_Diff_B: The average of the differences of team assist to turnover ratio for team B less their opponent over games up to the prior 9 games
- ptsAgnst_Diff: The difference of the total accumulated points scored against team A up to the date of the game in the season less the total accumulated points scored against team B up to the date of the game in the season
- ptsScore_Diff: The difference of the average points scored by team A per game of all games in the season up to the point of this game less the average points scored by team B per game of all games in the season up to the point of this game
- ptsAllow_Diff: The difference of the average points scored against team A per game of all games in the season up to the point of this game less the average points scored against team B per game of all games in the season up to the point of this game
- spread_Diff: The difference of the average betting spread for team A less the average betting spread for team B
- lastFiveWin%_Diff: The difference between the win percentage for team A over the prior five games and the win percentage for team B over the prior five games

In [8]:
features.columns

Index(['teamHome_A', 'gameBack_A', 'ptsAllow_A', 'teamPTS1_A', 'teamDrtg_A',
       'team3PM_Diff_A', 'teamDRB_Diff_A', 'teamPTS1_Diff_A',
       'teamPTS2_Diff_A', 'teamPTS3_Diff_A', 'teamBLKR_Diff_A',
       'teamAST/TO_Diff_A', 'lastFiveWin%_B', 'teamDrtg_B', 'team3PM_Diff_B',
       'teamPTS1_Diff_B', 'teamPTS3_Diff_B', 'teamBLKR_Diff_B',
       'teamAST/TO_Diff_B', 'ptsAgnst_Diff', 'ptsScore_Diff', 'ptsAllow_Diff',
       'spread_Diff', 'lastFiveWin%_Diff'],
      dtype='object')

## NBA data needed for creation of above features
2019-2020 NBA Data for (parentheses indicate need/feature relevant to data):
1. Game date and team names for each game (for orgainization and sorting)
2. Home/away information for each team for each game (teamHome_A)
3. Game back information for each team for every date of the 2019-2020 NBA season (gameBack_A)
4. Total points scored for each team in each game (ptsAllow_A, ptsAllow_Diff, lastFiveWin%_B, ptsAgnst_Diff, ptsScore_Diff, ptsAllow_Diff, 'lastFiveWin%_Diff')
5. The defensive ratings for each team in each game (teamDrtg_A & teamDrtg_B)
6. The number of three point makes for each team in each game (team3PM_Diff_A & team3PM_Diff_B)
7. The number of defensive rebounds for each team in each game (teamDRB_Diff_A)
8. The points scored in the first quarter of each game by each team (teamPTS1_Diff_A & teamPTS1_Diff_B)
9. The points scored in the second quarter of each game by each team (teamPTS2_Diff_A)
10. The points scored in the third quarter of each game by each team (teamPTS3_Diff_A & teamPTS3_Diff_B)
11. The block rate for each team in each game OR the number of blocks and number of two point attemps for each team in each game (teamBLKR_Diff_A & teamBLKR_Diff_B)
12. The number of assists and turnovers for each team in each game (teamAST/TO_Diff_A & teamAST/TO_Diff_B)
13. Betting spread information for each game (spread_Diff)

# Collect relevant data

The next goal is to collect the above relevant data. In this case there is an API available on GitHub which uses endpoints from the [official NBA stats website](https://stats.nba.com/) appropriately named [nba_api](https://github.com/swar/nba_api). Since betting data is not available through the API, [OddsShark](https://www.oddsshark.com/nba/database) will be used to aggregate the spread data.

The nba_api was used to obtain all the relevant game data that can be used to manipulate it into the form needed for model testing.
That code to collect that data can be viewed [here (2019-2020_data_agg)](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/2019-2020_data_agg.ipynb). The aforementioned file also brings the spread data together with the game data. The spread data was obtained by means of using OddsShark.com and converting it to the an appropriate format for use using the [spread_convert](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/spread_convert.ipynb) notebook.

The result is a CSV file that contains all the raw elements needed to derive the above features.

Import data.

In [9]:
seas_data = pd.read_csv('C:/Users/philb/Google Drive/Thinkful/Thinkful_repo/projects/supervised_capstone/Export Data/2019-2020_data.csv', index_col=0)

Look at dimensions of the data.

In [10]:
seas_data.shape

(1076, 32)

This contains 32 variables that can be used to derive all the needed features in order to test the model. The 1076 rows indicate that there is game data for all NBA games from October 22, 2019 (the start of the regular season) to nearly present day (January 5, 2020) which holds information for 538 games in the 2019-2020 NBA season.

In [11]:
seas_data.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS,PTS_QTR1,PTS_QTR2,PTS_QTR3,Spread
0,22019,1610612737,ATL,Atlanta Hawks,21900526,2020-01-04,ATL vs. IND,W,240,116,42,82,0.512,10,32,0.313,22,24,0.917,4,36,40,22,8,4,13,28,5.0,43,21,32,7.5
1,22019,1610612737,ATL,Atlanta Hawks,21900517,2020-01-03,ATL @ BOS,L,239,106,39,93,0.419,16,45,0.356,12,15,0.8,6,36,42,26,6,5,14,24,-3.0,32,23,25,10.5
2,22019,1610612737,ATL,Atlanta Hawks,21900491,2019-12-30,ATL @ ORL,W,240,101,39,81,0.481,9,29,0.31,14,17,0.824,11,41,52,21,8,4,20,20,8.0,25,22,27,10.5
3,22019,1610612737,ATL,Atlanta Hawks,21900477,2019-12-28,ATL @ CHI,L,239,81,32,86,0.372,9,34,0.265,8,11,0.727,9,30,39,24,8,5,19,16,-35.0,19,24,22,9.5
4,22019,1610612737,ATL,Atlanta Hawks,21900469,2019-12-27,ATL vs. MIL,L,239,86,33,91,0.363,12,41,0.293,8,14,0.571,8,38,46,20,10,8,18,18,-26.0,19,21,23,7.0


# Manipulate data to fit model specifications

Essentially, the aggregated data from the prior step needs to be converted to fit the specifications of the model, first step in doing that is to create the appropriate features with the given data.

First off, let's limit things to just the columns essential to deriving the proper features.

In [12]:
seas_data.columns

Index(['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'PTS', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PLUS_MINUS', 'PTS_QTR1',
       'PTS_QTR2', 'PTS_QTR3', 'Spread'],
      dtype='object')

For the prior list of 12 needed data (parentheses indicate variables needed from above to meet required derivations):
1. Date & Names (`TEAM_ABBREVIATION`, `GAME_ID`, `GAME_DATE`, `MATCHUP`)
2. Home/Away (`MATCHUP`)
4. Points scored (`PTS`)
5. Defensive Rating (`FGA`, `OREB`, `DREB`, `FGM`, `TOV`, `FTA`, `PTS`)
6. Three pointers made (`FG3M`)
7. Defensive rebounds (`DREB`)
8. First quarter points (`PTS_QTR1`)
9. Second quarter points (`PTS_QTR2`)
10. Third quarter points (`PTS_QTR3`)
11. Block rate (`BLK`, `FGA`, `FG3A`)
12. Assists and turnovers (`AST`, `TOV`)
13. Spread (`Spread`)

So the required variables needed are `TEAM_ABBREVIATION`, `GAME_ID`, `GAME_DATE`, `MATCHUP`, `WL`, `PTS`, `FGA`, `OREB`, `DREB`, `FGM`, `TOV`, `FTA`, `FG3M`, `PTS_QTR1`, `PTS_QTR2`, `PTS_QTR3`, `BLK`, `FG3A`, `AST`, and `Spread`.

Make a list of the above variables.

In [13]:
needed_vars = ['GAME_ID', 'GAME_DATE', 'TEAM_ABBREVIATION', 'MATCHUP', 'WL', 'PTS', 'FGA', 'OREB', 'DREB', 'FGM', 'TOV', 'FTA',
               'FG3M', 'PTS_QTR1', 'PTS_QTR2', 'PTS_QTR3', 'BLK', 'FG3A', 'AST', 'Spread']

Get the data from `seas_data` containing only the variables from `needed_vars` above and place it in a dataframe called `ess_vars`.

In [14]:
ess_vars = seas_data[needed_vars].copy()

Take a look at the top of `ess_vars` to get an idea of its layout.

In [15]:
ess_vars.head()

Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ABBREVIATION,MATCHUP,WL,PTS,FGA,OREB,DREB,FGM,TOV,FTA,FG3M,PTS_QTR1,PTS_QTR2,PTS_QTR3,BLK,FG3A,AST,Spread
0,21900526,2020-01-04,ATL,ATL vs. IND,W,116,82,4,36,42,13,24,10,43,21,32,4,32,22,7.5
1,21900517,2020-01-03,ATL,ATL @ BOS,L,106,93,6,36,39,14,15,16,32,23,25,5,45,26,10.5
2,21900491,2019-12-30,ATL,ATL @ ORL,W,101,81,11,41,39,20,17,9,25,22,27,4,29,21,10.5
3,21900477,2019-12-28,ATL,ATL @ CHI,L,81,86,9,30,32,19,11,9,19,24,22,5,34,24,9.5
4,21900469,2019-12-27,ATL,ATL vs. MIL,L,86,91,8,38,33,18,14,12,19,21,23,8,41,20,7.0


Reminder of the ultimate features needed.

In [16]:
features.columns

Index(['teamHome_A', 'gameBack_A', 'ptsAllow_A', 'teamPTS1_A', 'teamDrtg_A',
       'team3PM_Diff_A', 'teamDRB_Diff_A', 'teamPTS1_Diff_A',
       'teamPTS2_Diff_A', 'teamPTS3_Diff_A', 'teamBLKR_Diff_A',
       'teamAST/TO_Diff_A', 'lastFiveWin%_B', 'teamDrtg_B', 'team3PM_Diff_B',
       'teamPTS1_Diff_B', 'teamPTS3_Diff_B', 'teamBLKR_Diff_B',
       'teamAST/TO_Diff_B', 'ptsAgnst_Diff', 'ptsScore_Diff', 'ptsAllow_Diff',
       'spread_Diff', 'lastFiveWin%_Diff'],
      dtype='object')

The most difficult ones to derive are probably the `gameBack` variable. So we'll start with that. This variable is really the number of games back a team is from the leader in their conference (in terms of team differential). The way to calculate this for a given team on a particular date is to find the greatest team differential (that is wins minus losses) on a given date in the appropriate conference then subtract the given team's differential from that and divide it by two. This results in each game being 'worth' 0.5 games.

So first, we'll get the total wins and losses for each team including the game currently being played on the given date.

Get wins and losses in numerical form for each game for each team.

In [17]:
ess_vars['W'] = ess_vars.WL.apply(lambda x: 1 if x == 'W' else 0)
ess_vars['L'] = ess_vars.WL.apply(lambda x: 1 if x == 'L' else 0)

Get total wins and losses for each team up to and including the game played on the game date as well as finding their team differential (wins minus losses).

In [18]:
team_dfs = []
for team in ess_vars.TEAM_ABBREVIATION.unique():
    team_df = ess_vars[ess_vars.TEAM_ABBREVIATION==team].sort_values('GAME_DATE').copy()
    team_df['W_TOTAL'] = team_df.W.cumsum()
    team_df['L_TOTAL'] = team_df.L.cumsum()
    team_df['TEAM_DIFFERENTIAL'] = team_df['W_TOTAL'] - team_df['L_TOTAL']
    team_dfs.append(team_df)
wl_vars = pd.concat(team_dfs)

Create two new dataframes with every date between October 22, 2019 (start of season) and Jan 5, 2020 (when data was collected) - one for Eastern teams and one for Western teams.

In [19]:
dates_df_E = pd.DataFrame(pd.date_range(start='2019-10-22', end='2020-01-05'), columns=['GAME_DATE'])
dates_df_W = pd.DataFrame(pd.date_range(start='2019-10-22', end='2020-01-05'), columns=['GAME_DATE'])

Change `wl_vars` `GAME_DATE` variable to datetime64.

In [20]:
wl_vars['GAME_DATE'] = wl_vars.GAME_DATE.astype('datetime64')

Seperate Eastern and Western teams.

In [21]:
east_teams = ['ATL', 'BOS', 'CLE', 'CHI', 'MIA', 'MIL', 'BKN', 'NYK', 'ORL', 'IND', 'PHI', 'TOR', 'WAS', 'DET', 'CHA']
west_teams = ['NOP', 'DAL', 'DEN', 'GSW', 'HOU', 'LAC', 'LAL', 'MIN', 'PHX', 'POR', 'SAC', 'SAS', 'OKC', 'UTA', 'MEM']

Make function to easily convert both `dates_df_E` and `dates_df_W` to format needed.

In [22]:
def make_teamDifferential_df(team_list, dates_df):
    for team in team_list:
        sing_team_diff = wl_vars[wl_vars.TEAM_ABBREVIATION==team][['GAME_DATE','TEAM_DIFFERENTIAL']].rename(columns={'TEAM_DIFFERENTIAL':team})
        dates_df = dates_df.merge(sing_team_diff, how='left', on='GAME_DATE').ffill().fillna(0)
    dates_df['MAX_DIFF'] = dates_df.iloc[:, 1:].max(axis=1) 
    return dates_df

Call the function to create the two dataframes.

In [23]:
dates_df_E = make_teamDifferential_df(east_teams, dates_df_E)
dates_df_W = make_teamDifferential_df(west_teams, dates_df_W)

There are now two dataframes with team differential data for every day with the relevant Eastern and Western teams as well as the highest differential for the day (best team according to differential). An idea of what that looks like can be seen below.

In [24]:
dates_df_E.head()

Unnamed: 0,GAME_DATE,ATL,BOS,CLE,CHI,MIA,MIL,BKN,NYK,ORL,IND,PHI,TOR,WAS,DET,CHA,MAX_DIFF
0,2019-10-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,2019-10-23,0.0,-1.0,-1.0,-1.0,1.0,0.0,-1.0,-1.0,1.0,-1.0,1.0,1.0,-1.0,1.0,1.0,1.0
2,2019-10-24,1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,1.0,-1.0,1.0,1.0,-1.0,0.0,1.0,1.0
3,2019-10-25,1.0,0.0,-1.0,0.0,1.0,1.0,0.0,-2.0,1.0,-1.0,1.0,0.0,0.0,0.0,0.0,1.0
4,2019-10-26,2.0,1.0,0.0,-1.0,2.0,0.0,0.0,-3.0,0.0,-2.0,2.0,1.0,-1.0,-1.0,0.0,2.0


Create another function to create a games behind dataframe easily able to merge with `wl_vars` dataframe.

In [25]:
def make_games_back_df(team_list, dates_df):
    dates_gb_dfs = []
    for team in team_list:
        dates_gb = pd.DataFrame(dates_df['GAME_DATE'])
        dates_gb['TEAM_ABBREVIATION'] = team
        dates_gb['GAMES_BACK'] = (dates_df[[team,'MAX_DIFF']].MAX_DIFF - dates_df[[team,'MAX_DIFF']][team]) / 2
        dates_gb_dfs.append(dates_gb)
    gameback_df = pd.concat(dates_gb_dfs)
    return gameback_df

Call that function on the Eastern and Western data.

In [26]:
gameback_df_E = make_games_back_df(east_teams, dates_df_E)
gameback_df_W = make_games_back_df(west_teams, dates_df_W)

Put all the data together now.

In [27]:
gameback_df_EW = pd.concat([gameback_df_E, gameback_df_W])

What we now have is a dataframe that can be easily be inner merged with the `wl_vars` dataframe completing the creation of this variable.

In [28]:
gameback_df_EW.head()

Unnamed: 0,GAME_DATE,TEAM_ABBREVIATION,GAMES_BACK
0,2019-10-22,ATL,0.5
1,2019-10-23,ATL,0.5
2,2019-10-24,ATL,0.0
3,2019-10-25,ATL,0.0
4,2019-10-26,ATL,0.0


Add the variable to the dataset.

In [29]:
gb_vars = wl_vars.merge(gameback_df_EW).copy()

Let's start by creating some other statistics needed:
1. `POSS` or possessions in order to calculate the defensive rating
2. `FG2PA` or two point field goal attempts in order to calculate block rate
3. `AST/TO` or assist to turnover ratio (one of the needed variables)

In [30]:
gb_vars['POSS'] = gb_vars.FGA - gb_vars.OREB / (gb_vars.OREB + gb_vars.DREB) * (gb_vars.FGA - gb_vars.FGM) * 1.07 + gb_vars.TOV + (.4 * gb_vars.FTA)
gb_vars['FG2PA'] = gb_vars.FGA - gb_vars.FG3A
gb_vars['AST/TOV'] = gb_vars.AST / gb_vars.TOV

Next, add a home indicator.

In [31]:
gb_vars['TEAM_HOME'] = gb_vars.MATCHUP.apply(lambda x: 1 if x.split()[1] =='vs.' else 0)

Get average points scored up to that point of the season for each team. 

In [32]:
temp_dfs = []
for team in gb_vars.TEAM_ABBREVIATION.unique():
    temp_df = gb_vars[gb_vars.TEAM_ABBREVIATION==team].sort_values('GAME_DATE')
    temp_df['GAME_PLAYED'] = temp_df.TEAM_ABBREVIATION.apply(lambda x: 1 if x else 0).cumsum()
    temp_df['ACCUM_PTS'] = temp_df.PTS.cumsum()
    temp_df['AVG_PTS'] = temp_df.ACCUM_PTS / temp_df.GAME_PLAYED
    temp_dfs.append(temp_df)
ap_vars = pd.concat(temp_dfs)

Get the rolling sum for the last 5 games for each team and the last five games win %.

In [33]:
temp_dfs = []
for team in ap_vars.TEAM_ABBREVIATION.unique():
    temp_df = ap_vars[ap_vars.TEAM_ABBREVIATION==team].sort_values('GAME_DATE')
    temp_df['LAST_FIVE_W'] = temp_df.W.rolling(5, min_periods=0).sum()
    temp_df['LAST_FIVE_W_PCT'] = temp_df.LAST_FIVE_W / 5
    temp_dfs.append(temp_df)
lf_vars = pd.concat(temp_dfs)

Next, there are certain variables that require adversarial data in order to be created: points allowed, defensive rating, block rate, and points against.

In [34]:
joined_data = pd.merge(lf_vars, lf_vars, on=['GAME_ID', 'GAME_DATE'], suffixes=('_A', '_B'))
vs_data = joined_data[joined_data.TEAM_ABBREVIATION_A != joined_data.TEAM_ABBREVIATION_B].copy()

In [35]:
vs_data.head()

Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ABBREVIATION_A,MATCHUP_A,WL_A,PTS_A,FGA_A,OREB_A,DREB_A,FGM_A,TOV_A,FTA_A,FG3M_A,PTS_QTR1_A,PTS_QTR2_A,PTS_QTR3_A,BLK_A,FG3A_A,AST_A,Spread_A,W_A,L_A,W_TOTAL_A,L_TOTAL_A,TEAM_DIFFERENTIAL_A,GAMES_BACK_A,POSS_A,FG2PA_A,AST/TOV_A,TEAM_HOME_A,GAME_PLAYED_A,ACCUM_PTS_A,AVG_PTS_A,LAST_FIVE_W_A,LAST_FIVE_W_PCT_A,TEAM_ABBREVIATION_B,MATCHUP_B,WL_B,PTS_B,FGA_B,OREB_B,DREB_B,FGM_B,TOV_B,FTA_B,FG3M_B,PTS_QTR1_B,PTS_QTR2_B,PTS_QTR3_B,BLK_B,FG3A_B,AST_B,Spread_B,W_B,L_B,W_TOTAL_B,L_TOTAL_B,TEAM_DIFFERENTIAL_B,GAMES_BACK_B,POSS_B,FG2PA_B,AST/TOV_B,TEAM_HOME_B,GAME_PLAYED_B,ACCUM_PTS_B,AVG_PTS_B,LAST_FIVE_W_B,LAST_FIVE_W_PCT_B
1,21900014,2019-10-24,ATL,ATL @ DET,W,117,86,8,34,44,13,21,11,38,22,31,2,31,27,1.5,1,0,1,0,1,0.0,98.84,55,2.076923,0,1,117,117.0,1.0,0.2,DET,DET vs. ATL,L,100,85,9,30,35,13,22,10,32,31,18,5,37,20,-1.5,0,1,1,1,0,0.5,94.453846,48,1.538462,1,2,219,109.5,1.0,0.2
2,21900014,2019-10-24,DET,DET vs. ATL,L,100,85,9,30,35,13,22,10,32,31,18,5,37,20,-1.5,0,1,1,1,0,0.5,94.453846,48,1.538462,1,2,219,109.5,1.0,0.2,ATL,ATL @ DET,W,117,86,8,34,44,13,21,11,38,22,31,2,31,27,1.5,1,0,1,0,1,0.0,98.84,55,2.076923,0,1,117,117.0,1.0,0.2
5,21900028,2019-10-26,ATL,ATL vs. ORL,W,103,84,9,43,43,18,15,9,23,29,25,9,30,22,2.5,1,0,2,0,2,0.0,100.407115,54,1.222222,1,2,220,110.0,2.0,0.4,ORL,ORL @ ATL,L,99,99,17,29,35,10,26,5,26,24,25,7,31,15,-2.5,0,1,1,1,0,1.0,94.092174,68,1.5,0,2,193,96.5,1.0,0.2
6,21900028,2019-10-26,ORL,ORL @ ATL,L,99,99,17,29,35,10,26,5,26,24,25,7,31,15,-2.5,0,1,1,1,0,1.0,94.092174,68,1.5,0,2,193,96.5,1.0,0.2,ATL,ATL vs. ORL,W,103,84,9,43,43,18,15,9,23,29,25,9,30,22,2.5,1,0,2,0,2,0.0,100.407115,54,1.222222,1,2,220,110.0,2.0,0.4
9,21900043,2019-10-28,ATL,ATL vs. PHI,L,103,84,8,37,36,21,32,9,40,25,18,3,27,23,6.5,0,1,2,1,1,1.0,108.669333,57,1.095238,1,3,323,107.666667,2.0,0.4,PHI,PHI @ ATL,W,105,88,12,37,37,21,25,11,31,32,19,11,41,22,-6.5,1,0,3,0,3,0.0,105.635918,47,1.047619,0,3,329,109.666667,3.0,0.6


We'll get points allowed, defensive rating, block rate, and points against all in one fell swoop.

In [36]:
temp_dfs = []
for team in vs_data.TEAM_ABBREVIATION_A.unique():
    temp_df = vs_data[vs_data.TEAM_ABBREVIATION_A==team].sort_values('GAME_DATE')
    temp_df['PTSAGAINST_A'] = temp_df.PTS_B.cumsum()
    temp_df['PTSALLOW_A'] = temp_df.PTSAGAINST_A / temp_df.GAME_PLAYED_A
    temp_df['DRTG_A'] = temp_df.PTS_B * 100 / temp_df.POSS_B
    temp_df['BLKR_A'] = temp_df.BLK_A * 100 / temp_df.FG2PA_B
    temp_dfs.append(temp_df)
up_vs_data = pd.concat(temp_dfs)

Get team A cols to filter out B columns for now.

In [37]:
a_cols = up_vs_data.columns[up_vs_data.columns.str.endswith('_A')].tolist()

New dataframe, `team_stats`, holding almost all the data needed from a single team's perspective.

In [38]:
team_stats = up_vs_data[['GAME_ID', 'GAME_DATE'] + a_cols].copy()

Drop the '_A' from the column names. Rename `TEAM_ABBREVIATION` `TEAMABRV` as that '_A' is problematic going forward.

In [39]:
team_stats.columns = team_stats.columns.str.replace('TEAM_ABBREVIATION', 'TEAMABRV').str.replace('_A', '')

Take a look at the data.

In [40]:
team_stats.head()

Unnamed: 0,GAME_ID,GAME_DATE,TEAMABRV,MATCHUP,WL,PTS,FGA,OREB,DREB,FGM,TOV,FTA,FG3M,PTS_QTR1,PTS_QTR2,PTS_QTR3,BLK,FG3A,AST,Spread,W,L,W_TOTAL,L_TOTAL,TEAM_DIFFERENTIAL,GAMES_BACK,POSS,FG2PA,AST/TOV,TEAM_HOME,GAME_PLAYED,ACCUM_PTS,AVG_PTS,LAST_FIVE_W,LAST_FIVE_W_PCT,PTSAGAINST,PTSALLOW,DRTG,BLKR
1,21900014,2019-10-24,ATL,ATL @ DET,W,117,86,8,34,44,13,21,11,38,22,31,2,31,27,1.5,1,0,1,0,1,0.0,98.84,55,2.076923,0,1,117,117.0,1.0,0.2,100,100.0,105.871814,4.166667
5,21900028,2019-10-26,ATL,ATL vs. ORL,W,103,84,9,43,43,18,15,9,23,29,25,9,30,22,2.5,1,0,2,0,2,0.0,100.407115,54,1.222222,1,2,220,110.0,2.0,0.4,199,99.5,105.215977,13.235294
9,21900043,2019-10-28,ATL,ATL vs. PHI,L,103,84,8,37,36,21,32,9,40,25,18,3,27,23,6.5,0,1,2,1,1,1.0,108.669333,57,1.095238,1,3,323,107.666667,2.0,0.4,304,101.333333,99.398009,6.382979
13,21900052,2019-10-29,ATL,ATL @ MIA,L,97,83,9,24,35,20,24,11,26,23,21,7,39,28,8.5,0,1,2,2,0,1.5,98.592727,44,1.4,0,4,420,105.0,2.0,0.4,416,104.0,106.015531,16.666667
17,21900066,2019-10-31,ATL,ATL vs. MIA,L,97,88,16,34,36,16,26,7,26,20,29,5,34,20,6.5,0,1,2,3,-1,2.5,96.5952,54,1.25,1,5,517,103.4,2.0,0.4,522,104.4,112.448511,10.638298


That gets all the base data needed, next up is to get the data into the appropriate differences, rolling windows, and shifts needed for the ultimate feature data.

Before differencing, let's clean things up by getting things down to essential variables for moving forward.

In [41]:
team_stats = team_stats[['GAME_ID', 'GAME_DATE', 'TEAMABRV', 'W', 'TEAM_HOME', 'GAMES_BACK', 'PTSALLOW', 'DRTG', 'FG3M', 'DREB',
                         'PTS_QTR1', 'PTS_QTR2', 'PTS_QTR3', 'BLKR', 'AST/TOV', 'LAST_FIVE_W_PCT', 'PTSAGAINST', 'AVG_PTS', 'Spread']].copy()

For differencing, get into adversarial format once again.

In [42]:
joined = pd.merge(team_stats, team_stats, on=['GAME_ID', 'GAME_DATE'], suffixes=('_A', '_B'))
vs_df = joined[joined.TEAMABRV_A != joined.TEAMABRV_B].copy()

In [43]:
vs_df.head()

Unnamed: 0,GAME_ID,GAME_DATE,TEAMABRV_A,W_A,TEAM_HOME_A,GAMES_BACK_A,PTSALLOW_A,DRTG_A,FG3M_A,DREB_A,PTS_QTR1_A,PTS_QTR2_A,PTS_QTR3_A,BLKR_A,AST/TOV_A,LAST_FIVE_W_PCT_A,PTSAGAINST_A,AVG_PTS_A,Spread_A,TEAMABRV_B,W_B,TEAM_HOME_B,GAMES_BACK_B,PTSALLOW_B,DRTG_B,FG3M_B,DREB_B,PTS_QTR1_B,PTS_QTR2_B,PTS_QTR3_B,BLKR_B,AST/TOV_B,LAST_FIVE_W_PCT_B,PTSAGAINST_B,AVG_PTS_B,Spread_B
1,21900014,2019-10-24,ATL,1,0,0.0,100.0,105.871814,11,34,38,22,31,4.166667,2.076923,0.2,100,117.0,1.5,DET,0,1,0.5,113.5,118.373128,10,30,32,31,18,9.090909,1.538462,0.2,227,109.5,-1.5
2,21900014,2019-10-24,DET,0,1,0.5,113.5,118.373128,10,30,32,31,18,9.090909,1.538462,0.2,227,109.5,-1.5,ATL,1,0,0.0,100.0,105.871814,11,34,38,22,31,4.166667,2.076923,0.2,100,117.0,1.5
5,21900028,2019-10-26,ATL,1,1,0.0,99.5,105.215977,9,43,23,29,25,13.235294,1.222222,0.4,199,110.0,2.5,ORL,0,0,1.0,94.0,102.582371,5,29,26,24,25,12.962963,1.5,0.2,188,96.5,-2.5
6,21900028,2019-10-26,ORL,0,0,1.0,94.0,102.582371,5,29,26,24,25,12.962963,1.5,0.2,188,96.5,-2.5,ATL,1,1,0.0,99.5,105.215977,9,43,23,29,25,13.235294,1.222222,0.4,199,110.0,2.5
9,21900043,2019-10-28,ATL,0,1,1.0,101.333333,99.398009,9,37,40,25,18,6.382979,1.095238,0.4,304,107.666667,6.5,PHI,1,0,0.0,102.333333,94.78295,11,37,31,32,19,19.298246,1.047619,0.6,307,109.666667,-6.5


There are numerous variables that need differenced.

In [44]:
features.columns

Index(['teamHome_A', 'gameBack_A', 'ptsAllow_A', 'teamPTS1_A', 'teamDrtg_A',
       'team3PM_Diff_A', 'teamDRB_Diff_A', 'teamPTS1_Diff_A',
       'teamPTS2_Diff_A', 'teamPTS3_Diff_A', 'teamBLKR_Diff_A',
       'teamAST/TO_Diff_A', 'lastFiveWin%_B', 'teamDrtg_B', 'team3PM_Diff_B',
       'teamPTS1_Diff_B', 'teamPTS3_Diff_B', 'teamBLKR_Diff_B',
       'teamAST/TO_Diff_B', 'ptsAgnst_Diff', 'ptsScore_Diff', 'ptsAllow_Diff',
       'spread_Diff', 'lastFiveWin%_Diff'],
      dtype='object')

Let's get the differences and rename the appropriate columns to be consistent with the above.

In [45]:
vs_df['teamHome'] = vs_df.TEAM_HOME_A
vs_df['gameBack'] = vs_df.GAMES_BACK_A
vs_df['ptsAllow'] = vs_df.PTSALLOW_A
vs_df['teamPTS1'] = vs_df.PTS_QTR1_A
vs_df['teamDrtg'] = vs_df.DRTG_A
vs_df['team3PM_Diff'] = vs_df.FG3M_A - vs_df.FG3M_B
vs_df['teamDRB_Diff'] = vs_df.DREB_A - vs_df.DREB_B
vs_df['teamPTS1_Diff'] = vs_df.PTS_QTR1_A - vs_df.PTS_QTR1_B
vs_df['teamPTS2_Diff'] = vs_df.PTS_QTR2_A - vs_df.PTS_QTR2_B
vs_df['teamPTS3_Diff'] = vs_df.PTS_QTR3_A - vs_df.PTS_QTR3_B
vs_df['teamBLKR_Diff'] = vs_df.BLKR_A - vs_df.BLKR_B
vs_df['teamAST/TO_Diff'] = vs_df['AST/TOV_A'] - vs_df['AST/TOV_B']
vs_df['lastFiveWin%'] = vs_df.LAST_FIVE_W_PCT_A
vs_df['ptsAgnst'] = vs_df.PTSAGAINST_A
vs_df['ptsScore'] = vs_df.AVG_PTS_A
vs_df['ptsAllow'] = vs_df.PTSALLOW_A
vs_df['spread'] = vs_df.Spread_A

Get in terms of single team again.

In [46]:
vs_df.columns = vs_df.columns.str.replace('_A', '')
sing_df = vs_df[vs_df.columns[~vs_df.columns.str.endswith('_B')]].copy()

Shift appropriate columns forward one game for each team by using altered function from first notebook.

In [47]:
def shift_data(df, team_abbr, shift_cols):
    new_dfs = []
    for team in df[team_abbr].unique():
        temp_df = df[(df[team_abbr]==team)]
        shift_data = temp_df[shift_cols].shift().copy()
        new_data = temp_df.drop(columns=shift_cols).join(shift_data).copy()
        new_dfs.append(new_data)
    shift_df = pd.concat(new_dfs)
    return shift_df

Define columns that need shifted forward one game.

In [48]:
shift_cols = ['gameBack', 'ptsAllow', 'lastFiveWin%', 'ptsAgnst', 'ptsScore']

Call function on current dataframe.

In [49]:
shifted_df = shift_data(sing_df, 'TEAMABRV', shift_cols)

Get rolling window of last nine games for various variables. First make a list of columns that need rolled into prior 9 games.

In [50]:
rolling_cols = ['teamPTS1', 'teamDrtg', 'team3PM_Diff', 'teamDRB_Diff', 'teamPTS1_Diff', 'teamPTS2_Diff', 'teamPTS3_Diff', 'teamBLKR_Diff', 'teamAST/TO_Diff']

Then get identifier variables and rolling columns into its own dataframe.

In [51]:
rolls_df = shifted_df[['GAME_ID', 'GAME_DATE', 'TEAMABRV'] + rolling_cols].copy()

Roll the columns into an average of the last 9 games and shift forward looping through each team.

In [52]:
rolling_dfs = []
for team in rolls_df.TEAMABRV.unique():
    temp_df = rolls_df[rolls_df.TEAMABRV==team].sort_values('GAME_DATE')
    rolling_temp_df = temp_df[rolling_cols].rolling(9, 1, win_type='triang').mean().shift()
    temp_df = temp_df.drop(columns=rolling_cols).join(rolling_temp_df)
    rolling_dfs.append(temp_df)
rolling_df = pd.concat(rolling_dfs)

Put those rolling variables (now moving window of last 9) back into base dataframe.

In [53]:
base_df = shifted_df.drop(columns=rolling_cols).merge(rolling_df, on=['GAME_ID', 'GAME_DATE', 'TEAMABRV']).copy()

At this point the data is VERY close to being in the form it needs to be for testing the model, there are only a few more steps.
1. Remove first games
2. Final differences
2. Random picker
3. Getting final targets and features
4. Scaling the data

Make list of columns we need.

In [54]:
var_cols = ['GAME_ID', 'GAME_DATE', 'TEAMABRV', 'W', 'teamHome', 'gameBack', 'ptsAllow', 'teamPTS1', 'teamDrtg', 'team3PM_Diff', 'teamDRB_Diff', 'teamPTS1_Diff', 
            'teamPTS2_Diff', 'teamPTS3_Diff', 'teamBLKR_Diff', 'teamAST/TO_Diff', 'lastFiveWin%', 'ptsAgnst', 'ptsScore', 'spread']

Gather only those columns into the dataframe.

In [55]:
base_df = base_df[var_cols].copy()

Get in adversarial format once again.

In [56]:
joined = pd.merge(base_df, base_df, on=['GAME_ID', 'GAME_DATE'], suffixes=('_A', '_B'))
ab_df = joined[joined.TEAMABRV_A != joined.TEAMABRV_B].copy()

At this point, any game data with a null indicates that it is at least one of the two teams first games of the season and the model does not work on first game data as it relies on data from prior games.

In [57]:
ab_df = ab_df.dropna().copy()

Get the final differences needed for the data: `ptsAgnst_Diff`, `ptsScore_Diff`, `ptsAllow_Diff`, `spread_Diff` and `lastFiveWin%_Diff`.

In [58]:
ab_df['ptsAgnst_Diff'] = ab_df.ptsAgnst_A - ab_df.ptsAgnst_B
ab_df['ptsScore_Diff'] = ab_df.ptsScore_A - ab_df.ptsScore_B
ab_df['ptsAllow_Diff'] = ab_df.ptsAllow_A - ab_df.ptsAllow_B
ab_df['spread_Diff'] = ab_df.spread_A - ab_df.spread_B
ab_df['lastFiveWin%_Diff'] = ab_df['lastFiveWin%_A'] - ab_df['lastFiveWin%_B']

All the appropriate variables have now been created, now let's randomly pick the point of view for each game.

In [59]:
rand_vs_df = ab_df.sample(frac = 1.0).groupby('GAME_ID').head(1).copy()

Assign new target and new features.

In [60]:
target_2020 = rand_vs_df.W_A
features_2020 = rand_vs_df[features.columns.to_list()]

Get scaled data based off of fit to original features.

In [61]:
s_features_2020 = pd.DataFrame(scaler.transform(features_2020.iloc[:, 1:]), columns=features_2020.columns[1:], index=features_2020.index).join(features_2020.iloc[:, 0]).copy()

Give new features and old features the eye test to make sure things look good.

In [62]:
s_features_2020.head()

Unnamed: 0,gameBack_A,ptsAllow_A,teamPTS1_A,teamDrtg_A,team3PM_Diff_A,teamDRB_Diff_A,teamPTS1_Diff_A,teamPTS2_Diff_A,teamPTS3_Diff_A,teamBLKR_Diff_A,teamAST/TO_Diff_A,lastFiveWin%_B,teamDrtg_B,team3PM_Diff_B,teamPTS1_Diff_B,teamPTS3_Diff_B,teamBLKR_Diff_B,teamAST/TO_Diff_B,ptsAgnst_Diff,ptsScore_Diff,ptsAllow_Diff,spread_Diff,lastFiveWin%_Diff,teamHome_A
1633,-1.17978,4.668815,2.753561,0.408152,-0.787266,0.607929,3.299822,0.303473,-4.935143,-2.344324,1.313346,-0.35854,3.538459,5.57429,2.652305,-2.031489,-1.11422,-2.31491,-4.932012,3.12858,1.980786,-0.703715,-0.486813,0
1729,-0.975085,0.464835,0.733085,-0.570981,-0.931008,1.036194,0.5243,0.999882,1.720609,1.765521,1.216587,1.943285,0.534177,1.793469,-0.877914,0.76015,-0.612292,-0.589214,-1.140796,-1.340662,-1.625223,0.824914,-0.486813,0
1974,-1.17978,0.109954,1.475907,-0.061549,1.33692,0.766999,0.980879,0.37678,1.708973,1.101831,0.052766,-0.35854,-0.035788,0.022705,-0.063248,0.725543,-0.539323,-0.788928,-9.511801,0.811517,-1.352429,-1.745962,1.466747,1
1698,0.611299,2.240123,0.703372,0.032629,-0.324097,-0.114003,0.103766,0.743311,-1.269825,0.58073,0.944083,-0.35854,-0.80462,0.309856,-1.314764,-0.681812,-0.612414,-0.103356,-0.65215,-1.083205,-0.407721,0.338532,0.001577,0
1994,-0.668043,1.392184,0.643946,0.374362,2.295199,0.179664,0.115781,0.462303,1.627522,-0.313254,0.063134,-1.125814,1.141032,-0.423974,-0.700813,-0.381884,-0.759182,2.969145,-1.319404,1.870602,-0.286342,-1.606996,0.978357,1


In [63]:
s_features.head()

Unnamed: 0,gameBack_A,ptsAllow_A,teamPTS1_A,teamDrtg_A,team3PM_Diff_A,teamDRB_Diff_A,teamPTS1_Diff_A,teamPTS2_Diff_A,teamPTS3_Diff_A,teamBLKR_Diff_A,teamAST/TO_Diff_A,lastFiveWin%_B,teamDrtg_B,team3PM_Diff_B,teamPTS1_Diff_B,teamPTS3_Diff_B,teamBLKR_Diff_B,teamAST/TO_Diff_B,ptsAgnst_Diff,ptsScore_Diff,ptsAllow_Diff,spread_Diff,lastFiveWin%_Diff,teamHome_A
225,1.378905,-0.318634,-0.603995,0.094033,-1.250434,0.1919,-0.737302,-0.612855,-0.548397,0.022798,-0.481386,1.17601,-0.499519,1.251073,-0.31119,1.152363,1.215918,2.10101,1.639429,-0.468623,0.529933,0.540033,-0.486813,1
6430,-0.053959,-0.184871,0.881649,-0.797269,1.09735,-0.738046,0.80065,-0.735032,-0.373858,0.661966,0.747301,-1.125814,-0.763446,-0.041106,-0.960561,0.529436,1.120838,-0.451387,1.494521,1.680383,0.594009,-1.440236,0.978357,1
14229,-0.719217,-0.471506,0.406243,-0.58893,0.586268,0.461095,0.199888,0.26682,-0.280771,-1.147929,-0.566486,-1.125814,0.010098,0.676771,-0.334803,0.829364,-0.362017,-1.18927,0.41613,0.465083,0.369744,0.623413,0.978357,0
4913,-1.17978,-0.930122,0.83708,0.411914,1.049436,1.023958,0.416162,1.659639,0.731555,-0.025002,0.55881,1.17601,-1.406014,0.916064,1.483437,0.725543,-0.77158,2.499623,-1.595742,-0.350057,-0.57537,-0.939958,-0.486813,1
26122,2.14651,-0.20398,-0.425717,-0.232058,0.506411,-1.484451,0.416162,-1.492529,-0.129504,-0.610447,1.599242,-0.35854,0.830076,-1.381143,-1.975942,0.194901,-1.124166,-1.426572,-1.295814,-0.453802,-0.238973,0.824914,-0.975202,0


Looks good. Only one more thing to do. Test the model on this data and see how well it does.

# Execute and check performance of model (in the wild)

For how long the last section was, this one is a very simple one liner.

In [64]:
model.score(s_features_2020, target_2020)

0.6376518218623481

In [65]:
confusion_matrix(target_2020, model.predict(s_features_2020))

array([[154,  85],
       [ 94, 161]], dtype=int64)

So, looks like this model turns out to not work very well! It was only about 64% accurate using the Logistic Regression Model decided on in the prior notebook. This makes me wonder if perhaps one of the other previous models is more generalized. The good news I suppose was that this was pretty much equally bad at classifying positives (wins) and negatives (losses).

Let's look at a few of the other models from the prior notebook and see if they perform any better.

In [66]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from xgboost import XGBClassifier

Instantiate the models.

In [67]:
dt_model = DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                        max_depth=4, max_features=None, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=None, splitter='best')
rf_model = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=9, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=90, n_jobs=-1,
                        oob_score=False, random_state=None, verbose=0,
                        warm_start=False)
et_model = ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                      criterion='gini', max_depth=None, max_features='auto',
                      max_leaf_nodes=None, max_samples=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=5,
                      min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=-1,
                      oob_score=False, random_state=None, verbose=0,
                      warm_start=False)
xg_model = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
               colsample_bynode=1, colsample_bytree=1, gamma=0,
               learning_rate=0.05, max_delta_step=0, max_depth=3,
               min_child_weight=1, missing=None, n_estimators=100, n_jobs=-1,
               nthread=None, objective='binary:logistic', random_state=0,
               reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
               silent=None, subsample=1, verbosity=1)

In [68]:
models = [dt_model, rf_model, et_model, xg_model]

Find the accuracy and confusion matrix for the various models.

In [69]:
for model in models:
    model.fit(s_features, target)
    score = model.score(s_features_2020, target_2020)
    conf = confusion_matrix(target_2020, model.predict(s_features_2020))
    print('-'*len(str(model).split('(')[0]))
    print(str(model).split('(')[0])
    print('-'*len(str(model).split('(')[0]))
    print('Accuracy Score: {}%'.format(round(score*100, 4)))
    print('Confusion Matrix:')
    print(conf)

----------------------
DecisionTreeClassifier
----------------------
Accuracy Score: 64.5749%
Confusion Matrix:
[[154  85]
 [ 90 165]]
----------------------
RandomForestClassifier
----------------------
Accuracy Score: 64.17%
Confusion Matrix:
[[152  87]
 [ 90 165]]
--------------------
ExtraTreesClassifier
--------------------
Accuracy Score: 64.3725%
Confusion Matrix:
[[153  86]
 [ 90 165]]
-------------
XGBClassifier
-------------
Accuracy Score: 64.17%
Confusion Matrix:
[[156  83]
 [ 94 161]]


It actually appears that all of the above models perform just slightly better than the logistic regression model, but by very little.

# Summary

Ultimately, the reason none of the models generalize very well is likely due to overfitting in the modeling stage. I believe the reason for the overfitting actually lies in the data manipulation step for the original data. There is a step where the best rolling window is decided upon. In that step two things that likely attributed to the overfitting occurred. First, the data was scaled to minimize the total sum of the differences. This could have leaked information to the models by scaling the data in advance (even though it was inverse scaled as well). The other thing that happened was in deciding upon the proper rolling window number of games. The calculation that figured this out attempted to find the number of games that lead to a result most similar to the values of the actual game being played. This is likely the greatest culprit for the overfitting. By trying to mimic the data the best it can, in the modeling stage, that data is leaked to the model. The result is a pretty good accuracy within the whole of the training data, but as seen in this section, it leads to ungeneralized models.

The best next step is to probably try to get cleaner and simpler predictive data. There are numerous statistics and variables when it comes to a single basketball game, but it is likely that there are more predictive ones than the features I accumulated in this project. Unfortunately, the models created here don't generalize very well and are not likely very useful in the real world. However, through this project, much has been learned.

For navigational convenience:
1. [NBA Data Aggregation](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Aggregation.ipynb)
2. [NBA Data Cleaning and Exploration](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Cleaning_Exploration.ipynb)
3. [NBA Modeling](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Modeling.ipynb)
4. [NBA Model Testing](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Model_Testing.ipynb)*