# Predicting NBA Game Outcomes with Machine Learning - Notebook 1
Ishan Sheth, Independent Study, Fall 2021

In this project, I used data from nba.com and machine learning algorithms to create a model that can predict the outcome of NBA regular season games. I chose to just focus on the regular season due to the vastly different strategies and lineups in the playoffs. To create my models, I used five different classification algorithms (Logistic Regression, K-Nearest Neighbors, Gradient Boosting Classifier, Support Vector Classification, and Random Forest Classifier) and tested various parameters in an attempt to accurately predict wins and losses.

## Part 1: Data Collection and Modification
This part of the project was the longest and most time consuming. I used the nba_api to collect basic and advanced stats for every team's games in the 2018, 2019, and 2020 seasons. I then took the average stats for each team's last 7 games. This way, I could see how a team was performing entering a certain game. I chose 7 games so I could account for factors such as injuries. For example, if a good team loses their star player to an injury, their statistics without their star player will most likley be worse than before. Taking a moving average of the last 7 games accounts for this, providing a good estimate of recent team performance entering a game.

### 1.1 - Import Libraries and nba_api
Here, I imported the Python libraries that I will use for this part of the project. Pandas, Numpy, and Seaborn are all common libraries used for data science while nba_api and requests will allow me to get the data that I need.

In [3]:
import nba_api as nba
import pandas as pd
import numpy as np
import requests
import seaborn as sn

In [4]:
from nba_api.stats.endpoints import leaguegamelog
from nba_api.stats.endpoints import boxscoreadvancedv2

### 1.2 - Create First Part of Dataset
In this part, I take data from the 2018, 2019, and 2020 seasons to get game statistics for each game and team. While these stats will be helpful, there are also advanced stats that can be harvested. A combination of these basic and advanced stats will create a robust dataset and eventually, a better model.

In [5]:
seasons = [2018, 2019, 2020]
new_df = pd.DataFrame()
for years in seasons:
    df = leaguegamelog.LeagueGameLog(season=years).get_data_frames()[0]
    df['WL'] = df['WL'].map({'W': 1, 'L': 0})
    df2 = df.drop(['SEASON_ID',
    'TEAM_ID',
    'TEAM_ABBREVIATION',
    'TEAM_NAME',
    'GAME_ID',
    'GAME_DATE',
    'MATCHUP', 'VIDEO_AVAILABLE', 'PLUS_MINUS'], axis=1)
    df = df.sort_values(['GAME_ID', 'TEAM_NAME'])
    new_df = pd.concat([new_df, df])
new_df.shape

(6738, 29)

In [6]:
new_df.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
3,22018,1610612738,BOS,Boston Celtics,21800001,2018-10-16,BOS vs. PHI,1,240,42,...,43,55,21,7,5,15,20,105,18,1
2,22018,1610612755,PHI,Philadelphia 76ers,21800001,2018-10-16,PHI @ BOS,0,240,34,...,41,47,18,8,5,16,20,87,-18,1
1,22018,1610612744,GSW,Golden State Warriors,21800002,2018-10-16,GSW vs. OKC,1,240,42,...,41,58,28,7,7,21,29,108,8,1
0,22018,1610612760,OKC,Oklahoma City Thunder,21800002,2018-10-16,OKC @ GSW,0,240,33,...,29,45,21,12,6,15,21,100,-8,1
22,22018,1610612766,CHA,Charlotte Hornets,21800003,2018-10-17,CHA vs. MIL,0,240,41,...,32,41,21,8,9,11,19,112,-1,1


### 1.3 - Create Advanced Dataset and Combine it with the One Above
The next part is to create the advanced stats dataset that I mentioned above. I will get advanced stats such as "Possessions per Game" and "Offensive Rating"(points scored per 100 posessions) for each team and game. While it did not take too much code to create the dataset and add it to the one above, perfecting this process took multiple attempts and I tried numerous strategies before landing on the code below.

In [7]:
newlist = []
newlist2 = []
for games in new_df['GAME_ID']:
        newlist.append(games)

In [8]:
for i in newlist:
    if i not in newlist2:
            newlist2.append(i)

In [9]:
df4 = pd.DataFrame()

In [None]:
for gid in newlist2:
    new_iter = boxscoreadvancedv2.BoxScoreAdvancedV2(game_id=gid).get_data_frames()[1]
    df4 = df4.append(new_iter, ignore_index=True)

In [17]:
df4 = df4.sort_values(['GAME_ID', 'TEAM_ABBREVIATION'])


In [147]:
df4 = df4.reset_index()

In [160]:
new_df.shape

(6738, 30)

In [159]:
df4 = df4.drop(labels=[1,3,5,7,9,11,13,15], axis=0)
df4.shape

(6738, 30)

In [161]:
new_df = new_df.sort_values(['GAME_ID', 'TEAM_ABBREVIATION'])
new_df = new_df.reset_index()
new_df.head(50)

Unnamed: 0,level_0,index,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
0,0,3,22018,1610612738,BOS,Boston Celtics,21800001,2018-10-16,BOS vs. PHI,1,...,43,55,21,7,5,15,20,105,18,1
1,1,2,22018,1610612755,PHI,Philadelphia 76ers,21800001,2018-10-16,PHI @ BOS,0,...,41,47,18,8,5,16,20,87,-18,1
2,2,1,22018,1610612744,GSW,Golden State Warriors,21800002,2018-10-16,GSW vs. OKC,1,...,41,58,28,7,7,21,29,108,8,1
3,3,0,22018,1610612760,OKC,Oklahoma City Thunder,21800002,2018-10-16,OKC @ GSW,0,...,29,45,21,12,6,15,21,100,-8,1
4,4,22,22018,1610612766,CHA,Charlotte Hornets,21800003,2018-10-17,CHA vs. MIL,0,...,32,41,21,8,9,11,19,112,-1,1
5,5,23,22018,1610612749,MIL,Milwaukee Bucks,21800003,2018-10-17,MIL @ CHA,1,...,46,57,26,5,4,21,25,113,1,1
6,6,17,22018,1610612751,BKN,Brooklyn Nets,21800004,2018-10-17,BKN @ DET,0,...,34,39,28,9,5,19,23,100,-3,1
7,7,16,22018,1610612765,DET,Detroit Pistons,21800004,2018-10-17,DET vs. BKN,1,...,32,46,21,5,5,17,20,103,3,1
8,8,19,22018,1610612754,IND,Indiana Pacers,21800005,2018-10-17,IND vs. MEM,1,...,44,57,29,2,7,20,24,111,28,1
9,9,18,22018,1610612763,MEM,Memphis Grizzlies,21800005,2018-10-17,MEM @ IND,0,...,21,28,16,11,3,10,18,83,-28,1


In [164]:
df4 = df4.sort_values(['GAME_ID', 'TEAM_ABBREVIATION'])
df4 = df4.reset_index()
df4.head(50)

Unnamed: 0,level_0,index,GAME_ID,TEAM_ID,TEAM_NAME,TEAM_ABBREVIATION,TEAM_CITY,MIN,E_OFF_RATING,OFF_RATING,...,TM_TOV_PCT,EFG_PCT,TS_PCT,USG_PCT,E_USG_PCT,E_PACE,PACE,PACE_PER40,POSS,PIE
0,0,1,21800001,1610612738,Celtics,BOS,Boston,240:00,98.9,100.0,...,14.3,0.49,0.509,1.0,0.198,106.64,105.5,87.92,105,0.595
1,2,0,21800001,1610612755,76ers,PHI,Philadelphia,240:00,81.2,82.1,...,15.1,0.42,0.448,1.0,0.2,106.64,105.5,87.92,106,0.405
2,4,3,21800002,1610612744,Warriors,GSW,Golden State,240:00,101.0,104.9,...,20.4,0.479,0.525,1.0,0.199,106.6,103.0,85.83,103,0.582
3,6,2,21800002,1610612760,Thunder,OKC,Oklahoma City,240:00,94.1,97.1,...,14.6,0.418,0.466,1.0,0.2,106.6,103.0,85.83,103,0.418
4,8,5,21800003,1610612766,Hornets,CHA,Charlotte,240:00,108.0,107.7,...,10.6,0.533,0.551,1.0,0.199,103.74,103.5,86.25,104,0.473
5,10,4,21800003,1610612749,Bucks,MIL,Milwaukee,240:00,108.9,109.7,...,20.4,0.576,0.602,1.0,0.2,103.74,103.5,86.25,103,0.527
6,12,7,21800004,1610612751,Nets,BKN,Brooklyn,240:00,94.6,99.0,...,18.8,0.518,0.545,1.0,0.195,105.18,100.5,83.75,101,0.523
7,14,6,21800004,1610612765,Pistons,DET,Detroit,240:00,98.4,103.0,...,17.0,0.457,0.506,1.0,0.192,105.18,100.5,83.75,100,0.477
8,16,17,21800005,1610612754,Pacers,IND,Indiana,240:00,116.0,115.6,...,20.8,0.627,0.626,1.0,0.199,97.52,95.5,79.58,96,0.714
9,17,16,21800005,1610612763,Grizzlies,MEM,Memphis,240:00,83.6,87.4,...,10.5,0.357,0.431,1.0,0.196,97.52,95.5,79.58,95,0.286


The dataset below (big_df) is my new combined dataset with basic and advanced stats for thousands of games.

In [166]:
big_df = pd.concat([new_df, df4], axis=1)
pd.set_option('display.max_columns', None)
big_df.head(50)

Unnamed: 0,level_0,index,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE,level_0.1,index.1,GAME_ID.1,TEAM_ID.1,TEAM_NAME.1,TEAM_ABBREVIATION.1,TEAM_CITY,MIN.1,E_OFF_RATING,OFF_RATING,E_DEF_RATING,DEF_RATING,E_NET_RATING,NET_RATING,AST_PCT,AST_TOV,AST_RATIO,OREB_PCT,DREB_PCT,REB_PCT,E_TM_TOV_PCT,TM_TOV_PCT,EFG_PCT,TS_PCT,USG_PCT,E_USG_PCT,E_PACE,PACE,PACE_PER40,POSS,PIE
0,0,3,22018,1610612738,BOS,Boston Celtics,21800001,2018-10-16,BOS vs. PHI,1,240,42,97,0.433,11,37,0.297,10,14,0.714,12,43,55,21,7,5,15,20,105,18,1,0,1,21800001,1610612738,Celtics,BOS,Boston,240:00,98.9,100.0,81.2,82.1,17.7,17.9,0.5,1.4,15.1,0.25,0.825,0.54,14.13,14.3,0.49,0.509,1.0,0.198,106.64,105.5,87.92,105,0.595
1,1,2,22018,1610612755,PHI,Philadelphia 76ers,21800001,2018-10-16,PHI @ BOS,0,240,34,87,0.391,5,26,0.192,14,23,0.609,6,41,47,18,8,5,16,20,87,-18,1,2,0,21800001,1610612755,76ers,PHI,Philadelphia,240:00,81.2,82.1,98.9,100.0,-17.7,-17.9,0.529,1.13,13.7,0.175,0.75,0.46,14.937,15.1,0.42,0.448,1.0,0.2,106.64,105.5,87.92,106,0.405
2,2,1,22018,1610612744,GSW,Golden State Warriors,21800002,2018-10-16,GSW vs. OKC,1,240,42,95,0.442,7,26,0.269,17,18,0.944,17,41,58,28,7,7,21,29,108,8,1,4,3,21800002,1610612744,Warriors,GSW,Golden State,240:00,101.0,104.9,94.1,97.1,6.9,7.8,0.667,1.33,18.4,0.415,0.652,0.546,19.641,20.4,0.479,0.525,1.0,0.199,106.6,103.0,85.83,103,0.582
3,3,0,22018,1610612760,OKC,Oklahoma City Thunder,21800002,2018-10-16,OKC @ GSW,0,240,33,91,0.363,10,37,0.27,24,37,0.649,16,29,45,21,12,6,15,21,100,-8,1,6,2,21800002,1610612760,Thunder,OKC,Oklahoma City,240:00,94.1,97.1,101.0,104.9,-6.9,-7.8,0.636,1.4,14.7,0.348,0.585,0.454,14.114,14.6,0.418,0.466,1.0,0.2,106.6,103.0,85.83,103,0.418
4,4,22,22018,1610612766,CHA,Charlotte Hornets,21800003,2018-10-17,CHA vs. MIL,0,240,41,92,0.446,16,38,0.421,14,22,0.636,9,32,41,21,8,9,11,19,112,-1,1,8,5,21800003,1610612766,Hornets,CHA,Charlotte,240:00,108.0,107.7,108.9,109.7,-0.8,-2.0,0.512,1.91,15.7,0.193,0.696,0.417,10.61,10.6,0.533,0.551,1.0,0.199,103.74,103.5,86.25,104,0.473
5,5,23,22018,1610612749,MIL,Milwaukee Bucks,21800003,2018-10-17,MIL @ CHA,1,240,42,85,0.494,14,34,0.412,15,20,0.75,11,46,57,26,5,4,21,25,113,1,1,10,4,21800003,1610612749,Bucks,MIL,Milwaukee,240:00,108.9,109.7,108.0,107.7,0.8,2.0,0.619,1.24,18.5,0.304,0.807,0.583,20.231,20.4,0.576,0.602,1.0,0.2,103.74,103.5,86.25,103,0.527
6,6,17,22018,1610612751,BKN,Brooklyn Nets,21800004,2018-10-17,BKN @ DET,0,240,40,82,0.488,5,27,0.185,15,22,0.682,5,34,39,28,9,5,19,23,100,-3,1,12,7,21800004,1610612751,Nets,BKN,Brooklyn,240:00,94.6,99.0,98.4,103.0,-3.8,-4.0,0.7,1.47,20.2,0.261,0.655,0.475,17.979,18.8,0.518,0.545,1.0,0.195,105.18,100.5,83.75,101,0.523
7,7,16,22018,1610612765,DET,Detroit Pistons,21800004,2018-10-17,DET vs. BKN,1,240,39,92,0.424,6,24,0.25,19,22,0.864,14,32,46,21,5,5,17,20,103,3,1,14,6,21800004,1610612765,Pistons,DET,Detroit,240:00,98.4,103.0,94.6,99.0,3.8,4.0,0.538,1.24,15.0,0.345,0.739,0.525,16.24,17.0,0.457,0.506,1.0,0.192,105.18,100.5,83.75,100,0.477
8,8,19,22018,1610612754,IND,Indiana Pacers,21800005,2018-10-17,IND vs. MEM,1,240,47,83,0.566,10,26,0.385,7,13,0.538,13,44,57,29,2,7,20,24,111,28,1,16,17,21800005,1610612754,Pacers,IND,Indiana,240:00,116.0,115.6,83.6,87.4,32.4,28.3,0.617,1.45,21.1,0.436,0.758,0.634,20.894,20.8,0.627,0.626,1.0,0.199,97.52,95.5,79.58,96,0.714
9,9,18,22018,1610612763,MEM,Memphis Grizzlies,21800005,2018-10-17,MEM @ IND,0,240,25,84,0.298,10,29,0.345,23,28,0.821,7,21,28,16,11,3,10,18,83,-28,1,17,16,21800005,1610612763,Grizzlies,MEM,Memphis,240:00,83.6,87.4,116.0,115.6,-32.4,-28.3,0.64,1.6,13.1,0.242,0.564,0.366,10.068,10.5,0.357,0.431,1.0,0.196,97.52,95.5,79.58,95,0.286


Below, I am printing the columns in big_df to see the statistics that I will be working with

In [103]:
for cols in big_df.columns:
    print(cols)

index
SEASON_ID
TEAM_ID
TEAM_ABBREVIATION
TEAM_NAME
GAME_ID
GAME_DATE
MATCHUP
WL
MIN
FGM
FGA
FG_PCT
FG3M
FG3A
FG3_PCT
FTM
FTA
FT_PCT
OREB
DREB
REB
AST
STL
BLK
TOV
PF
PTS
PLUS_MINUS
VIDEO_AVAILABLE
index
GAME_ID
TEAM_ID
TEAM_NAME
TEAM_ABBREVIATION
TEAM_CITY
MIN
E_OFF_RATING
OFF_RATING
E_DEF_RATING
DEF_RATING
E_NET_RATING
NET_RATING
AST_PCT
AST_TOV
AST_RATIO
OREB_PCT
DREB_PCT
REB_PCT
E_TM_TOV_PCT
TM_TOV_PCT
EFG_PCT
TS_PCT
USG_PCT
E_USG_PCT
E_PACE
PACE
PACE_PER40
POSS
PIE


### 1.4 - Exporting New Dataframe as CSV File
Finally, I saved big_df as a csv file so I can upload it in a new notebook. A CSV file can be opened in Microsoft Excel and is the file type that Python  This is a good checkpoint in my code and exporting the data ensures that I have the initial dataset if anything goes wrong.

In [167]:
big_df.to_csv('big_df.csv', index=False)