# NBA Game predictor

## Consists of the Model and the App

### Goal of the Model:

**Given two teams and their list of per game stats and other factors, output the probabilities each team wins**

### Goal of the App:

**Given two teams and the day before the game (Home, Away, Day), output the the probabilities each team wins based off the Models**

## Steps:

1. Creating The Model
    1. Retrieve data using the [nba_api](https://github.com/swar/nba_api)
    2. Adjust data and add features so each row includes each game and the respective per game stats for each team before the game and the outcome of the game
    3. Train and test the model on this data
2. Creating The  App
    1. Retrieve the current years gamelogs
    2. Adjust data and add features so each row includes the teams most up to date current per game stats
    3. Create function that takes in a team and outputs the probabilities


## 1. Creating The Model

### A. Retrieve data using nba_api

**First Lets import our nessecarry libraries**

In [4]:
from nba_api.stats.endpoints import teamgamelog
from nba_api.stats.static import teams
from time import sleep

- We will be using each teams game logs for a particular season
- We also need a list of each team to itterate for
- And a sleep function so we dont get timed out

**While we are at it lets import our other libraries too**

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

- We use pandas for data analysis
- Numpy for Linear Algebra
- Seaborn for data visualization
- Matplotlib for data visualization

**Next lets get a dictionary with each teams info**

In [3]:
team_dict = teams.get_teams()
team_dict[0]

{'id': 1610612737,
 'full_name': 'Atlanta Hawks',
 'abbreviation': 'ATL',
 'nickname': 'Hawks',
 'city': 'Atlanta',
 'state': 'Atlanta',
 'year_founded': 1949}

**Now we can itterate through each team for a select season and create a dataframe for all teams for that season**

In [4]:
df = pd.DataFrame()
for team in team_dict:
    #print(team["full_name"])
    gamelogobj = teamgamelog.TeamGameLog(team_id = str(team["id"]), season = "2014")
    gamelogdf = gamelogobj.get_data_frames()[0]
    df = pd.concat([df, gamelogdf])
    sleep(.750) #Need to pause between loops so no timeout occurs

**Now I admit the issue with the current set up in this Notebook:** \
In order to get a complete data frame of multiple seasons is by running this for each season you want, \
then individually adjusting each seasons data and features, 
then finally concatenating all of those data frames. \
\
This is because when aggregating previous stats, only Game_ID's value (or Games left) is looked at so other years stats would be aggregated also.\
\
In the future this can be fixed by putting all the functions into a for loop and iterating through for each season you want, \
or by adding a season feature and adjusting the if statements to also match the correct season. \
Other thought is using Game_ID... it is formated to include the last two digits of the year and a uinque game number for that year.
\
\
For now you can just download the complete dataframe (Seasons 2014 - 2018) and skip over running this next section of the notebook to save time. \
You can read in the file below.

In [79]:
complete_df = pd.read_csv("complete_games")

## B. Adjust Data and Add Features

#### The data we currently have:
- 2 rows for each game played
- 1 for each team in the game
- Whether the team won/loss
- How many X that team got in that game:
    - Points
    - Rebounds
    - Turnovers
    - Assists
    - Shots Made
    - Shots Missed


#### The stats we want:
- Points Per Game
- Rebounds Per Game
- Assists Per Game
- Turnovers Per Game
- Points Allowed Per Game
- Rebounds Allowed Per Game
- Assists Allowed Per Game
- Turnovers Allowed Per Game
- Field Goal Percent
- Whether the team played the day before (Back to Back)

#### The plan to get the stats we want:
We will look through and get all the aggregate stats we need based off team and order of games played. We will only be looking at games played before the current game as we don't want to use data from the game to predict the game. We will divide by amount of games played to get the averages. Then sort the values based on whether the team was home/away. Set the other column to 0. This will allow us to sum all the values in the table when we group. Since we know each game will have only 1 home team and 1 away team, our grouping will allign.

**We first want to reset the index** \
\
Currently it has what I call "GAMES_LEFT" which is actually more accurately described as number of games played after this for this team in this season. \
\
This shouldnt be our index since each team shares the same 82 numbers\
\
We also should rename this column to "GAMES_LEFT"

In [77]:
df = df.reset_index()
df.rename(columns = {'index': "GAMES_LEFT"}, inplace = True)

**Next lets add an indicator on whether its a home game or not and lets make a more user friendly way to see who is playing in the game**

In [None]:
df["Home"] = df.MATCHUP.apply(lambda x: 0 if x.__contains__('@') else 1)
df["Team"] = df.MATCHUP.apply(lambda x: x.split(" ")[0])
df["Opp"] = df.MATCHUP.apply(lambda x: x.split(" ")[2]) 

**Now we need to change the teams record to be that from before the game since we don't want to use POST-game data for PRE-game predictions**

In [None]:
for idx, series in df.iterrows():
    currentW = series.W
    currentL = series.L
    if series.WL =="W":
        
        df.at[idx,'W'] = currentW - 1
    else:
        df.at[idx,'L'] = currentL - 1
df["W_PCT"] = df["W"] / (df["L"] + df["W"])

**Now we add an indicator on whether the home team won or not**

In [None]:
df["Home_Win"] = 2
for idx, series in df.iterrows():
    if df.at[idx, "Home"] == 1 and df.at[idx, "WL"] == "W":
        df.at[idx, "Home_Win"] = 1
    else:
        df.at[idx, "Home_Win"] = 0


**Create a indicator on whether the team played the day before**

In [None]:
df["Back_To_Back"] = 2
for idx, series in df.iterrows():
    Back_To_Back = 0
    if series.GAMES_LEFT != 81:
        if (int(df.loc[(idx + 1), "GAME_DATE"].split(" ")[1].split(",")[0]) + 1) == int(df.loc[idx, "GAME_DATE"].split(" ")[1].split(",")[0]):
            Back_To_Back = 1
    df.at[idx, "Back_To_Back"] = Back_To_Back

**Sort that indicator based on whether the team is Home or Away**

In [None]:
df["Home_Team_B2B"] = 2
df["Away_Team_B2B"] = 2
for idx, series in df.iterrows():
    if df.at[idx, "Back_To_Back"] == 1 :
        if df.at[idx, "Home"] == 1:
            df.at[idx, "Home_Team_B2B"] = 1
            df.at[idx, "Away_Team_B2B"] = 0
        else:
            df.at[idx, "Away_Team_B2B"] = 1
            df.at[idx, "Home_Team_B2B"] = 0
    else:
        df.at[idx, "Away_Team_B2B"] = 0
        df.at[idx, "Home_Team_B2B"] = 0
        

**It's now time to add the teams per game averages** \
We will do this by aggregating all stats from games with matching team_ids and Game_IDs lower than the current one and then dividing by the amount of games played.\
\
Then we will sort these stats based on Home or Away

In [None]:
df["Home_Team_ID"] = 2 #Team_ID
df["Away_Team_ID"] = 2

df["Home_Team"] = "" #3-letter team identifier (CHI, NYK, GSW...)
df["Away_Team"] = ""

df["Home_Team_W_PCT"] = 2.0 #Wins/Losses
df["Away_Team_W_PCT"] = 2.0

df["Home_Team_PPG"] = -1.0 #Points per game
df["Away_Team_PPG"] = -1.0

df["Home_Team_RPG"] = -1.0 #Rebounds per game
df["Away_Team_RPG"] = -1.0

df["Home_FGPCT"] = -1.0 #Field Goal Percentage
df["Away_FGPCT"] = -1.0

df["Home_Team_APG"] = -1.0 #Assists per game
df["Away_Team_APG"] = -1.0

df["Home_Team_TOVPG"] = -1.0 #Turnovers per game
df["Away_Team_TOVPG"] = -1.0
for idx, series in df.iterrows():
    team = df.at[idx, "Team_ID"]
    gameID = df.at[idx, "Game_ID"]
    
    total_points = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] < gameID), "PTS"].sum()
    total_rebounds = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] < gameID), "REB"].sum()
    total_assists = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] < gameID), "AST"].sum()
    total_tov = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] < gameID), "TOV"].sum()
    
    PrevFGM = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] < gameID), "FGM"].sum()
    PrevFGA = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] < gameID), "FGA"].sum()
    FGPCT = PrevFGM / PrevFGA
    
    
    if df.at[idx, "Home"] == 1:
        df.at[idx, "Home_Team_ID"] = df.at[idx, "Team_ID"] 
        df.at[idx, "Away_Team_ID"] = 0
    
        df.at[idx, "Home_Team"] = df.at[idx, "Team"] 
        df.at[idx, "Away_Team"] = df.at[idx, "Opp"]
        
        df.at[idx, "Home_Team_W_PCT"] = df.at[idx, "W_PCT"]
        df.at[idx, "Away_Team_W_PCT"] = 0
        
        df.at[idx, "Home_Team_PPG"] = total_points / len(df[(df["Team_ID"] == team) & (df["Game_ID"] < gameID)])
        df.at[idx, "Away_Team_PPG"] = 0
        
        df.at[idx, "Home_Team_RPG"] = total_rebounds / len(df[(df["Team_ID"] == team) & (df["Game_ID"] < gameID)])
        df.at[idx, "Away_Team_RPG"] = 0
        
        df.at[idx,"Home_FGPCT"] = FGPCT
        df.at[idx,"Away_FGPCT"] = 0.0
        
        df.at[idx, "Home_Team_APG"] = total_assists / len(df[(df["Team_ID"] == team) & (df["Game_ID"] < gameID)])
        df.at[idx, "Away_Team_APG"] = 0
        
        df.at[idx, "Home_Team_TOVPG"] = total_tov / len(df[(df["Team_ID"] == team) & (df["Game_ID"] < gameID)])
        df.at[idx, "Away_Team_TOVPG"] = 0
    else:
        df.at[idx, "Home_Team_ID"] = 0
        df.at[idx, "Away_Team_ID"] = df.at[idx, "Team_ID"]
        
        df.at[idx, "Home_Team"] = df.at[idx, "Opp"]
        df.at[idx, "Away_Team"] = df.at[idx, "Team"]
        
        df.at[idx, "Home_Team_W_PCT"] = 0
        df.at[idx, "Away_Team_W_PCT"] = df.at[idx, "W_PCT"]
        
        df.at[idx, "Away_Team_PPG"] = total_points / len(df[(df["Team_ID"] == team) & (df["Game_ID"] < gameID)])
        df.at[idx, "Home_Team_PPG"] = 0
        
        df.at[idx, "Away_Team_RPG"] = total_rebounds / len(df[(df["Team_ID"] == team) & (df["Game_ID"] < gameID)])
        df.at[idx, "Home_Team_RPG"] = 0
        
        df.at[idx,"Home_FGPCT"] = 0.0
        df.at[idx,"Away_FGPCT"] = FGPCT
        
        df.at[idx, "Away_Team_APG"] = total_assists / len(df[(df["Team_ID"] == team) & (df["Game_ID"] < gameID)])
        df.at[idx, "Home_Team_APG"] = 0
        
        df.at[idx, "Away_Team_TOVPG"] = total_tov / len(df[(df["Team_ID"] == team) & (df["Game_ID"] < gameID)])
        df.at[idx, "Home_Team_TOVPG"] = 0

**Similarly, we now add the team's allowed stats per game**\
This is done just like before however we need to find our the opponents stats for each game and aggregrate those \
\
And course sort by Home/Away\
\
*Note: These are not efficient and do take some time*

In [None]:
df["Home_PAPG"] = -1.0 #Points allowed per game
df["Away_PAPG"] = -1.0

df["Home_RAPG"] = -1.0 #Rebounds allowed per game
df["Away_RAPG"] = -1.0

df["Home_AAPG"] = -1.0 #Assists allowed per game
df["Away_AAPG"] = -1.0

df["Home_TOVAPG"] = -1.0 #Turnovers allowed per game
df["Away_TOVAPG"] = -1.0
for idx, series in df.iterrows():
    team = df.at[idx, "Team_ID"]
    gameID = df.at[idx, "Game_ID"]
    
    PA = 0 #Points Allowed
    RA = 0 #Rebounds Allowed
    AA = 0 #Assists Allowed
    TOVA = 0 #Turnovers Allowed
    
    GP = 0 #Games Played
    for idx2, series2 in df.loc[(df["Team_ID"] == team) & (df["Game_ID"] < gameID)].iterrows():
        opp = series2.Opp
        gameID2 = series2.Game_ID
        
        PA += df.loc[(df["Team"] == opp) & (df["Game_ID"] == gameID2), "PTS"].sum()
        RA += df.loc[(df["Team"] == opp) & (df["Game_ID"] == gameID2), "REB"].sum()
        AA += df.loc[(df["Team"] == opp) & (df["Game_ID"] == gameID2), "AST"].sum()
        TOVA += df.loc[(df["Team"] == opp) & (df["Game_ID"] == gameID2), "TOV"].sum()
        
        GP += 1
        
    PAPG = 0 #Points allowed per game
    TOVAPG = 0 #Turnovers allowed per game
    RAPG = 0 #Rebounds allowed per game
    AAPG = 0 #Assists allowed per game
    
    if GP >0:
        PAPG = PA/ GP
        RAPG = RA/GP
        AAPG = AA/ GP
        TOVAPG = TOVA/ GP
        
    if df.at[idx, "Home"] == 1:
        df.at[idx,"Home_PAPG"] = PAPG
        df.at[idx,"Away_PAPG"] = 0.0
        
        df.at[idx,"Home_RAPG"] = RAPG
        df.at[idx,"Away_RAPG"] = 0.0
        
        df.at[idx,"Home_AAPG"] = AAPG
        df.at[idx,"Away_AAPG"] = 0.0
        
        df.at[idx,"Home_TOVAPG"] = TOVAPG
        df.at[idx,"Away_TOVAPG"] = 0.0
    else:
        df.at[idx,"Home_PAPG"] = 0.0
        df.at[idx,"Away_PAPG"] = PAPG
        
        df.at[idx,"Home_RAPG"] = 0.0
        df.at[idx,"Away_RAPG"] = RAPG
        
        df.at[idx,"Home_AAPG"] = 0.0
        df.at[idx,"Away_AAPG"] = AAPG
        
        df.at[idx,"Home_TOVAPG"] = 0.0
        df.at[idx,"Away_TOVAPG"] = TOVAPG

**Next, we drop all the columns we do not need anymore**

In [None]:
df.drop(["GAMES_LEFT", "GAME_DATE", "MATCHUP", "Team_ID", "Team", "Opp", "W", "L", "WL", "Home", "Back_To_Back", 
         "W_PCT","PTS", "REB", "MIN", "FGM", "FGA", "FG_PCT", "FTM", "FTA", "FT_PCT", "OREB", "DREB", "AST", 
         "STL", "BLK", "TOV", "PF", "FG3M", "FG3A", "FG3_PCT"], inplace = True, axis = 1)

**All null values are caused by games played = 0 so we can set the default value to 0 since all other values at the start of the season are 0**

In [None]:
df=df.fillna(0)

**Finally we can group by Game_ID**\
*We also group by Team and Opp so its more user friendly*

In [None]:
grouped_df = df.groupby(by = ["Game_ID", "Home_Team", "Away_Team"]).sum()

**grouped_df consists of one particular season. I ran the above code for 5 seasons (2014-2018)** \
I did not include the 2019-2020 season since it was interrupted by Covid-19 and the 2020-2021 season was shortened to 72 games. I believe these seasons can still be useful, I just decided not to use them here.\
**I stored each season into their own dataframe then concatenated everything and sent that to a csv file (complete_games)**

In [None]:

#grouped14 = grouped_df.copy()
#grouped15 = grouped_df.copy()
#grouped16 = grouped_df.copy()
#grouped17 = grouped_df.copy()
#grouped18 = grouped_df.copy()

In [None]:
#complete_df = pd.concat([grouped14, grouped15, grouped16, grouped17, grouped18])

In [None]:
#complete_df.to_csv("complete_games")

## C. Train and Test the model
\
\
**First import all nessecary libraries**

In [80]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

- We will use train_test_split to split the data
- classification_report to test the model
- confusion_matrix to help illustrate the results
- We will use 3 Models:
    - Logistic Regression
    - Logistic Regression with a balanced class weight
    - Random Forest Classifier
    

**First Split the data into the training and testing sets**

In [81]:
X = complete_df[[ "Home_Team_PPG", "Away_Team_PPG", "Home_Team_RPG", "Away_Team_RPG", "Home_RAPG", "Away_RAPG",
               "Home_FGPCT", "Away_FGPCT", "Home_PAPG", "Away_PAPG", "Home_Team_APG", "Away_Team_APG",
                "Home_AAPG", "Away_AAPG", "Home_Team_TOVPG", "Away_Team_TOVPG", "Home_TOVAPG", "Away_TOVAPG",
                "Home_Team_B2B", "Away_Team_B2B"]]
y= complete_df["Home_Win"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

**Now create and fit the modelx using the training data**

In [82]:
logmodel = LogisticRegression(max_iter= 1000)
logmodel.fit(X_train, y_train)

logmodel_bal = LogisticRegression(max_iter= 1000, class_weight='balanced')
logmodel_bal.fit(X_train, y_train)

rfc = RandomForestClassifier(n_estimators= 1500)
rfc.fit(X_train, y_train)


RandomForestClassifier(n_estimators=1500)

**Lastly create our predictions and test them**

In [83]:
predictions_log = logmodel.predict(X_test)
print("log")
print(classification_report(y_test, predictions_log))
print(confusion_matrix(y_test, predictions_log))
print("\n")
predictions_logbal = logmodel_bal.predict(X_test)
print("logbal")
print(classification_report(y_test, predictions_logbal))
print(confusion_matrix(y_test, predictions_logbal))
print("\n")
predictions_rfc = rfc.predict(X_test)
print("rfc")
print(classification_report(y_test, predictions_rfc))
print(confusion_matrix(y_test, predictions_rfc))
print("\n")

log
              precision    recall  f1-score   support

           0       0.66      0.41      0.50       871
           1       0.65      0.84      0.73      1159

    accuracy                           0.65      2030
   macro avg       0.65      0.62      0.62      2030
weighted avg       0.65      0.65      0.63      2030

[[353 518]
 [185 974]]


logbal
              precision    recall  f1-score   support

           0       0.59      0.62      0.60       871
           1       0.70      0.67      0.69      1159

    accuracy                           0.65      2030
   macro avg       0.65      0.65      0.65      2030
weighted avg       0.65      0.65      0.65      2030

[[541 330]
 [377 782]]


rfc
              precision    recall  f1-score   support

           0       0.62      0.42      0.50       871
           1       0.65      0.81      0.72      1159

    accuracy                           0.64      2030
   macro avg       0.63      0.61      0.61      2030
weighted 

Overall the Logistic Regression model has 65% accuracy. It has:\
\
-353 True Positives (Correctly Predicted Away Win)\
-185 False Positives (Predicted Away Win when it was a Home Win)\
-974 True Negatives (Correctly Predicted Home Win)\
-518 False Negatives(Predicted Home Win when it was a Away Win\
\
We can interpret from these results that this model over predicts Home Wins\
\
The Log model with balanced weights has a similar accuracy but more evenly predicts outcomes with only 54% of home wins predcited\
\
The Random Forest Classifier preformed similarly to the original Log model, but slightly worse

## 2. Creating The App
### A. Retrieve the Data
We will gather the data the same way we did earlier but with the current season selected

In [84]:
team_dict = teams.get_teams()
df = pd.DataFrame()
for team in team_dict:
    #print(team["full_name"])
    gamelogobj = teamgamelog.TeamGameLog(team_id = str(team["id"]), season = "2021")
    gamelogdf = gamelogobj.get_data_frames()[0]
    df = pd.concat([df, gamelogdf])
    sleep(.750)

### B. Adjust Data and Add Features 
Again this will be just like before but this time with some slight adjustments, most notably:
1. We don't want to group this data so we dont need to sort the values into Home/Away
2. We don't solve for back to back games in the dataframe but rather calculate the day of their last game
3. We will be using the current game played in the per game averages calculations since we aren't trying to predict its outcome, but rather the game after that (not in the dataframe)



First lets reset index like we did before, and rename the unnamed column (was named GAMES_LEFT above) to GAMES_PLAYED_AFTER

In [85]:
df.reset_index(inplace = True)
df = df.rename(columns = {'index': "GAMES_PLAYED_AFTER"})

Next we split up the two teams and calculate the day the game was played (Looking for the DD part of MM-DD-YYYY)

In [86]:
df["Team"] = df.MATCHUP.apply(lambda x: x.split(" ")[0])
df["Opp"] = df.MATCHUP.apply(lambda x: x.split(" ")[2])
df["Day"] = df.GAME_DATE.apply(lambda x: x.split(" ")[1].split(",")[0])

Now lets calculate the per game averages

In [87]:
df["PPG"] = -1.0
df["RPG"] = -1.0
df["FGPCT"] = -1.0
df["APG"] = -1.0
df["TOVPG"] = -1.0

for idx, series in df.iterrows():
    team = df.at[idx, "Team_ID"]
    gameID = df.at[idx, "Game_ID"]
    
    total_points = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID), "PTS"].sum()
    total_rebounds = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID), "REB"].sum()
    total_assists = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID), "AST"].sum()
    total_tov = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID), "TOV"].sum()
    
    PrevFGM = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID), "FGM"].sum()
    PrevFGA = df.loc[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID), "FGA"].sum()
    FGPCT = PrevFGM / PrevFGA
    
    df.at[idx, "Team_ID"] = df.at[idx, "Team_ID"] #Team_ID
    df.at[idx, "PPG"] = total_points / len(df[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID)]) 
    df.at[idx, "RPG"] = total_rebounds / len(df[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID)])
    df.at[idx,"FGPCT"] = FGPCT
    df.at[idx, "APG"] = total_assists / len(df[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID)])   
    df.at[idx, "TOVPG"] = total_tov / len(df[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID)])
    
    

Next is the per game allowed averages. \
Since we will only be caring about each teams most recent stats, we can speed this up by only calculating these stats for the lastest game.\
GAMES_PLAYED_AFTER keeps track of how many games in the dataframe occur after that one for each team, so 0 games played after is the most recent game.

In [88]:
df["PAPG"] = -1.0
df["RAPG"] = -1.0
df["AAPG"] = -1.0
df["TOVAPG"] = -1.0

for idx, series in df.iterrows():
    team = df.at[idx, "Team_ID"]
    gameID = df.at[idx, "Game_ID"]
    
    PA = 0 #Points Allowed
    RA = 0 #Rebounds Allowed
    AA = 0 #Assists Allowed
    TOVA = 0 #Turnovers Allowed
    
    GP = 0 #Games Played
    if series.GAMES_PLAYED_AFTER == 0:
        for idx2, series2 in df.loc[(df["Team_ID"] == team) & (df["Game_ID"] <= gameID)].iterrows():
            opp = series2.Opp
            gameID2 = series2.Game_ID

            PA += df.loc[(df["Team"] == opp) & (df["Game_ID"] == gameID2), "PTS"].sum()
            RA += df.loc[(df["Team"] == opp) & (df["Game_ID"] == gameID2), "REB"].sum()
            AA += df.loc[(df["Team"] == opp) & (df["Game_ID"] == gameID2), "AST"].sum()
            TOVA += df.loc[(df["Team"] == opp) & (df["Game_ID"] == gameID2), "TOV"].sum()

            GP += 1

        PAPG = 0 #Points allowed per game
        TOVAPG = 0 #Turnovers allowed per game
        RAPG = 0 #Rebounds allowed per game
        AAPG = 0 #Assists allowed per game

        if GP >0:
            PAPG = PA/ GP
            RAPG = RA/GP
            AAPG = AA/ GP
            TOVAPG = TOVA/ GP


        df.at[idx,"PAPG"] = PAPG
        df.at[idx,"RAPG"] = RAPG
        df.at[idx,"AAPG"] = AAPG
        df.at[idx,"TOVAPG"] = TOVAPG

   

Then we will drop the unnessecary columns

In [89]:
df.drop(["MATCHUP", "Opp", "W", "L", "WL", "Game_ID", "GAME_DATE",
         "W_PCT","PTS", "REB", "MIN", "FGM", "FGA", "FG_PCT", "FTM", "FTA", "FT_PCT", "OREB", "DREB", "AST", 
         "STL", "BLK", "TOV", "PF", "FG3M", "FG3A", "FG3_PCT"], inplace = True, axis = 1)

Now instead of grouping, we are going to create a new data of only the most recent stats

In [90]:
current_stats = df[df["GAMES_PLAYED_AFTER"] == 0].copy()

And we can drop that column now

In [91]:
current_stats.drop(["GAMES_PLAYED_AFTER"], inplace = True, axis = 1)

In [92]:
#current_stats

### C. Create function that takes in a team and day and outputs the probabilities
This is very simple. We will just need to find each teams corresponding value from current_stats, format them in an array, then use that array for each of our models

*Note this takes in the day from before the game. This is to prevent issues on the first of each month.*

In [93]:
def Predict(Home, Away, Day):
    
    HomePPG = current_stats.loc[current_stats["Team"] == Home, "PPG"].sum()
    AwayPPG = current_stats.loc[current_stats["Team"] == Away, "PPG"].sum()
    
    HomeRPG = current_stats.loc[current_stats["Team"] == Home, "RPG"].sum()
    AwayRPG = current_stats.loc[current_stats["Team"] == Away, "RPG"].sum()
    
    HomeRAPG = current_stats.loc[current_stats["Team"] == Home, "RAPG"].sum()
    AwayRAPG = current_stats.loc[current_stats["Team"] == Away, "RAPG"].sum()
    
    Home_FGPCT = current_stats.loc[current_stats["Team"] == Home, "FGPCT"].sum()
    Away_FGPCT = current_stats.loc[current_stats["Team"] == Away, "FGPCT"].sum()
    
    Home_PAPG = current_stats.loc[current_stats["Team"] == Home, "PAPG"].sum()
    Away_PAPG = current_stats.loc[current_stats["Team"] == Away, "PAPG"].sum()
    
    Home_Team_APG = current_stats.loc[current_stats["Team"] == Home, "APG"].sum()
    Away_Team_APG = current_stats.loc[current_stats["Team"] == Away, "APG"].sum()
    
    Home_AAPG = current_stats.loc[current_stats["Team"] == Home, "AAPG"].sum()
    Away_AAPG = current_stats.loc[current_stats["Team"] == Away, "AAPG"].sum()
    
    Home_Team_TOVPG = current_stats.loc[current_stats["Team"] == Home, "TOVPG"].sum()
    Away_Team_TOVPG = current_stats.loc[current_stats["Team"] == Away, "TOVPG"].sum()
    
    Home_TOVAPG = current_stats.loc[current_stats["Team"] == Home, "TOVAPG"].sum()
    Away_TOVAPG = current_stats.loc[current_stats["Team"] == Away, "TOVAPG"].sum()
    
    if current_stats.loc[current_stats["Team"] == Home, "Day"].sum() == Day:
        Home_Team_B2B = 1
        print("htb2b")
    else:
        Home_Team_B2B = 0
    if current_stats.loc[current_stats["Team"] == Away, "Day"].sum() == Day:
        Away_Team_B2B = 1
        print("atb2b")
    else:
        Away_Team_B2B = 0
        
    arr = [
        HomePPG,
        AwayPPG,
        
        HomeRPG,
        AwayRPG,
    
        HomeRAPG,
        AwayRAPG,
    
        Home_FGPCT,
        Away_FGPCT,

        Home_PAPG,
        Away_PAPG,

        Home_Team_APG,
        Away_Team_APG,

        Home_AAPG,
        Away_AAPG,

        Home_Team_TOVPG,
        Away_Team_TOVPG,

        Home_TOVAPG,
        Away_TOVAPG,

        Home_Team_B2B,
        Away_Team_B2B
        
    ]
    
    print(arr[0:2]) #Check if teams were inputted correctly
    
    pred_log = logmodel.predict_proba([arr])
    pred_logbal = logmodel_bal.predict_proba([arr])
    pred_rfc = rfc.predict_proba([arr])
    
    
    return [pred_log, pred_logbal, pred_rfc]

In [96]:
# ("HOME", "AWAY", "YESTERDAY'S DATE" ("DD"))
# ( "CHI", "PHI", "09")
# Full list of team abbreviations are below
# Will output team PPG if inputted correctly, else it will output 0

a = Predict("CHI", "MIL", "04")

[111.67948717948718, 115.0]


In [95]:
print("Log Model:                " + str(a[0]).split(" ")[1].split("]")[0])
print("Blanced Log Model:        " + str(a[1]).split(" ")[1].split("]")[0])
print("Random Forest Classifier: " + str(a[2]).split(" ")[1].split("]")[0])
print("Average:                  " + str((float(str(a[0]).split(" ")[1].split("]")[0]) + float(str(a[1]).split(" ")[1].split("]")[0]) + float(str(a[2]).split(" ")[1].split("]")[0]))/3))

#Outputs chance for a home team win

Log Model:                0.563548
Blanced Log Model:        0.47746438
Random Forest Classifier: 0.68133333
Average:                  0.5741152366666668


ATL	Atlanta Hawks\
BKN	Brooklyn Nets\
BOS	Boston Celtics\
CHA	Charlotte Hornets\
CHI	Chicago Bulls\
CLE	Cleveland Cavaliers\
DAL	Dallas Mavericks\
DEN	Denver Nuggets\
DET	Detroit Pistons\
GSW	Golden State Warriors\
HOU	Houston Rockets\
IND	Indiana Pacers\
LAC	Los Angeles Clippers\
LAL	Los Angeles Lakers\
MEM	Memphis Grizzlies\
MIA	Miami Heat\
MIL	Milwaukee Bucks\
MIN	Minnesota Timberwolves\
NOP	New Orleans Pelicans\
NYK	New York Knicks\
OKC	Oklahoma City Thunder\
ORL	Orlando Magic\
PHI	Philadelphia 76ers\
PHX	Phoenix Suns\
POR	Portland Trail Blazers\
SAC	Sacramento Kings\
SAS	San Antonio Spurs\
TOR	Toronto Raptors\
UTA	Utah Jazz\
WAS	Washington Wizards