# NBA Game Predictions - Final Report
## Kevin Yang, Eric Lee, Derek Young


### Introduction
The NBA (National Basketball Association) is an American based men's basketball league that is considered to be the best basketball league in the world. As of the 2016-2017 season, there are 30 teams divided into two conferences (and further divided into six divisions).

Our main objective is to build prediction models for NBA team performance.

On a broad level, we have decided to focus our project on predicting the outcome of a given nba game. On a high level, our approach will involve determining the most important features in determining game outcomes, and then train a supervised machine learning model on these features over many previous games. 

### Report Contents
Below outlines the pipeline to predict NBA games.

1. [Data Collection and Scraping](#Data Collection and Scraping)
2. [Exploratory Data Analysis]()
3. [Feature Selection]()
4. [Cutoff and "N"-previous Game Selection]()
5. [Naive Approach]()
6. [Linear Regression Model](#Linear Regression Model)
7. [Logistic Regression Model]()
8. [Support Vector Machines]()
9. [Concluding Thoughts]()

While most of our project is written in Python, a certain subset was done in R, specifically modeling linear and logistic regression. Throughout the report, we will highlight snippets of code that were substantial to our results. All code can be found ____.

<a id='Data Collection and Scraping'></a>
### Data Collection and Scraping

We scraped all of our data off of [stats.nba.com](http://stats.nba.com) and stored all relevant data in a local sqlite database.

### Linear Regression Model


In [None]:
$$$$

# Exploratory Data Analysis
We do some exploratory data analysis of our collected NBA data in order to give us helpful insight that will be useful for us when we ultimately decide what type of variables would be useful predictors for our model. Our EDA will include examining variables both univariately and bi-variately.

## Bivariate Analysis
Based on our personal intuition and understanding of the NBA, we listed out some of the variables that we wanted to explore with the understanding that these variables could potentially be important factors in determining which team will win a given matchup. For all of the following variables, we explored bivariate scatterplots, graphing the home team's data on the y axis against the away team's data on the x axis. For our plots concerning the home and away win/loss percentage in the previous N games, we choose 8 games for the purpose of these graphs. A more detailed analysis on how we pick N is discussed in a different section.

1. Home W/L Percentage
2. Away W/L Percentage
3. Home Average Point Differential
4. Away Average Point Differential
5. Home W/L Percentage in Previous N Games (8 chosen here for the plots)
6. Away W/L Percentage in Previous N Games (8 chosen here for the plots)
7. Away W/L Percentage as Away Team
8. Home W/L Percentage as Home Team

Running these graphs over multiple seasons and eyeballing the plots gave us similar results. As an example, we display and inspect these graphs for the most recently completed season (2015-16). Each of the points are color coded: games that were won by the home team are blue while lost games are red. Hovering over each point displays the home team, the teams matched up, and an boolean value of whether or not the away team was playing a back to back game (1 for back to back, 0 for not).

In [None]:
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook
from bokeh.charts import Bar
from bokeh.charts import Scatter
from bokeh.io import output_notebook
from bokeh.plotting import figure, output_file, show, ColumnDataSource
from bokeh.models import HoverTool
from bokeh.io import output_notebook

# Plot of Home W/L Percentage vs. Away W/L Percentage
all_years_id = home_games_with_cutoff["season_id"].unique()
# Look at the previous completed season
testYear =  home_games_with_cutoff[(home_games_with_cutoff["season_id"] =="22015")]

# Based on home team, color losses as red and wins as blue 
colormap = {'L': 'red', 'W': 'blue', }
colors = [colormap[x] for x in testYear["wl"]]

# Import the nba dataframe of the given year as ColumnDataSource 
source = ColumnDataSource(data=testYear)

# Variables we want to show in the hover: university name and ranking
hover = HoverTool(tooltips=[("Home", "@team_abbreviation"), ("Game", "@matchup"), ("Away BB", "@away_back_to_back")])

# Add in labels to the graph
yearNum = "2015"
title = "Home Vs. Away - Win/Loss % Year " + yearNum
p = figure(title = title, tools = [hover, "resize", "box_zoom", "reset"])
p.xaxis.axis_label = 'Away Team W/L %'
p.yaxis.axis_label = 'Home Team W/L %'

# Plot the points for the graph
p.circle(testYear["away_win_pct"], testYear["home_win_pct"], color=colors, fill_alpha=0.2, size=10, source = source)
# Add a line of slope one to visually divide points
p.line([0, 1], [0, 1], line_width=2, color = "black")

# Display inline
output_notebook()
show(p)

<img src="WinLoss.png", width = 500, height = 500>

In this graph, what we can see is that the games above the clearly line tend to have more wins than games below the line. This goes to show that the team with a higher win/loss percentage tends to win more matchups. This pattern in the graph supports the idea that win loss percentage can be an important predictor for modeling NBA game outcomes. 

In [None]:
# Plot of Home Avg Pt Diff vs. Away Avg Pt Diff
all_years_id = home_games_with_cutoff["season_id"].unique()

# Look at the previous completed season
testYear =  home_games_with_cutoff[(home_games_with_cutoff["season_id"] =="22015")]

# Based on home team, color losses as red and wins as blue 
colormap = {'L': 'red', 'W': 'blue', }
colors = [colormap[x] for x in testYear["wl"]]

# Import the nba dataframe of the given year as ColumnDataSource
source = ColumnDataSource(data=testYear)

# Variables we want to show in the hover: university name and ranking
hover = HoverTool(tooltips=[("Home", "@team_abbreviation"), ("Game", "@matchup"), ("Away BB", "@away_back_to_back")])

# Add in labels to the graph
yearNum = "2015"
title = "Home Vs. Away - Avg Pt Diff Year " + yearNum
p = figure(title = title, tools = [hover, "resize", "box_zoom", "reset"])
p.xaxis.axis_label = 'Away Team Avg Pt Diff'
p.yaxis.axis_label = 'Home Teams Avg Pt Diff'

# Plot the points for the graph
p.circle(testYear["away_avg_pt_diff"], testYear["home_avg_pt_diff"], color=colors, fill_alpha=0.2, size=10, source = source)
# Add a line of slope one to visually divide points
p.line([-15, 15], [-15, 15], line_width=2, color = "black")

# Display inline
output_notebook()
show(p)

<img src="AvgPtDiff.png", width = 500, height = 500>

Again, we see that the games above the line clearly tend to have more wins than games below the line, showing that the team with a higher avg pt differential tends to win more matchups. This pattern in the graph supports the idea that this variable can be an important predictor for modeling NBA game outcomes. 

In [None]:
# Plot of Home W/L Percentage Previous N Games vs. Away W/L Percentage Previous N games
all_years_id = home_games_with_cutoff["season_id"].unique()
# Hard code the year we are looking at for now
testYear =  home_games_with_cutoff[(home_games_with_cutoff["season_id"] =="22015")]

# Based on home team, color losses as red and wins as blue 
colormap = {'L': 'red', 'W': 'blue', }
colors = [colormap[x] for x in testYear["wl"]]

# Import the nba dataframe of the given year as ColumnDataSource
source = ColumnDataSource(data=testYear)

# Variables we want to show in the hover: university name and ranking
hover = HoverTool(tooltips=[("Home", "@team_abbreviation"), ("Game", "@matchup"), ("Away BB", "@away_back_to_back")])

# Add in labels to the graph
yearNum = "2015"
title = "Home Vs. Away - Win/Loss % Last 8 Games " + yearNum
p = figure(title = title, tools = [hover, "resize", "box_zoom", "reset"])
p.xaxis.axis_label = 'Away Team Win/Loss % Last 8 Games'
p.yaxis.axis_label = 'Home Team Win/Loss % Last 8 Games'

# Plot the points for the graph
p.circle(testYear["away_win_pct_N"], testYear["home_win_pct_N"], color=colors, fill_alpha=0.2, size=10, source = source)
p.line([0, 1], [0, 1], line_width=2, color = "black")

# Display inline
output_notebook()
show(p)

<img src="WinLossLastN.png", width = 500, height = 500>

For this plot, a lot of the points have different shades because there are multiple games with different outcomes that lie at the same point. Again, we see that the games above the line clearly tend to have more wins than games below the line, showing that the team with a higher win/loss percentage in the last 8 games ends up winning more matchups. This pattern in the graph supports the idea that this variable can be an important predictor for modeling NBA game outcomes. 

In [None]:
# Plot of Home Team W/L Percentage as Home vs. Away W/L Percentage as Away 
all_years_id = home_games_with_cutoff["season_id"].unique()
# Hard code the year we are looking at for now
testYear =  home_games_with_cutoff[(home_games_with_cutoff["season_id"] =="22015")]

# Based on home team, color losses as red and wins as blue 
colormap = {'L': 'red', 'W': 'blue', }
colors = [colormap[x] for x in testYear["wl"]]

# Import the nba dataframe of the given year as ColumnDataSource
source = ColumnDataSource(data=testYear)

# Variables we want to show in the hover: university name and ranking
hover = HoverTool(tooltips=[("Home", "@team_abbreviation"), ("Game", "@matchup"), ("Away BB", "@away_back_to_back")])

# Add in labels to the graph
yearNum = "2015"
title = "Home Win % as Home vs Away Win % as Away " + yearNum
p = figure(title = title, tools = [hover, "resize", "box_zoom", "reset"])
p.xaxis.axis_label = 'Away Team Win/Loss as Away Team'
p.yaxis.axis_label = 'Home Team Win/Loss as Home Team'

# Plot the points for the graph
p.circle(testYear["away_win_pct_as_away"], testYear["home_win_pct_as_home"], color=colors, fill_alpha=0.2, size=10, source = source)
p.line([0, 1], [0, 1], line_width=2, color = "black")

# Display inline
output_notebook()
show(p)

<img src="HomeAsHome%.png", width = 500, height = 500>

Again, we see that the games above the line clearly tend to have more wins than games below the line, showing that home teams with a higher win/loss percentage as the home team are more likely to win matchups, and vice versa. This pattern in the graph supports the idea that this variable can be an important predictor for modeling NBA game outcomes. 

## Hypothesis Testing: Is Back to Back 
Another variable that we believed to be an important predictor to the outcome of NBA games was whether or not either of the teams playing in the current matchup had just played a game the day before, as teams playing consequtive games are likely to be more tired and more likely to lose than if they were well rested. 

We evaluate this claim with hypothesis testing. For each team in the league, we will look back throughout several seasons and calculate their win/loss percentages for regular games and back to back games. We then run a difference in proportions hypothesis test to see whether or not there is a statistically significant difference between these win/loss percentages. Because we already intuitively believe that playing back to back games will lower a given team's chances of winning, this hypothesis test will be one sided. The hypothesis test below was run at an alpha level of 0.05. The null and alternative hypotheses are listed below:

#### Null Hypothesis: 
The true difference in this team's win/loss percentages between regular games and back to back games is zero.
#### Alternative Hypothesis: 
The true difference in this team's win/loss percentages between regular games and back to back games is non-zero.

In [None]:
import scipy.stats as st
from scipy.stats import t
from scipy.stats import norm

def teamPercentages(teamID, game_log):
    """ Given a team id (int) returns a (list, list) representing win loss percentages for back to back 
        and regular games
    Input:
        teamID (int): team_id number
        game_log (pd.DataFrame): Game_log that which lists all the home games 
    Output:
        (list, list): Tuple of lists where each list contains the number of wins and number of games played for a 
                      given category, where the first list represents back to back games and the second list 
                      represents regular games.
    """
    seasons = game_log["season_id"].unique()
    # Initialize lists which store the percentages for back to back games and regular games for each season
    bbList = []
    regList = []
    # Initialize counters for games 
    bbWins = 0 
    bbGames = 0
    regWins = 0
    regGames = 0
    # Iterate through all seasons of the dataframe 
    for season in seasons:
        # Get dataframe that only contains games from this season of the team of interest
        currDf = game_log[(game_log["season_id"] == season) & ((game_log["team_id"] == teamID) | (game_log["opp_team_id"] == teamID))]
        # Iterate over the dataframe
        for (index, row) in currDf.iterrows():
            # Determine whether this is a home or away game
            if (row["team_id"] == teamID):
                home = True
            else:
                home = False
            # Case 1: Home back to back game 
            if (home and row["home_back_to_back"] == 1):
                bbGames += 1
                if (row["wl_binary"] == 1):
                    bbWins += 1
            # Case 2: Away back to back game
            if(home == False and row["away_back_to_back"] == 1):
                bbGames += 1
                if (row["wl_binary"] == 0):
                    bbWins += 1
            # Otherwise we have regular games
            else:
                regGames += 1
                if (home and row["wl_binary"] == 1):
                    regWins += 1
                elif (home == False and row["wl_binary"] == 0):
                    regWins += 1
    # Add the num of won and total games in the respective lists 
    bbList.append(bbWins)
    bbList.append(bbGames)
    regList.append(regWins)
    regList.append(regGames)
    return (bbList, regList)

def leaguePercentages(game_log, season_id):
    """ For all teams in the league, generate their win loss percentages for back to back and regular games 
        across all seasons from data from a given season onwards, storing the result in a dictionary. 
    Input: 
        game_log (pd.Dataframe): Dataframe containing all the home games
        season_id (string): season id in which we only look at the results occuring at or after this season 
    Output:
        returnDict: (string, (List, List)) where key represents the team name and value is a tuple of 
                                           lists. Each list contains the number of wins and total games played, 
                                           where the first list represents back to back games and the second list 
                                           represents regular games
    """
    # Index out the seasons 
    newLog = game_log[game_log["season_id"] >= season_id]
    # Initialize the dictionary for all teams
    returnDict = {}
    teamList = newLog["team_id"].unique()
    # Iterate over all teams and call above helper function to generate the lists of win loss % for each team
    for teamID in teamList:
        returnDict[team] = teamPercentages(teamID, newLog)
    return returnDict

def hypothesisTest(bbList, regList):
    """ Run a hypothesis test to see whether the true mean win loss percentage of regular games is higher than
        the true mean win loss percentage of back to back games for a specific team over many seasons (alpha = 0.05).
    Input: 
        bbList (List): List containing the win loss percentages of a team for back to back games over many seasons
        regList (List): List containing the win loss percentages of a team for regular games over many seasons 
    Output:
        result: (Boolean) where 1 represents regular game win loss percentage being significantly greater than the 
                          win loss percentage for back to back games given the inputted data.
    """
    # Get number of back to back and reg games 
    bbWins = bbList[0]
    nBB = bbList[1]
    regWins = regList[0]
    nReg = regList[1]
    # Get our win loss percentages for back to back and regular games 
    bbProp = bbWins / float(nBB)
    regProp = regWins / float(nReg)
    # Caluclate the pooled proportion and pooled standard deviation 
    pooledProp = (bbWins + regWins) / float(nBB + nReg)
    pooledSE = np.sqrt(pooledProp * (1 - pooledProp) * ((1/float(nBB)) + (1/float(nReg))))
    propDiff = regProp - bbProp
    z = propDiff / float(pooledSE)
    # Find the cutoff for alpha = 0.05 
    cutoff = norm.ppf(0.95, loc = 0, scale = 1)
    return z > cutoff 
    
def testLeague(game_log, season_id):
    """ For all teams in the league, run the hypothesis tests comparing win loss percentages between regular and 
        back to back games.
    Input: 
        game_log (pd.Dataframe): Dataframe containing home game logs 
        season_id (string): season_id in which we only look at the results occuring at or after this season  
    Output:
        result: (Float, Dictionary) Tuple containing an int and a dictionary. The float represents the percentage of teams
                                    in which back to back games is significant from the hypothesis test. The dictionary 
                                    has keys being the team_id, and values as booleans of whether or not there was a 
                                    significant difference between the records
    """
    # Index out the seasons 
    newLog = game_log[game_log["season_id"] >= season_id]
    # Initialize the dictionary for all teams
    returnDict = {}
    teamList = newLog["team_id"].unique()
    for team in teamList:
        returnDict[team] = hypothesisTest(teamPercentages(team, newLog[newLog["season_id"] >= season_id])[0], teamPercentages(team, all_games[all_games["season_id"] >= season_id])[1])
    # Find the proportion of teams in which this difference was significant across the league 
    sig = 0 
    total = 0 
    for team in returnDict:
        if returnDict[team] == 1:
            sig += 1
        total += 1
    percSig = (sig) / float(total)
    # Print the percentage of teams in which back to back is a significant variable 
    print percSig
    return (percSig, returnDict)

testLeague(home_games, "22005")

## Result
The above code ran the hypothesis test comparing win/loss percentages between regular and back to back games across all teams in the league from the 2005 season to the games that have occured in the current (2016-2017) season. As the results show, for our data collected from the 2005 season and on, at an alpha level of 0.05, we see that there is statistically significant evidence to reject the null in favor of the alternative that teams have a worse record on back to back games compared to regular games, for 53% of teams in the league. Thus, we decide that back to back is a significant predictor that should be included in our model. 

# How has the NBA changed over the years?
We want to examine the change of key statistics over time in the past 40 years of the NBA. 
1. How has the standard deviation of team wins in a given season changed over different seasons?
2. How has the average point differential for a given season changed over different seasons?
3. How has the difference between the mean point differential between top teams and bottom teams changed over different seasons?

Knowing how the league has changed in terms of these potential predictors is key to better understanding and optimizing our model to how the league currently operates and whether the certain variables are likely vary a lot throughout different seasons or are more stable. The following code addresses these questions by generating visualizations on these different values across the past 40 seasons. 

In [None]:
def get_team_wins_count(league_df, team_id, season_id):
    """ Given a df containing ALL game logs (including home and away), 
        team_id and season_id, returns number of wins the team got that season
    Input:
        league_df (pandas.DataFrame): dataframe containing post-processed league logs (both HOME and AWAY)
        team_id (int or string): player ID number
        season_id (int or string): season ID number
    Output:
        (int): number of games team won in season
    """ 
    team_id = int(team_id)
    season_id = str(season_id)
    
    temp_df = league_df[(league_df['season_id'] == season_id) & (league_df['team_id'] == team_id)]
    temp_df = temp_df.sort_values('game_date')
    
    # get last game
    last_game = temp_df.iloc[len(temp_df) - 1]

    wins = 0
    games_won_so_far = 0
    if last_game['is_home']:
        games_won_so_far = int(round(last_game['home_win_pct']*last_game['home_game_count']))
    else:
        games_won_so_far = int(round(last_game['away_win_pct']*last_game['away_game_count']))
    wins = games_won_so_far + last_game['wl_binary']
    return wins
    
get_team_wins_count(all_games, "1610612742", "22004")

def graph_stdev_wins(league_df):
    """ Given a df containing ALL SORTED game logs (including home and away), 
        graphs stdev of team wins over time
    Input:
        league_df (pandas.DataFrame): dataframe containing post-processed league logs (both HOME and AWAY)
    Output:
        None
    """
    season_list = league_df['season_id'].unique().tolist()
    seasons = []
    stdevs = []
    for season in season_list:
        season_df = league_df[league_df['season_id'] == season]
        team_list = season_df['team_id'].unique().tolist()
        win_counts = []
        for team in team_list:
            team_wins = get_team_wins_count(league_df, team, season)
            win_counts.append(team_wins)
        stdev = np.array(win_counts).std(ddof = 1)
        seasons.append(int(season[1:]))
        stdevs.append(stdev)
        
    # remove last, unfinished year
    seasons = seasons[:len(seasons)- 1]
    stdevs = stdevs[:len(stdevs) - 1]
    fig = plt.figure()
    fig.suptitle('Standard Deviation Of Games Won',fontsize=12)
    plt.xlabel('Season Starting Year')
    plt.ylabel('Standard Deviation')
    plt.plot(seasons, stdevs, seasons, 
                  np.poly1d(np.polyfit(seasons, stdevs, 1))(np.unique(seasons)))
         
graph_stdev_wins(all_games)

def graph_avg_ptdiff(league_df):
    """ Given a df containing ALL SORTED game logs (including home and away), 
        graphs average (absolute value) ptdiff over seasons
    Input:
        league_df (pandas.DataFrame): dataframe containing post-processed league logs (both HOME and AWAY)
    Output:
        None
    """
    season_list = league_df['season_id'].unique().tolist()
    seasons = []
    avg_pt_diffs = []
    for season in season_list:
        season_df = league_df[league_df['season_id'] == season]
        pt_diff = season_df['plus_minus'].values
        pt_diff = np.apply_along_axis(lambda x: np.abs(x), 0, pt_diff)
        seasons.append(int(season[1:]))
        avg_pt_diffs.append(np.mean(pt_diff))
    fig = plt.figure()
    fig.suptitle('Average Point Differentials Over Seasons',fontsize=12)
    plt.xlabel('Season Starting Year')
    plt.ylabel('Average Point Differential')
    plt.plot(seasons, avg_pt_diffs, seasons, 
                  np.poly1d(np.polyfit(seasons, avg_pt_diffs, 1))(np.unique(seasons)))

graph_avg_ptdiff(all_games)
    
def total_games_graph(league_df):
    """ Given a df containing ALL HOME game logs, 
        graphs total games played throughout seasons
    Input:
        league_df (pandas.DataFrame): dataframe containing post-processed league logs (only HOME)
    Output:
        None
    """
    season_list = league_df['season_id'].unique().tolist()
    seasons = []
    total_games = []
    for season in season_list:
        season_df = league_df[league_df['season_id'] == season]
        seasons.append(int(season[1:]))
        total_games.append(len(season_df))
    fig = plt.figure()
    fig.suptitle('Total Games Played Over Different Seasons',fontsize=12)
    plt.xlabel('Season Starting Year')
    plt.ylabel('Games Played')
    plt.plot(seasons, total_games, seasons, 
                  np.poly1d(np.polyfit(seasons, total_games, 1))(np.unique(seasons)))
    
total_games_graph(home_games)

Its clear that the first two plots show that the standard deviation of games won and average point differentials have wave-like patterns that are only slowly increasing as a whole (looking across the wave patterns). The increasing trend of the standard deviation of the number of games won shows us that there exists a larger spread of games won throughout the league within a season as we move to more current seasons, indicating there are likely larger gaps between good and bad teams in the league. This could possibly be due to the increasing number of teams in the league. Similarly, the increasing trend in average point differential also supports this idea of a larger gap between good and bad teams, as average point differntials are larger in more current seasons. 

These effects, however, do not seem particularly large, as the total increase for the standard deviation of games won and average point differential are 4 and 0.5, respectively.

Where we do notice a larger effect is in the total games played graph. We see that the number has been increasing from seasons 1940 to 1990, where it starts to level off. Note that the last point on the total number of games played across seasons is lower because it is the current season which is still in progress. Additionally, the years 1989 and 2011 show dips in the plot due to the lockout season, in which the seasons were condensed. This is important when considering the scope of our data in terms of considering which seasons to include. Seasons after 1990 and after are a lot more similar to the number of total games played currently in the NBA, while going back further and further becomes increasingly different in terms of number of games played. 