# DS 100 Final Project: A Basketball Analysis

## Given 2 teams' average statistics before a game, how well can we predict who wins?

### Creators: Alec Zhou, Prashant Malyala

## I. This Project

[Insert description of project here]

## II. Set Up

Below, we will import the necessary libraries for our data analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from datetime import date

Now, we'll read in the relevant basketball data.

In [None]:
#from the provided ds100 dataset
team_box_score = pd.read_csv('basketball/Basketball-TeamBoxScores.csv')
standings = pd.read_csv('basketball/Standings.csv')

## III. Data Preview

Let's briefly see what our data looks like.

### i. Standings Data:

The granularity of this data is quite fine; it seems that the Standings table has a record representing each team's standing at every date of the season (after the games on that date occur). The columns are rollups and seem to represent aggregate statistics (such as total points scored, current streak, etc.) through each date for each team.

In [None]:
standings.head(5)

In [None]:
#all 30 teams are listed on this date even though not all of them played; must be a full date by date record
len(standings[standings['stDate'] == '2012-10-30'])

### ii. Team Box Score Data:

The granularity of the team box score data is also rather fine; it seems that this table has a record for each game of each season. The column statistics provided match the granularity of each record, as these statistics are generally measured or computed using only data from that game.

In [None]:
team_box_score.head(5)

Note that for each set of 3 officials, there are 2 games listed on each date. This led us to suspect that the Team Box Score data had two entries per game, one for each team. We can verify this in the cell below.

In [None]:
#shows that each game is listed twice, once for each team
team_box_score[team_box_score['gmDate'] == '2012-10-30'][['teamAbbr', 'opptAbbr']]

So not only do we have all the team and opposing team's statistics per game entry, we have two listings of every game! Certainly there's some redundancy in this data, and this is going to be an important point we will revisit when building our model.

Finally, let's see how complete this data is. There are 30 teams in the league, 82 games per season, and this data covers the 2012-2013 through 2017-2018 seasons (6 seasons). So we would expect this many records:

In [None]:
30 * 82 * 6

Let's see what percentage of records we actually have.

In [None]:
team_box_score.shape[0] / 14760 * 100

While not as perfect as we may have expected, it seems that the vast majority of the data is present. Since the effects of missing just a couple of records won't greatly affect our analysis, we will continue forward.

## IV: Data Cleaning

Right now, both our tables have a lot of columns, making them quite difficult to read or interpret. In this section, we will use domain expertise to compute more useful statistics and to replace or drop some columns.

### i. Cleaning up the Standings Data

Let's look at the standings data again.

In [None]:
standings.head()

**Dropping certain columns:**

Rank, or seed, is determined by a variety of factors, including record (wins vs losses) and performance in a team's division. We believe that because our goal is not to predict standing but rather a win or a loss for a given game, rank isn't as effective a measure as winning percentage for our analysis.

We also believe that information like totals for ptsFor and ptsAgnst aren't as useful as averages (ptsScore, ptsAllow). Opponent statistics like opptGmPlay and opptGmWon are effectively summarized in the strength of schedule (sos) column. Finally, statistics like rel%Indx and Pythagorean statistics are difficult to interpret; the goal of our analysis is to determine more simple statistics that have important relationships with a team's likelihood of winning. Consequently, we will drop all of these columns below.

In [None]:
standings = standings.drop(['rank', 'rankOrd', 'gameBack', 'ptsFor', 'ptsAgnst', 'opptGmPlay', 'opptGmWon',
                           'opptOpptGmPlay', 'opptOpptGmWon', 'rel%Indx', 'pyth%13.91', 'wpyth13.91',
                           'lpyth13.91', 'pyth%16.5', 'wpyth16.5', 'lpyth16.5'], axis=1)

**Winning percentages:**

There's still a lot of data that could be summarized more easily using numbers and percentages. Let's start with winning percentages.

The Standings table currently has columns for gameWon, gameLost, homeWin, homeLoss, awayWin, awayLoss, confWin, and confLoss. Let's convert all these numbers into more concise percentages, like total winning percentage, home winning percentage, etc.

In [None]:
# total winning percentage is (wins) / (wins + losses)
standings['total_wpc'] = np.divide(standings['gameWon'], standings['gameWon'] + standings['gameLost'])
# wherever there were 0 games played, we divided by 0 (NaN). replace these
standings = standings.fillna(0)
# can now drop gameWon, gameLost, as this information is summarized by our wpc statistic
standings = standings.drop(['gameWon', 'gameLost'], axis=1)

# follow the same process as above for home_wpc, away_wpc, and conf_wpc
standings['home_wpc'] = np.divide(standings['homeWin'], standings['homeWin'] + standings['homeLoss'])
standings['away_wpc'] = np.divide(standings['awayWin'], standings['awayWin'] + standings['awayLoss'])
standings['conf_wpc'] = np.divide(standings['confWin'], standings['confWin'] + standings['confLoss'])
standings = standings.fillna(0)
standings = standings.drop(['homeWin', 'homeLoss', 'awayWin', 'awayLoss', 'confWin', 'confLoss'], axis=1)

**Streaks:**

Data regarding whether or not a team is on a streak could be quite useful. Right now, this information is encapsulated in strings, but what if we were to represent them more interpretably using numbers?

One way we might do this is by representing losing streaks with negative numbers and winning streaks with positive ones. The length of the streak would be the absolute value of the number.

In [None]:
# function that converts streak strings into numbers
def stk_to_num(x):
    if x == '-':
        return 0
    elif x[0] == 'L':
        return -1 * int(x[1:])
    else:
        return int(x[1:])

In [None]:
# once we have numerically represented the streaks, there's no point keeping the string data
standings['streak'] = standings['stk'].apply(stk_to_num)
standings = standings.drop(['stk', 'stkType', 'stkTot'], axis=1)

**Last Five/Last Ten:**

Finally, we have this information about how many games each team won in their last 5 and last 10 games. Let's represent these as percentages as well.

In [None]:
standings['lastTen_wpc'] = standings['lastTen'] / 10
standings['lastFive_wpc'] = standings['lastFive'] / 5

In [None]:
standings = standings.drop(['lastFive', 'lastTen'], axis=1)

### ii. Cleaning up the Team Box Score Data

Now let's take another look at the Team Box Score data.

In [None]:
team_box_score.head()

**Dropping certain columns:**

In general, we believe like factors like gmTime and the identity of the officials are relatively negligible in determining the outcome of a game. seasTyp is also not very informative, as all of these games seem to have taken place during the regular season.

In [None]:
# shows that all games took place during the regular season
team_box_score['seasTyp'].unique()

In [None]:
# dropping these columns
team_box_score = team_box_score.drop(['gmTime', 'seasTyp', 'offLNm1', 'offFNm1', 'offLNm2', 'offFNm2',
                                      'offLNm3', 'offFNm3'], axis=1)

**Representing wins:**

When it comes time to train and test our model, we will want our results to be easily interpretable. Rather than using strings for Win and Loss in the teamRslt column, let's convert this to a binary indicator, where 0's represent losses and 1's represent wins.

In [None]:
#make the result column a binary indicator
team_box_score = team_box_score.replace(to_replace=['Loss', 'Win'], value=[0, 1])
#verify that we only have 0's and 1's in this column
team_box_score['teamRslt'].unique()

## V: Data Transformations

**Section i:**

Right now, our standings data has aggregate statistics for each team through a certain date and our team box score data has the game stats for each date. Both tables also have date attributes, so if we were to combine them, we could try and make use of the statistics offered by both. Thus, the first part of this section will be merging the two tables together.

**Section ii:**

The second section will be focused on how we can transform our data to represent real world circumstances. In reality, we will never have the team box score data of a game before said game (otherwise, predicting a win is as easy as seeing who scored more points). So instead, what we will do is try to make the team box score data look more like the standings data (i.e. we will calculate the team's and oppt's average points, average assists, average STL/TO ratios, etc. for every game of the season).

Throughout the process we will need to be mindful of the way we take averages. We want our averages to represent statistics we would have BEFORE game *i* happened, so we would need to take the average of only games *0...i-1*. As for the standings data, because this data has been aggregated after the games that occured that day, we would simply need to move this data one game down for each team. We will walk through each step of this process in section ii. through code commenting and markdown cells, so please refer to those for a more detailed walk through.

We will be using the following function throughout this section to extract data for each particular season.

In [None]:
def get_nth_season(data, n):
    #1st season begins in 2012 and 6th season begins in 2017
    return data[(data['gmDate'] > date(2011 + n, 10, 1)) & (data['gmDate'] < date(2012 + n, 5, 1))]

### i. Merging Standings, Team Box Score Data

We will want to merge each team's games in the team box score data with what their standing was on that same date. Consequently, we will need to merge on both date and teamAbbr.

In [None]:
# inner join the tables on the date and teamAbbr
# now we'll have each team's stats for that game and their stats up till that date in the season
team_box_score = pd.merge(team_box_score, standings, how='inner',
                          left_on=['gmDate', 'teamAbbr'], right_on=['stDate', 'teamAbbr'])

#drop the added stDate column, which is redundant as we already have a gmDate column
team_box_score = team_box_score.drop('stDate', axis=1)

Now, we'll make our gmDate column into a datetime object for future use.

In [None]:
#finally, convert gmDate into a datetime type for ease of manipulation
team_box_score['gmDate'] = team_box_score['gmDate'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d").date())

In [None]:
team_box_score.head()

### ii. Computing Averages

The first thing we will need to do is split our team box score data by season. We will do so below using our *get_nth_season* function from the beginning of the section.

In [None]:
# data for each season
all_seasons = []
for i in range(1, 7):
    # compile every season's data into one list
    all_seasons.append(get_nth_season(team_box_score, i))

Now, we will need to split each season up by team in order to compute each team's average statistics. We will use the following functions each time in order to do this. The components of these functions end up being quite long, so we've included a variety of comments to break them down.

In [None]:
def all_team_data(season_data, which_team=0):
    """
    This function takes the data for one season and compiles a list of 30 DataFrames; each DataFrame in the list
    represents the season data for one team in the NBA.
    
    Based on the argument given to which_team, this function will either group by the teamAbbr or the opptAbbr.
    The purpose of this is we will need to group by the former to compute the provided team's average stats before
    every game, but we will later need to group by the latter to compute the opponent's average stats before
    every game.
    """
    if which_team == 0:
        teams = season_data['teamAbbr'].unique()
        grouped = season_data.groupby('teamAbbr')
    else:
        teams = season_data['opptAbbr'].unique()
        grouped = season_data.groupby('opptAbbr')
    teams_data = []
    for team in teams:
        teams_data.append(grouped.get_group(team))
    return teams_data

In [None]:
def compute_averages(team, columns):
    """
    This function takes the season data for one team in the NBA and a list of columns. For every column c,
    this function replaces every entry, i_c, in the column with the average of its previous entries
    mean(0_c, ..., (i-1)_c).
    
    In other words, say I pass in the 2012-2013 season data for the Washington Wizards and the following
    list of columns: ['teamPTS', 'teamAST']. For every game in the season data, I will replace the teamPTS
    and teamAST recorded for that game with the Wizards' average teamPTS and teamAST during all games
    BEFORE that one. Essentially, it will be like the Wizards' 50th game was tomorrow and I had their season
    statistical averages up until that point in time. The point of this transformation is to pretend that
    we are analysts looking at stats before a game happens trying to determine who will win using a model.
    
    This function goes through the entries in reverse order, so entries are only
    replaced after all computation involving them has been completed.
    """
    # first reset the index so that all games are numbered 0-81
    team = team.reset_index()
    team = team.drop('index', axis=1)
    
    # replace the final entry in each column with the average of the prev n-1 entries
    for col in columns:
        team.loc[len(team)-1, col] = team.loc[0:(len(team)-2)][col].mean()
    
    # for each entry, i, in each column, replace that entry with the average of the previous i-1 entries
    for i in range(len(team)-2):
        ind = len(team) - i - 2
        for col in columns:
            # for each game, g, compute the team's total, say, pts after game g. then subtract the points scored in game g
            # and divide by the number of games that happened before game g. this yields the team's avg pts before game g.
            team.loc[ind,col] = (team.loc[ind+1,col]*team.loc[ind,'gamePlay']-team.loc[ind,col])/(team.loc[ind,'gamePlay']-1)

    # set all stats before the first game to 0
    for col in columns:
        team.loc[0,col] = 0
    # return the dataframe
    return team

In [None]:
team_cols = ['teamMin', 'teamPTS', 'teamAST', 'teamTO', 'teamSTL', 'teamBLK', 'teamPF', 'teamFGA',
               'teamFGM', 'teamFG%', 'team2PA', 'team2PM', 'team2P%', 'team3PA', 'team3PM', 'team3P%', 'teamFTA',
               'teamFTM', 'teamFT%', 'teamORB', 'teamDRB', 'teamTRB', 'teamPTS1', 'teamPTS2', 'teamPTS3', 'teamPTS4',
               'teamPTS5', 'teamPTS6', 'teamPTS7', 'teamPTS8', 'teamTREB%', 'teamASST%', 'teamTS%', 'teamEFG%',
               'teamOREB%', 'teamDREB%', 'teamTO%', 'teamSTL%', 'teamBLK%', 'teamBLKR', 'teamPPS', 'teamFIC',
               'teamFIC40', 'teamOrtg', 'teamDrtg', 'teamEDiff', 'teamPlay%', 'teamAR', 'teamAST/TO', 'teamSTL/TO']
oppt_cols = ['opptMin', 'opptPTS', 'opptPTS', 'opptAST', 'opptTO', 'opptSTL', 'opptBLK', 'opptPF',
            'opptFGA', 'opptFGM', 'opptFG%', 'oppt2PA', 'oppt2PM', 'oppt2P%', 'oppt3PA', 'oppt3PM', 'oppt3P%',
            'opptFTA', 'opptFTM', 'opptFT%', 'opptORB', 'opptDRB', 'opptTRB', 'opptPTS1', 'opptPTS2', 'opptPTS3',
            'opptPTS4', 'opptPTS5', 'opptPTS6', 'opptPTS7', 'opptPTS8', 'opptTREB%', 'opptASST%', 'opptTS%',
            'opptEFG%', 'opptOREB%', 'opptDREB%', 'opptTO%', 'opptSTL%', 'opptBLK%', 'opptBLKR', 'opptPPS',
            'opptFIC', 'opptFIC40', 'opptOrtg', 'opptDrtg', 'opptEDiff', 'opptPlay%', 'opptAR', 'opptAST/TO',
            'opptSTL/TO']

In [None]:
#all_seasons[0].columns[57:]
all_seasons[0].columns[0:15]

In [None]:
def adjust_standings(team, columns):
    """
    This function essentially shifts all the standings data that we've added one entry down. This way, the data
    we'll use to predict won't include statistics from that game. Then we'll set all standings data for first
    games to 0.
    
    Call this function after calling compute_averages to ensure indices are numbered as desired.
    """
    # for each column we want to shift...
    for col in columns:
        # let entry n take the value of entry n-1
        for i in range(1, len(team)):
            ind = len(team) - i
            team.loc[ind, col] = team.loc[ind-1, col]
        # set entry 0 of each col to 0
        team.loc[0, col] = 0
    return team

In [None]:
# for every season...
for i in range(len(all_seasons)):
    # get the data for every team from that season
    list_of_teams = all_team_data(all_seasons[i])
    
    # for each team, compute averages for various columns.
    # (see the compute_averages comments for details).
    for k in range(len(list_of_teams)):
        list_of_teams[k] = compute_averages(list_of_teams[k], team_cols)
        # also for each team, move all their standings data one entry down
        # for ex. we want the Wizards' streak number BEFORE game i, not after it.
        list_of_teams[k] = adjust_standings(list_of_teams[k], standings_cols)
    

    # once every team's averages have been computed, join them back into one season
    season_df = pd.concat(list_of_teams)
    
    # then sort the season by game date and assign it back into all_seasons
    season_df = season_df.sort_values(by='gmDate')
    all_seasons[i] = season_df

## VI: Exploratory Data Analysis

Here, we will investigate features to build a model. Include heatmap depicting correlation and several other plots.

### i. PCA Analysis

## VII: Inference and Prediction

This is where we will build a model. Make sure to have training, val (so we can say we did cross validation), and test data. We have 6 seasons total; we can make 4 of them and the first 30 games of the 5th season training data, remainder of the 5th season val data, and the 6th season test data (or something, may be better to not make beginning of season data test as we'll prolly fail a lot of initial games just because all stats would be 0).