# DS100 Final Project: A Basketball Analysis
## Which box score Team Statistics are the best predictors of winning NBA teams?
### Creators: Alec Zhou, Prashant Malyala

## This Project

In this project, we will be investigating which team statistics (measured or calculated by in game performance) are the best predictors of team success in the NBA. For our analysis, we will be using Team Box Score and Standings data from the 2012-2013 through 2017-2018 regular seasons.

## Set up

Below, we will import the necessary libraries for our data analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Now, we'll read in the relevant basketball data.

In [None]:
#from the provided ds100 dataset
team_box_score = pd.read_csv('basketball/Basketball-TeamBoxScores.csv')
standings = pd.read_csv('basketball/Standings.csv')

## Data Preview

Let's briefly see what our data looks like.

**Standings Data:**

There are many teams who didn't play on October 30, 2012, who still have their standings listed on that date. Digging deeper, it seems that the Standings data has an entry for every team on every day of each season.

In [None]:
standings.head(5)

In [None]:
#30 teams are listed on this date even though not all of them played. there are only 30 teams in the whole NBA
len(standings[standings['stDate'] == '2012-10-30'])

Note that much of the information in the Standings table, like streaks or games behind the conference leader, don't classify as "box score team statistics" and therefore aren't relevant to our question. We just want to have a record of how many wins/losses each team had at each point in the season.

**Team Box Score Data:**

For each set of 3 officials, there are 2 games listed on each date. This led us to suspect that the Team Box Score data had two entries per game, one for each team. We can verify this in the second cell below.

In [None]:
team_box_score.head(7)

In [None]:
#shows that the same game is listed twice, once for each team
team_box_score[team_box_score['gmDate'] == '2012-10-30'][['teamAbbr', 'opptAbbr']]

So not only do we have all the team and opposing team's statistics per game entry, we have two listings of every game! Certainly there's some redundancy in this data, and thinking about how we should address this redundancy will be part of our data cleaning and EDA process.

Note that we have a lot of quantitative variables in the team box score data; these are the statistics we hope to investigate.

As for the categorical data like who the officials were and in what conference the team was, we don't plan to dig into this information as much. However, because it's still worthwhile to see what the implications are of the data we drop, we will have a section dedicated to doing some EDA with this categorical data (refer to the section titled: *Visualizing some features of the Box Score Data*).

## Data Cleaning and Exploratory Data Analysis

### Cleaning our Standings Data

Looking at the Standings.PDF file provided with our data, it's clear that much of the Standings data isn't relevant to us. Our question is chiefly concerned with which **in game** team statistics best predict team success, so external measures of teams' winning/losing streaks, performance relative to the conference, and strength of schedule are all data we can drop.

Note that the assumption that we can predict team success using only team box statistics is inherent to our question. How valid this assumption is will be for us to assess once we build a model.

Here, the measure of success we will refer to is a **team's winning percentage** at different points of the season. 

team winning percentage = $ \frac{W}{W + L} $, where W represents total wins, and L total losses.

In [None]:
#really, we only care about the standing date, team name, and win percentage (computed from wins and losses)
standings = standings[['stDate', 'teamAbbr', 'gameWon', 'gameLost']]

#wpc represents winning percentage
standings['wpc'] = np.divide(standings['gameWon'], standings['gameWon'] + standings['gameLost'])
standings = standings.fillna(0)
standings = standings.drop(['gameWon', 'gameLost'], axis=1)

In [None]:
standings.head()

### Merging the Standings and Box Score Data

Now that we have the winning percentages of each team throughout the season from the Standings table, we can inner join it with the Team Box Score data based on the date and team in question.

For every game, we can then see what the winning percentage was of both teams and where their statistical performances differed.

In [None]:
#inner join the tables on the date and teamAbbr so every game has a winning percentage attribute
team_box_score = pd.merge(team_box_score, standings, how='inner',
                          left_on=['gmDate', 'teamAbbr'], right_on=['stDate', 'teamAbbr'])
#drop the added stDate column, which is redundant as we already have a gmDate column
team_box_score = team_box_score.drop('stDate', axis=1)

In [None]:
team_box_score.head()

### Visualizing some features of the Box Score Data

*Make this section about how officials may affect winning performance, how being the home or away team may affect winning performance, seeing how wins were distributed across each conference/division, and maybe if rest at all played a factor in who was more likely to win a game. We plan to drop all if not most of this data in the next section when we start exploring specific team statistics, so investing some time into exploring the impact of what we're dropping is a good idea.*

### Cleaning our Team Box Score Data

Our box score data has a lot of columns, making it quite difficult to find patterns. Not all of the columns are directly relevant to the question we're trying to answer, so let's see what we can clean up.

Just in case we want to revisit any of the team_box_score data, we'll store our changes in different dataframes.

First of all, the time of each game and the identities of our officials aren't really team statistics. We may choose to revisit them later, but for now, let's ignore them.

In [None]:
#note that none of these seasons were lockouts or out of the ordinary. this seasTyp column also doesn't tell us much
team_box_score['seasTyp'].unique()

In [None]:
#dropping the aforementioned columns
tbs_cleaned = team_box_score.drop(['gmTime', 'seasTyp', 'offLNm1', 'offFNm1',
                                   'offLNm2', 'offFNm2', 'offLNm3', 'offFNm3'], axis=1)

The conference and division a team plays in, the number of days they've had off, and the total team minutes (which would be the same for both teams in a game) also aren't really performance statistics. We will also drop this information for now.

Note that we will also drop this data for the opponent side (opptConf, opptDiv, etc.).

In [None]:
#dropping the columns we've mentioned above
tbs_cleaned = tbs_cleaned.drop(['teamConf', 'teamDiv', 'teamMin', 'teamDayOff', 'opptConf',
                                  'opptDiv', 'opptMin', 'opptDayOff'], axis=1)

In [None]:
tbs_cleaned.head()

Finally, just to make the teamRslt column a bit more computer interpretable, let's replace losses with 0's and wins with 1's.

In [None]:
#make the result column a binary indicator
tbs_cleaned = tbs_cleaned.replace(to_replace=['Loss', 'Win'], value=[0, 1])
#verify that we only have 0's and 1's in this column
tbs_cleaned['teamRslt'].unique()

In [None]:
tbs_cleaned.head()

We've narrowed our data mostly down to the barebones numbers and statistics—awesome!

Now we can focus on exploring our data a bit more thoroughly.

### Exploring the Team Box Score Data

In [None]:
np.var(tbs_cleaned).sort_values(ascending=False)[0:25]

In [None]:
numbers_only = tbs_cleaned.drop(['teamAbbr', 'teamLoc', 'opptAbbr', 'opptConf', 'opptDiv', 'opptLoc', 'opptRslt'], axis=1)

In [None]:
#centers data mean on 0, can now do SVD
numbers_only_centered = numbers_only - np.mean(numbers_only)
numbers_only_centered.head()

In [None]:
#divide by standard deviation to prevent high variance components from affecting PCA
numbers_only_centered_scaled = np.divide(numbers_only_centered, np.sqrt(np.var(numbers_only_centered)))
numbers_only_centered_scaled.head()

In [None]:
u, s, vt = np.linalg.svd(numbers_only_centered_scaled, full_matrices=False)
p_matrix = numbers_only_centered_scaled @ vt.T

In [None]:
plt.plot(np.arange(1, 104), s**2 / sum(s**2));
plt.xticks(np.arange(1, 104), np.arange(1, 104));
plt.xlabel('PC #');
plt.ylabel('Variance Explained');
plt.title('Scree Plot of Variance of PC #i')

In [None]:
numbers_only_filtered = numbers_only[numbers_only.columns.drop(list(numbers_only.filter(regex='oppt')))]

df_rando = numbers_only_filtered[['teamOREB%', 'teamEDiff', 'teamSTL', 'teamBLK', 'teamFG%', 'team3P%', 'teamORB',
                                 'teamDRB', 'teamPTS3', 'teamPTS4', 'teamASST%', 'teamEFG%', 'teamOrtg', 'teamDrtg',
                                 'teamSTL/TO']]
df_rando.head()

In [None]:
sns.heatmap(df_rando.corr())

## Inference and Prediction