## EECS 731 Project 4: Major Leagues

In this project, I will be reading from a dataset of [NBA games](https://github.com/fivethirtyeight/data/tree/master/nba-elo) and performing regression in order to predict the scores of future games. In particular, I will perform regression for all 30 franchises currently in the NBA while taking into considerations their previous game results (points scored, opponent points scored, etc.), the game's forecast, and where each game was played (home, away, or neutral).

In addition to the above link, the dataset I used can also be found in the data/raw/ directory.

### Python Imports

To compare the different regression models, I use the Root Mean Squared Error (RMSE) metric. To calculate that, I use the math sqrt and sklearn mean_squared_error functions. I also limit the display precision of Pandas dataframes to 2 so that the final results are more readable.

In [1]:
import pandas as pd
pd.set_option("display.precision",2)
from math import sqrt

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error

### Regression Model Creation

For this project, I evaluate the following four regression model approaches:

- Linear Regression
- Random Forest
- Gradient Boosting
- Neural Network (MLP)

For the Random Forest, I limit the max_depth value to 6 in order to reduce its execution time. For the MLP neural network, I arbitrarily set the max_iter value to 10000 to give the model a better chance of convergence.

In [2]:
linearReg_model = linear_model.LinearRegression()
randomForest_model = RandomForestClassifier(max_depth=6)
gradBoost_model = GradientBoostingClassifier()
neuralNetwork_model = MLPRegressor(max_iter=10000)

### Reading the Raw Dataset

In [3]:
game_list = pd.read_csv("../data/raw/nbaallelo.csv")
game_list

Unnamed: 0,gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,...,win_equiv,opp_id,opp_fran,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes
0,1,194611010TRH,NBA,0,1947,11/1/1946,1,0,TRH,Huskies,...,40.29,NYK,Knicks,68,1300.00,1306.72,H,L,0.64,
1,1,194611010TRH,NBA,1,1947,11/1/1946,1,0,NYK,Knicks,...,41.71,TRH,Huskies,66,1300.00,1293.28,A,W,0.36,
2,2,194611020CHS,NBA,0,1947,11/2/1946,1,0,CHS,Stags,...,42.01,NYK,Knicks,47,1306.72,1297.07,H,W,0.63,
3,2,194611020CHS,NBA,1,1947,11/2/1946,2,0,NYK,Knicks,...,40.69,CHS,Stags,63,1300.00,1309.65,A,L,0.37,
4,3,194611020DTF,NBA,0,1947,11/2/1946,1,0,DTF,Falcons,...,38.86,WSC,Capitols,50,1300.00,1320.38,H,L,0.64,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126309,63155,201506110CLE,NBA,0,2015,6/11/2015,100,1,CLE,Cavaliers,...,60.31,GSW,Warriors,103,1790.96,1809.98,H,L,0.55,
126310,63156,201506140GSW,NBA,0,2015,6/14/2015,102,1,GSW,Warriors,...,68.01,CLE,Cavaliers,91,1704.39,1700.74,H,W,0.77,
126311,63156,201506140GSW,NBA,1,2015,6/14/2015,101,1,CLE,Cavaliers,...,60.01,GSW,Warriors,104,1809.98,1813.63,A,L,0.23,
126312,63157,201506170CLE,NBA,0,2015,6/16/2015,102,1,CLE,Cavaliers,...,59.29,GSW,Warriors,105,1813.63,1822.29,H,L,0.48,


### Preparing Per-Team Extraction

In the nbaallelo.csv dataset, the "team_id" column contains the 3-letter abbreviation for each team's city as denoted by [Basketball Reference](https://www.basketball-reference.com/teams/). In this case, I am only interested in the franchises currently in the NBA, so I hard code a list of the desired team abbreviations.

In [4]:
team_codes = ["ATL","BRK","BOS","CHA","CHI","CLE","DAL","DEN","DET","GSW","HOU",
              "IND","LAC","LAL","MEM","MIA","MIL","MIN","NOP","NYK","OKC","ORL",
              "PHI","PHO","POR","SAC","SAS","TOR","UTA","WAS"]

I then create the necessary lists for storing the regression results and other helpful data (e.g. number of games used for training and testing).

In [5]:
team_num_games = []              
linearReg_results = []
randomForest_results = []
gradBoost_results = []
neuralNetwork_results = []

### Extracting Per-Team Dataset

For all 30 NBA teams, I extract their datasets with the following steps:

- Get all games from the original dataset that contain games the current NBA team played against other NBA teams.
- Retrieve the desired columns from the reduced dataset.
- Save the new dataset to the data/processed/ directory.

As an example, for the Atlanta Hawks (ATL), I would get the dataset as follows:

In [6]:
team_list = game_list[game_list["team_id"].isin(["ATL"])]
team_list = team_list[team_list["opp_id"].isin(team_codes)]
team_list = team_list.loc[:,("team_id","pts","win_equiv","opp_id","opp_pts","game_location","forecast")].dropna()
team_list

Unnamed: 0,team_id,pts,win_equiv,opp_id,opp_pts,game_location,forecast
17169,ATL,125,47.31,MIL,107,H,0.86
17194,ATL,106,47.84,CHI,91,H,0.77
17233,ATL,123,47.73,PHO,100,A,0.61
17242,ATL,124,47.51,LAL,125,A,0.29
17308,ATL,103,44.44,BOS,123,H,0.50
...,...,...,...,...,...,...,...
126279,ATL,94,52.19,WAS,91,A,0.44
126286,ATL,89,51.18,CLE,97,H,0.55
126290,ATL,82,49.96,CLE,94,H,0.52
126295,ATL,111,49.74,CLE,114,A,0.23


Note that I perform the same steps for all 30 teams in a for loop, as will be shown in the following sections, and use a variable for the inital step (e.g. team instead of "ATL"). Also, while not shown above I do save the datasets to the processed directory in the for loop as well.

### Performing Regression

In order to perform regression for all 30 teams, I use a for loop to extract datasets for each team and then perform regression on those datasets. Once the datasets are created, I then perform the following steps:

- Store the number of rows/games for later comparisons.
- Separate the datasets into two different datasets: one with the desired features and one with the points for each game.
- Perform one-hot encoding on the string-based columns in the features dataset (team ids, game location, etc.).
- Create training and testing sets with the two newly created datasets.
- Perform regression with each of the models and store the results.

Note that, to track the progress of the for loop, I print out the current team that is being evaluated. I do this as some models, namely the Gradient Boosting and Neural Network models, can take longer depending on the size of a given team's dataset.

In [7]:
print("\nPerforming Regression for:")
for team in team_codes:
    team_list = game_list[game_list["team_id"].isin([team])]
    team_list = team_list[team_list["opp_id"].isin(team_codes)]
    team_list = team_list.loc[:,("team_id","pts","win_equiv","opp_id","opp_pts","game_location","forecast")].dropna()
    team_list.to_csv("../data/processed/{}.csv".format(team))
    team_num_games.append(len(team_list))
    
    print("{} ({} games)".format(team,len(team_list)))

    team_features = team_list.loc[:,("team_id","win_equiv","opp_id","opp_pts","game_location","forecast")].dropna()
    team_points = team_list.loc[:,"pts"].dropna()
    team_features = pd.get_dummies(team_features, columns=["team_id","opp_id","game_location"])

    features_train, features_test, points_train, points_test = train_test_split(team_features, team_points, test_size=0.50, random_state=0, shuffle=True)

    points_pred = linearReg_model.fit(features_train, points_train).predict(features_test)
    linearReg_results.append(sqrt(mean_squared_error(points_test, points_pred)))

    points_pred = randomForest_model.fit(features_train, points_train).predict(features_test)
    randomForest_results.append(sqrt(mean_squared_error(points_test, points_pred)))

    points_pred = gradBoost_model.fit(features_train, points_train).predict(features_test)
    gradBoost_results.append(sqrt(mean_squared_error(points_test, points_pred)))

    points_pred = neuralNetwork_model.fit(features_train, points_train).predict(features_test)
    neuralNetwork_results.append(sqrt(mean_squared_error(points_test, points_pred)))
    
    print("\tDone")


Performing Regression for:
ATL (3332 games)
	Done
BRK (266 games)
	Done
BOS (4020 games)
	Done
CHA (756 games)
	Done
CHI (3517 games)
	Done
CLE (3156 games)
	Done
DAL (2643 games)
	Done
DEN (2840 games)
	Done
DET (3748 games)
	Done
GSW (3120 games)
	Done
HOU (3217 games)
	Done
IND (2877 games)
	Done
LAC (2279 games)
	Done
LAL (3974 games)
	Done
MEM (1106 games)
	Done
MIA (2088 games)
	Done
MIL (3314 games)
	Done
MIN (1904 games)
	Done
NOP (166 games)
	Done
NYK (3836 games)
	Done
OKC (604 games)
	Done
ORL (1960 games)
	Done
PHI (3568 games)
	Done
PHO (3379 games)
	Done
POR (3249 games)
	Done
SAC (2212 games)
	Done
SAS (3027 games)
	Done
TOR (1464 games)
	Done
UTA (2743 games)
	Done
WAS (1347 games)
	Done


### Displaying the Results

Once all of the teams have been evaluated, I create a new Pandas dataframe with the team IDs, the number of rows/games used for regression and the RMSE of each model tested.

In [8]:
reg_results = pd.DataFrame({"Team":team_codes,"# of Games":team_num_games,
                            "LinearRegression":linearReg_results,"RandomForest":randomForest_results,
                            "GradientBoost":gradBoost_results,"NeuralNetwork":neuralNetwork_results})
reg_results

Unnamed: 0,Team,# of Games,LinearRegression,RandomForest,GradientBoost,NeuralNetwork
0,ATL,3332,10.3,12.22,15.06,10.27
1,BRK,266,466000000000.0,13.18,14.39,12.33
2,BOS,4020,10.5,12.36,14.36,10.74
3,CHA,756,10.2,13.31,14.77,10.74
4,CHI,3517,10.9,12.51,14.16,11.07
5,CLE,3156,10.4,12.19,14.05,10.54
6,DAL,2643,10.5,12.43,14.22,10.74
7,DEN,2840,10.6,13.49,15.77,10.82
8,DET,3748,10.1,12.42,13.77,10.19
9,GSW,3120,10.6,12.6,15.09,10.76


## Discussion

Based on the RMSE results from each team, it can be seen that the Linear Regression and Neural Network models had the best performance as they were able to achieve RMSE values of ~10. However, in some cases, there are noticeable differences between the two models:

- For the Linear Regression model, it largely failed to predict the points for team's with relatively fewer games in their datasets (e.g. BRK, NOP, etc.), as it had RMSE values in the millions. This is most likely because the model overfit to the training data and, as a result, couldn't effectively predict the testing data.
- For the Neural Network model, it was more consistent in that it achieved ~10 RMSE values for all 30 NBA teams. Compared to the Linear Regression model, this implies that the Neural Network model didn't have as many issues, if any, with overfitting to the training data.

For the other two models, Random Forest and Gradient Boosting, the performance was similar to the Neural Network model in that the achieve consistent RMSE values for all 30 teams at ~12-13 and ~14-16, respectively. For those models, the performance is most likely due to the following reasons:

- For the Random Forest model, the limited max_depth most likely limited the performance of the model. If given unlimited depth, the model may have had the best performance. However, given the guaranteed increase in time that would occur as a result, the increase in performance may not be ultimately desirable.
- For the Gradient Boosting model, even though it's similar to the Random Forest model, it most likely encountered local minimums as the model successively builds upon previous trees. As such, when it converges to a local minimum, it most likely doesn't have a way to account for that.

In the end, based on the results for all 30 NBA teams, it's apparent that the Neural Network model would give the best and most consistent performance while the Linear Regression model could also be effective for teams when given enough sample games. The other models, Random Forest and Gradient Boosting, also achieve good performance but wouldn't be as effective in this specific instance.