# Predicting NBA Regular Season game results

In this example, we are going to use the results and statistics of past NBA games from 1989 to 2016 to predict the outcome of the remaining games in the 2015-2016 NBA regular season.
First of all, we need to create a dataset of NBA games in order to train a model a to perform prediction. All the data we need can be found at <a href="http://www.basketball-reference.com/" target="_blank">basketball-reference.com</a>, and have to be processed to create a dataset with some features, i.e. the values that will help us predicting the outcome of the games, and some values to be predicted. If you are interested in this data preprocessing stage, take a look at this <a href="https://github.com/lucabaroffio/NBA-data-analysis" target="_blank">Github repository</a> where you can find all the scripts and the data.

In our case, we will use some team performance indices like its winning percentage, the number of points scored and allowed per game to predict the score of a game, that is, the number of points scored by each team. 

First of all, we import some libraries that we will need later:

In [1]:
# import libraries
import os
import csv
import pandas as pd
import numpy as np
from sklearn import cross_validation
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')

## Predict results

We will use a simple linear regression model to predict the number of points scored by the two teams of each game, and hence the result of the game. First of all, we will load the dataset and we will define two matrices, one containing all the features and the other containing the values to be predicted, i.e. the score of the game.

To evaluate the accuracy of our model, we resort to a 10-fold cross validation procedure. This means that we will repeat the evaluation 10 times, each time using 90% of the data for training the model and the remaining 10% to evaluate its performance. 

In [2]:
# read csv and create a pandas dataset
dataset = pd.read_csv(os.path.join('data', 'dataset.csv.gz'))
dataset['away_team_W%'] = dataset.apply(lambda row: float(row['away_team_W'])/float(row['away_team_GP']), axis=1)
dataset['home_team_W%'] = dataset.apply(lambda row: float(row['home_team_W'])/float(row['home_team_GP']), axis=1)

# define the features
X = dataset.as_matrix(
    [
        'away_team_PSPG',
        'away_team_PAPG',
        'home_team_PSPG',
        'home_team_PAPG',
        'home_team_W%',
        'away_team_W%', 
        'away_team_last5_W%',
        'home_team_last5_W%'
    ]
)

# define the variables to be predicted
Y = dataset.as_matrix(['home_team_points', 'away_team_points'])

# define a 10 fold cross validation procedure
k_fold = cross_validation.KFold(n = len(dataset), n_folds = 10)

# declare a classifier
clf = linear_model.LinearRegression()

accuracy = []

# for each fold...
for train_indices, test_indices in k_fold:
    
    # get training and test samples
    train_X = X[train_indices]
    train_Y = Y[train_indices]
    test_X = X[test_indices]
    test_Y = Y[test_indices]
    
    # fit the model
    model = clf.fit(train_X, train_Y)
    
    # predict the score
    predictions = model.predict(test_X)
    
    # infer the winner
    predictions_winner = [(0 if x[0]>x[1] else 1) for x in predictions]
    
    # get the real result of the game
    gt_winner = [(0 if x[0]>x[1] else 1) for x in test_Y]
    
    # check whether the prediction matches the ground truth and compute the accuracy
    prediction_matches = [predictions_winner[ind] == gt_winner[ind] for ind, _ in enumerate(predictions_winner)]
    accuracy.append(float(sum(prediction_matches))/float(len(prediction_matches)))

# overall accuracy is the mean accuracy for each fold
print 'Accuracy: %.4f +/- %.4f' % (np.mean(accuracy), np.std(accuracy))
    

Accuracy: 0.6818 +/- 0.0196


This means that we are able to predict the correct outcome for 68% of the NBA games.

## Predict results for future games
Now we train the model on all the NBA games from 1989 to 2016 to predict the outcome of the remaining NBA games in 2015/2016 regular season. We load another dataset containing the features, i.e. the team stats, for future NBA games, and we use our model to predict the outcome. 

In [3]:
# declare a classifier
clf = linear_model.LinearRegression()
    
# fit the model
model = clf.fit(X, Y)

# read the dataset with the future games
future_dataset = pd.read_csv(os.path.join('data', 'future_dataset.csv.gz'))
future_dataset['away_team_W%'] = future_dataset.apply(
    lambda row: float(row['away_team_W'])/float(row['away_team_GP']), axis=1
)
future_dataset['home_team_W%'] = future_dataset.apply(
    lambda row: float(row['home_team_W'])/float(row['home_team_GP']), axis=1
)

# define the features
test_X = future_dataset.as_matrix(
    [
        'away_team_PSPG',
        'away_team_PAPG',
        'home_team_PSPG',
        'home_team_PAPG',
        'home_team_W%',
        'away_team_W%', 
        'away_team_last5_W%',
        'home_team_last5_W%'
    ]
)
    
# predict the score
predictions = model.predict(test_X)
    
# print the scores
for index, row in future_dataset.iterrows():
    print "%s @ %s: \n%s - %s\n" % (
        row['away_team_name'], 
        row["home_team_name"],
        int(predictions[index][1]), # away team predicted score
        int(predictions[index][0])  # home team predicted score
    )

Boston Celtics @ Atlanta Hawks: 
97 - 105

Cleveland Cavaliers @ Chicago Bulls: 
99 - 102

Golden State Warriors @ Memphis Grizzlies: 
101 - 100

Phoenix Suns @ New Orleans Pelicans: 
98 - 104

Minnesota Timberwolves @ Portland Trail Blazers: 
96 - 109

Oklahoma City Thunder @ Sacramento Kings: 
104 - 103

Utah Jazz @ Denver Nuggets: 
99 - 100

Los Angeles Lakers @ Houston Rockets: 
97 - 109

Brooklyn Nets @ Indiana Pacers: 
96 - 102

Dallas Mavericks @ Los Angeles Clippers: 
101 - 108

Orlando Magic @ Miami Heat: 
95 - 100

Toronto Raptors @ New York Knicks: 
103 - 97

Milwaukee Bucks @ Philadelphia 76ers: 
99 - 94

Golden State Warriors @ San Antonio Spurs: 
102 - 104

Charlotte Hornets @ Washington Wizards: 
93 - 100

Charlotte Hornets @ Boston Celtics: 
96 - 102

Washington Wizards @ Brooklyn Nets: 
98 - 99

Atlanta Hawks @ Cleveland Cavaliers: 
98 - 102

Houston Rockets @ Minnesota Timberwolves: 
105 - 99

Chicago Bulls @ New Orleans Pelicans: 
98 - 100

Los Angeles Lakers @ Oklah