# Predicting NBA

This is a project for predicting the outcome of NBA games.
We will be using different Machine Learning for prediction and comparing the results.


Dataset: https://www.kaggle.com/nathanlauga/nba-games

This dataset contains data of every NBA game played from 2014 season to now. For our training set, we can use seasons
2015/2016-2018/2019, where we would test our model on 2018/2019 season. The reason, we will not be using later seasons
(for now) is that after those seasons Covid-19 hit. After that, the games were postponed, players were in quarantine,
no fans, play-offs were played in the bubble and similar noise in the data from real-life events.

Article of basic idea:  Albrecht  Zimmermann,  Sruthi  Moorthy,and Zifan Shi. “Predicting college basket-ball match outcomes using machine learn-ing  techniques:  some  results  and  lessonslearned”.

Inspired by post: https://towardsdatascience.com/predicting-the-outcome-of-nba-games-with-machine-learning-a810bb768f20


First we need to create a function for loading the data. This will be used a few times during implementation.

In [8]:
import numpy as np
import pandas as pd
from helper_functions import calculate_elo

global teams_data
global train_set
global test_set

def load_data(season):

    global teams_data
    global train_set
    global test_set

    filename = "Data/games.csv"
    data = pd.read_csv(filename, parse_dates=["GAME_DATE_EST"])

    train_set = data[(data["SEASON"]) < season]
    test_set = data[(data["SEASON"]) == season]

    train_set = train_set[2004 < (train_set["SEASON"])]

    teams_data = pd.read_csv("Data/teams.csv")

    train_set = train_set.sort_values("GAME_DATE_EST", inplace=False, ascending=True)
    test_set = test_set.sort_values("GAME_DATE_EST", inplace=False, ascending=True)

    score_init = [1500 for i in range(0, len(teams_data.index))]
    teams_data["ELO_RATING"] = score_init

    games_played = list()
    pts_scd_played = list()
    pts_rcv_played = list()
    win_streak = list()
    score_init = list()
    for i in range(0, len(teams_data.index)):
        games_played.append(0)
        pts_scd_played.append(0)
        pts_rcv_played.append(0)
        win_streak.append(0)
        score_init.append(1500)

    teams_data["ELO_RATING"] = score_init
    teams_data["GAMES_PLAYED"] = games_played
    teams_data["AVG_POINTS_SCD"] = pts_scd_played
    teams_data["AVG_POINTS_REC"] = pts_rcv_played
    teams_data["WIN_STREAK"] = win_streak

## Establishing baseline
First we need to get baseline for the prediction accuracy. First is to randomly pick a winner from two team or
next is to predict that home team wins every time. We can try both and pick the best one.


### Random

In [9]:
def random_score():
    from random import random

    test_out_random = list()

    for n,i in test_set.iterrows():
        if random() < 0.5:
            test_out_random.append(0)
        else:
            test_out_random.append(1)

    random_eval = 0
    for i in range(0, len(test_out_random)):
        if test_out_random[i] == test_set.iloc[i]["HOME_TEAM_WINS"]:
            random_eval += 1

    return random_eval * 100 / len(test_out_random)

### Home team always wins

In [10]:
def home_score():
    home_team_eval = 0
    for i in range(0, len(test_set.index)):
        if test_set.iloc[i]["HOME_TEAM_WINS"] == 1:
            home_team_eval += 1

    return home_team_eval * 100 / len(test_set.index)

Our baseline for our model's accuracy will be the probability that home team always wins (almost 60%).



## ELO Rating

Until now, we were only looking at win/lose ratio. ELO rating is improved win/lose stat, due to the fact, it takes into
consideration history of previous matches from teams, but not in simple 0/1. It's a score, where every team starts at
1500 points, after that every team receives or loses points for win or loss. Increase or deduction in points is dependent
on different parameters (location of the match and others). This metric is much better representation of the team's
strength, than simple win/lose metric.

In [11]:
def base_elo_score():

    score_init = [1500 for i in range(0, len(teams_data.index))]
    teams_data["ELO_RATING"] = score_init

    for i in range(0, len(train_set.index)):

        if i > 0 and train_set.iloc[i]["SEASON"] != train_set.iloc[i - 1]["SEASON"]:
            for j in range(0, len(teams_data.index)):
                teams_data.at[j, "ELO_RATING"] = teams_data.at[j, "ELO_RATING"] * 0.75 + (0.25*1505)

        home_id = train_set.iloc[i]["TEAM_ID_home"]
        away_id = train_set.iloc[i]["TEAM_ID_away"]
        win_lose = train_set.iloc[i]["HOME_TEAM_WINS"]
        home_pts = train_set.iloc[i]["PTS_home"]
        away_pts = train_set.iloc[i]["PTS_away"]
        home_team_index = teams_data.index[teams_data["TEAM_ID"] == home_id].tolist()[0]
        away_team_index = teams_data.index[teams_data["TEAM_ID"] == away_id].tolist()[0]
        home_elo = teams_data.iloc[home_team_index]["ELO_RATING"]
        away_elo = teams_data.iloc[away_team_index]["ELO_RATING"]
        teams_data.at[home_team_index, "ELO_RATING"] = calculate_elo(home_elo, away_elo, win_lose, abs(home_pts-away_pts), 0)
        teams_data.at[away_team_index, "ELO_RATING"] = calculate_elo(away_elo, home_elo, 1-win_lose, abs(home_pts-away_pts), 1)

    for j in range(0, len(teams_data.index)):
                teams_data.at[j, "ELO_RATING"] = teams_data.at[j, "ELO_RATING"] * 0.75 + (0.25*1505)

    prediction_count = 0
    for i in range(0, len(test_set.index)):
        home_id = test_set.iloc[i]["TEAM_ID_home"]
        away_id = test_set.iloc[i]["TEAM_ID_away"]
        home_pts = test_set.iloc[i]["PTS_home"]
        away_pts = test_set.iloc[i]["PTS_away"]
        home_team_index = teams_data.index[teams_data["TEAM_ID"] == home_id].tolist()[0]
        away_team_index = teams_data.index[teams_data["TEAM_ID"] == away_id].tolist()[0]
        home_elo = teams_data.iloc[home_team_index]["ELO_RATING"]
        away_elo = teams_data.iloc[away_team_index]["ELO_RATING"]
        win_lose = 0
        if home_elo >= away_elo:
            win_lose = 1
        teams_data.at[home_team_index, "ELO_RATING"] = calculate_elo(home_elo, away_elo, win_lose, abs(home_pts-away_pts), 0)
        teams_data.at[away_team_index, "ELO_RATING"] = calculate_elo(away_elo, home_elo, 1-win_lose, abs(home_pts-away_pts), 1)
        if win_lose == test_set.iloc[i]["HOME_TEAM_WINS"]:
            prediction_count += 1

    return prediction_count * 100 / len(test_set.index)

As we can see, ELO rating improves the result, and we break the barrier of 60% accuracy. More interesting thing is that we get this result of more than 60% when the predictions are not independent, where an output of one prediction is an input in another through ELO rating. With this in mind, ELO rating is a very good and interesting attribute. Unfortunately, the results are not admissible in our case, where we do not know current data of point scored, so we will be using average points in previous season for the teams. This is a big estimation, which could be improved in the future.

## Preparing data for training

We have to prepare data and select feature for training.

In [12]:
def prepare_train_data(season, num_test):

    load_data(season)

    train_set_data = pd.DataFrame(columns=["ELO_RATING_HOME", "ELO_RATING_AWAY", "WIN_STREAK_HOME", "WIN_STREAK_AWAY", "AVG_POINTS_SCD_HOME", "AVG_POINTS_REC_HOME", "AVG_POINTS_SCD_AWAY", "AVG_POINTS_REC_AWAY", "HOME_WIN"])

    for i in range(0, len(train_set.index)):

        if i > 0 and train_set.iloc[i]["SEASON"] != train_set.iloc[i - 1]["SEASON"]:
            for j in range(0, len(teams_data.index)):
                teams_data.at[j, "ELO_RATING"] = teams_data.at[j, "ELO_RATING"] * 0.75 + (0.25*1505)
                teams_data.at[j, "GAMES_PLAYED"] = 0
                teams_data.at[j, "AVG_POINTS_SCD"] = 0
                teams_data.at[j, "AVG_POINTS_REC"] = 0
                teams_data.at[j, "WIN_STREAK"] = 0

        home_id = train_set.iloc[i]["TEAM_ID_home"]
        away_id = train_set.iloc[i]["TEAM_ID_away"]
        win_lose = train_set.iloc[i]["HOME_TEAM_WINS"]
        home_pts = train_set.iloc[i]["PTS_home"]
        away_pts = train_set.iloc[i]["PTS_away"]

        home_team_index = teams_data.index[teams_data["TEAM_ID"] == home_id].tolist()[0]
        away_team_index = teams_data.index[teams_data["TEAM_ID"] == away_id].tolist()[0]

        home_elo = teams_data.iloc[home_team_index]["ELO_RATING"]
        away_elo = teams_data.iloc[away_team_index]["ELO_RATING"]

        win_streak_home = teams_data.iloc[home_team_index]["WIN_STREAK"]
        win_streak_away = teams_data.iloc[away_team_index]["WIN_STREAK"]

        avg_points_scd_home = teams_data.iloc[home_team_index]["AVG_POINTS_SCD"]
        avg_points_scd_away = teams_data.iloc[away_team_index]["AVG_POINTS_SCD"]

        avg_points_rec_home = teams_data.iloc[home_team_index]["AVG_POINTS_REC"]
        avg_points_rec_away = teams_data.iloc[away_team_index]["AVG_POINTS_REC"]

        if num_test == 2 and win_streak_home < 0:
            win_streak_home = 0
        if num_test == 2 and win_streak_away < 0:
            win_streak_away = 0

        row = [home_elo, away_elo, win_streak_home,win_streak_away,avg_points_scd_home,avg_points_scd_away,avg_points_rec_home,avg_points_rec_away, win_lose]

        train_set_data.loc[len(train_set_data)] = row

        home_elo_out = calculate_elo(home_elo, away_elo, win_lose, abs(avg_points_scd_home-avg_points_scd_away), 0)
        away_elo_out = calculate_elo(away_elo, home_elo, 1-win_lose, abs(avg_points_scd_home-avg_points_scd_away), 1)

        games_played_home = teams_data.iloc[home_team_index]["GAMES_PLAYED"]
        games_played_away = teams_data.iloc[away_team_index]["GAMES_PLAYED"]

        teams_data.at[home_team_index, "AVG_POINTS_SCD"] = ((games_played_home * avg_points_scd_home) + home_pts) / (games_played_home + 1)
        teams_data.at[away_team_index, "AVG_POINTS_SCD"] = ((games_played_away * avg_points_scd_away) + away_pts) / (games_played_away + 1)

        teams_data.at[home_team_index, "AVG_POINTS_REC"] = ((games_played_home * avg_points_rec_home) + away_pts) / (games_played_home + 1)
        teams_data.at[away_team_index, "AVG_POINTS_REC"] = ((games_played_away * avg_points_rec_away) + home_pts) / (games_played_away + 1)

        if win_lose == 1:
            if win_streak_home > 0:
                teams_data.at[home_team_index, "WIN_STREAK"] = win_streak_home + 1
            else:
                teams_data.at[home_team_index, "WIN_STREAK"] = 1
            if win_streak_away < 0:
                teams_data.at[away_team_index, "WIN_STREAK"] = win_streak_away - 1
            else:
                teams_data.at[away_team_index, "WIN_STREAK"] = -1
        else:
            if win_streak_home < 0:
                teams_data.at[home_team_index, "WIN_STREAK"] = win_streak_home - 1
            else:
                teams_data.at[home_team_index, "WIN_STREAK"] = -1
            if win_streak_away > 0:
                teams_data.at[away_team_index, "WIN_STREAK"] = win_streak_away + 1
            else:
                teams_data.at[away_team_index, "WIN_STREAK"] = 1

        teams_data.at[home_team_index, "GAMES_PLAYED"] = games_played_home + 1
        teams_data.at[away_team_index, "GAMES_PLAYED"] = games_played_away + 1

        teams_data.at[home_team_index, "ELO_RATING"] = home_elo_out
        teams_data.at[away_team_index, "ELO_RATING"] = away_elo_out

    return train_set_data;

Now we are ready to train and predict

#Training the model and predictions

We can now train the model on our training data.

When the training is done, we can predict.

For prediction, we are using 4 algorithms:
- Random Forest (num_test = 1)
- Multinomial Naive Bayes (num_test = 2)
- Linear Regression (num_test = 3)
- K-Nearest Neighbours with n=3 (num_test = 4)

In [17]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline

def predict_season(num_test, season):

    train_set_data = prepare_train_data(season, num_test)

    X_train = train_set_data.iloc[:, 0:8].values
    y_train = train_set_data.iloc[:, 8].values
    y_train=y_train.astype('int')

    if num_test == 1:
        regressor = RandomForestRegressor()
    if num_test == 2:
        regressor = MultinomialNB()
    if num_test == 3:
        regressor = make_pipeline(StandardScaler(), LogisticRegression())
    if num_test == 4:
        regressor = KNeighborsClassifier(n_neighbors=3)

    regressor.fit(X_train, y_train)

    for j in range(0, len(teams_data.index)):
                teams_data.at[j, "ELO_RATING"] = teams_data.at[j, "ELO_RATING"] * 0.75 + (0.25*1505)
                teams_data.at[j, "GAMES_PLAYED"] = 0
                teams_data.at[j, "WIN_STREAK"] = 0

    prediction_count = 0
    for i in range(0, len(test_set.index)):

        home_id = test_set.iloc[i]["TEAM_ID_home"]
        away_id = test_set.iloc[i]["TEAM_ID_away"]

        home_team_index = teams_data.index[teams_data["TEAM_ID"] == home_id].tolist()[0]
        away_team_index = teams_data.index[teams_data["TEAM_ID"] == away_id].tolist()[0]

        home_elo = teams_data.iloc[home_team_index]["ELO_RATING"]
        away_elo = teams_data.iloc[away_team_index]["ELO_RATING"]

        win_streak_home = teams_data.iloc[home_team_index]["WIN_STREAK"]
        win_streak_away = teams_data.iloc[away_team_index]["WIN_STREAK"]

        if num_test == 2 and win_streak_home < 0:
            win_streak_home = 0
        if num_test == 2 and win_streak_away < 0:
            win_streak_away = 0

        avg_points_scd_home = teams_data.iloc[home_team_index]["AVG_POINTS_SCD"]
        avg_points_scd_away = teams_data.iloc[away_team_index]["AVG_POINTS_SCD"]

        avg_points_rec_home = teams_data.iloc[home_team_index]["AVG_POINTS_REC"]
        avg_points_rec_away = teams_data.iloc[away_team_index]["AVG_POINTS_REC"]

        row = [home_elo, away_elo, win_streak_home,win_streak_away,avg_points_scd_home,avg_points_scd_away,avg_points_rec_home,avg_points_rec_away]

        x = np.reshape(row, (1, -1))
        y_pred = regressor.predict(x.reshape(1, -1))

        win_lose = 0
        if y_pred > 0.5:
            win_lose = 1

        if win_lose == test_set.iloc[i]["HOME_TEAM_WINS"]:
            prediction_count += 1

        home_elo_out = calculate_elo(home_elo, away_elo, win_lose, abs(avg_points_scd_home-avg_points_scd_away), 0)
        away_elo_out = calculate_elo(away_elo, home_elo, 1-win_lose, abs(avg_points_scd_home-avg_points_scd_away), 1)

        games_played_home = teams_data.iloc[home_team_index]["GAMES_PLAYED"]
        games_played_away = teams_data.iloc[away_team_index]["GAMES_PLAYED"]

        if win_lose == 1:
            if win_streak_home > 0:
                teams_data.at[home_team_index, "WIN_STREAK"] = win_streak_home + 1
            else:
                teams_data.at[home_team_index, "WIN_STREAK"] = 1
            if win_streak_away < 0:
                teams_data.at[away_team_index, "WIN_STREAK"] = win_streak_away - 1
            else:
                teams_data.at[away_team_index, "WIN_STREAK"] = -1
        else:
            if win_streak_home < 0:
                teams_data.at[home_team_index, "WIN_STREAK"] = win_streak_home - 1
            else:
                teams_data.at[home_team_index, "WIN_STREAK"] = -1
            if win_streak_away > 0:
                teams_data.at[away_team_index, "WIN_STREAK"] = win_streak_away + 1
            else:
                teams_data.at[away_team_index, "WIN_STREAK"] = 1

        teams_data.at[home_team_index, "GAMES_PLAYED"] = games_played_home + 1
        teams_data.at[away_team_index, "GAMES_PLAYED"] = games_played_away + 1

        teams_data.at[home_team_index, "ELO_RATING"] = home_elo_out
        teams_data.at[away_team_index, "ELO_RATING"] = away_elo_out

    return prediction_count * 100 / len(test_set.index)

This is The end of the algorithm

#Running the code

Here we are just running the code and saving data

In [14]:
import csv

header = ['Season', 'Random', 'Home team wins', 'Higher ELO', 'Random Forest', 'Naive Bayes', 'Logistic Regression', 'K-nearest neighbours']

with open('results.csv', 'w', encoding='UTF8') as f:
    writer = csv.writer(f)

    writer.writerow(header)

    rf = 0
    nb = 0
    lr = 0
    kn = 0

    rs = 0
    hw = 0
    es = 0

    counter = 0

    for i in range(2006, 2019):

        load_data(i)
        cur_rs = random_score()
        cur_hw = home_score()
        cur_es = base_elo_score()

        cur_rf = predict_season(1, i)
        cur_nb = predict_season(2, i)
        cur_lr = predict_season(3, i)
        cur_kn = predict_season(4, i)

        rs += cur_rs
        hw += cur_hw
        es += cur_es

        rf += cur_rf
        nb += cur_nb
        lr += cur_lr
        kn += cur_kn

        counter += 1

        row = [i, cur_rs, cur_hw, cur_es, cur_rf, cur_nb, cur_lr, cur_kn]

        writer.writerow(row)

    print("===============================================================")
    print("\tRandom: " + str(rs/counter))
    print("\tHome team wins: " + str(hw/counter))
    print("\tHigher ELO: " + str(es/counter))
    print("\tRandom Forest: " + str(rf/counter))
    print("\tNaive Bayes: " + str(nb/counter))
    print("\tLogistic regression: " + str(lr/counter))
    print("\tK-nearest neighbour: " + str(kn/counter))

	Random: 49.95817550794899
	Home team wins: 59.38928298014298
	Higher ELO: 60.775391600921125
	Random Forest: 60.46250933564464
	Naive Bayes: 60.51132142040909
	Logistic regression: 62.20291824572364
	K-nearest neighbour: 58.5196233538248
