# Predicting NBA

This is a project for predicting the outcome of NBA games.
We will be using different Machine Learning for prediction and comparing the results.


Dataset: https://www.kaggle.com/nathanlauga/nba-games

This dataset contains data of every NBA game played from 2014 season to now. For our training set, we can use seasons
2014/2015-2017/2018, where we would test our model on 2018/2019 season. The reason, we will not be using later seasons
(for now) is that after those seasons Covid-19 hit. After that, the games were postponed, players were in quarantine,
no fans, play-offs were played in the bubble and similar noise in the data from real-life events.

Original post: https://towardsdatascience.com/predicting-the-outcome-of-nba-games-with-machine-learning-a810bb768f20


In [1]:
from datetime import datetime
import pandas as pd
from helper_functions import calculate_elo

filename = "Data/games.csv"
data = pd.read_csv(filename, parse_dates=["GAME_DATE_EST"])

test_date = datetime.strptime("2018-06-09", "%Y-%m-%d")
start_date = datetime.strptime("2013-06-30", "%Y-%m-%d")

train_set = data[(data["GAME_DATE_EST"] < test_date)]
train_set = train_set[(train_set["GAME_DATE_EST"] > start_date)]

end_date = datetime.strptime("2019-06-14", "%Y-%m-%d")

test_set = data[(data["GAME_DATE_EST"] < end_date)]
test_set = test_set[(test_set["GAME_DATE_EST"] > test_date)]

teams_data = pd.read_csv("Data/teams.csv")

## Establishing baseline
First we need to get baseline for the prediction accuracy. First is to randomly pick a winner from two team or
next is to predict that home team wins every time. We can try both and pick the best one.


### Random

In [2]:
from random import random

test_out_random = list()

for n,i in test_set.iterrows():
    if random() < 0.5:
        test_out_random.append(0)
    else:
        test_out_random.append(1)

random_eval = 0;
for i in range(0, len(test_out_random)):
    if test_out_random[i] == test_set.iloc[i]["HOME_TEAM_WINS"]:
        random_eval += 1

print("Accuracy of random winner: " + str(random_eval / len(test_out_random)))

Accuracy of random winner: 0.5029027576197388


### Home team always wins

In [3]:
home_team_eval = 0
for i in range(0, len(test_set.index)):
    if test_set.iloc[i]["HOME_TEAM_WINS"] == 1:
        home_team_eval += 1

print("Accuracy of home team always wins: " + str(home_team_eval / len(test_set.index)))

Accuracy of home team always wins: 0.5878084179970973


Our baseline for our model's accuracy will be the probability that home team always wins (almost 60%).



## ELO Rating

Until now, we were only looking at win/lose ratio. ELO rating is improved win/lose stat, due to the fact, it takes into
consideration history of previous matches from teams, but not in simple 0/1. It's a score, where every team starts at
1500 points, after that every team receives or loses points for win or loss. Increase or deduction in points is dependent
on different parameters (location of the match and others). This metric is much better representation of the team's
strength, than simple win/lose metric.

In [4]:
train_set.sort_values("GAME_DATE_EST", inplace=True, ascending=True)
test_set.sort_values("GAME_DATE_EST", inplace=True, ascending=True)

score_init = [1500 for i in range(0, len(teams_data.index))]
teams_data["ELO_RATING"] = score_init

for i in range(0, len(train_set.index)):

    if i > 0 and train_set.iloc[i]["SEASON"] != train_set.iloc[i - 1]["SEASON"]:
        for j in range(0, len(teams_data.index)):
            teams_data.at[j, "ELO_RATING"] = teams_data.at[j, "ELO_RATING"] * 0.75 + (0.25*1505)

    home_id = train_set.iloc[i]["TEAM_ID_home"]
    away_id = train_set.iloc[i]["TEAM_ID_away"]
    win_lose = train_set.iloc[i]["HOME_TEAM_WINS"]
    home_pts = train_set.iloc[i]["PTS_home"]
    away_pts = train_set.iloc[i]["PTS_away"]
    home_team_index = teams_data.index[teams_data["TEAM_ID"] == home_id].tolist()[0]
    away_team_index = teams_data.index[teams_data["TEAM_ID"] == away_id].tolist()[0]
    home_elo = teams_data.iloc[home_team_index]["ELO_RATING"]
    away_elo = teams_data.iloc[away_team_index]["ELO_RATING"]
    teams_data.at[home_team_index, "ELO_RATING"] = calculate_elo(home_elo, away_elo, win_lose, abs(home_pts-away_pts), 0)
    teams_data.at[away_team_index, "ELO_RATING"] = calculate_elo(away_elo, home_elo, 1-win_lose, abs(home_pts-away_pts), 1)

print(teams_data)

    LEAGUE_ID     TEAM_ID  MIN_YEAR  MAX_YEAR ABBREVIATION       NICKNAME  \
0           0  1610612737      1949      2019          ATL          Hawks   
1           0  1610612738      1946      2019          BOS        Celtics   
2           0  1610612740      2002      2019          NOP       Pelicans   
3           0  1610612741      1966      2019          CHI          Bulls   
4           0  1610612742      1980      2019          DAL      Mavericks   
5           0  1610612743      1976      2019          DEN        Nuggets   
6           0  1610612745      1967      2019          HOU        Rockets   
7           0  1610612746      1970      2019          LAC       Clippers   
8           0  1610612747      1948      2019          LAL         Lakers   
9           0  1610612748      1988      2019          MIA           Heat   
10          0  1610612749      1968      2019          MIL          Bucks   
11          0  1610612750      1989      2019          MIN   Timberwolves   