# NBA Game Predictor
This project is used to predict NBA games.  

It uses the following datasets
* [Kaggle](https://www.kaggle.com/datasets/eoinamoore/historical-nba-data-and-player-box-scores) - used for NBA game logs.
* [Basketball Reference](https://www.basketball-reference.com/leagues/NBA_2025.html) - used for NBA stats

To import the proper data, download the Kaggle `Games.csv` file. This is used for game logs, and is how we set up matchups.  
To import NBA season data, we need to go to a given NBA season and import the proper data. This includes the Per Game stats, Per Game Opponent stats, and the Advanced stats. On basketball-reference, you can easily press the share and export and click Get as CSV. Simply download the CSV and put into the proper file. If Per Game stats, put into `pg.csv`, if Per Game Opponent stats, put into `pgo.csv`, if Advanced stats, put into `adv.csv`. Note, when putting into the proper folder, put the year as the starting year of the NBA season. For example, the 2024-2025 season is stored in the `24` folder.

In [1]:
import pandas as pd

The cell below reads the gamelog and stores it. It also has 2 helper methods.
* `get_type(type)`: gets a dataframe of games, but only if it's of the correct type. Default is 26, since it gets regular season games and NBA Cup games. This would include all seasons.
* `get_season(year, type)`: gets a full NBA season of the type given. The year inputted is the starting year of the season, like stated above, 2024-2025 season would be `get_season(24)`  

Note: type is optional in both. The numbers in type refers to the table below. Having multiple digits in type is treated as including both. Like `45` would have playoffs and play-in tournament.

|id|Result|
|:-:|-|
|0|Nothing|
|1|Preseason|
|2|Regular Season / NBA Cup|
|3|Nothing
|4|Playoffs|
|5|Play-in Torunament|
|6|NBA Cup / Regular Season|
|7|Nothing|
|8|Nothing|
|9|Nothing|

In [2]:
df = pd.read_csv('Games.csv')
def get_type(type=26):
  type = str(type)
  return df[df['gameId'].astype(str).str.match(r'^[' + type + ']')]

def get_season(year, type=26):
  year = str(year)
  type = str(type)
  return df[df['gameId'].astype(str).str.match(r'^[' + type + ']' + year)]

  df = pd.read_csv('Games.csv')


The cell below iterates through the game logs and gets dictionaries of teams and their id's.  
* `teams_dict` is a dictionary with the key being the id, and the value being a set of all the teams. This could be used for historical data, where teams have different names, but are given the same id. We could get all the team names from the `teams_dict` dictionary.  
* `id_dict` is a dictionary with the key being the team name, and the value being the id of a time. This is useful for finding the id of a team, given the team anme.

In [3]:
teams_dict = {}
id_dict = {}
for i, r in df.iterrows():
  h = r.hometeamId
  a = r.awayteamId
  hN = r.hometeamCity + " " + r.hometeamName
  aN = r.awayteamCity + " " + r.awayteamName
  if h not in teams_dict:
    teams_dict[h] = set()
  if a not in teams_dict:
    teams_dict[a] = set()
  teams_dict[h].add(hN)
  teams_dict[a].add(aN)

  if hN not in id_dict:
    id_dict[hN] = set()
  if aN not in id_dict:
    id_dict[aN] = set()
  id_dict[hN].add(h)
  id_dict[aN].add(a)

### Preprocessing, getting stats
This cell has all the preprocessing methods. It's used for getting a dataframe of team stats from a given year. It combines all the stats (pg, pgo, adv) into one dataframe. It has a lot of helper methods, the only method really needed here is `get_all_stats(year)`.
* `map_helper_0(x)`: since the adv dataframe has 2 headers, the first header is used to distinguish a specific set of stats for offense and defense. map_helper 0 appends the repspective stat to the start of stat of the 2nd header, making it so that there aren't any duplicates, and we can distinguish offense and defense.
* `map_helper_1(x)`: This filters the second header, and removes unneccesary columns, like arena and attendance information.
* `remove_asterisk(df)`: Basketball reference would have an asterisk at the end of the team name to signify that it's a playoff team. We remove the asterisk so we can handle data properly.
* `remove_blank(df)`: Removes blank columns in a dataframe.
* `set_ids(df)`: sets the id's of the dataframe to the team id's. Uses the "Team" column to get the id's.
* `filter_league_average(df)`: Basketball reference has a League Average row, so we remove it here.
* `get_adv(year)`: Gets the Advanced stats csv from a given year and creates a dataframe.
* `get_pg(year)`: Gets the Per Game csv from a given year and creates a dataframe.
* `get_pgo(year)`: Gets the Per Game Opponent csv from a given year and creates a dataframe.
* `get_all_stats(year)`: Gets all the stats, using the 3 methods above, and merges them into one dataframe. Creates a statistic dataframe for a given year, which can be used for training. 

In [4]:
def map_helper_0(x):
  if "Offense" in x:
    return "OFF_"
  elif "Defense" in x:
    return "DFF_"
  return ""

def map_helper_1(x):
  if "Unnamed" in x or "Arena" in x or "Attend" in x or "Rk" in x:
    return ""
  if ".1" in x:
    return x.replace(".1", "")
  return x

def remove_asterisk(df):
  df["Team"] = df["Team"].str.replace("*", "")
  return df

def remove_blank(df):
  return df.loc[:, ~df.columns.isin([''])]

def set_ids(df):
  df.index = [next(iter(id_dict[team])) if team in id_dict else None for team in df["Team"]]

def filter_league_average(df):
  return df[df["Team"] != "League Average"]

def get_adv(year):
  year = str(year)
  df1 = pd.read_csv(year + "/adv.csv", header=0)
  df2 = pd.read_csv(year + "/adv.csv", header=1)
  df1.columns = map(map_helper_0, df1.columns)
  df2.columns = map(map_helper_1, df2.columns)
  combined = [f"{col1}{col2}" for col1, col2 in zip(df1.columns, df2.columns)]
  final_df = pd.read_csv(year + "/adv.csv", header=None, skiprows=2)
  final_df.columns = combined
  final_df = filter_league_average(final_df)
  final_df = remove_blank(final_df)
  remove_asterisk(final_df)
  set_ids(final_df)
  return final_df

def get_pg(year):
  year = str(year)
  df1 = pd.read_csv(year + "/pg.csv")
  df1 = filter_league_average(df1)
  df1.columns = ["" if i == 'Rk' else i for i in df1.columns]
  df1 = remove_blank(df1)
  remove_asterisk(df1)
  set_ids(df1)
  return df1

def get_pgo(year):
  year = str(year)
  df1 = pd.read_csv(year + "/pgo.csv")
  df1 = filter_league_average(df1)
  remove_asterisk(df1)
  set_ids(df1)
  df1.columns = ["" if i == 'Rk' or i == 'G' else i+"_OPP" for i in df1.columns]
  df1 = df1.rename(columns={'Team_OPP': 'Team'})
  df1 = remove_blank(df1)
  return df1

def get_all_stats(year):
  pg_df = get_pg(year)
  pgo_df = get_pgo(year)
  adv_df = get_adv(year)
  combined_df = pd.concat([pg_df, pgo_df, adv_df], axis=1)
  combined_df = combined_df.loc[:, ~combined_df.columns.duplicated()]
  return combined_df


# get_all_stats(23)

### Preprocessing, game logs
This following cell gets the gamelogs ready for training, and combines the stats above to the gamelogs to get proper stats. `get_game_log(year)` is the important method here to use.
* `get_games_year(year)`: gets all the games in a year. This filters some of the data and gets only the needed columns for the game log.
* `get_stats_year(year)`: gets all the stats, removes the `Team` columns, and is ready for game log.
* `get_game_log(year)`: gets the game log for a year. Gets all the games first. It uses the teamId in the games dataframe. It uses the id to get the stats of the home and away team, then subtracts the two stats to and appends these stats to the game log. Each game in the log will contain the team information, the winner (1 if the home team wins, else 0), and the stats. These stats are the home team stats minus the away team stats.

In [5]:
def get_games_year(year):
  keep = ['hometeamCity','hometeamName','hometeamId', 'awayteamCity', 'awayteamName', 'awayteamId', 'winner']
  games = get_season(year)
  games = games.loc[:, games.columns.isin(keep)]
  games['winner'] = (games['winner'] == games['hometeamId']) * 1
  games["hometeamCity"] = games["hometeamCity"] + " " + games["hometeamName"]
  games["awayteamCity"] = games["awayteamCity"] + " " + games["awayteamName"]
  games = games.loc[:, ~games.columns.str.contains("Name")].rename(columns={
    'hometeamCity': 'hometeam',
    'awayteamCity': 'awayteam',
    'winner': 'homeWin'
  })
  return games

def get_stats_year(year):
  stats = get_all_stats(year)
  stats = stats.loc[:, ~stats.columns.isin(["Team"])]
  return stats

def get_game_log(year):
  stats = get_stats_year(year)
  games = get_games_year(year)

  for col in stats.columns:
    games[col] = 0.0

  for index, row in games.iterrows():
    home_id = row['hometeamId']
    away_id = row['awayteamId']

    if home_id in stats.index and away_id in stats.index:
      diff = stats.loc[home_id] - stats.loc[away_id]
      for col in diff.index:
        games.at[index, col] = diff[col]
    else:
      print("id's not found? " + home_id + ", " + away_id)
    
  return games

### Preprocessing, combined game logs
This get_logs method gets the logs for all seasons from start to end.
* `get_logs(start, end, exclude)`: appends the game logs from each season into one dataframe. Starts at the year given, and goes up to the end year. Default end year is 24, to be used with current data. Start year can be moved around, to better find out if some years hurt the model. Exclude can be used to exclude seasons (like the bubble years).

In [6]:
def get_logs(start, end=24, exclude=[]):
  log_list = []
  def append_it(year):
    if year not in exclude:
      try:
        log_list.append(get_game_log(year))
      except Exception as e:
          print(f"Failed to get data for year {year}: {e}")

  if start > end:
    for i in range(end, -1, -1):
      append_it(i)
    for i in range(99, start - 1, -1):
      append_it(i)
  else:
    for i in range(end, start - 1, -1):
      append_it(i)

  return pd.concat(log_list, ignore_index=True)

get_logs(19, 24)

Unnamed: 0,hometeam,hometeamId,awayteam,awayteamId,homeWin,G,MP,FG,FGA,FG%,...,3PAr,TS%,OFF_eFG%,OFF_TOV%,OFF_ORB%,OFF_FT/FGA,DFF_eFG%,DFF_TOV%,DFF_DRB%,DFF_FT/FGA
0,Miami Heat,1610612748,Philadelphia 76ers,1610612755,1,0.0,1.6,0.4,-0.6,0.008,...,0.007,0.010,0.014,0.4,-1.1,-0.015,-0.031,-1.8,3.8,-0.036
1,Detroit Pistons,1610612765,Sacramento Kings,1610612758,0,0.0,-1.2,-0.6,-0.5,-0.004,...,0.002,-0.003,-0.003,1.1,1.0,0.003,-0.019,0.8,-0.9,0.014
2,Golden State Warriors,1610612744,Houston Rockets,1610612745,0,-1.0,-1.0,-1.6,-2.7,-0.004,...,0.087,0.015,0.015,0.5,-4.6,0.001,0.018,1.0,-0.5,0.005
3,Denver Nuggets,1610612743,Indiana Pacers,1610612754,0,1.0,0.6,1.8,0.9,0.015,...,-0.045,0.006,0.008,0.8,5.7,0.008,-0.005,-1.8,0.1,-0.018
4,New York Knicks,1610612752,Phoenix Suns,1610612756,1,0.0,0.3,2.2,2.9,0.009,...,-0.057,-0.006,-0.004,-0.9,3.3,-0.012,-0.004,2.2,0.3,-0.018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6996,Charlotte Hornets,1610612766,Chicago Bulls,1610612741,1,0.0,1.1,-2.3,-2.7,-0.013,...,0.003,-0.008,-0.011,-0.4,1.1,0.013,0.000,-3.2,-1.2,-0.080
6997,Indiana Pacers,1610612754,Detroit Pistons,1610612765,0,7.0,-0.5,2.8,2.8,0.017,...,-0.064,0.004,0.005,-1.8,-2.6,-0.024,-0.030,0.3,0.8,0.006
6998,Orlando Magic,1610612753,Cleveland Cavaliers,1610612739,1,8.0,-1.2,-1.0,0.7,-0.014,...,0.002,-0.009,-0.016,-3.1,-2.3,0.026,-0.025,1.5,1.7,0.012
6999,Los Angeles Clippers,1610612746,Los Angeles Lakers,1610612747,1,1.0,0.7,-0.7,0.9,-0.014,...,0.017,0.004,-0.007,-0.7,-1.0,0.032,-0.009,-1.9,-1.2,0.001
