In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score

# Introduction

### Sports analytics has been a very fast growing industry in the past decade, and is an area that I have a lot of passions for, and a field I hope to enter in the future.s
An area that has especially seen a lot of growth in recent years is sports betting. In 2018, the gross gaming revenue in the United States was around 400 million dollars, and by 2023, this revenue grew to over 11 billion dollars. This surge is largely attributed to the legalization of sports betting in individual U.S. states, including North Carolina. 

In March 2024, North Carolina legalized sports betting and has reportedly topped 60 million dollars in sports betting tax revenue over the first six months, with over 8.5 million dollars generated in taxes. If you tune into any televised sporting event, it often feels like a Fanduel or DraftKings advertisement appears every few seconds..
nds.

### The ginormous industry that is sports betting is fueled by predicting the outcome of games/player performance.  
Companies like Fanduel and Draftkings hire thousands of data analysts to generate the most accurate predictions for games/player performance, and consumers bet against those predictions (the house almost always comes out
### For this project, I want to produce a predictive model to determine the outcome of NBA games (the winner of a game) at a successful rate (over 50% success rate at the minimum).   minimum).  
I will use data from game outcomes and statistics in the past, and perhaps also use past player performance to come up with my model.

### During this project, I will have the opportunity to delve into the data and data analysis that has created a $200 billion global market.  
If I am successful in creating a model that can consistently predict game outcomes and/or player performances, the model can theoretically be used not only for sports betting markets and insights, but could be useful for teams themselves when making strategic decisions for games.
 decisions for games.
abase of games from 2010-2020


# Reading in the Data

## To create my model, I use a NBA Database from Kaggle that holds data/statistics for all games, teams, players, drafts, and individual plays from 1947-2023

### After reading in all of the datasets and reviewing each table, I decided that game.csv and team.csv were the datasets I would be mainly using to build the model

### Database Source: https://www.kaggle.com/datasets/wyattowalsh/basketball

In [94]:
game = pd.read_csv('game.csv')
game.tail(10)

Unnamed: 0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_id,game_date,matchup_home,wl_home,min,fgm_home,fga_home,fg_pct_home,fg3m_home,fg3a_home,fg3_pct_home,ftm_home,fta_home,ft_pct_home,oreb_home,dreb_home,reb_home,ast_home,stl_home,blk_home,tov_home,pf_home,pts_home,plus_minus_home,video_available_home,team_id_away,team_abbreviation_away,team_name_away,matchup_away,wl_away,fgm_away,fga_away,fg_pct_away,fg3m_away,fg3a_away,fg3_pct_away,ftm_away,fta_away,ft_pct_away,oreb_away,dreb_away,reb_away,ast_away,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,video_available_away,season_type
65688,42022,1610612738,BOS,Boston Celtics,42200305,2023-05-25 00:00:00,BOS vs. MIA,W,240,40.0,79.0,0.506,16.0,39.0,0.41,14.0,19.0,0.737,12.0,25.0,37.0,23.0,13.0,4.0,10.0,9.0,110.0,13,1,1610612748,MIA,Miami Heat,MIA @ BOS,L,40.0,78.0,0.513,9.0,23.0,0.391,8.0,10.0,0.8,10.0,26.0,36.0,20.0,6.0,2.0,16.0,13.0,97.0,-13,1,Playoffs
65689,42022,1610612748,MIA,Miami Heat,42200306,2023-05-27 00:00:00,MIA vs. BOS,L,240,33.0,93.0,0.355,14.0,30.0,0.467,23.0,29.0,0.793,17.0,30.0,47.0,22.0,5.0,4.0,5.0,24.0,103.0,-1,1,1610612738,BOS,Boston Celtics,BOS @ MIA,W,34.0,78.0,0.436,7.0,35.0,0.2,29.0,34.0,0.853,12.0,35.0,47.0,18.0,4.0,8.0,12.0,22.0,104.0,1,1,Playoffs
65690,42022,1610612738,BOS,Boston Celtics,42200307,2023-05-29 00:00:00,BOS vs. MIA,L,240,32.0,82.0,0.39,9.0,42.0,0.214,11.0,13.0,0.846,10.0,30.0,40.0,18.0,6.0,4.0,15.0,13.0,84.0,-19,1,1610612748,MIA,Miami Heat,MIA @ BOS,W,42.0,86.0,0.488,14.0,28.0,0.5,5.0,6.0,0.833,7.0,35.0,42.0,26.0,7.0,2.0,14.0,15.0,103.0,19,1,Playoffs
65691,42022,1610612743,DEN,Denver Nuggets,42200401,2023-06-01 00:00:00,DEN vs. MIA,W,240,40.0,79.0,0.506,8.0,27.0,0.296,16.0,20.0,0.8,6.0,39.0,45.0,29.0,4.0,4.0,10.0,8.0,104.0,11,1,1610612748,MIA,Miami Heat,MIA @ DEN,L,39.0,96.0,0.406,13.0,39.0,0.333,2.0,2.0,1.0,11.0,32.0,43.0,26.0,5.0,4.0,8.0,15.0,93.0,-11,1,Playoffs
65692,42022,1610612743,DEN,Denver Nuggets,42200402,2023-06-04 00:00:00,DEN vs. MIA,L,240,39.0,75.0,0.52,11.0,28.0,0.393,19.0,22.0,0.864,9.0,29.0,38.0,23.0,7.0,2.0,14.0,21.0,108.0,-3,1,1610612748,MIA,Miami Heat,MIA @ DEN,W,38.0,78.0,0.487,17.0,35.0,0.486,18.0,20.0,0.9,8.0,23.0,31.0,28.0,5.0,4.0,11.0,22.0,111.0,3,1,Playoffs
65693,42022,1610612748,MIA,Miami Heat,42200403,2023-06-07 00:00:00,MIA vs. DEN,L,240,34.0,92.0,0.37,11.0,35.0,0.314,15.0,19.0,0.789,10.0,23.0,33.0,20.0,7.0,3.0,4.0,22.0,94.0,-15,1,1610612743,DEN,Denver Nuggets,DEN @ MIA,W,41.0,80.0,0.513,5.0,18.0,0.278,22.0,27.0,0.815,13.0,45.0,58.0,28.0,3.0,5.0,14.0,18.0,109.0,15,1,Playoffs
65694,42022,1610612748,MIA,Miami Heat,42200404,2023-06-09 00:00:00,MIA vs. DEN,L,240,35.0,78.0,0.449,8.0,25.0,0.32,17.0,20.0,0.85,8.0,29.0,37.0,23.0,2.0,3.0,15.0,19.0,95.0,-13,1,1610612743,DEN,Denver Nuggets,DEN @ MIA,W,39.0,79.0,0.494,14.0,28.0,0.5,16.0,21.0,0.762,5.0,29.0,34.0,26.0,11.0,7.0,8.0,18.0,108.0,13,1,Playoffs
65695,42022,1610612743,DEN,Denver Nuggets,42200405,2023-06-12 00:00:00,DEN vs. MIA,W,240,38.0,84.0,0.452,5.0,28.0,0.179,13.0,23.0,0.565,11.0,46.0,57.0,21.0,6.0,7.0,15.0,13.0,94.0,5,1,1610612748,MIA,Miami Heat,MIA @ DEN,L,33.0,96.0,0.344,9.0,35.0,0.257,14.0,16.0,0.875,11.0,33.0,44.0,18.0,9.0,7.0,8.0,21.0,89.0,-5,1,Playoffs
65696,32022,1610616834,LBN,Team LeBron,32200001,2023-02-19 00:00:00,LBN vs. GNS,L,221,79.0,132.0,0.598,17.0,60.0,0.283,0.0,0.0,,13.0,32.0,45.0,49.0,7.0,2.0,10.0,5.0,175.0,-9,1,1610616833,GNS,Team Giannis,GNS @ LBN,W,76.0,123.0,0.618,29.0,66.0,0.439,3.0,4.0,0.75,10.0,36.0,46.0,43.0,8.0,1.0,12.0,2.0,184.0,9,1,All-Star
65697,32022,1610616834,LBN,Team LeBron,32200001,2023-02-19 00:00:00,LBN vs. GNS,L,221,79.0,132.0,0.598,17.0,60.0,0.283,0.0,0.0,,13.0,32.0,45.0,49.0,7.0,2.0,10.0,5.0,175.0,-9,1,1610616833,GNS,Team Giannis,GNS @ LBN,W,76.0,123.0,0.618,29.0,66.0,0.439,3.0,4.0,0.75,10.0,36.0,46.0,43.0,8.0,1.0,12.0,2.0,184.0,9,1,All Star


In [27]:
line_score = pd.read_csv('line_score.csv')
line_score.tail(10)

Unnamed: 0,game_date_est,game_sequence,game_id,team_id_home,team_abbreviation_home,team_city_name_home,team_nickname_home,team_wins_losses_home,pts_qtr1_home,pts_qtr2_home,...,pts_ot2_away,pts_ot3_away,pts_ot4_away,pts_ot5_away,pts_ot6_away,pts_ot7_away,pts_ot8_away,pts_ot9_away,pts_ot10_away,pts_away
58043,2023-05-25 00:00:00,1.0,42200305,1610612738,BOS,Boston,Celtics,2-3,35.0,26.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97.0
58044,2023-05-27 00:00:00,1.0,42200306,1610612738,BOS,Boston,Celtics,3-3,34.0,23.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,103.0
58045,2023-05-29 00:00:00,1.0,42200307,1610612748,MIA,Miami,Heat,4-3,22.0,30.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,84.0
58046,2023-06-01 00:00:00,1.0,42200401,1610612743,DEN,Denver,Nuggets,1-0,29.0,30.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,93.0
58047,2023-06-04 00:00:00,1.0,42200402,1610612743,DEN,Denver,Nuggets,1-1,23.0,34.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,111.0
58048,2023-06-07 00:00:00,1.0,42200403,1610612748,MIA,Miami,Heat,1-2,24.0,24.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,109.0
58049,2023-06-09 00:00:00,1.0,42200404,1610612748,MIA,Miami,Heat,1-3,21.0,30.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,108.0
58050,2023-06-12 00:00:00,1.0,42200405,1610612743,DEN,Denver,Nuggets,4-1,22.0,22.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,89.0
58051,2023-02-19 00:00:00,1.0,32200001,1610616834,LBN,Team,LeBron,0-1,46.0,46.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,184.0
58052,2023-02-19 00:00:00,1.0,32200001,1610616834,LBN,Team,LeBron,0-1,46.0,46.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,184.0


In [28]:
game_summary = pd.read_csv('game_summary.csv')
game_summary.head()
game_summary.columns

Index(['game_date_est', 'game_sequence', 'game_id', 'game_status_id',
       'game_status_text', 'gamecode', 'home_team_id', 'visitor_team_id',
       'season', 'live_period', 'live_pc_time',
       'natl_tv_broadcaster_abbreviation', 'live_period_time_bcast',
       'wh_status'],
      dtype='object')

In [29]:
team_info_common = pd.read_csv('team_info_common.csv')
team_info_common.tail(10)

Unnamed: 0,team_id,season_year,team_city,team_name,team_abbreviation,team_conference,team_division,team_code,team_slug,w,...,league_id,season_id,pts_rank,pts_pg,reb_rank,reb_pg,ast_rank,ast_pg,opp_pts_rank,opp_pts_pg


In [30]:
team_history = pd.read_csv('team_history.csv')
team_history.tail(10)

Unnamed: 0,team_id,city,nickname,year_founded,year_active_till
42,1610612764,Washington,Bullets,1974,1996
43,1610612764,Capital,Bullets,1973,1973
44,1610612764,Baltimore,Bullets,1963,1972
45,1610612764,Chicago,Zephyrs,1962,1962
46,1610612764,Chicago,Packers,1961,1961
47,1610612765,Detroit,Pistons,1957,2019
48,1610612765,Ft. Wayne Zollner,Pistons,1948,1956
49,1610612766,Charlotte,Hornets,2014,2019
50,1610612766,Charlotte,Bobcats,2004,2013
51,1610612766,Charlotte,Hornets,1988,2001


In [31]:
team_details = pd.read_csv('team_details.csv')
team_details.tail(10)

Unnamed: 0,team_id,abbreviation,nickname,yearfounded,city,arena,arenacapacity,owner,generalmanager,headcoach,dleagueaffiliation,facebook,instagram,twitter
15,1610612757,POR,Trail Blazers,1970.0,Portland,Moda Center,19980.0,Jody Allen,Joe Cronin,Chauncey Billups,No Affiliate,https://www.facebook.com/trailblazers,https://instagram.com/trailblazers,https://twitter.com/trailblazers
16,1610612758,SAC,Kings,1948.0,Sacramento,Golden 1 Center,17500.0,Vivek Ranadive,Monte McNair,Mike Brown,Stockton Kings,https://www.facebook.com/sacramentokings,https://instagram.com/sacramentokings,https://twitter.com/SacramentoKings
17,1610612759,SAS,Spurs,1976.0,San Antonio,AT&T Center,18694.0,Peter Holt,Brian Wright,Gregg Popovich,Austin Spurs,https://www.facebook.com/Spurs,https://instagram.com/spurs,https://twitter.com/spurs
18,1610612760,OKC,Thunder,1967.0,Oklahoma City,Paycom Center,,Clay Bennett,Sam Presti,Mark Daigneault,Oklahoma City Blue,https://www.facebook.com/OKCThunder,https://instagram.com/okcthunder,https://twitter.com/okcthunder
19,1610612761,TOR,Raptors,1995.0,Toronto,Scotiabank Arena,,Lawrence Tanenbaum,Masai Ujiri,,Raptors 905,https://www.facebook.com/TorontoRaptors,https://instagram.com/raptors,https://twitter.com/Raptors
20,1610612762,UTA,Jazz,1974.0,Utah,Delta Center,,Ryan Smith,Justin Zanik,Will Hardy,Salt Lake City Stars,https://www.facebook.com/utahjazz,https://instagram.com/utahjazz,https://twitter.com/utahjazz
21,1610612763,MEM,Grizzlies,1995.0,Memphis,FedExForum,18119.0,Robert Pera,Zach Kleiman,Taylor Jenkins,Memphis Hustle,https://www.facebook.com/MemphisGrizzlies,https://instagram.com/memgrizz,https://twitter.com/memgrizz
22,1610612764,WAS,Wizards,1961.0,Washington,Capital One Arena,20647.0,Ted Leonsis,Tommy Sheppard,Wes Unseld,Capital City Go-Go,https://www.facebook.com/Wizards,https://instagram.com/washwizards,https://twitter.com/WashWizards
23,1610612765,DET,Pistons,1948.0,Detroit,Little Caesars Arena,,Tom Gores,Ed Stefanski,Monty Williams,Motor City Cruise,https://www.facebook.com/detroitpistons,https://instagram.com/detroitpistons,https://twitter.com/DetroitPistons
24,1610612766,CHA,Hornets,1988.0,Charlotte,Spectrum Center,19026.0,Michael Jordan,Mitch Kupchak,Steve Clifford,Greensboro Swarm,https://www.facebook.com/hornets,https://instagram.com/hornets,https://twitter.com/hornets


In [32]:
team = pd.read_csv('team.csv')
team.tail(10)
team.columns

Index(['id', 'full_name', 'abbreviation', 'nickname', 'city', 'state',
       'year_founded'],
      dtype='object')

In [33]:
player = pd.read_csv('player.csv')
player.tail(10)
player.columns

Index(['id', 'full_name', 'first_name', 'last_name', 'is_active'], dtype='object')

In [34]:
other_stats = pd.read_csv('other_stats.csv')
other_stats.tail(10)
other_stats.columns

Index(['game_id', 'league_id', 'team_id_home', 'team_abbreviation_home',
       'team_city_home', 'pts_paint_home', 'pts_2nd_chance_home',
       'pts_fb_home', 'largest_lead_home', 'lead_changes', 'times_tied',
       'team_turnovers_home', 'total_turnovers_home', 'team_rebounds_home',
       'pts_off_to_home', 'team_id_away', 'team_abbreviation_away',
       'team_city_away', 'pts_paint_away', 'pts_2nd_chance_away',
       'pts_fb_away', 'largest_lead_away', 'team_turnovers_away',
       'total_turnovers_away', 'team_rebounds_away', 'pts_off_to_away'],
      dtype='object')

In [35]:
officials = pd.read_csv('officials.csv')
officials.tail(10)
officials.columns

Index(['game_id', 'official_id', 'first_name', 'last_name', 'jersey_num'], dtype='object')

In [36]:
inactive_players = pd.read_csv('inactive_players.csv')
inactive_players.tail(10)
inactive_players.columns

Index(['game_id', 'player_id', 'first_name', 'last_name', 'jersey_num',
       'team_id', 'team_city', 'team_name', 'team_abbreviation'],
      dtype='object')

In [37]:
game_info = pd.read_csv('game_info.csv')
game_info.tail(10)
game_info.columns

Index(['game_id', 'game_date', 'attendance', 'game_time'], dtype='object')

In [38]:
draft_history = pd.read_csv('draft_history.csv')
draft_history.tail(10)
draft_history.columns

Index(['person_id', 'player_name', 'season', 'round_number', 'round_pick',
       'overall_pick', 'draft_type', 'team_id', 'team_city', 'team_name',
       'team_abbreviation', 'organization', 'organization_type',
       'player_profile_flag'],
      dtype='object')

In [39]:
draft_combine_stats = pd.read_csv('draft_combine_stats.csv')
draft_combine_stats.tail(10)
draft_combine_stats.columns

Index(['season', 'player_id', 'first_name', 'last_name', 'player_name',
       'position', 'height_wo_shoes', 'height_wo_shoes_ft_in',
       'height_w_shoes', 'height_w_shoes_ft_in', 'weight', 'wingspan',
       'wingspan_ft_in', 'standing_reach', 'standing_reach_ft_in',
       'body_fat_pct', 'hand_length', 'hand_width', 'standing_vertical_leap',
       'max_vertical_leap', 'lane_agility_time', 'modified_lane_agility_time',
       'three_quarter_sprint', 'bench_press', 'spot_fifteen_corner_left',
       'spot_fifteen_break_left', 'spot_fifteen_top_key',
       'spot_fifteen_break_right', 'spot_fifteen_corner_right',
       'spot_college_corner_left', 'spot_college_break_left',
       'spot_college_top_key', 'spot_college_break_right',
       'spot_college_corner_right', 'spot_nba_corner_left',
       'spot_nba_break_left', 'spot_nba_top_key', 'spot_nba_break_right',
       'spot_nba_corner_right', 'off_drib_fifteen_break_left',
       'off_drib_fifteen_top_key', 'off_drib_fifteen_bre

In [40]:
common_player_info = pd.read_csv('common_player_info.csv')
common_player_info.tail(10)
common_player_info.columns

Index(['person_id', 'first_name', 'last_name', 'display_first_last',
       'display_last_comma_first', 'display_fi_last', 'player_slug',
       'birthdate', 'school', 'country', 'last_affiliation', 'height',
       'weight', 'season_exp', 'jersey', 'position', 'rosterstatus',
       'games_played_current_season_flag', 'team_id', 'team_name',
       'team_abbreviation', 'team_code', 'team_city', 'playercode',
       'from_year', 'to_year', 'dleague_flag', 'nba_flag', 'games_played_flag',
       'draft_year', 'draft_round', 'draft_number', 'greatest_75_flag'],
      dtype='object')

# Data Preprocessing/Cleaning

## In these below cells, I am updating columns and column values in order to: Minimize incomplete data, remove/replace NaN values, drop unnecessary columns, convert values into formats for compatability, and creating new columns (such as rolling averages) that can help build more accurate results

### Here I am merging columns that I felt were going to be helpful from the line_score subset with the game subset

In [41]:
line_score_subset = line_score[['game_id', 'team_wins_losses_home', 
                                 'team_wins_losses_away']]

In [42]:
merged_game_info = pd.merge(game, line_score_subset, on= 'game_id', how= 'inner')

In [43]:
merged_game_info.tail(10)

Unnamed: 0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_id,game_date,matchup_home,wl_home,min,fgm_home,...,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,video_available_away,season_type,team_wins_losses_home,team_wins_losses_away
58139,42022,1610612738,BOS,Boston Celtics,42200307,2023-05-29 00:00:00,BOS vs. MIA,L,240,32.0,...,7.0,2.0,14.0,15.0,103.0,19,1,Playoffs,4-3,3-4
58140,42022,1610612743,DEN,Denver Nuggets,42200401,2023-06-01 00:00:00,DEN vs. MIA,W,240,40.0,...,5.0,4.0,8.0,15.0,93.0,-11,1,Playoffs,1-0,0-1
58141,42022,1610612743,DEN,Denver Nuggets,42200402,2023-06-04 00:00:00,DEN vs. MIA,L,240,39.0,...,5.0,4.0,11.0,22.0,111.0,3,1,Playoffs,1-1,1-1
58142,42022,1610612748,MIA,Miami Heat,42200403,2023-06-07 00:00:00,MIA vs. DEN,L,240,34.0,...,3.0,5.0,14.0,18.0,109.0,15,1,Playoffs,1-2,2-1
58143,42022,1610612748,MIA,Miami Heat,42200404,2023-06-09 00:00:00,MIA vs. DEN,L,240,35.0,...,11.0,7.0,8.0,18.0,108.0,13,1,Playoffs,1-3,3-1
58144,42022,1610612743,DEN,Denver Nuggets,42200405,2023-06-12 00:00:00,DEN vs. MIA,W,240,38.0,...,9.0,7.0,8.0,21.0,89.0,-5,1,Playoffs,4-1,1-4
58145,32022,1610616834,LBN,Team LeBron,32200001,2023-02-19 00:00:00,LBN vs. GNS,L,221,79.0,...,8.0,1.0,12.0,2.0,184.0,9,1,All-Star,0-1,1-0
58146,32022,1610616834,LBN,Team LeBron,32200001,2023-02-19 00:00:00,LBN vs. GNS,L,221,79.0,...,8.0,1.0,12.0,2.0,184.0,9,1,All-Star,0-1,1-0
58147,32022,1610616834,LBN,Team LeBron,32200001,2023-02-19 00:00:00,LBN vs. GNS,L,221,79.0,...,8.0,1.0,12.0,2.0,184.0,9,1,All Star,0-1,1-0
58148,32022,1610616834,LBN,Team LeBron,32200001,2023-02-19 00:00:00,LBN vs. GNS,L,221,79.0,...,8.0,1.0,12.0,2.0,184.0,9,1,All Star,0-1,1-0


In [44]:
officials_subset = officials[['game_id', 'official_id']]
merged_game_info = pd.merge(merged_game_info, officials_subset, on= 'game_id', how= 'inner')

### I only want to use rows that represent regular season games

In [45]:
regular_season_games = merged_game_info.loc[merged_game_info['season_type'] == 'Regular Season']
regular_season_games.tail(10)

Unnamed: 0,season_id,team_id_home,team_abbreviation_home,team_name_home,game_id,game_date,matchup_home,wl_home,min,fgm_home,...,blk_away,tov_away,pf_away,pts_away,plus_minus_away,video_available_away,season_type,team_wins_losses_home,team_wins_losses_away,official_id
70662,22022,1610612751,BKN,Brooklyn Nets,22201217,2023-04-09 00:00:00,BKN vs. PHI,L,240,35.0,...,5.0,13.0,23.0,134.0,29,1,Regular Season,54-28,45-37,1627541
70663,22022,1610612741,CHI,Chicago Bulls,22201223,2023-04-09 00:00:00,CHI vs. DET,W,240,40.0,...,3.0,25.0,16.0,81.0,-22,1,Regular Season,40-42,17-65,1193
70664,22022,1610612741,CHI,Chicago Bulls,22201223,2023-04-09 00:00:00,CHI vs. DET,W,240,40.0,...,3.0,25.0,16.0,81.0,-22,1,Regular Season,40-42,17-65,1201
70665,22022,1610612741,CHI,Chicago Bulls,22201223,2023-04-09 00:00:00,CHI vs. DET,W,240,40.0,...,3.0,25.0,16.0,81.0,-22,1,Regular Season,40-42,17-65,1830
70666,22022,1610612761,TOR,Toronto Raptors,22201221,2023-04-09 00:00:00,TOR vs. MIL,W,240,48.0,...,2.0,14.0,13.0,105.0,-16,1,Regular Season,58-24,41-41,1179
70667,22022,1610612761,TOR,Toronto Raptors,22201221,2023-04-09 00:00:00,TOR vs. MIL,W,240,48.0,...,2.0,14.0,13.0,105.0,-16,1,Regular Season,58-24,41-41,101284
70668,22022,1610612761,TOR,Toronto Raptors,22201221,2023-04-09 00:00:00,TOR vs. MIL,W,240,48.0,...,2.0,14.0,13.0,105.0,-16,1,Regular Season,58-24,41-41,1626302
70669,22022,1610612739,CLE,Cleveland Cavaliers,22201218,2023-04-09 00:00:00,CLE vs. CHA,L,240,35.0,...,5.0,19.0,14.0,106.0,11,1,Regular Season,51-31,27-55,2005
70670,22022,1610612739,CLE,Cleveland Cavaliers,22201218,2023-04-09 00:00:00,CLE vs. CHA,L,240,35.0,...,5.0,19.0,14.0,106.0,11,1,Regular Season,51-31,27-55,201246
70671,22022,1610612739,CLE,Cleveland Cavaliers,22201218,2023-04-09 00:00:00,CLE vs. CHA,L,240,35.0,...,5.0,19.0,14.0,106.0,11,1,Regular Season,51-31,27-55,202035


In [46]:
regular_season_games.columns

Index(['season_id', 'team_id_home', 'team_abbreviation_home', 'team_name_home',
       'game_id', 'game_date', 'matchup_home', 'wl_home', 'min', 'fgm_home',
       'fga_home', 'fg_pct_home', 'fg3m_home', 'fg3a_home', 'fg3_pct_home',
       'ftm_home', 'fta_home', 'ft_pct_home', 'oreb_home', 'dreb_home',
       'reb_home', 'ast_home', 'stl_home', 'blk_home', 'tov_home', 'pf_home',
       'pts_home', 'plus_minus_home', 'video_available_home', 'team_id_away',
       'team_abbreviation_away', 'team_name_away', 'matchup_away', 'wl_away',
       'fgm_away', 'fga_away', 'fg_pct_away', 'fg3m_away', 'fg3a_away',
       'fg3_pct_away', 'ftm_away', 'fta_away', 'ft_pct_away', 'oreb_away',
       'dreb_away', 'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away',
       'pf_away', 'pts_away', 'plus_minus_away', 'video_available_away',
       'season_type', 'team_wins_losses_home', 'team_wins_losses_away',
       'official_id'],
      dtype='object')

### Here I am removing some of the unnecessary columns in regular_season_games

In [47]:
cols_to_display = ['season_id', 'team_id_home', 'team_name_home',
       'game_id', 'game_date', 'wl_home', 'fgm_home',
       'fga_home', 'fg_pct_home', 'fg3m_home', 'fg3a_home', 'fg3_pct_home',
       'ftm_home', 'fta_home', 'ft_pct_home', 'oreb_home', 'dreb_home',
       'reb_home', 'ast_home', 'stl_home', 'blk_home', 'tov_home', 'pf_home',
       'pts_home', 'plus_minus_home', 'team_id_away',
       'team_name_away', 'matchup_away', 'wl_away',
       'fgm_away', 'fga_away', 'fg_pct_away', 'fg3m_away', 'fg3a_away',
       'fg3_pct_away', 'ftm_away', 'fta_away', 'ft_pct_away', 'oreb_away',
       'dreb_away', 'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away',
       'pf_away', 'pts_away', 'plus_minus_away',
       'team_wins_losses_home', 'team_wins_losses_away',
       'official_id']
regular_season_games = regular_season_games[cols_to_display]
regular_season_games.tail(10)
                

Unnamed: 0,season_id,team_id_home,team_name_home,game_id,game_date,wl_home,fgm_home,fga_home,fg_pct_home,fg3m_home,...,ast_away,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,team_wins_losses_home,team_wins_losses_away,official_id
70662,22022,1610612751,Brooklyn Nets,22201217,2023-04-09 00:00:00,L,35.0,83.0,0.422,12.0,...,31.0,10.0,5.0,13.0,23.0,134.0,29,54-28,45-37,1627541
70663,22022,1610612741,Chicago Bulls,22201223,2023-04-09 00:00:00,W,40.0,95.0,0.421,6.0,...,20.0,1.0,3.0,25.0,16.0,81.0,-22,40-42,17-65,1193
70664,22022,1610612741,Chicago Bulls,22201223,2023-04-09 00:00:00,W,40.0,95.0,0.421,6.0,...,20.0,1.0,3.0,25.0,16.0,81.0,-22,40-42,17-65,1201
70665,22022,1610612741,Chicago Bulls,22201223,2023-04-09 00:00:00,W,40.0,95.0,0.421,6.0,...,20.0,1.0,3.0,25.0,16.0,81.0,-22,40-42,17-65,1830
70666,22022,1610612761,Toronto Raptors,22201221,2023-04-09 00:00:00,W,48.0,95.0,0.505,11.0,...,28.0,5.0,2.0,14.0,13.0,105.0,-16,58-24,41-41,1179
70667,22022,1610612761,Toronto Raptors,22201221,2023-04-09 00:00:00,W,48.0,95.0,0.505,11.0,...,28.0,5.0,2.0,14.0,13.0,105.0,-16,58-24,41-41,101284
70668,22022,1610612761,Toronto Raptors,22201221,2023-04-09 00:00:00,W,48.0,95.0,0.505,11.0,...,28.0,5.0,2.0,14.0,13.0,105.0,-16,58-24,41-41,1626302
70669,22022,1610612739,Cleveland Cavaliers,22201218,2023-04-09 00:00:00,L,35.0,89.0,0.393,10.0,...,23.0,8.0,5.0,19.0,14.0,106.0,11,51-31,27-55,2005
70670,22022,1610612739,Cleveland Cavaliers,22201218,2023-04-09 00:00:00,L,35.0,89.0,0.393,10.0,...,23.0,8.0,5.0,19.0,14.0,106.0,11,51-31,27-55,201246
70671,22022,1610612739,Cleveland Cavaliers,22201218,2023-04-09 00:00:00,L,35.0,89.0,0.393,10.0,...,23.0,8.0,5.0,19.0,14.0,106.0,11,51-31,27-55,202035


### In the columns wl_home and wl_away, I want to change the values 'W' and 'L' to 1 and 0. This is because changing the categorical variables into binary values is more compatabile during data analysis.

In [48]:
print(regular_season_games['wl_home'].unique())

['L' 'W']


In [49]:
regular_season_games['wl_home'] = regular_season_games['wl_home'].str.strip()

In [50]:
regular_season_games['wl_home'] = regular_season_games['wl_home'].apply(lambda x: 1 if x == 'W' else 0)
regular_season_games['wl_away'] = regular_season_games['wl_away'].apply(lambda x: 1 if x == 'W' else 0)

### Here I am creating new columns by applying arithmetic between values from two existing columns. I create new columns to display total numbers of games played and win percentage for both the home and away teams. Before calculating any totals or averages, I first replace NaN values with empty strings, and split string values and convert them into numeric values. 

In [51]:
# Replace empty strings with NaN if not already done
regular_season_games[['team_wins_losses_home', 'team_wins_losses_away']] = regular_season_games[['team_wins_losses_home', 'team_wins_losses_away']].replace('', np.nan)

# Split the 'team_wins_losses_home' and 'team_wins_losses_away' columns by '-'
regular_season_games[['home_wins', 'home_losses']] = regular_season_games['team_wins_losses_home'].str.split('-', expand=True)
regular_season_games[['away_wins', 'away_losses']] = regular_season_games['team_wins_losses_away'].str.split('-', expand=True)

# Convert the columns to numeric values 
regular_season_games[['home_wins', 'home_losses']] = regular_season_games[['home_wins', 'home_losses']].apply(pd.to_numeric, errors='coerce')
regular_season_games[['away_wins', 'away_losses']] = regular_season_games[['away_wins', 'away_losses']].apply(pd.to_numeric, errors='coerce')

# Now calculate total games played for home and away teams
regular_season_games['home_total_games'] = regular_season_games['home_wins'] + regular_season_games['home_losses']
regular_season_games['away_total_games'] = regular_season_games['away_wins'] + regular_season_games['away_losses']

# Calculate win percentage for home and away teams
regular_season_games['home_win_pct'] = regular_season_games['home_wins'] / regular_season_games['home_total_games']
regular_season_games['away_win_pct'] = regular_season_games['away_wins'] / regular_season_games['away_total_games']

# Handle possible NaN or division by zero by replacing with 0 
regular_season_games['home_win_pct'] = regular_season_games['home_win_pct'].fillna(0)
regular_season_games['away_win_pct'] = regular_season_games['away_win_pct'].fillna(0)

### Here I am creating the target variable, game_winner, and adding it to the dataset

In [52]:
# Create a target variable where 1 indicates the home team wins, 0 indicates the away team wins
regular_season_games['game_winner'] = (regular_season_games['wl_home'] == 1).astype(int)

### Here I add a new column to regular_season_games that outputs the rolling win percentages for the home and away teams from the previous 10 games

In [53]:
regular_season_games['rolling_win_pct_home'] = regular_season_games['home_wins'].rolling(window=10).sum() / regular_season_games['home_total_games'].rolling(window=10).sum()
regular_season_games['rolling_win_pct_away'] = regular_season_games['away_wins'].rolling(window=10).sum() / regular_season_games['away_total_games'].rolling(window=10).sum()

Unnamed: 0,season_id,team_id_home,team_name_home,game_id,game_date,wl_home,fgm_home,fga_home,fg_pct_home,fg3m_home,...,official_id,away_wins,away_losses,home_total_games,away_total_games,home_win_pct,away_win_pct,game_winner,rolling_win_pct_home,rolling_win_pct_away
0,21996,1610612765,Detroit Pistons,29600059,1996-11-08 00:00:00,0,32.0,75.0,0.427,8.0,...,1140,,,,,0.000000,0.000000,0,,
1,21996,1610612765,Detroit Pistons,29600059,1996-11-08 00:00:00,0,32.0,75.0,0.427,8.0,...,1165,,,,,0.000000,0.000000,0,,
2,21996,1610612765,Detroit Pistons,29600059,1996-11-08 00:00:00,0,32.0,75.0,0.427,8.0,...,1153,,,,,0.000000,0.000000,0,,
3,21996,1610612752,New York Knicks,29600114,1996-11-16 00:00:00,1,31.0,64.0,0.484,4.0,...,1147,,,,,0.000000,0.000000,1,,
4,21996,1610612752,New York Knicks,29600114,1996-11-16 00:00:00,1,31.0,64.0,0.484,4.0,...,1142,,,,,0.000000,0.000000,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70667,22022,1610612761,Toronto Raptors,22201221,2023-04-09 00:00:00,1,48.0,95.0,0.505,11.0,...,101284,41.0,41.0,82.0,82.0,0.707317,0.500000,1,0.570732,0.441463
70668,22022,1610612761,Toronto Raptors,22201221,2023-04-09 00:00:00,1,48.0,95.0,0.505,11.0,...,1626302,41.0,41.0,82.0,82.0,0.707317,0.500000,1,0.598780,0.434146
70669,22022,1610612739,Cleveland Cavaliers,22201218,2023-04-09 00:00:00,0,35.0,89.0,0.393,10.0,...,2005,27.0,55.0,82.0,82.0,0.621951,0.329268,0,0.618293,0.409756
70670,22022,1610612739,Cleveland Cavaliers,22201218,2023-04-09 00:00:00,0,35.0,89.0,0.393,10.0,...,201246,27.0,55.0,82.0,82.0,0.621951,0.329268,0,0.614634,0.387805


### This was due to running the model and producing suspiciously high accuracies, I removed some columns that could potentially give away the answer

In [None]:
regular_season_games.drop(columns=['home_wins', 'home_losses', 'team_wins_losses_home', 'team_wins_losses_away'])

### Below, I create rolling averages for all of the other stats in regular_season_games, and append them to rolling_regular_season_stats. I created these columns because I had realized that when I was preparing the data for games_to_predict, there was no way to fill in the columns for the existing stats columns, since the games had not been played yet. Therefore, I created the rolling averages so that I can run the model with averages of these stats from the previous 10 games instead.

In [54]:
# Define the stats for which to calculate rolling averages
stats_to_average = [
    'fg_pct_home', 'fg_pct_away', 'fg3_pct_home', 'fg3_pct_away',
    'ft_pct_home', 'ft_pct_away', 'oreb_home', 'oreb_away', 'dreb_home', 'dreb_away', 'reb_home', 'reb_away',
    'ast_home', 'ast_away', 'stl_home', 'stl_away', 'blk_home', 'blk_away', 'tov_home', 'tov_away', 'pf_home', 'pf_away'
]

# Function to calculate rolling averages
def calculate_rolling_averages(df, stats, window):
    for stat in stats:
        if 'pct' in stat:  # If the stat is a percentage
            df[stat] = df[stat].clip(0, 1)  # Clip percentages to be between 0 and 1
        # Check for NaN values in the stat
        df[stat] = df[stat].fillna(0)

        # Calculate the rolling average
        df[f'rolling_{stat}'] = df[stat].rolling(window, min_periods=1).mean()

    return df

rolling_regular_season_stats = regular_season_games.copy()

calculate_rolling_averages(rolling_regular_season_stats, stats_to_average, 10)

Unnamed: 0,season_id,team_id_home,team_name_home,game_id,game_date,wl_home,fgm_home,fga_home,fg_pct_home,fg3m_home,...,rolling_ast_home,rolling_ast_away,rolling_stl_home,rolling_stl_away,rolling_blk_home,rolling_blk_away,rolling_tov_home,rolling_tov_away,rolling_pf_home,rolling_pf_away
0,21996,1610612765,Detroit Pistons,29600059,1996-11-08 00:00:00,0,32.0,75.0,0.427,8.0,...,9.0,25.00,1.0,8.00,2.00,2.00,14.00,11.00,18.0,17.0
1,21996,1610612765,Detroit Pistons,29600059,1996-11-08 00:00:00,0,32.0,75.0,0.427,8.0,...,9.0,25.00,1.0,8.00,2.00,2.00,14.00,11.00,18.0,17.0
2,21996,1610612765,Detroit Pistons,29600059,1996-11-08 00:00:00,0,32.0,75.0,0.427,8.0,...,9.0,25.00,1.0,8.00,2.00,2.00,14.00,11.00,18.0,17.0
3,21996,1610612752,New York Knicks,29600114,1996-11-16 00:00:00,1,31.0,64.0,0.484,4.0,...,11.0,22.25,3.5,8.25,3.25,1.75,17.25,13.75,19.5,19.0
4,21996,1610612752,New York Knicks,29600114,1996-11-16 00:00:00,1,31.0,64.0,0.484,4.0,...,12.2,20.60,5.0,8.40,4.00,1.60,19.20,15.40,20.4,20.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70667,22022,1610612761,Toronto Raptors,22201221,2023-04-09 00:00:00,1,48.0,95.0,0.505,11.0,...,24.5,27.50,9.5,5.90,5.90,4.60,11.70,17.20,20.0,19.1
70668,22022,1610612761,Toronto Raptors,22201221,2023-04-09 00:00:00,1,48.0,95.0,0.505,11.0,...,24.5,27.00,9.5,5.60,5.30,3.90,11.10,17.10,19.5,18.0
70669,22022,1610612739,Cleveland Cavaliers,22201218,2023-04-09 00:00:00,0,35.0,89.0,0.393,10.0,...,24.3,26.00,9.6,5.60,5.00,3.50,11.20,17.50,19.5,17.0
70670,22022,1610612739,Cleveland Cavaliers,22201218,2023-04-09 00:00:00,0,35.0,89.0,0.393,10.0,...,24.8,25.20,9.9,5.40,4.90,3.50,10.90,18.10,19.7,16.1


In [55]:
regular_season_games.columns

Index(['season_id', 'team_id_home', 'team_name_home', 'game_id', 'game_date',
       'wl_home', 'fgm_home', 'fga_home', 'fg_pct_home', 'fg3m_home',
       'fg3a_home', 'fg3_pct_home', 'ftm_home', 'fta_home', 'ft_pct_home',
       'oreb_home', 'dreb_home', 'reb_home', 'ast_home', 'stl_home',
       'blk_home', 'tov_home', 'pf_home', 'pts_home', 'plus_minus_home',
       'team_id_away', 'team_name_away', 'matchup_away', 'wl_away', 'fgm_away',
       'fga_away', 'fg_pct_away', 'fg3m_away', 'fg3a_away', 'fg3_pct_away',
       'ftm_away', 'fta_away', 'ft_pct_away', 'oreb_away', 'dreb_away',
       'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away', 'pf_away',
       'pts_away', 'plus_minus_away', 'team_wins_losses_home',
       'team_wins_losses_away', 'official_id', 'home_wins', 'home_losses',
       'away_wins', 'away_losses', 'home_total_games', 'away_total_games',
       'home_win_pct', 'away_win_pct', 'game_winner', 'rolling_win_pct_home',
       'rolling_win_pct_away'],
      

### Replace any NaN values in the numeric columns with median values to improve data quality

In [56]:
# Identify numeric columns in your DataFrame
numeric_columns = regular_season_games.select_dtypes(include=['float64', 'int64']).columns

# Fill missing values only for numeric columns using the median
regular_season_games[numeric_columns] = regular_season_games[numeric_columns].fillna(regular_season_games[numeric_columns].median())

### Here I one-hot encode the team_id_home and team_id_away columns in order to convert the values from categorical variables to values that are compatable with the model. One-hot encoding can now identify individual teams and recognize patterns specific to each team.

In [57]:
# One-hot encode team IDs for home and away teams
regular_season_games = pd.get_dummies(regular_season_games, columns=['team_id_home', 'team_id_away'])

### Feature scaling is an important step especially when using a logistic regression model. Here I use the StandardScaler so that all of the features are scaled the same, which helps the model learn from all features equally.

In [58]:
# List of features to scale
features_to_scale = ['fg_pct_home', 'fg_pct_away', 'fg3_pct_home', 'fg3_pct_away', 
                     'ft_pct_home', 'ft_pct_away', 'oreb_home', 'dreb_home', 'reb_home',
                     'ast_home', 'stl_home', 'blk_home', 'tov_home', 'pf_home', 'pts_home',
                     'plus_minus_home', 'oreb_away', 'dreb_away', 'reb_away', 'ast_away', 
                     'stl_away', 'blk_away', 'tov_away', 'pf_away', 'pts_away', 'plus_minus_away',
                     'rolling_win_pct_home', 'rolling_win_pct_away']

# Apply scaling
scaler = StandardScaler()
regular_season_games[features_to_scale] = scaler.fit_transform(regular_season_games[features_to_scale])

### While running the model the first few times, I kept getting accuracies of high 99% and 100%, which is suspiciously high. So I went back and realized that many of the columns were giving away the game winners. So here I drop all of those columns.

In [59]:
# Drop unnecessary columns
regular_season_games.drop(columns=['game_id', 'game_date', 'team_name_home', 'team_name_away', 'matchup_away', 'team_wins_losses_home', 'team_wins_losses_away', 
                                  'fgm_home', 'fga_home', 'fg3m_home', 'fg3a_home', 
                                  'ftm_home', 'fta_home', 'pts_home', 'fgm_away', 'fga_away',
                                  'fg3m_away', 'fg3a_away', 'ftm_away', 'fta_away', 'pts_away',
                                  'official_id', 'home_wins', 'home_losses', 'away_wins', 'away_losses',
                                  'home_total_games', 'away_total_games', 'home_win_pct', 'away_win_pct', 
                                  'wl_home', 'wl_away', 'plus_minus_home', 'plus_minus_away'], inplace=True)

# Creating My Logistic Regression Model

## First, I wanted to run my model by splitting the dataset into training and testing sets, and evaluating the accuracy based on the model's results compared to the testing set. 

In [61]:
# Define features (X) and target (y)
X = regular_season_games.drop(columns=['game_winner']) 
y = regular_season_games['game_winner']  # target column

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
# Display the size of the splits
print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

Training set size: 49801
Testing set size: 12451


In [62]:
X_train.columns

Index(['season_id', 'fg_pct_home', 'fg3_pct_home', 'ft_pct_home', 'oreb_home',
       'dreb_home', 'reb_home', 'ast_home', 'stl_home', 'blk_home', 'tov_home',
       'pf_home', 'fg_pct_away', 'fg3_pct_away', 'ft_pct_away', 'oreb_away',
       'dreb_away', 'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away',
       'pf_away', 'rolling_win_pct_home', 'rolling_win_pct_away',
       'team_id_home_1610612737', 'team_id_home_1610612738',
       'team_id_home_1610612739', 'team_id_home_1610612740',
       'team_id_home_1610612741', 'team_id_home_1610612742',
       'team_id_home_1610612743', 'team_id_home_1610612744',
       'team_id_home_1610612745', 'team_id_home_1610612746',
       'team_id_home_1610612747', 'team_id_home_1610612748',
       'team_id_home_1610612749', 'team_id_home_1610612750',
       'team_id_home_1610612751', 'team_id_home_1610612752',
       'team_id_home_1610612753', 'team_id_home_1610612754',
       'team_id_home_1610612755', 'team_id_home_1610612756',
       '

## I initialize, fit, and run the logistic regression model, and come up with an accuracy of about 94%.

In [63]:
# Initialize the model
model = LogisticRegression(max_iter=500)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")

Model Accuracy: 0.9375150590314031


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## I ran this code below when I was getting accuracy rates of 99 and 100%. I realized that some of the columns were giving away the answer to the game winner.

In [64]:
# Ensure all correlations are displayed
pd.set_option('display.max_rows', None)

# Add y_train to a copy of X_train for correlation analysis
correlation_data = X_train.copy()
correlation_data['game_winner'] = y_train

# Compute correlations
correlations = correlation_data.corr()['game_winner']
print(correlations.sort_values(ascending=False))

# Reset the option afterward if needed
pd.reset_option('display.max_rows')

game_winner                1.000000
fg_pct_home                0.431996
dreb_home                  0.321845
ast_home                   0.308401
fg3_pct_home               0.306448
reb_home                   0.256381
blk_home                   0.160417
stl_home                   0.143710
tov_away                   0.112550
pf_away                    0.102873
ft_pct_home                0.094029
oreb_away                  0.057138
team_id_home_1610612759    0.052569
team_id_away_1610612750    0.037103
team_id_home_1610612762    0.033292
team_id_away_1610612752    0.032898
team_id_away_1610612766    0.032633
team_id_home_1610612743    0.030554
team_id_home_1610612742    0.029357
team_id_home_1610612744    0.027253
team_id_home_1610612748    0.026865
team_id_away_1610612764    0.025267
team_id_away_1610612758    0.024644
team_id_away_1610612737    0.022981
team_id_away_1610612753    0.018587
team_id_away_1610612740    0.017227
team_id_away_1610612751    0.016754
team_id_away_1610612765    0

## I wanted to toy around with my model, so I added Recursive Feature Elimination to remove the 'least important features' and run the model based on the selected features.

In [65]:
# Initialize the logistic regression model
model = LogisticRegression(max_iter=500)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Store original feature names before scaling
feature_names = X_train.columns

# Initialize RFE and select the top 'n' 
rfe = RFE(estimator=model, n_features_to_select=10)  # Select top 10 features

# Fit RFE to the scaled training data
X_train_rfe = rfe.fit_transform(X_train_scaled, y_train)

# Get selected feature names
selected_features = feature_names[rfe.support_]
print(f"Selected features: {selected_features}")

# Train the model using only the selected features
model.fit(X_train_rfe, y_train)

# Use the selected features for prediction
X_test_rfe = rfe.transform(X_test_scaled)
y_pred = model.predict(X_test_rfe)

# Evaluate the model
print(f"Accuracy with selected features: {accuracy_score(y_test, y_pred)}")

Selected features: Index(['fg_pct_home', 'fg3_pct_home', 'reb_home', 'tov_home', 'pf_home',
       'fg_pct_away', 'fg3_pct_away', 'reb_away', 'tov_away', 'pf_away'],
      dtype='object')
Accuracy with selected features: 0.9197654806842824


## However, I regularly got accuracy rates lower than the logistic regression model without RFE. 

# Now that I had tested the logistic regression model on the testing set, I wanted to try and use this model to predict game outcomes for future NBA games. 

In [68]:
rolling_regular_season_stats.columns

Index(['season_id', 'team_id_home', 'team_name_home', 'game_id', 'game_date',
       'wl_home', 'fgm_home', 'fga_home', 'fg_pct_home', 'fg3m_home',
       'fg3a_home', 'fg3_pct_home', 'ftm_home', 'fta_home', 'ft_pct_home',
       'oreb_home', 'dreb_home', 'reb_home', 'ast_home', 'stl_home',
       'blk_home', 'tov_home', 'pf_home', 'pts_home', 'plus_minus_home',
       'team_id_away', 'team_name_away', 'matchup_away', 'wl_away', 'fgm_away',
       'fga_away', 'fg_pct_away', 'fg3m_away', 'fg3a_away', 'fg3_pct_away',
       'ftm_away', 'fta_away', 'ft_pct_away', 'oreb_away', 'dreb_away',
       'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away', 'pf_away',
       'pts_away', 'plus_minus_away', 'team_wins_losses_home',
       'team_wins_losses_away', 'official_id', 'home_wins', 'home_losses',
       'away_wins', 'away_losses', 'home_total_games', 'away_total_games',
       'home_win_pct', 'away_win_pct', 'game_winner', 'rolling_win_pct_home',
       'rolling_win_pct_away', 'rollin

## I realized that there was no way I could fill in some of the statistical columns when preparing the rows for my future games, since the games had not been played yet. So, I had to drop these columns and replace them with rolling averages (averages of the individual stats for the home and away teams based on the previous 10 games)

In [69]:
rolling_regular_season_stats.drop(columns= ['game_id', 'game_date',
       'wl_home', 'fgm_home', 'fga_home', 'fg_pct_home', 'fg3m_home',
       'fg3a_home', 'fg3_pct_home', 'ftm_home', 'fta_home', 'ft_pct_home',
       'oreb_home', 'dreb_home', 'reb_home', 'ast_home', 'stl_home',
       'blk_home', 'tov_home', 'pf_home', 'pts_home', 'plus_minus_home',
       'matchup_away', 'wl_away', 'fgm_away',
       'fga_away', 'fg_pct_away', 'fg3m_away', 'fg3a_away', 'fg3_pct_away',
       'ftm_away', 'fta_away', 'ft_pct_away', 'oreb_away', 'dreb_away',
       'reb_away', 'ast_away', 'stl_away', 'blk_away', 'tov_away', 'pf_away',
       'pts_away', 'plus_minus_away', 'team_wins_losses_home',
       'team_wins_losses_away', 'official_id', 'home_wins', 'home_losses',
       'away_wins', 'away_losses', 'home_total_games', 'away_total_games',
       'home_win_pct', 'away_win_pct'], inplace=True)

## I wanted to make sure that I was preparing the data similarly to how I prepared the data for regular_season_games, such as one-hot encoding the team id columns, scaling the features, and replacing NaN values in the numeric columns with median values

In [70]:
# Identify numeric columns in your DataFrame
numeric_columns =rolling_regular_season_stats.select_dtypes(include=['float64', 'int64']).columns

# Fill missing values only for numeric columns using the median
rolling_regular_season_stats[numeric_columns] = rolling_regular_season_stats[numeric_columns].fillna(rolling_regular_season_stats[numeric_columns].median())

## While using the StandardScaler, I saw that the scaled values made no sense (There were percentages below 0% and above 100%, and there were negative values under columns where it is impossible to have a negative average). So I used the MinMax Scaler (scale between 0 and 1), and the scaled values made more sense when looking through the table. Since the accuracies did not change between these two scalers, I decided to use MinMaxScaler so that the table could be more readable/understandable. 

### I tested other scalers like PowerTransformer and RobustScaler

In [71]:
# Apply scaling
scaler = MinMaxScaler()

items_to_scale = ['rolling_win_pct_home',
       'rolling_win_pct_away', 'rolling_fg_pct_home', 'rolling_fg_pct_away',
       'rolling_fg3_pct_home', 'rolling_fg3_pct_away', 'rolling_ft_pct_home',
       'rolling_ft_pct_away', 'rolling_oreb_home', 'rolling_oreb_away',
       'rolling_dreb_home', 'rolling_dreb_away', 'rolling_reb_home',
       'rolling_reb_away', 'rolling_ast_home', 'rolling_ast_away',
       'rolling_stl_home', 'rolling_stl_away', 'rolling_blk_home',
       'rolling_blk_away', 'rolling_tov_home', 'rolling_tov_away',
       'rolling_pf_home', 'rolling_pf_away']

rolling_regular_season_stats[items_to_scale] = scaler.fit_transform(rolling_regular_season_stats[items_to_scale])

In [72]:
# One-hot encode team IDs for home and away teams
rolling_regular_season_stats = pd.get_dummies(rolling_regular_season_stats, columns=['team_id_home', 'team_id_away'])

In [73]:
rolling_regular_season_stats

Unnamed: 0,season_id,team_name_home,team_name_away,game_winner,rolling_win_pct_home,rolling_win_pct_away,rolling_fg_pct_home,rolling_fg_pct_away,rolling_fg3_pct_home,rolling_fg3_pct_away,...,team_id_away_1610612757,team_id_away_1610612758,team_id_away_1610612759,team_id_away_1610612760,team_id_away_1610612761,team_id_away_1610612762,team_id_away_1610612763,team_id_away_1610612764,team_id_away_1610612765,team_id_away_1610612766
0,21996,Detroit Pistons,Chicago Bulls,0,0.501661,0.500000,0.378637,0.648094,0.799248,0.471922,...,False,False,False,False,False,False,False,False,False,False
1,21996,Detroit Pistons,Chicago Bulls,0,0.501661,0.500000,0.378637,0.648094,0.799248,0.471922,...,False,False,False,False,False,False,False,False,False,False
2,21996,Detroit Pistons,Chicago Bulls,0,0.501661,0.500000,0.378637,0.648094,0.799248,0.471922,...,False,False,False,False,False,False,False,False,False,False
3,21996,New York Knicks,Minnesota Timberwolves,1,0.501661,0.500000,0.433193,0.519271,0.693515,0.471922,...,False,False,False,False,False,False,False,False,False,False
4,21996,New York Knicks,Minnesota Timberwolves,1,0.501661,0.500000,0.465926,0.441977,0.630075,0.471922,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70667,22022,Toronto Raptors,Milwaukee Bucks,1,0.570732,0.441463,0.467075,0.522413,0.387782,0.507470,...,False,False,False,False,False,False,False,False,False,False
70668,22022,Toronto Raptors,Milwaukee Bucks,1,0.598780,0.434146,0.476263,0.499791,0.379887,0.472093,...,False,False,False,False,False,False,False,False,False,False
70669,22022,Cleveland Cavaliers,Charlotte Hornets,0,0.618293,0.409756,0.442573,0.474654,0.360150,0.408381,...,False,False,False,False,False,False,False,False,False,True
70670,22022,Cleveland Cavaliers,Charlotte Hornets,0,0.614634,0.387805,0.431470,0.446167,0.340414,0.365447,...,False,False,False,False,False,False,False,False,False,True


## Before running the model on future games, I first wanted to see the accuracy for the model, since the features now only include rolling averages for the stats instead of the actual stats from each individual game. 

In [74]:
# Define features (X) and target (y)
X = rolling_regular_season_stats.drop(columns=['game_winner', 'team_name_home', 'team_name_away']) 
y = rolling_regular_season_stats['game_winner']  # target column

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
# Display the size of the splits
print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

Training set size: 49801
Testing set size: 12451


In [75]:
# Initialize the model
model = LogisticRegression(max_iter=500)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")

Model Accuracy: 0.6561721950044174


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## As you can see, the rolling averages produce less accurate results compared to the real stats for each individual game. The accuracy produced with the new features is around 65%.

## Below, I reran the logistic regression model with RFE

In [76]:
# Initialize the logistic regression model
model = LogisticRegression(max_iter=500)

# Scale the features
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Store original feature names before scaling
feature_names = X_train.columns

# Initialize RFE and select the top 'n' features (e.g., 10 features)
rfe = RFE(estimator=model, n_features_to_select=10)  # Select top 10 features

# Fit RFE to the scaled training data
X_train_rfe = rfe.fit_transform(X_train_scaled, y_train)

# Get selected feature names
selected_features = feature_names[rfe.support_]
print(f"Selected features: {selected_features}")

# Train the model using only the selected features
model.fit(X_train_rfe, y_train)

# Use the selected features for prediction
X_test_rfe = rfe.transform(X_test_scaled)
y_pred = model.predict(X_test_rfe)

# Evaluate the model
print(f"Accuracy with selected features: {accuracy_score(y_test, y_pred)}")

Selected features: Index(['rolling_fg_pct_home', 'rolling_fg_pct_away', 'rolling_fg3_pct_home',
       'rolling_fg3_pct_away', 'rolling_dreb_away', 'rolling_reb_home',
       'rolling_tov_home', 'rolling_tov_away', 'rolling_pf_home',
       'rolling_pf_away'],
      dtype='object')
Accuracy with selected features: 0.639948598506144


In [77]:
X_train

Unnamed: 0,season_id,rolling_win_pct_home,rolling_win_pct_away,rolling_fg_pct_home,rolling_fg_pct_away,rolling_fg3_pct_home,rolling_fg3_pct_away,rolling_ft_pct_home,rolling_ft_pct_away,rolling_oreb_home,...,team_id_away_1610612757,team_id_away_1610612758,team_id_away_1610612759,team_id_away_1610612760,team_id_away_1610612761,team_id_away_1610612762,team_id_away_1610612763,team_id_away_1610612764,team_id_away_1610612765,team_id_away_1610612766
5635,22004,0.550459,0.400943,0.251914,0.458316,0.358271,0.699124,0.410320,0.730720,0.557214,...,False,False,False,False,False,False,False,False,False,False
42826,22015,0.518657,0.475655,0.475115,0.299539,0.351316,0.314958,0.405137,0.501285,0.338308,...,False,False,False,False,False,False,False,False,False,False
22895,22009,0.427273,0.478632,0.521057,0.470046,0.405075,0.371115,0.617846,0.364396,0.398010,...,False,False,False,False,False,False,False,False,False,False
17029,22007,0.661896,0.501792,0.346478,0.275241,0.573684,0.471406,0.579991,0.628963,0.427861,...,False,False,False,False,False,False,False,False,False,False
54664,22018,0.441860,0.537217,0.520674,0.659824,0.455263,0.454577,0.667192,0.667095,0.338308,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27298,22010,0.451883,0.506438,0.444487,0.566401,0.334774,0.563455,0.430149,0.742931,0.383085,...,False,False,False,False,False,False,False,False,False,False
59078,22019,0.558282,0.296371,0.422665,0.587348,0.522556,0.554697,0.886886,0.679520,0.383085,...,False,False,False,False,False,False,False,False,False,False
61265,22020,0.440299,0.498099,0.558959,0.592375,0.357895,0.473982,0.596665,0.757498,0.114428,...,False,False,False,False,False,False,False,False,False,False
4179,22004,0.569767,0.480519,0.474732,0.511940,0.223120,0.255023,0.592384,0.486718,0.472637,...,False,False,False,False,True,False,False,False,False,False


## The first step in predicting future NBA games, I first had to prepare the data for each game. Below, I create a dataset called games_to_predict that contains 5 row, each row representing 1 of the 5 NBA games from Saturday, November 30th.

In [78]:
# Create a DataFrame for the games that I want to predict
games_to_predict = pd.DataFrame(columns=[
    'season_id', 'team_name_home',  'team_name_away', 'rolling_win_pct_home', 'rolling_win_pct_away', 'rolling_fg_pct_home',
    'rolling_fg_pct_away', 'rolling_fg3_pct_home', 'rolling_fg3_pct_away', 
    'rolling_ft_pct_home', 'rolling_ft_pct_away', 'rolling_oreb_home', 'rolling_oreb_away',
    'rolling_dreb_home', 'rolling_dreb_away', 'rolling_reb_home', 'rolling_reb_away',
    'rolling_ast_home', 'rolling_ast_away', 'rolling_stl_home', 'rolling_stl_away',
    'rolling_blk_home', 'rolling_blk_away', 'rolling_tov_home', 'rolling_tov_away',
    'rolling_pf_home', 'rolling_pf_away', 
    'team_id_home_1610612737', 'team_id_home_1610612738', 'team_id_home_1610612739', 
    'team_id_home_1610612740', 'team_home_1610612741', 'team_id_home_1610612742', 
    'team_id_home_1610612743', 'team_id_home_1610612744', 'team_id_home_1610612745', 
    'team_id_home_1610612746', 'team_id_home_1610612747', 'team_id_home_1610612748', 
    'team_id_home_1610612749', 'team_id_home_1610612750', 'team_id_home_1610612751', 
    'team_id_home_1610612752', 'team_id_home_1610612753', 'team_id_home_1610612754', 
    'team_id_home_1610612755', 'team_id_home_1610612756', 'team_id_home_1610612757', 
    'team_id_home_1610612758', 'team_id_home_1610612759', 'team_id_home_1610612760', 
    'team_id_home_1610612761', 'team_id_home_1610612762', 'team_id_home_1610612763', 
    'team_id_home_1610612764', 'team_id_home_1610612765', 'team_id_home_1610612766',
    'team_id_away_1610612737', 'team_id_away_1610612738', 'team_id_away_1610612739', 
    'team_id_away_1610612740', 'team_id_away_1610612741', 'team_id_away_1610612742', 
    'team_id_away_1610612743', 'team_id_away_1610612744', 'team_id_away_1610612745', 
    'team_id_away_1610612746', 'team_id_away_1610612747', 'team_id_away_1610612748', 
    'team_id_away_1610612749', 'team_id_away_1610612750', 'team_id_away_1610612751', 
    'team_id_away_1610612752', 'team_id_away_1610612753', 'team_id_away_1610612754', 
    'team_id_away_1610612755', 'team_id_away_1610612756', 'team_id_away_1610612757', 
    'team_id_away_1610612758', 'team_id_away_1610612759', 'team_id_away_1610612760', 
    'team_id_away_1610612761', 'team_id_away_1610612762', 'team_id_away_1610612763', 
    'team_id_away_1610612764', 'team_id_away_1610612765', 'team_id_away_1610612766'
])

In [79]:
games_to_predict

Unnamed: 0,season_id,team_name_home,team_name_away,rolling_win_pct_home,rolling_win_pct_away,rolling_fg_pct_home,rolling_fg_pct_away,rolling_fg3_pct_home,rolling_fg3_pct_away,rolling_ft_pct_home,...,team_id_away_1610612757,team_id_away_1610612758,team_id_away_1610612759,team_id_away_1610612760,team_id_away_1610612761,team_id_away_1610612762,team_id_away_1610612763,team_id_away_1610612764,team_id_away_1610612765,team_id_away_1610612766


## I got the stats for each team's rolling averages (past 10 games) from statmuse.com

In [80]:
# Define the games with their respective home and away team IDs
games = [
    {'season_id': '70671', 'home_id': 1610612766, 'away_id': 1610612737},  # Hornets vs Hawks
    {'season_id': '70671', 'home_id': 1610612765, 'away_id': 1610612755},  # Pistons vs 76ers
    {'season_id': '70671', 'home_id': 1610612749, 'away_id': 1610612764},  # Bucks vs Wizards
    {'season_id': '270671', 'home_id': 1610612756, 'away_id': 1610612744},  # Suns vs Warriors
    {'season_id': '70671', 'home_id': 1610612762, 'away_id': 1610612742}   # Jazz vs Mavericks
]

# Data for each game (team stats)
games_data = [
    # Game 1: Hawks vs. Hornets
    {
        'season_id': '70672', 'team_name_home': 'Charlotte Hornets', 'team_name_away': 'Atlanta Hawks',
        'rolling_win_pct_home': 0.20, 'rolling_win_pct_away': 0.50,
        'rolling_fg_pct_home': .418, 'rolling_fg_pct_away': .46,
        'rolling_fg3_pct_home': .362, 'rolling_fg3_pct_away': .347,
        'rolling_ft_pct_home': .785, 'rolling_ft_pct_away': .784,
        'rolling_oreb_home': 13.3, 'rolling_oreb_away': 13.4,
        'rolling_dreb_home': 34.4, 'rolling_dreb_away': 33.8,
        'rolling_reb_home': 47.7, 'rolling_reb_away': 47.2,
        'rolling_ast_home': 23.2, 'rolling_ast_away': 32.1,
        'rolling_stl_home': 8.2, 'rolling_stl_away': 9.6,
        'rolling_blk_home': 5.3, 'rolling_blk_away': 5.7,
        'rolling_tov_home': 15.9, 'rolling_tov_away': 15.7,
        'rolling_pf_home': 19.7, 'rolling_pf_away': 18
    },
    # Game 2: Philadelphia 76ers vs. Pistons
    {
        'season_id': '70672', 'team_name_home': 'Detroit Pistons', 'team_name_away': 'Philadelphia 76ers',
        'rolling_win_pct_home': 0.40, 'rolling_win_pct_away': 0.20,
        'rolling_fg_pct_home': .447, 'rolling_fg_pct_away': .425,
        'rolling_fg3_pct_home': .357, 'rolling_fg3_pct_away': .313,
        'rolling_ft_pct_home': .747, 'rolling_ft_pct_away': .814,
        'rolling_oreb_home': 12.6, 'rolling_oreb_away': 10.2,
        'rolling_dreb_home': 34.6, 'rolling_dreb_away': 29.9,
        'rolling_reb_home': 47.2, 'rolling_reb_away': 40.1,
        'rolling_ast_home': 25.7, 'rolling_ast_away': 22.6,
        'rolling_stl_home': 6.3, 'rolling_stl_away': 9.7,
        'rolling_blk_home': 5.4, 'rolling_blk_away': 4.6,
        'rolling_tov_home': 15.9, 'rolling_tov_away': 14.9,
        'rolling_pf_home': 19.7, 'rolling_pf_away': 19.5
    },
    # Game 3: Washington Wizards vs. Milwaukee Bucks
    {
        'season_id': '70672', 'team_name_home': 'Milwaukee Bucks', 'team_name_away': 'Washington Wizards',
        'rolling_win_pct_home': 0.70, 'rolling_win_pct_away': 0.00,
        'rolling_fg_pct_home': .479, 'rolling_fg_pct_away': .445,
        'rolling_fg3_pct_home': .394, 'rolling_fg3_pct_away': .328,
        'rolling_ft_pct_home': .727, 'rolling_ft_pct_away': .727,
        'rolling_oreb_home': 8.8, 'rolling_oreb_away': 9.0,
        'rolling_dreb_home': 35.2, 'rolling_dreb_away': 32.6,
        'rolling_reb_home': 44, 'rolling_reb_away': 41.6,
        'rolling_ast_home': 25, 'rolling_ast_away': 23.1,
        'rolling_stl_home': 7.5, 'rolling_stl_away': 7.5,
        'rolling_blk_home': 6.7, 'rolling_blk_away': 4.4,
        'rolling_tov_home': 12.7, 'rolling_tov_away': 16.1,
        'rolling_pf_home': 17.3, 'rolling_pf_away': 20.7
    },
    # Game 4: Golden State Warriors vs. Phoenix Suns
    {
        'season_id': '70672', 'team_name_home': 'Phoenix Suns', 'team_name_away': 'Golden State Warriors',
        'rolling_win_pct_home': 0.30, 'rolling_win_pct_away': 0.50,
        'rolling_fg_pct_home': .459, 'rolling_fg_pct_away': .447,
        'rolling_fg3_pct_home': .359, 'rolling_fg3_pct_away': .365,
        'rolling_ft_pct_home': .821, 'rolling_ft_pct_away': .665,
        'rolling_oreb_home': 10.9, 'rolling_oreb_away': 13.4,
        'rolling_dreb_home': 31.1, 'rolling_dreb_away': 35.8,
        'rolling_reb_home': 42, 'rolling_reb_away': 49.2,
        'rolling_ast_home': 25.7, 'rolling_ast_away': 30.5,
        'rolling_stl_home': 8, 'rolling_stl_away': 9,
        'rolling_blk_home': 3.6, 'rolling_blk_away': 4.9,
        'rolling_tov_home': 12.3, 'rolling_tov_away': 14.8,
        'rolling_pf_home': 18.7, 'rolling_pf_away': 20.9
    },
    # Game 5: Dallas Mavericks vs. Utah Jazz
    {
        'season_id': '70672', 'team_name_home': 'Utah Jazz', 'team_name_away': 'Dallas Mavericks',
        'rolling_win_pct_home': 0.30, 'rolling_win_pct_away': 0.60,
        'rolling_fg_pct_home': .479, 'rolling_fg_pct_away': .50,
        'rolling_fg3_pct_home': .371, 'rolling_fg3_pct_away': .361,
        'rolling_ft_pct_home': .767, 'rolling_ft_pct_away': .789,
        'rolling_oreb_home': 12, 'rolling_oreb_away': 12.1,
        'rolling_dreb_home': 30.9, 'rolling_dreb_away': 34.2,
        'rolling_reb_home': 42.9, 'rolling_reb_away': 46.3,
        'rolling_ast_home': 24.9, 'rolling_ast_away': 25.4,
        'rolling_stl_home': 7.9, 'rolling_stl_away': 7.8,
        'rolling_blk_home': 3.3, 'rolling_blk_away': 6.1,
        'rolling_tov_home': 16.1, 'rolling_tov_away': 14.1,
        'rolling_pf_home': 20, 'rolling_pf_away': 19.5
    }
]

# Add all rows to the DataFrame
games_to_predict = pd.DataFrame(games_data)

In [81]:
pd.set_option('display.max_columns', None)
games_to_predict

Unnamed: 0,season_id,team_name_home,team_name_away,rolling_win_pct_home,rolling_win_pct_away,rolling_fg_pct_home,rolling_fg_pct_away,rolling_fg3_pct_home,rolling_fg3_pct_away,rolling_ft_pct_home,rolling_ft_pct_away,rolling_oreb_home,rolling_oreb_away,rolling_dreb_home,rolling_dreb_away,rolling_reb_home,rolling_reb_away,rolling_ast_home,rolling_ast_away,rolling_stl_home,rolling_stl_away,rolling_blk_home,rolling_blk_away,rolling_tov_home,rolling_tov_away,rolling_pf_home,rolling_pf_away
0,70672,Charlotte Hornets,Atlanta Hawks,0.2,0.5,0.418,0.46,0.362,0.347,0.785,0.784,13.3,13.4,34.4,33.8,47.7,47.2,23.2,32.1,8.2,9.6,5.3,5.7,15.9,15.7,19.7,18.0
1,70672,Detroit Pistons,Philadelphia 76ers,0.4,0.2,0.447,0.425,0.357,0.313,0.747,0.814,12.6,10.2,34.6,29.9,47.2,40.1,25.7,22.6,6.3,9.7,5.4,4.6,15.9,14.9,19.7,19.5
2,70672,Milwaukee Bucks,Washington Wizards,0.7,0.0,0.479,0.445,0.394,0.328,0.727,0.727,8.8,9.0,35.2,32.6,44.0,41.6,25.0,23.1,7.5,7.5,6.7,4.4,12.7,16.1,17.3,20.7
3,70672,Phoenix Suns,Golden State Warriors,0.3,0.5,0.459,0.447,0.359,0.365,0.821,0.665,10.9,13.4,31.1,35.8,42.0,49.2,25.7,30.5,8.0,9.0,3.6,4.9,12.3,14.8,18.7,20.9
4,70672,Utah Jazz,Dallas Mavericks,0.3,0.6,0.479,0.5,0.371,0.361,0.767,0.789,12.0,12.1,30.9,34.2,42.9,46.3,24.9,25.4,7.9,7.8,3.3,6.1,16.1,14.1,20.0,19.5


## Below I am one-hot encoding the id columns, and ensure that all of the columns from rolling_regular_season_stats are present in games_to_predict (Since there are only 10 teams out of the 30 in games_to_predict)

In [82]:
# Get unique team IDs from the `games` list
team_ids = set(game['home_id'] for game in games).union(set(game['away_id'] for game in games))

all_team_ids = [
    '1610612737', '1610612738', '1610612739', '1610612740', '1610612741',
    '1610612742', '1610612743', '1610612744', '1610612745', '1610612746',
    '1610612747', '1610612748', '1610612749', '1610612750', '1610612751',
    '1610612752', '1610612753', '1610612754', '1610612755', '1610612756',
    '1610612757', '1610612758', '1610612759', '1610612760', '1610612761',
    '1610612762', '1610612763', '1610612764', '1610612765', '1610612766'
]

# Dynamically create one-hot encoded columns for each team ID
for team_id in all_team_ids:
    games_to_predict[f'team_id_home_{team_id}'] = False
    games_to_predict[f'team_id_away_{team_id}'] = False

# Populate the one-hot encoded columns based on the home and away team IDs
for i, game in enumerate(games):
    home_id = game['home_id']
    away_id = game['away_id']
    
    games_to_predict.at[i, f'team_id_home_{home_id}'] = True
    games_to_predict.at[i, f'team_id_away_{away_id}'] = True

In [83]:
games_to_predict

Unnamed: 0,season_id,team_name_home,team_name_away,rolling_win_pct_home,rolling_win_pct_away,rolling_fg_pct_home,rolling_fg_pct_away,rolling_fg3_pct_home,rolling_fg3_pct_away,rolling_ft_pct_home,rolling_ft_pct_away,rolling_oreb_home,rolling_oreb_away,rolling_dreb_home,rolling_dreb_away,rolling_reb_home,rolling_reb_away,rolling_ast_home,rolling_ast_away,rolling_stl_home,rolling_stl_away,rolling_blk_home,rolling_blk_away,rolling_tov_home,rolling_tov_away,rolling_pf_home,rolling_pf_away,team_id_home_1610612737,team_id_away_1610612737,team_id_home_1610612738,team_id_away_1610612738,team_id_home_1610612739,team_id_away_1610612739,team_id_home_1610612740,team_id_away_1610612740,team_id_home_1610612741,team_id_away_1610612741,team_id_home_1610612742,team_id_away_1610612742,team_id_home_1610612743,team_id_away_1610612743,team_id_home_1610612744,team_id_away_1610612744,team_id_home_1610612745,team_id_away_1610612745,team_id_home_1610612746,team_id_away_1610612746,team_id_home_1610612747,team_id_away_1610612747,team_id_home_1610612748,team_id_away_1610612748,team_id_home_1610612749,team_id_away_1610612749,team_id_home_1610612750,team_id_away_1610612750,team_id_home_1610612751,team_id_away_1610612751,team_id_home_1610612752,team_id_away_1610612752,team_id_home_1610612753,team_id_away_1610612753,team_id_home_1610612754,team_id_away_1610612754,team_id_home_1610612755,team_id_away_1610612755,team_id_home_1610612756,team_id_away_1610612756,team_id_home_1610612757,team_id_away_1610612757,team_id_home_1610612758,team_id_away_1610612758,team_id_home_1610612759,team_id_away_1610612759,team_id_home_1610612760,team_id_away_1610612760,team_id_home_1610612761,team_id_away_1610612761,team_id_home_1610612762,team_id_away_1610612762,team_id_home_1610612763,team_id_away_1610612763,team_id_home_1610612764,team_id_away_1610612764,team_id_home_1610612765,team_id_away_1610612765,team_id_home_1610612766,team_id_away_1610612766
0,70672,Charlotte Hornets,Atlanta Hawks,0.2,0.5,0.418,0.46,0.362,0.347,0.785,0.784,13.3,13.4,34.4,33.8,47.7,47.2,23.2,32.1,8.2,9.6,5.3,5.7,15.9,15.7,19.7,18.0,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
1,70672,Detroit Pistons,Philadelphia 76ers,0.4,0.2,0.447,0.425,0.357,0.313,0.747,0.814,12.6,10.2,34.6,29.9,47.2,40.1,25.7,22.6,6.3,9.7,5.4,4.6,15.9,14.9,19.7,19.5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
2,70672,Milwaukee Bucks,Washington Wizards,0.7,0.0,0.479,0.445,0.394,0.328,0.727,0.727,8.8,9.0,35.2,32.6,44.0,41.6,25.0,23.1,7.5,7.5,6.7,4.4,12.7,16.1,17.3,20.7,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
3,70672,Phoenix Suns,Golden State Warriors,0.3,0.5,0.459,0.447,0.359,0.365,0.821,0.665,10.9,13.4,31.1,35.8,42.0,49.2,25.7,30.5,8.0,9.0,3.6,4.9,12.3,14.8,18.7,20.9,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,70672,Utah Jazz,Dallas Mavericks,0.3,0.6,0.479,0.5,0.371,0.361,0.767,0.789,12.0,12.1,30.9,34.2,42.9,46.3,24.9,25.4,7.9,7.8,3.3,6.1,16.1,14.1,20.0,19.5,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False


In [84]:
print("Training features:", X_train.columns.tolist())
print("Prediction features:", games_to_predict.columns.tolist())

Training features: ['season_id', 'rolling_win_pct_home', 'rolling_win_pct_away', 'rolling_fg_pct_home', 'rolling_fg_pct_away', 'rolling_fg3_pct_home', 'rolling_fg3_pct_away', 'rolling_ft_pct_home', 'rolling_ft_pct_away', 'rolling_oreb_home', 'rolling_oreb_away', 'rolling_dreb_home', 'rolling_dreb_away', 'rolling_reb_home', 'rolling_reb_away', 'rolling_ast_home', 'rolling_ast_away', 'rolling_stl_home', 'rolling_stl_away', 'rolling_blk_home', 'rolling_blk_away', 'rolling_tov_home', 'rolling_tov_away', 'rolling_pf_home', 'rolling_pf_away', 'team_id_home_1610612737', 'team_id_home_1610612738', 'team_id_home_1610612739', 'team_id_home_1610612740', 'team_id_home_1610612741', 'team_id_home_1610612742', 'team_id_home_1610612743', 'team_id_home_1610612744', 'team_id_home_1610612745', 'team_id_home_1610612746', 'team_id_home_1610612747', 'team_id_home_1610612748', 'team_id_home_1610612749', 'team_id_home_1610612750', 'team_id_home_1610612751', 'team_id_home_1610612752', 'team_id_home_1610612753'

## Retrain/refit the model to use all 85 of the features wanted from games_to_predict, scale the features, and run the models on the predicted games. Print out the predictions and probabilities.

In [91]:
# Train the model using all 85 columns
model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

# Now, apply MinMax scaling consistently across training and prediction data
scaler = MinMaxScaler()
scaler.fit(X_train)  # Fit the scaler on all the features from training

# Ensure that the prediction data has the same columns as the training data
X_predict_all_features = games_to_predict[X_train.columns]

# Scale the prediction data using the fitted scaler
games_to_predict_scaled = scaler.transform(X_predict_all_features)

# Make predictions with the trained model
predictions = model.predict(games_to_predict_scaled)

# Add predictions to the games_to_predict DataFrame
games_to_predict['predicted_outcome'] = predictions

# Optionally, predict probabilities if needed
probabilities = model.predict_proba(games_to_predict_scaled)
games_to_predict['probability_home_win'] = probabilities[:, 1]  # Home win probability (class 1)

# Print the predictions and probabilities for review
print(games_to_predict[['team_name_home', 'team_name_away', 'predicted_outcome', 'probability_home_win']])

# Optionally, evaluate the model on a separate test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")

      team_name_home         team_name_away  predicted_outcome  \
0  Charlotte Hornets          Atlanta Hawks                  0   
1    Detroit Pistons     Philadelphia 76ers                  1   
2    Milwaukee Bucks     Washington Wizards                  1   
3       Phoenix Suns  Golden State Warriors                  0   
4          Utah Jazz       Dallas Mavericks                  0   

   probability_home_win  
0              0.000825  
1              0.996972  
2              0.999989  
3              0.000777  
4              0.000009  
Model Accuracy: 0.6561721950044174


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Results

## These were the model's predictions for each game's winner on November 30th: 

### Charlotte Hornets (home) vs. Atlanta Hawks (away): **Atlanta Hawks** &#x2705;
 
### Destroit Pistons (home) vs Philadelphia 76ers (away): **Detroit Pistons** &#x274C;

### Milwaukee Bucks (home) vs. Washington Wizards (away): **Milwaukee Bucks** &#x2705;

### Phoenix Suns (home) vs. Golden State Warriors (away): **Golden State Warriors** &#x274C;

### Utah Jazz (home) vs. Dallas Mavericks (away): **Dallas Mavericks** &#x2705;

## The logistic regression model correctly predicted 60% (3/5) of the NBA games on November 30th. The accuracy when comparing its accuracy based on the testing set was around 65%, which is similar in accuracy. We must also keep in mind that this is a VERY SMALL sample size.

# Conclusion

<font size="4">I built a logistic regression model to predict the outcomes (winner) of NBA regular season games. I used the game.csv, line_scores.csv, and team.csv tables in an NBA database that recorded game/player stats and information across all seasons from 1947 to 2023. At first, I ran my model on features such as percentages (free throws, 3 point, field goal) and team stats (rebounds, assists, steals, blocks, etc) The dataset contained these stats for every NBA regular season game between 1947 and 2023. I then ran the logistic regression model (500 iterations) and tested the accuracy of the results compared to a test set, which was around 99%. I also modified the model to incorporate Recursive Feature Elimination, which helps select the most valuable features, and the accuracy jumped to 100%. Realizing that these accuracies were too high, I found that some columns such as field goals made, win percentage, and total points were effectively giving away the winner of the game. So after dropping columns with very high correlations with the game_winner column, the accuracies changed to 93.5% and 91.5%. 

After seeing the high accuracy rates from the model, I wanted to run predictions on future NBA games. However, during data preparation for the games I wanted to predict, I realized that it was impossible to fill the values for columns such as total rebounds, total assists, field goal percentage, etc., since these games had not been played yet. Therefore, I had to create rolling averages for these game stats, based on the previous 10 games. I then used a dataset that contained rolling averages instead of the actual stats for each individual game. When rerunning the logistic regression model and the logistic regression model with RFE, the accuracy dropped significantly to 64% and 61%. Still, in terms of an expected range of accuracy for predicting the outcomes of games, 60-65% is still relatively good. 

I ran the model on the games_to_predict, and the model predicted 60% (3/5) games. 

To conclude, an accuracy of 60% is a good rate of success for predicting the winner of NBA games. However, we must keep in mind that this is a very small sample size, and that we cannot conclude that the model will maitain a 60% hit rate in a much larger number of game predictions. 

I used game percentages (field goal, 3 point, free throws) and game stats like rebounds, assists, and steals as my features for this model, however there are many other columns/data that I believe can increase the accuracy. For instance, I did not use any individual player stats and player injuries, which are present in the play_by_play, inactive_players, and player datasets. Between seasons as well as in the middle of a season, team rosters are always changing, which plays a very large factor in team performance game in and game out. Players also can get injured, and often do not play all 82 games in a season. For example, if Lebron James was not playing in a specific game, this would significantly affect the probability of the Lakers winning or losing.

Also, I used a dataset of all games from 1947-2023. I would assume that stats from more recent seasons are more effective for predicting future games than say the game stats from the 1947-1948 season. I also did not have the game stats from the previous 2023-2024 season, nor the stats from the games already played during the current 2024-2025 season. I believe these stats would be very beneficial for predicting future NBA games. 

In conclusion, when running my model on the set of 5 games on November 30th, it was able to produce a 60% success rate, but because of the small sample size, I cannot safely assume that this 60% will remain when a large number of games are added to predict. </font>
