In [26]:
!pip install basketball-reference-scraper
!pip install html5lib



# NBA Betting Models
## Introduction
I have watched sports since I was six years old. I have been watching sice then and have recently started sports betting. Since beginning my study of data science and economics, I've became interested in developing a sports betting model. This will be my attempt to develop this. Along the way, I hope to learn a lot about how we model sports betting. I will be using data from basketball reference to get my testing data. There's a very helpful API that exists for scraping data from this website that I will be using a lot. This can be found here at https://github.com/vishaalagartha/basketball_reference_scraper. 

### Initial Data
I'm going to instantiate some constants up here and lists that will help. You can see in my years array that I only start at 2015 and skip 2020 and 2021. The reason for starting at 2015 is what many people consider a major shift in the NBA style of play. I skipped 2020 and 2021 because rest, lack of fans, different stadiums, and many other factor are different then traditional NBA environments. 

In [27]:
from basketball_reference_scraper.seasons import get_schedule, get_standings
from basketball_reference_scraper.players import get_stats, get_game_logs, get_player_headshot
from basketball_reference_scraper.teams import get_roster, get_team_stats, get_opp_stats, get_roster_stats, get_team_misc
from basketball_reference_scraper.box_scores import get_box_scores
import pandas as pd
years = [2015, 2016, 2017, 2018, 2019, 2022] # Games were very weird in 2020 and 2021
teams = {'Atlanta Hawks': 'ATL', 'Boston Celtics':'BOS', 'Charlotte Hornets':'CHO', 'Chicago Bulls':'CHI',
         'Cleveland Cavaliers':'CLE', 'Dallas Mavericks':'DAL','Denver Nuggets':'DEN','Detroit Pistons':'DET',
         'Golden State Warriors':'GSW', 'Houston Rockets':'HOU','Indiana Pacers':'IND',
         'Los Angeles Clippers':'LAC','Los Angeles Lakers':'LAL', 'Memphis Grizzlies':'MEM',
         'Miami Heat':'MIA','Milwaukee Bucks':'MIL', 'Minnesota Timberwolves':'MIN',
         'New Orleans Pelicans':'NOP','New York Knicks':'NYK','Brooklyn Nets':'BRK', 'Oklahoma City Thunder':'OKC',
         'Orlando Magic':'ORL', 'Philadelphia 76ers':'PHI','Phoenix Suns':'PHO','Portland Trail Blazers':'POR',
         'Sacramento Kings':'SAC','San Antonio Spurs':'SAS','Toronto Raptors':'TOR',
         'Utah Jazz':'UTA','Washington Wizards':'WAS'}

## Brother's Help
My brother is a huge basketball fan. So much so that he actually works for the NBA. When I decided to take on this project I asked him what are some things he thought I should look at. He came up with a few ideas 

### 3-Point Volatility
One of my brother's big suggestions was looking at 3-point variance in attempts and completion percentage. What would this imply for winning percentage? For variance in point differential? What about for variance in overall points? That's what I hope to explore.

I hope to create a table that will eventually look like this:
Year | Team | Volatility in 3PA | Volatility in 3P% | WP | Volatility in PTS | Volatility in PT differential

First thing we want to obtain is a master schedule of sorts. For every game we want all of the advanced stats and not just pts. I'm going to loop through all of the years and schedules.

In [28]:
master = pd.DataFrame()

for yr in years: 
    tmp_sched = get_schedule(yr, playoffs=False)
    tmp_sched['YEAR'] = yr
    master = pd.concat([tmp_sched,master], ignore_index=True)
    
master

Unnamed: 0,DATE,VISITOR,VISITOR_PTS,HOME,HOME_PTS,YEAR
0,2021-10-19,Brooklyn Nets,104,Milwaukee Bucks,127,2022
1,2021-10-19,Golden State Warriors,121,Los Angeles Lakers,114,2022
2,2021-10-20,Indiana Pacers,122,Charlotte Hornets,123,2022
3,2021-10-20,Chicago Bulls,94,Detroit Pistons,88,2022
4,2021-10-20,Boston Celtics,134,New York Knicks,138,2022
...,...,...,...,...,...,...
7468,2015-04-15,Detroit Pistons,112,New York Knicks,90,2015
7469,2015-04-15,Miami Heat,105,Philadelphia 76ers,101,2015
7470,2015-04-15,Indiana Pacers,83,Memphis Grizzlies,95,2015
7471,2015-04-15,Denver Nuggets,126,Golden State Warriors,133,2015


In [29]:
master['GAME'] = ''
for idx, row in master.iterrows():
    h = teams[str( master.at[idx, "HOME"])]
    v = teams[str( master.at[idx, "VISITOR"])]
    master.at[idx, "VISITOR"] = v
    master.at[idx, "HOME"] = h
    master.at[idx, "GAME"] = v + " @ " + h
master.tail(6)

Unnamed: 0,DATE,VISITOR,VISITOR_PTS,HOME,HOME_PTS,YEAR,GAME
7467,2015-04-15,SAS,103,NOP,108,2015,SAS @ NOP
7468,2015-04-15,DET,112,NYK,90,2015,DET @ NYK
7469,2015-04-15,MIA,105,PHI,101,2015,MIA @ PHI
7470,2015-04-15,IND,83,MEM,95,2015,IND @ MEM
7471,2015-04-15,DEN,126,GSW,133,2015,DEN @ GSW
7472,2015-04-15,SAC,122,LAL,99,2015,SAC @ LAL


In [30]:
df = pd.melt(master, id_vars=['DATE','GAME', 'HOME_PTS', 'VISITOR_PTS', 'YEAR'], value_vars=['HOME', 'VISITOR'], 
             var_name=['HOME/VISITOR'], value_name='TEAM')
master = df.sort_values(['YEAR','DATE', 'GAME'])
    
master.head()

Unnamed: 0,DATE,GAME,HOME_PTS,VISITOR_PTS,YEAR,HOME/VISITOR,TEAM
6244,2014-10-28,DAL @ SAS,101,100,2015,HOME,SAS
13717,2014-10-28,DAL @ SAS,101,100,2015,VISITOR,DAL
6245,2014-10-28,HOU @ LAL,90,108,2015,HOME,LAL
13718,2014-10-28,HOU @ LAL,90,108,2015,VISITOR,HOU
6243,2014-10-28,ORL @ NOP,101,84,2015,HOME,NOP


In [31]:
master['POINTS'] = 0
master['RESULT'] = ''
master['PT_DIFF'] = 0
master['PTS ALLOWED'] = 0
    
for idx, row in master.iterrows():
    hp = int(row['HOME_PTS'])
    ap = int(row['VISITOR_PTS'])
    #Determine if team should use home or away
    if row["HOME/VISITOR"] == "HOME":
        master.at[idx, 'POINTS'] = hp
        master.at[idx, 'PTS ALLOWED'] = ap
        if hp > ap:
            master.at[idx, 'RESULT'] = "W"
        else:
            master.at[idx, 'RESULT'] = "L"
        master.at[idx, 'PT_DIFF'] = hp - ap
    else:
        master.at[idx, 'POINTS'] = ap
        master.at[idx, 'PTS ALLOWED'] = hp
        if hp > ap:
            master.at[idx, 'RESULT'] = "L"
        else:
            master.at[idx, 'RESULT'] = "W"
        master.at[idx, 'PT_DIFF'] = ap - hp
master.drop(columns=["HOME_PTS", "VISITOR_PTS"], inplace=True)
    
    
master

Unnamed: 0,DATE,GAME,YEAR,HOME/VISITOR,TEAM,POINTS,RESULT,PT_DIFF,PTS ALLOWED
6244,2014-10-28,DAL @ SAS,2015,HOME,SAS,101,W,1,100
13717,2014-10-28,DAL @ SAS,2015,VISITOR,DAL,100,L,-1,101
6245,2014-10-28,HOU @ LAL,2015,HOME,LAL,90,L,-18,108
13718,2014-10-28,HOU @ LAL,2015,VISITOR,HOU,108,W,18,90
6243,2014-10-28,ORL @ NOP,2015,HOME,NOP,101,W,17,84
...,...,...,...,...,...,...,...,...,...
8793,2022-06-10,GSW @ BOS,2022,VISITOR,GSW,107,W,10,97
1321,2022-06-13,BOS @ GSW,2022,HOME,GSW,104,W,10,94
8794,2022-06-13,BOS @ GSW,2022,VISITOR,BOS,94,L,-10,104
1322,2022-06-16,GSW @ BOS,2022,HOME,BOS,90,L,-13,103


At this point we now have a goog clean table with tify data. Each entry is attached to just a singular team. This way we can look up the bix score for every team on every date and add it to our table. However, the way this is set up is by getting the box score for every single game. Which entails getting all of its players and averaging them. I'm going to do this for every statistic available as I expect to use that. 

First off I'm going to use the opponent column to hope search.

In [32]:
master['OPPONENT'] = ''
for idx, row in master.iterrows():
    game = str(master.at[idx, 'GAME'])
    spl = game.split(' @ ')
    if row['HOME/VISITOR'] == 'HOME':
        master.at[idx, 'OPPONENT'] = spl[0]
    else:
        master.at[idx, 'OPPONENT'] = spl[1]
master.tail(6)

Unnamed: 0,DATE,GAME,YEAR,HOME/VISITOR,TEAM,POINTS,RESULT,PT_DIFF,PTS ALLOWED,OPPONENT
1320,2022-06-10,GSW @ BOS,2022,HOME,BOS,97,L,-10,107,GSW
8793,2022-06-10,GSW @ BOS,2022,VISITOR,GSW,107,W,10,97,BOS
1321,2022-06-13,BOS @ GSW,2022,HOME,GSW,104,W,10,94,BOS
8794,2022-06-13,BOS @ GSW,2022,VISITOR,BOS,94,L,-10,104,GSW
1322,2022-06-16,GSW @ BOS,2022,HOME,BOS,90,L,-13,103,GSW
8795,2022-06-16,GSW @ BOS,2022,VISITOR,GSW,103,W,13,90,BOS


In [33]:
sts_b = ['FGA', 'FG%', '3PA', '3P%', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV']
sts_a = ['ST%', 'STL%', 'ORtg', 'DRtg','TS%', 'eFG%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'BLK%', 'TOV%', 'USG%']

master['FGA'] = 0.0

# for st in sts_b:
#     master[st] = 0.0
    
# for st in sts_a:
#     master[st] = 0.0

for idx, row in master.iterrows():
    # Add basic stats
    team = row['TEAM']
    try:
        df = get_box_scores(str(row['DATE']), team, row['OPPONENT'], stat_type='BASIC')[team]
        #for stat in sts_b:
        master.at[idx, 'FGA'] = df.iloc[-1].loc['FGA']
    except ValueError:      
        print('Issue with: ', idx)
        
#     try:
#         df = get_box_scores(str(row['DATE']), team, row['OPPONENT'], stat_type='ADVANCED')[team]
#         for stat in sts_a:
#             master.at[idx, stat] = df.iloc[-1].loc[stat]
#     except ValueError:      
#         print('Issue with: ', idx )

# ### Correlation Matrix
master.to_csv("FGA added.csv")

ConnectionError: HTTPSConnectionPool(host='widgets.sports-reference.com', port=443): Max retries exceeded with url: /wg.fcgi?css=1&site=bbr&url=%2Fboxscores%2F201501050LAC.html&div=div_box-ATL-game-basic (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f50d038d0a0>: Failed to establish a new connection: [Errno -2] Name or service not known'))

In [None]:
master

### 2-Game Series
Having a brother work in the NBA scheduling department has its advantages. This is one of them. My brother says that the NBA schedule will feature a lot more 20game series. For example when Milwaukee goes to Dallas they'll play two games back to back. What does playing a series imply for the result of the two games? More points or less? 

### Referees
To look at the referees, I will need to scrape some additional data not in the API. https://www.basketball-reference.com/referees/2022_register.html They produce a lot of stats related to referees which means I don't have to create a ton of additional statistics and will probably use this data in the model. Unlike other sports, the NBA referees has a MAJOR impact on the game

In [None]:
master['RESULT']

In [None]:
master['RESULT'][13717]