# Goal

The goal here is to predict the outcomes of NBA games to:
1. Practice scraping
2. Practice using neural nets

We'll use [Basketball Reference](https://www.basketball-reference.com/) for our stats since they allow scraping as long as it isn't done in a mannger that adversely impact site performance/access. To make sure we don't adversely impact the site, we'll make sure we add a pause between each scraping event and we'll then store our results once the scraping is complete so we don't need to re-scrape data from the site. This will result in ongoing maintenance on our end to pick which games to start scraping again as new results come in, which is okay.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

# Grabbing Data

First, we'll grab data from just one game for practice to make sure we can get what we want:
* Score
* Box Score
* Advanced Stats

In [2]:
test_game = requests.get("https://www.basketball-reference.com/boxscores/202110200CHO.html")

# Checking output to make sure info retrieved properly
test_game

<Response [200]>

In [207]:
game_soup = BeautifulSoup(test_game.content, 'html.parser')

First, we'll break down the data for one team (in this case, the Indiana Pacers)

In [208]:
IND = game_soup.select('div.section_wrapper.box-IND.box-IND-game')

In [209]:
IND_reg_subset = IND[0]

In [210]:
IND_reg_data = list(IND_reg_subset.find_all('table', class_='sortable'))[0]

In [211]:
players = IND_reg_data.find_all('th', class_='left')
players

[<th class="left" csk="Brogdon,Malcolm" data-append-csv="brogdma01" data-stat="player" scope="row"><a href="/players/b/brogdma01.html">Malcolm Brogdon</a></th>,
 <th class="left" csk="Sabonis,Domantas" data-append-csv="sabondo01" data-stat="player" scope="row"><a href="/players/s/sabondo01.html">Domantas Sabonis</a></th>,
 <th class="left" csk="Duarte,Chris" data-append-csv="duartch01" data-stat="player" scope="row"><a href="/players/d/duartch01.html">Chris Duarte</a></th>,
 <th class="left" csk="Turner,Myles" data-append-csv="turnemy01" data-stat="player" scope="row"><a href="/players/t/turnemy01.html">Myles Turner</a></th>,
 <th class="left" csk="Holiday,Justin" data-append-csv="holidju01" data-stat="player" scope="row"><a href="/players/h/holidju01.html">Justin Holiday</a></th>,
 <th class="left" csk="Craig,Torrey" data-append-csv="craigto01" data-stat="player" scope="row"><a href="/players/c/craigto01.html">Torrey Craig</a></th>,
 <th class="left" csk="Lamb,Jeremy" data-append-csv=

In [212]:
player_reg_stats = IND_reg_data.find_all('td', class_='right')
player_reg_stats

[<td class="right" csk="2418" data-stat="mp">40:18</td>,
 <td class="right" data-stat="fg">8</td>,
 <td class="right" data-stat="fga">20</td>,
 <td class="right" data-stat="fg_pct">.400</td>,
 <td class="right" data-stat="fg3">3</td>,
 <td class="right" data-stat="fg3a">10</td>,
 <td class="right" data-stat="fg3_pct">.300</td>,
 <td class="right" data-stat="ft">9</td>,
 <td class="right" data-stat="fta">9</td>,
 <td class="right" data-stat="ft_pct">1.000</td>,
 <td class="right iz" data-stat="orb">0</td>,
 <td class="right" data-stat="drb">4</td>,
 <td class="right" data-stat="trb">4</td>,
 <td class="right" data-stat="ast">11</td>,
 <td class="right iz" data-stat="stl">0</td>,
 <td class="right" data-stat="blk">1</td>,
 <td class="right" data-stat="tov">1</td>,
 <td class="right" data-stat="pf">2</td>,
 <td class="right" data-stat="pts">28</td>,
 <td class="right iz" data-stat="plus_minus">0</td>,
 <td class="right" csk="2359" data-stat="mp">39:19</td>,
 <td class="right" data-stat="f

In [213]:
IND_adv_subset = IND[1]

In [214]:
IND_adv_data = list(IND_adv_subset.find_all('table', class_='sortable'))[0]

In [215]:
player_adv_stats = IND_adv_data.find_all('td', class_='right')
player_adv_stats

[<td class="right" csk="2418" data-stat="mp">40:18</td>,
 <td class="right" data-stat="ts_pct">.584</td>,
 <td class="right" data-stat="efg_pct">.475</td>,
 <td class="right" data-stat="fg3a_per_fga_pct">.500</td>,
 <td class="right" data-stat="fta_per_fga_pct">.450</td>,
 <td class="right iz" data-stat="orb_pct">0.0</td>,
 <td class="right" data-stat="drb_pct">8.7</td>,
 <td class="right" data-stat="trb_pct">4.9</td>,
 <td class="right" data-stat="ast_pct">40.3</td>,
 <td class="right iz" data-stat="stl_pct">0.0</td>,
 <td class="right" data-stat="blk_pct">1.6</td>,
 <td class="right" data-stat="tov_pct">4.0</td>,
 <td class="right" data-stat="usg_pct">25.5</td>,
 <td class="right" data-stat="off_rtg">134</td>,
 <td class="right" data-stat="def_rtg">117</td>,
 <td class="right poptip" data-stat="bpm" data-tip="OBPM: 7.1&lt;br&gt; DBPM: -0.5&lt;br&gt; VORP: 7.2&lt;br&gt; &lt;em&gt;&lt;small&gt;VORP is prorated to 82 games&lt;/small&gt;&lt;/em&gt; ">6.5</td>,
 <td class="right" csk="235

# Converting data into a dataframe

Now that we have all of the stats for Indiana, let's build a dataframe

In [216]:
cols = ['player', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', 'FT', 'FTA', 
        'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 
        '+/-', 'TS%', 'eFG%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%',
        'STL%', 'BLK%', 'TOV%', 'USG%', 'ORTg', 'DRTg', 'BPM']

In [217]:
players

[<th class="left" csk="Brogdon,Malcolm" data-append-csv="brogdma01" data-stat="player" scope="row"><a href="/players/b/brogdma01.html">Malcolm Brogdon</a></th>,
 <th class="left" csk="Sabonis,Domantas" data-append-csv="sabondo01" data-stat="player" scope="row"><a href="/players/s/sabondo01.html">Domantas Sabonis</a></th>,
 <th class="left" csk="Duarte,Chris" data-append-csv="duartch01" data-stat="player" scope="row"><a href="/players/d/duartch01.html">Chris Duarte</a></th>,
 <th class="left" csk="Turner,Myles" data-append-csv="turnemy01" data-stat="player" scope="row"><a href="/players/t/turnemy01.html">Myles Turner</a></th>,
 <th class="left" csk="Holiday,Justin" data-append-csv="holidju01" data-stat="player" scope="row"><a href="/players/h/holidju01.html">Justin Holiday</a></th>,
 <th class="left" csk="Craig,Torrey" data-append-csv="craigto01" data-stat="player" scope="row"><a href="/players/c/craigto01.html">Torrey Craig</a></th>,
 <th class="left" csk="Lamb,Jeremy" data-append-csv=

In [218]:
test = players[0]
players[0]

<th class="left" csk="Brogdon,Malcolm" data-append-csv="brogdma01" data-stat="player" scope="row"><a href="/players/b/brogdma01.html">Malcolm Brogdon</a></th>

In [219]:
test.string

'Malcolm Brogdon'

In [220]:
names = []

for player in range(len(players) - 1):
    names.append(players[player].string)
    
names

['Malcolm Brogdon',
 'Domantas Sabonis',
 'Chris Duarte',
 'Myles Turner',
 'Justin Holiday',
 'Torrey Craig',
 'Jeremy Lamb',
 'T.J. McConnell',
 'Isaiah Jackson',
 'Goga Bitadze',
 'Oshae Brissett',
 'Brad Wanamaker',
 'Duane Washington']

In [221]:
player_reg_stats[0]

<td class="right" csk="2418" data-stat="mp">40:18</td>

In [222]:
mp_list = []

for player in range(len(player_reg_stats)):
    if player_reg_stats[player].get('data-stat') == 'mp':
        mp_list.append(player_reg_stats[player].string)
        
mp_list = mp_list[:-1] # last val is overall team mins, which is always 240
mp_list

['40:18', '39:19', '32:42', '25:49', '25:43', '27:48', '24:14', '24:07']

Looking through the code more closely, the `data-stat` values for regular stats are:
* `mp`
* `fg`
* `fga`
* `fg_pct`
* `fg3`
* `fg3a`
* `fg3_pct`
* `ft`
* `fta`
* `ft_pct`
* `orb`
* `drb`
* `trb`
* `ast`
* `stl`
* `blk`
* `tov`
* `pf`
* `pts`
* `plus_minus`

And the following values for advanced stats:
* `ts_pct`
* `efg_pct`
* `fg3a_per_fga_pct`
* `fta_per_fga_pct`
* `orb_pct`
* `drb_pct`
* `trb_pct`
* `ast_pct`
* `stl_pct`
* `blk_pct`
* `tov_pct`
* `usg_pct`
* `off_rtg`
* `def_rtg`
* `bpm`

Let's initialize some lists to fill when iterating through the extracted data

In [223]:
# Regular stats
mp_list = []
fg_list = []
fga_list = []
fg_pct_list = []
fg3_list = []
fg3a_list = []
fg3_pct_list = []
ft_list = []
fta_list = []
ft_pct_list = []
orb_list = []
drb_list = []
trb_list = []
ast_list = []
stl_list = []
blk_list = []
tov_list = []
pf_list = []
pts_list = []
pm_list = []

# Advanced Stats
ts_pct_list = []
efg_pct_list = []
fg3a_per_fga_pct_list = []
fta_per_fga_pct_list = []
orb_pct_list = []
drb_pct_list = []
trb_pct_list = []
ast_pct_list = []
stl_pct_list = []
blk_pct_list = []
tov_pct_list = []
usg_pct_list = []
off_rtg_list = []
def_rtg_list = []
bpm_list = []


reg_stat_list = [mp_list, fg_list, fga_list, fg_pct_list, fg3_list, fg3a_list,
                 fg3_pct_list, ft_list, fta_list, ft_pct_list, orb_list, 
                 drb_list, trb_list, ast_list, stl_list, blk_list, tov_list, 
                 pf_list, pts_list, pm_list]
    
adv_stat_list = [ts_pct_list, efg_pct_list, fg3a_per_fga_pct_list, 
                 fta_per_fga_pct_list, orb_pct_list, drb_pct_list, trb_pct_list,
                 ast_pct_list, stl_pct_list, blk_pct_list, tov_pct_list, 
                 usg_pct_list, off_rtg_list, def_rtg_list, bpm_list]

reg_stat_val_list = ['mp', 'fg', 'fga', 'fg_pct', 'fg3', 'fg3a', 'fg3_pct', 
                     'ft','fta', 'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl',
                     'blk', 'tov', 'pf', 'pts', 'plus_minus']

adv_stat_val_list = ['ts_pct', 'efg_pct', 'fg3a_per_fga_pct', 
                     'fta_per_fga_pct', 'orb_pct', 'drb_pct', 'trb_pct',
                     'ast_pct', 'stl_pct', 'blk_pct', 'tov_pct', 'usg_pct',
                     'off_rtg', 'def_rtg', 'bpm']

We'll loop through each line of extracted data and add that info to the correct list.

In [224]:
# Loop through regular stats
for stat_observation in range(len(player_reg_stats)):
    for stat_type in range(len(reg_stat_val_list)):
        if player_reg_stats[stat_observation].get('data-stat') == reg_stat_val_list[stat_type]:
            reg_stat_list[stat_type].append(player_reg_stats[stat_observation].string)
            
# Loop through advanced stats
for stat_observation in range(len(player_adv_stats)):
    for stat_type in range(len(adv_stat_val_list)):
        if player_adv_stats[stat_observation].get('data-stat') == adv_stat_val_list[stat_type]:
            adv_stat_list[stat_type].append(player_adv_stats[stat_observation].string)

In [225]:
# Checking output to make sure everything went okay:
print(mp_list)
print(bpm_list)

['40:18', '39:19', '32:42', '25:49', '25:43', '27:48', '24:14', '24:07', '240']
['6.5', '3.4', '8.5', '-5.1', '-0.1', '-0.6', '-16.0', '3.5', None]


Since team stats are included in the last index for each stat list, we'll first pull out the team stats into their own list and then remove the final index for each stat list.

In [226]:
total_ind_stats_reg = [stat[-1] for stat in reg_stat_list]
total_ind_stats_adv = [stat[-1] for stat in adv_stat_list]

total_ind_stats = total_ind_stats_reg + total_ind_stats_adv
print(total_ind_stats)

['240', '42', '90', '.467', '17', '47', '.362', '21', '24', '.875', '8', '43', '51', '29', '2', '10', '16', '24', '122', None, '.607', '.561', '.522', '.267', '19.0', '78.2', '52.6', '69.0', '1.8', '13.2', '13.7', '100.0', '112.2', '113.2', None]


In [227]:
for i in range(len(reg_stat_list)):
    reg_stat_list[i] = reg_stat_list[i][:-1]
    
for i in range(len(adv_stat_list)):
    adv_stat_list[i] = adv_stat_list[i][:-1]

In [228]:
adv_stat_list

[['.584', '.795', '.827', '.580', '.506', '.512', '.167', '.429'],
 ['.475', '.789', '.800', '.583', '.450', '.375', '.167', '.429'],
 ['.500', '.316', '.600', '.500', '.700', '1.000', '.667', '.286'],
 ['.450', '.211', '.200', '.667', '.200', '.500', '.000', '.000'],
 ['0.0', '8.7', '0.0', '4.4', '4.4', '12.3', '0.0', '0.0'],
 ['8.7', '26.6', '13.3', '20.3', '20.4', '15.7', '7.2', '10.9'],
 ['4.9', '18.9', '7.6', '13.4', '13.5', '14.2', '4.1', '6.2'],
 ['40.3', '9.3', '5.1', '5.1', '21.6', '12.9', '0.0', '38.7'],
 ['0.0', '0.0', '1.4', '0.0', '0.0', '0.0', '0.0', '1.8'],
 ['1.6', '0.0', '0.0', '9.8', '0.0', '2.3', '5.2', '5.2'],
 ['4.0', '22.4', '5.8', '34.0', '0.0', '17.0', '18.2', '12.5'],
 ['25.5', '28.0', '21.8', '18.8', '17.4', '8.7', '18.7', '13.7'],
 ['134', '117', '145', '79', '124', '129', '29', '112'],
 ['117', '113', '114', '106', '115', '114', '114', '109'],
 ['6.5', '3.4', '8.5', '-5.1', '-0.1', '-0.6', '-16.0', '3.5']]

In [229]:
reg_stat_list

[['40:18', '39:19', '32:42', '25:49', '25:43', '27:48', '24:14', '24:07'],
 ['8', '13', '9', '3', '4', '1', '1', '3'],
 ['20', '19', '15', '6', '10', '4', '9', '7'],
 ['.400', '.684', '.600', '.500', '.400', '.250', '.111', '.429'],
 ['3', '4', '6', '1', '1', '1', '1', '0'],
 ['10', '6', '9', '3', '7', '4', '6', '2'],
 ['.300', '.667', '.667', '.333', '.143', '.250', '.167', '.000'],
 ['9', '3', '3', '2', '2', '2', '0', '0'],
 ['9', '4', '3', '4', '2', '2', '0', '0'],
 ['1.000', '.750', '1.000', '.500', '1.000', '1.000', None, None],
 ['0', '3', '0', '1', '1', '3', '0', '0'],
 ['4', '12', '5', '6', '6', '5', '2', '3'],
 ['4', '15', '5', '7', '7', '8', '2', '3'],
 ['11', '2', '1', '1', '4', '3', '0', '7'],
 ['0', '0', '1', '0', '0', '0', '0', '1'],
 ['1', '0', '0', '4', '0', '1', '2', '2'],
 ['1', '6', '1', '4', '0', '1', '2', '1'],
 ['2', '4', '4', '4', '3', '4', '2', '1'],
 ['28', '33', '27', '9', '11', '5', '3', '6'],
 ['0', '-1', '+7', '-8', '+3', '-3', '+5', '-8']]

In [230]:
ind_player_stats = reg_stat_list + adv_stat_list
ind_player_stats

[['40:18', '39:19', '32:42', '25:49', '25:43', '27:48', '24:14', '24:07'],
 ['8', '13', '9', '3', '4', '1', '1', '3'],
 ['20', '19', '15', '6', '10', '4', '9', '7'],
 ['.400', '.684', '.600', '.500', '.400', '.250', '.111', '.429'],
 ['3', '4', '6', '1', '1', '1', '1', '0'],
 ['10', '6', '9', '3', '7', '4', '6', '2'],
 ['.300', '.667', '.667', '.333', '.143', '.250', '.167', '.000'],
 ['9', '3', '3', '2', '2', '2', '0', '0'],
 ['9', '4', '3', '4', '2', '2', '0', '0'],
 ['1.000', '.750', '1.000', '.500', '1.000', '1.000', None, None],
 ['0', '3', '0', '1', '1', '3', '0', '0'],
 ['4', '12', '5', '6', '6', '5', '2', '3'],
 ['4', '15', '5', '7', '7', '8', '2', '3'],
 ['11', '2', '1', '1', '4', '3', '0', '7'],
 ['0', '0', '1', '0', '0', '0', '0', '1'],
 ['1', '0', '0', '4', '0', '1', '2', '2'],
 ['1', '6', '1', '4', '0', '1', '2', '1'],
 ['2', '4', '4', '4', '3', '4', '2', '1'],
 ['28', '33', '27', '9', '11', '5', '3', '6'],
 ['0', '-1', '+7', '-8', '+3', '-3', '+5', '-8'],
 ['.584', '.795'

First, we'll add the names column to our dataframe

In [231]:
ind_df = pd.DataFrame(names, columns=['player'])

Now we'll add the rest of the stats to our dataframe.

In [232]:
stat_cols = cols[1:]

In [234]:
for i in range(len(stat_cols)):
    stat_df = pd.DataFrame(ind_player_stats[i], columns=[stat_cols[i]])
    ind_df = pd.concat([ind_df, stat_df], axis=1)

In [237]:
ind_df

Unnamed: 0,player,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,...,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,ORTg,DRTg,BPM
0,Malcolm Brogdon,40:18,8.0,20.0,0.4,3.0,10.0,0.3,9.0,9.0,...,8.7,4.9,40.3,0.0,1.6,4.0,25.5,134.0,117.0,6.5
1,Domantas Sabonis,39:19,13.0,19.0,0.684,4.0,6.0,0.667,3.0,4.0,...,26.6,18.9,9.3,0.0,0.0,22.4,28.0,117.0,113.0,3.4
2,Chris Duarte,32:42,9.0,15.0,0.6,6.0,9.0,0.667,3.0,3.0,...,13.3,7.6,5.1,1.4,0.0,5.8,21.8,145.0,114.0,8.5
3,Myles Turner,25:49,3.0,6.0,0.5,1.0,3.0,0.333,2.0,4.0,...,20.3,13.4,5.1,0.0,9.8,34.0,18.8,79.0,106.0,-5.1
4,Justin Holiday,25:43,4.0,10.0,0.4,1.0,7.0,0.143,2.0,2.0,...,20.4,13.5,21.6,0.0,0.0,0.0,17.4,124.0,115.0,-0.1
5,Torrey Craig,27:48,1.0,4.0,0.25,1.0,4.0,0.25,2.0,2.0,...,15.7,14.2,12.9,0.0,2.3,17.0,8.7,129.0,114.0,-0.6
6,Jeremy Lamb,24:14,1.0,9.0,0.111,1.0,6.0,0.167,0.0,0.0,...,7.2,4.1,0.0,0.0,5.2,18.2,18.7,29.0,114.0,-16.0
7,T.J. McConnell,24:07,3.0,7.0,0.429,0.0,2.0,0.0,0.0,0.0,...,10.9,6.2,38.7,1.8,5.2,12.5,13.7,112.0,109.0,3.5
8,Isaiah Jackson,,,,,,,,,,...,,,,,,,,,,
9,Goga Bitadze,,,,,,,,,,...,,,,,,,,,,


Next steps: Figure out how to do the same thing for the opposing team. Once we know how to do both teams, we should be able to grab all of the data and save it to a local file to stop hitting basketball reference every time I want to work on this.