# Web Scraping in Python with BeautifulSoup: Collecting NBA Team Box Scores

## Introduction

This project uses Python and BeautifulSoup to scrape ESPN for the box scores of regular season games for NBA teams. The box score is a selected set of statistics which summarize the results of a game.

This data will be useful for answering questions such as:
- Which NBA statistics are actually useful for determining wins?
- Can we predict whether a team will make the playoffs?
- Can we predict whether a team will win a game?

The answers to these questions will be the most useful to a team or a coach when trying to determine what aspects of gameplay to improve upon. If securing more offensive rebounds appears to be more important to winning a game than the percentage of successful free throws, then teams should be drilling rebounds instead of practicing free throws.

"**REWRITE** ...... In the sport of basketball, the box score is used to summarize/average the data of Games played (GP), Games started (GS), Minutes Played (MIN or MPG), Field-goals made (FGM), Field-goals attempted (FGA), Field-goal percentage (FG%), 3-pointers made (3PM), 3-pointers attempted (3PA), 3-point field goal (3P%), Free throws made (FTM), Free throws attempted (FTA), Free throw percentage (FT%), Offensive Rebounds (OREB), Defensive Rebounds (DREB), Total rebounds (REB), Assists (AST), Turnovers (TOV), Steals (STL), Blocked shots (BLK), Personal fouls (PF), Points scored (PTS), and Plus/Minus for Player efficiency (+/-)."

## Scrape ESPN Team Page for Team Names

First, we're going to need a list of the NBA teams, so let's look at the strucure of the ESPN website. The [teams page](https://www.espn.com/nba/teams) looks like the following:
<img src="../images/espnTeamPage.png"/>

Inspection of the page shows that links for each team are in a section container with a 'pl3' class. 
<img src="../images/teamContainer.png"/>
So let's open the team page, parse it, and find all containers of that class.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from urllib import request
import time

team_url = request.urlopen('http://www.espn.com/nba/teams').read()
team_soup = BeautifulSoup(team_url,'lxml')
team_containers = team_soup.find_all('div',{'class':'pl3'})

To get the team name, we grab the text inside the \<a\> tag inside each team's section container. Then we'll get the abbreviation for each team by splitting on '/' in a team's URL and grabbing the 7th element. Next, we'll create a dictionary of the teams and their abbreviations. This will come in handy later.

In [2]:
team_names = [team.a.text for team in team_containers]
team_abbrs = [team.find_all('a',href=True)[1]['href'].split('/')[6] for team in team_containers]
teams = dict(zip(team_abbrs,team_names))

Let's make sure we got all 30 teams.

In [3]:
print(len(teams))
teams

30


{'bos': 'Boston Celtics',
 'bkn': 'Brooklyn Nets',
 'ny': 'New York Knicks',
 'phi': 'Philadelphia 76ers',
 'tor': 'Toronto Raptors',
 'chi': 'Chicago Bulls',
 'cle': 'Cleveland Cavaliers',
 'det': 'Detroit Pistons',
 'ind': 'Indiana Pacers',
 'mil': 'Milwaukee Bucks',
 'den': 'Denver Nuggets',
 'min': 'Minnesota Timberwolves',
 'okc': 'Oklahoma City Thunder',
 'por': 'Portland Trail Blazers',
 'utah': 'Utah Jazz',
 'gs': 'Golden State Warriors',
 'lac': 'LA Clippers',
 'lal': 'Los Angeles Lakers',
 'phx': 'Phoenix Suns',
 'sac': 'Sacramento Kings',
 'atl': 'Atlanta Hawks',
 'cha': 'Charlotte Hornets',
 'mia': 'Miami Heat',
 'orl': 'Orlando Magic',
 'wsh': 'Washington Wizards',
 'dal': 'Dallas Mavericks',
 'hou': 'Houston Rockets',
 'mem': 'Memphis Grizzlies',
 'no': 'New Orleans Pelicans',
 'sa': 'San Antonio Spurs'}

## Scraping ESPN Team Schedule Page for Game IDs

The page containing a game's box score is of the following format:<br>
https://www.espn.com/nba/matchup?gameId=401070218.<br>
So we need to get a list of all the game IDs. We do this by looking at the schedule page of each team.

A team's regular season schedule is listed on a URL with the following format: https://www.espn.com/nba/team/schedule/_/name/ABBR/season/YEAR/seasontype/2, where ABBR (team abbreviation) and YEAR are our variables of interest.

The class for each table row of a team's schedule contains the string "Table__TR". So we'll search for that and exclude the first row which are just the table's column names.

Then we'll grab the dates and IDs of each game.

The opponent names and the game locations are stored in a section element of class "flex items-center opponent-logo". Opponent names are just the title of the image. Game locations are indicated by "@" or "vs" for away and home, respectively.

In [4]:
# For the 2018-2019 regular season
year = '2019'

team_schedules = dict()
for team_abbr in list(teams.keys()):
    url = 'https://www.espn.com/nba/team/schedule/_/name/'+team_abbr+'/season/'+year+'/seasontype/2'
    schedule_url = request.urlopen(url).read()
    time.sleep(1)
    schedule_soup = BeautifulSoup(schedule_url,'lxml')
    
    game_rows = schedule_soup.select('tr[class*="Table__TR"]')[1:]
    game_dates = [game.td.span.text for game in game_rows]
    game_ids = [game.find_all('a',href=True)[2]['href'].split('=')[1] for game in game_rows]

    game_logos = [game.find_all('div', attrs={'class':'flex items-center opponent-logo'}) for game in game_rows]
    game_locs = [game[0].span.text for game in game_logos]
    
    game_results = [game.text for game in schedule_soup.select('span[class*="fw-bold clr-"]')]
    
    schedule = dict(zip(game_ids, list(zip(game_dates, game_locs, game_results))))
    team_schedules.setdefault(team_abbr, schedule)

# Scraping the Box Scores

There are three tables of class "mod-data" on the Team Matchup page. We're interested in the first table which contains the team box scores for this particular game.

In [5]:
def get_box_score(table_data):
    box_data = []
    for tr in table_data:
        td = tr.find_all('td')
        row = [tr.text.strip('\t\n') for tr in td]
        box_data.append(row) 
    return box_data

In [43]:
for team_abbr in list(teams.keys())[9:10]:
#     for game_id in list(team_schedules[team_abbr].keys()):
    for game_id in ['401070856','401070693']:
        retries = 3
        try:
            box_url = request.urlopen('https://www.espn.com/nba/matchup?gameId='+game_id).read()
            time.sleep(1)
            box_soup = BeautifulSoup(box_url,'lxml')

            # Table containing the box score data
            box_table = box_soup.find_all('table',attrs={'class':'mod-data'})[0]

            home_team_container = box_soup.find_all('div', attrs={'class':'team home'})
            home_team_abbr = home_team_container[0].find_all('span', attrs={'class':'abbrev'})[0].text.lower()
            home_team_pts = box_soup.find_all('div', attrs={'class': 'score icon-font-before'})[0].text

            away_team_container = box_soup.find_all('div', attrs={'class':'team away'})
            away_team_abbr = away_team_container[0].find_all('span', attrs={'class':'abbrev'})[0].text.lower()
            away_team_pts = box_soup.find_all('div', attrs={'class': 'score icon-font-after'})[0].text

            points = ['PTS', away_team_pts, home_team_pts]

            # Table rows with an indent class
            indents = box_table.find_all('tr', attrs={'class':'indent'})
            # Table rows with a highlight class
            highlights = box_table.find_all('tr', attrs={'class':'highlight'})

            highlights_data = get_box_score(highlights)
            indents_data = get_box_score(indents)

            box_data = np.concatenate(([points],highlights_data,indents_data)).T

            if team_abbr == away_team_abbr:
                row = np.concatenate(([team_abbr,teams[team_abbr],game_id],team_schedules[team_abbr][game_id],box_data[1],
                                      [home_team_abbr,teams[home_team_abbr]],box_data[2]))
            else:
                row = np.concatenate(([team_abbr,teams[team_abbr],game_id],team_schedules[team_abbr][game_id],box_data[2],
                                      [away_team_abbr,teams[away_team_abbr]],box_data[1]))

            final_data1.append(row)
#         except:
#             continue

IndexError: list index out of range

In [6]:
final_data = []

In [33]:
final_data1 =[]

In [43]:
for team_abbr in list(teams.keys())[9:10]:
#     for game_id in list(team_schedules[team_abbr].keys()):
    for game_id in ['401070856','401070693']:
        retries = 3
        try:
            box_url = request.urlopen('https://www.espn.com/nba/matchup?gameId='+game_id).read()
            time.sleep(1)
            box_soup = BeautifulSoup(box_url,'lxml')

            # Table containing the box score data
            box_table = box_soup.find_all('table',attrs={'class':'mod-data'})[0]

            home_team_container = box_soup.find_all('div', attrs={'class':'team home'})
            home_team_abbr = home_team_container[0].find_all('span', attrs={'class':'abbrev'})[0].text.lower()
            home_team_pts = box_soup.find_all('div', attrs={'class': 'score icon-font-before'})[0].text

            away_team_container = box_soup.find_all('div', attrs={'class':'team away'})
            away_team_abbr = away_team_container[0].find_all('span', attrs={'class':'abbrev'})[0].text.lower()
            away_team_pts = box_soup.find_all('div', attrs={'class': 'score icon-font-after'})[0].text

            points = ['PTS', away_team_pts, home_team_pts]

            # Table rows with an indent class
            indents = box_table.find_all('tr', attrs={'class':'indent'})
            # Table rows with a highlight class
            highlights = box_table.find_all('tr', attrs={'class':'highlight'})

            highlights_data = get_box_score(highlights)
            indents_data = get_box_score(indents)

            box_data = np.concatenate(([points],highlights_data,indents_data)).T

            if team_abbr == away_team_abbr:
                row = np.concatenate(([team_abbr,teams[team_abbr],game_id],team_schedules[team_abbr][game_id],box_data[1],
                                      [home_team_abbr,teams[home_team_abbr]],box_data[2]))
            else:
                row = np.concatenate(([team_abbr,teams[team_abbr],game_id],team_schedules[team_abbr][game_id],box_data[2],
                                      [away_team_abbr,teams[away_team_abbr]],box_data[1]))

            final_data1.append(row)
#         except:
#             continue

IndexError: list index out of range

In [39]:

final_data[-1]

array(['tor', 'Toronto Raptors', '401071887', 'Tue, Apr 9', '@', 'W',
       '120', '46-88', '52.3', '16-37', '43.2', '12-17', '70.6', '64',
       '24', '10', '5', '18', '19', '48', '17', '32', '7', '47', '17',
       '0', '0', 'min', 'Minnesota Timberwolves', '100', '38-91', '41.8',
       '13-42', '31.0', '11-14', '78.6', '39', '26', '10', '9', '12',
       '14', '36', '16', '7', '5', '30', '14', '0', '0'], dtype='<U22')

In [19]:
retry_on_exceptions = (requests.exceptions.HTTPError, requests.exceptions.ConnectionError)
except retry_on_exceptions:
    time.sleep(5)
else:
    continue

'401070856'

In [41]:
team_schedules['mil']['401070856']

('Sat, Nov 10', '@', 'L')

In [42]:
box_table = box_soup.find_all('table',attrs={'class':'mod-data'})[0]

In [10]:
box_data[0]

array(['PTS', 'FG', 'Field Goal %', '3PT', 'Three Point %', 'FT',
       'Free Throw %', 'Rebounds', 'Assists', 'Steals', 'Blocks',
       'Total Turnovers', 'Fast Break Points', 'Points in Paint', 'Fouls',
       'Largest Lead', 'Offensive Rebounds', 'Defensive Rebounds',
       'Points Off Turnovers', 'Technical Fouls', 'Flagrant Fouls'],
      dtype='<U20')

In [6]:
col_names = ['teamABBR', 'teamName','gameID', 'gameDate', 'gameLoc', 'teamResult', 'teamPTS', 'teamFG', 'teamFG%',
             'team3PT', 'team3PT%', 'teamFT', 'teamFT%', 'teamTREB', 'teamASST', 'teamSTL', 'teamBLK',
             'teamTO', 'teamFB_PTS', 'teamPNT_PTS', 'teamFOUL', 'teamLG_LEAD', 'teamOREB', 'teamDREB',
             'teamTO_PTS', 'teamFOUL_T', 'teamFOUL_F',
             'opptABBR', 'opptName', 'opptPTS', 'opptFG', 'opptFG%',
             'oppt3PT', 'oppt3PT%', 'opptFT', 'opptFT%', 'opptTREB', 'opptASST', 'opptSTL', 'opptBLK',
             'opptTO', 'opptFB_PTS', 'opptPNT_PTS', 'opptFOUL', 'opptLG_LEAD', 'opptOREB', 'opptDREB',
             'opptTO_PTS', 'opptFOUL_T', 'opptFOUL_F']

In [7]:
df = pd.DataFrame(final_data, columns=col_names)

In [8]:
df

Unnamed: 0,teamABBR,teamName,gameID,gameDate,gameLoc,teamResult,teamPTS,teamFG,teamFG%,team3PT,...,opptTO,opptFB_PTS,opptPNT_PTS,opptFOUL,opptLG_LEAD,opptOREB,opptDREB,opptTO_PTS,opptFOUL_T,opptFOUL_F
0,bos,Boston Celtics,401070213,"Tue, Oct 16",vs,W,105,42-97,43.3,11-37,...,16,16,50,20,4,6,41,14,0,0
1,bos,Boston Celtics,401070219,"Fri, Oct 19",@,L,101,40-99,40.4,14-36,...,13,14,46,19,12,12,37,11,0,0
2,bos,Boston Celtics,401070711,"Sat, Oct 20",@,W,103,33-82,40.2,9-25,...,16,11,40,25,1,11,35,15,0,0
3,bos,Boston Celtics,401070721,"Mon, Oct 22",vs,L,90,37-91,40.7,9-40,...,9,6,42,15,13,8,41,9,0,0
4,bos,Boston Celtics,401070746,"Thu, Oct 25",@,W,101,33-86,38.4,11-32,...,16,10,42,27,16,16,41,16,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,bos,Boston Celtics,401071826,"Mon, Apr 1",vs,W,110,36-91,39.6,17-40,...,15,13,56,17,0,9,39,18,1,0
78,bos,Boston Celtics,401071843,"Wed, Apr 3",@,W,112,38-80,47.5,9-24,...,15,4,46,23,3,16,31,12,2,0
79,bos,Boston Celtics,401071856,"Fri, Apr 5",@,W,117,48-92,52.2,7-27,...,11,16,54,15,3,10,35,14,0,0
80,bos,Boston Celtics,401071876,"Sun, Apr 7",vs,L,108,42-89,47.2,12-31,...,16,17,54,14,14,11,34,16,0,0


In [9]:
df.to_csv('../data/nba_team_box_scores_'+str(year)+'.csv')

2460

In [9]:
list(teams.keys())[0:5]

['bos', 'bkn', 'ny', 'phi', 'tor']

In [13]:
list(teams.keys())[5:]

['chi',
 'cle',
 'det',
 'ind',
 'mil',
 'den',
 'min',
 'okc',
 'por',
 'utah',
 'gs',
 'lac',
 'lal',
 'phx',
 'sac',
 'atl',
 'cha',
 'mia',
 'orl',
 'wsh',
 'dal',
 'hou',
 'mem',
 'no',
 'sa']