# Web Scraping in Python with BeautifulSoup: Collecting NBA Team Box Scores

## Introduction

This project uses Python and BeautifulSoup to scrape ESPN for the box scores of regular season games for NBA teams. The box score is a selected set of statistics which summarize the results of a game.

This data will be useful for answering questions such as:
- Which NBA statistics are actually useful for determining wins?
- Can we predict whether a team will make the playoffs?
- Can we predict whether a team will win a game?

The answers to these questions will be the most useful to a team or a coach when trying to determine what aspects of gameplay to improve upon. If securing more offensive rebounds appears to be more important to winning a game than the percentage of successful free throws, then teams should be drilling rebounds instead of practicing free throws.

"**REWRITE** ...... In the sport of basketball, the box score is used to summarize/average the data of Games played (GP), Games started (GS), Minutes Played (MIN or MPG), Field-goals made (FGM), Field-goals attempted (FGA), Field-goal percentage (FG%), 3-pointers made (3PM), 3-pointers attempted (3PA), 3-point field goal (3P%), Free throws made (FTM), Free throws attempted (FTA), Free throw percentage (FT%), Offensive Rebounds (OREB), Defensive Rebounds (DREB), Total rebounds (REB), Assists (AST), Turnovers (TOV), Steals (STL), Blocked shots (BLK), Personal fouls (PF), Points scored (PTS), and Plus/Minus for Player efficiency (+/-)."

## Scrape ESPN Team Page for Team Names

First, we're going to need a list of the NBA teams, so let's look at the strucure of the ESPN website. The [teams page](https://www.espn.com/nba/teams) looks like the following:
<img src="../images/espnTeamPage.png"/>

Inspection of the page shows that links for each team are in a section container with a 'pl3' class. 
<img src="../images/teamContainer.png"/>
So let's open the team page, parse it, and find all containers of that class.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from urllib import request
import time

team_url = request.urlopen('http://www.espn.com/nba/teams').read()
team_soup = BeautifulSoup(team_url,'lxml')
team_containers = team_soup.find_all('div',{'class':'pl3'})

To get the team name, we grab the text inside the \<a\> tag inside each team's section container. Then we'll get the abbreviation for each team by splitting on '/' in a team's URL and grabbing the 7th element. Next, we'll create a dictionary of the teams and their abbreviations. This will come in handy later.

In [2]:
team_names = [team.a.text for team in team_containers]
team_abbrs = [team.find_all('a',href=True)[1]['href'].split('/')[6] for team in team_containers]
teams = dict(zip(team_abbrs,team_names))

Let's make sure we got all 30 teams.

In [3]:
print(len(teams))
teams

30


{'bos': 'Boston Celtics',
 'bkn': 'Brooklyn Nets',
 'ny': 'New York Knicks',
 'phi': 'Philadelphia 76ers',
 'tor': 'Toronto Raptors',
 'chi': 'Chicago Bulls',
 'cle': 'Cleveland Cavaliers',
 'det': 'Detroit Pistons',
 'ind': 'Indiana Pacers',
 'mil': 'Milwaukee Bucks',
 'den': 'Denver Nuggets',
 'min': 'Minnesota Timberwolves',
 'okc': 'Oklahoma City Thunder',
 'por': 'Portland Trail Blazers',
 'utah': 'Utah Jazz',
 'gs': 'Golden State Warriors',
 'lac': 'LA Clippers',
 'lal': 'Los Angeles Lakers',
 'phx': 'Phoenix Suns',
 'sac': 'Sacramento Kings',
 'atl': 'Atlanta Hawks',
 'cha': 'Charlotte Hornets',
 'mia': 'Miami Heat',
 'orl': 'Orlando Magic',
 'wsh': 'Washington Wizards',
 'dal': 'Dallas Mavericks',
 'hou': 'Houston Rockets',
 'mem': 'Memphis Grizzlies',
 'no': 'New Orleans Pelicans',
 'sa': 'San Antonio Spurs'}

## Scraping ESPN Team Schedule Page for Game IDs

The page containing a game's box score is of the following format:<br>
https://www.espn.com/nba/matchup?gameId=401070218.<br>
So we need to get a list of all the game IDs. We do this by looking at the schedule page of each team.

A team's regular season schedule is listed on a URL with the following format: https://www.espn.com/nba/team/schedule/_/name/ABBR/season/YEAR/seasontype/2, where ABBR (team abbreviation) and YEAR are our variables of interest.

In [4]:
year = '2019'
team_abbr = 'gs'

schedule_url = request.urlopen('https://www.espn.com/nba/team/schedule/_/name/'+team_abbr+'/season/'+year+'/seasontype/2').read()
schedule_soup = BeautifulSoup(schedule_url,'lxml')

The class for each table row of a team's schedule contains the string "Table__TR". So we'll search for that and exclude the first row which are just the table's column names.

Then we'll grab the dates and IDs of each game.

In [5]:
game_rows = schedule_soup.select('tr[class*="Table__TR"]')[1:]

game_dates = [game.td.span.text for game in game_rows]
game_ids = [game.find_all('a',href=True)[2]['href'].split('=')[1] for game in game_rows]

The opponent names and the game locations are stored in a section element of class "flex items-center opponent-logo". Opponent names are just the title of the image. Game locations are indicated by "@" or "vs" for away and home, respectively.

In [6]:
game_logos = [game.find_all('div', attrs={'class':'flex items-center opponent-logo'}) for game in game_rows]

game_oppts = [game[0].find('img').get('title') for game in game_logos]
game_locs = [game[0].span.text for game in game_logos]

# Scraping the Box Scores

In [7]:
game_id = game_ids[0]
box_url = request.urlopen('https://www.espn.com/nba/matchup?gameId='+game_id)
box_soup = BeautifulSoup(box_url,'lxml')

There are three tables of class "mod-data" on the Team Matchup page. We're interested in the first table which contains the team box scores for this particular game.

In [8]:
# Table containing the box score data
box_table = box_soup.find_all('table',attrs={'class':'mod-data'})[0]

The table headers will be "MatchUp", the away team's abbreviation, and the home team's abbreviation.

In [9]:
away_team = box_table.find_all('th')[1].find('img').get('src').split('/')[-1].split('.')[0]
home_team = box_table.find_all('th')[2].find('img').get('src').split('/')[-1].split('.')[0]

table_headers = [box_table.find_all('th')[0].text, away_team, home_team]

In [10]:
# Table rows with an indent class
indents = box_table.find_all('tr', attrs={'class':'indent'})
# Table rows with a highlight class
highlights = box_table.find_all('tr', attrs={'class':'highlight'})

In [11]:
def get_box_data(table_data):
    box_data = []
    for tr in table_data:
        td = tr.find_all('td')
        row = [tr.text.strip('\t\n') for tr in td]
        box_data.append(row) 
    return box_data

In [12]:
highlights_data = get_box_data(highlights)
indents_data = get_box_data(indents)

box_data = np.concatenate((highlights_data,indents_data))

In [13]:
df = pd.DataFrame(box_data, columns=table_headers)

In [14]:
df

Unnamed: 0,Matchup,okc,gs
0,FG,33-91,42-95
1,Field Goal %,36.3,44.2
2,3PT,10-37,7-26
3,Three Point %,27.0,26.9
4,FT,24-37,17-18
5,Free Throw %,64.9,94.4
6,Rebounds,59,66
7,Assists,21,28
8,Steals,12,7
9,Blocks,6,7
