In [None]:
years = np.arange(2000,2020,1).tolist()

for year in years:
    web_name = 'https://www.basketball-reference.com/leagues/NBA_'+str(year)+'_per_game.html'
    website = request.urlopen(web_name).read()
    soup = BeautifulSoup(website,'lxml')

    table = soup.find('table', attrs={'class': 'sortable', 'id': 'per_game_stats'})
    table_headers = [header.text for header in table.find('thead').find_all('th')]
    table_rows = table.find_all('tr')

    final_data = []
    # For each table row,
    for tr in table_rows:
        # Make a list of the table data tags for this row
        td = tr.find_all('td')
        # Extract just the cell data and make a list
        row = [tr.text for tr in td]
        # Append the extracted data
        final_data.append(row)


    df = pd.DataFrame(final_data[1:], columns=table_headers[1:])
    df.to_csv('../data/nba_stats_'+str(year)+'.csv')
    
    time.sleep(1)

# Web Scraping in Python with BeautifulSoup: Collecting NBA Team Box Scores

## Introduction

This project uses Python and BeautifulSoup to scrape ESPN for the box scores of regular season games for NBA teams. The box score is a selected set of statistics which summarize the results of a game.

This data will be useful for answering questions such as:
- Which NBA statistics are actually useful for determining wins?
- Can we predict whether a team will make the playoffs?
- Can we predict whether a team will win a game?

The answers to these questions will be the most useful to a team or a coach when trying to determine what aspects of gameplay to improve upon. If securing more offensive rebounds appears to be more important to winning a game than the percentage of successful free throws, then teams should be drilling rebounds instead of practicing free throws.

"**REWRITE** ...... In the sport of basketball, the box score is used to summarize/average the data of Games played (GP), Games started (GS), Minutes Played (MIN or MPG), Field-goals made (FGM), Field-goals attempted (FGA), Field-goal percentage (FG%), 3-pointers made (3PM), 3-pointers attempted (3PA), 3-point field goal (3P%), Free throws made (FTM), Free throws attempted (FTA), Free throw percentage (FT%), Offensive Rebounds (OREB), Defensive Rebounds (DREB), Total rebounds (REB), Assists (AST), Turnovers (TOV), Steals (STL), Blocked shots (BLK), Personal fouls (PF), Points scored (PTS), and Plus/Minus for Player efficiency (+/-)."

## Scrape ESPN Team Page for Team Names

First, we're going to need a list of the NBA teams, so let's look at the strucure of the ESPN website. The [teams page](https://www.espn.com/nba/teams) looks like the following:
<img src="../images/espnTeamPage.png"/>

Inspection of the page shows that links for each team are in a section container with a 'pl3' class. 
<img src="../images/teamContainer.png"/>
So let's open the team page, parse it, and find all containers of that class.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from urllib import request
import time

In [2]:
team_url = request.urlopen('http://www.espn.com/nba/teams').read()
team_soup = BeautifulSoup(team_url,'lxml')
team_containers = team_soup.find_all('div',{'class':'pl3'})

To get the team name, we grab the text inside the \<a\> tag inside each team's section container. Then we'll get the abbreviation for each team by splitting on '/' in a team's URL and grabbing the 7th element. Next, we'll create a dictionary of the teams and their abbreviations. This will come in handy later.

In [3]:
team_names = [team.a.text for team in team_containers]
team_abbrs = [team.find_all('a',href=True)[1]['href'].split('/')[6] for team in team_containers]
teams = dict(zip(team_names,team_abbrs))

Let's make sure we got all 30 teams.

In [4]:
print(len(teams))
teams

30


{'Boston Celtics': 'bos',
 'Brooklyn Nets': 'bkn',
 'New York Knicks': 'ny',
 'Philadelphia 76ers': 'phi',
 'Toronto Raptors': 'tor',
 'Chicago Bulls': 'chi',
 'Cleveland Cavaliers': 'cle',
 'Detroit Pistons': 'det',
 'Indiana Pacers': 'ind',
 'Milwaukee Bucks': 'mil',
 'Denver Nuggets': 'den',
 'Minnesota Timberwolves': 'min',
 'Oklahoma City Thunder': 'okc',
 'Portland Trail Blazers': 'por',
 'Utah Jazz': 'utah',
 'Golden State Warriors': 'gs',
 'LA Clippers': 'lac',
 'Los Angeles Lakers': 'lal',
 'Phoenix Suns': 'phx',
 'Sacramento Kings': 'sac',
 'Atlanta Hawks': 'atl',
 'Charlotte Hornets': 'cha',
 'Miami Heat': 'mia',
 'Orlando Magic': 'orl',
 'Washington Wizards': 'wsh',
 'Dallas Mavericks': 'dal',
 'Houston Rockets': 'hou',
 'Memphis Grizzlies': 'mem',
 'New Orleans Pelicans': 'no',
 'San Antonio Spurs': 'sa'}

## Scraping ESPN Team Schedule Page for Game IDs

The page containing a game's box score is of the following format:<br>
https://www.espn.com/nba/matchup?gameId=401070218.<br>
So we need to get a list of all the game IDs. We do this by looking at the schedule page of each team.

A team's regular season schedule is listed on a URL with the following format: https://www.espn.com/nba/team/schedule/_/name/ABBR/season/YEAR/seasontype/2, where ABBR (team abbreviation) and YEAR are our variables of interest.

In [5]:
year = '2019'
team_abbr = 'gs'

In [6]:
schedule_url = request.urlopen('https://www.espn.com/nba/team/schedule/_/name/'+team_abbr+'/season/'+year+'/seasontype/2').read()
schedule_soup = BeautifulSoup(schedule_url,'lxml')

Each game has an ID and it is found inside the \<span\> element with an 'ml4' class.

In [7]:
games = schedule_soup.find_all('span', attrs={'class':'ml4'})
game_ids = [game.find_all('a',href=True)[0]['href'].split('=')[1] for game in games]

Let's make sure we got all 82 regular season games.

In [8]:
len(game_ids)

82

In [None]:
# FROM THE SCHEDULE PAGE I ALSO WANT TO GET THE OPPONENT NAME AND WHETHER IT WAS A HOME (VS) OR 
# AN AWAY (@) GAME FOR THE TEAM

In [9]:
game_id = game_ids[0]
box_url = request.urlopen('https://www.espn.com/nba/matchup?gameId='+game_id)
box_soup = BeautifulSoup(box_url,'lxml')

In [13]:
box_soup.find_all('table',attrs={'class':'mod-data'})

[<table class="mod-data">
 <thead>
 <tr class="header">
 <th>Matchup</th>
 <th>
 <img src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/nba/500/okc.png&amp;h=100&amp;w=100"/>
 </th>
 <th>
 <img src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/nba/500/gs.png&amp;h=100&amp;w=100"/>
 </th>
 </tr>
 </thead>
 <tbody>
 <tr class="highlight" data-stat-attr="fieldGoalsMade-fieldGoalsAttempted">
 <td>
 									FG
 								</td>
 <td>
 									33-91
 								</td>
 <td>
 									42-95
 								</td>
 </tr>
 <tr class="highlight" data-stat-attr="fieldGoalPct">
 <td>
 									Field Goal %
 								</td>
 <td>
 									36.3
 								</td>
 <td>
 									44.2
 								</td>
 </tr>
 <tr class="highlight" data-stat-attr="threePointFieldGoalsMade-threePointFieldGoalsAttempted">
 <td>
 									3PT
 								</td>
 <td>
 									10-37
 								</td>
 <td>
 									7-26
 								</td>
 </tr>
 <tr class="highlight" data-stat-attr="threePointFieldGoalPct">
 <td>
 									Three Point %
 	