# Tracking Second Half Scoring Contributions in the NBA  

### Introduction  
On June 16, 2021, the Philadelphia 76ers played the Atlanta Hawks in Game 5 of the Eastern Conference Semifinals. The Sixers surrender a massive second half lead and ended up losing to the Hawks. Remarkably, in the entire second half, [only two Sixers players scored a field goal](https://www.bardown.com/somehow-only-two-2-sixers-players-managed-to-score-in-the-entire-2nd-half-of-game-5-1.1655958). Just how rare is that? Let's get an answer using some code.

### Imports

In [1]:
import pandas as pd
import numpy as np
import requests
import time
import urllib
from IPython.display import clear_output
from bs4 import BeautifulSoup
from bs4 import SoupStrainer

### Getting Unique Game IDs  
To scrape the data that we want, we need to get the proper html links for each game. Each game id is unique, and we can find them stored in the html code of specific webpgaes. These webpages are sorted by year and then month, so we need to loop through each combination of year and month, and then go to that specific webpage. There, we will find the game ids we want.

In [5]:
# We will start with a blank list of Game IDs
# This is where we will store the id info as it is retrieved
game_id_links = []

# Now we need a list of years
# We will start with a blank list and then fill it out in a loop
years = []
for i in range(2000,2022):
    years.append(str(i))

# We also need a list of months
months = ["january","february","march","april","may","june","july","august","september","october","november","december"]

# We now have a list of years and a list of months
# We will go through each combination and scrape the desired info

# For each year ...
for i in years:
    # For each month ...
    for j in months:
        # Create a string of the required webpage link 
        c_link = 'https://www.basketball-reference.com/leagues/NBA_{}_games-{}.html'.format(i,j)
        # Scrape the page, make it pretty, and reduce it to the basic info we want 
        r = requests.get(c_link)
        data = r.text
        soup = BeautifulSoup(data, parse_only=SoupStrainer("td", {"data-stat":"box_score_text"}))
        x = soup.findAll("a")
        # For each tag, strip out the href string (which is the game id) and add it to our list
        for tr in x:
            links = tr.get('href')
            game_id_links.append(links)

# Print out the length of our list to see how many games it includes
len(game_id_links)    

28019

That's a lot of games! We are looking at a 21 year span here, so it's quite reasonable that there would be so many games. We need to keep the size of this data set in mind as any attempt to analyze every single game will take a **long** time.

### Get the Data for Each Game  
Okay, now we have the game id for each game over the last 21 years. Awesome! Now, we need to build unique html addresses using those game ids. That will allow us to access the stats for each game, review the box scores, and calculate how many unique scorers each team had in the second half of a given game.

In [14]:
# We will start with a blank dictionary
# This is where we will store the information as we scrape it
# A dictionary is a good tool here as it will allow us to build a clean data frame at the end
game_level_data = {}

# As noted above, we have a lot of games in our data set. 
# That means it will take a very long time for or code to run.
# For now, we'll just look at the first 10 games as a proof of concept.
# Later, we can run this on the entire data set.

# For each of the first 10 games...
for i in game_id_links[0:10]:
    
    # Extract the date from the game id
    # We'll use this as part of storing our results
    date = i.split('/')[2][0:8]
    
    # We need team abberivations ('ATL' for Atlanta, etc) in order to properly parse the html code
    # To grab these, we'll first make a blank list of the home and away team abbreviations
    both_teams = []
    
    # Next, we will access the webpage using the game id
    c_link = 'https://www.basketball-reference.com{}'.format(i)
    # Then we'll parse the html code and extract instances of the team abbreviations 
    r = requests.get(c_link)
    data = r.text
    soup = BeautifulSoup(data, parse_only=SoupStrainer("div", {"id":"inner_nav"}))
    x = soup.findAll("a")
    # We'll find every instance of the team abbreviation and add it to our list
    for tr in x:
        links = tr.get('href')
        both_teams.append(links)    
    team_abs = []

    # The html code is a little messy, so we need to do some cleaning to get the two team abbreviations 
    for k in range(0,len(both_teams)):
        if both_teams[k].split("/")[1] =='teams':
            ab = both_teams[k].split("/")[2]
            team_abs.append(ab)
        else:
            next
    
    # Now that we have just the team abbrevations, we can reduce to only the unique ones and add them to our list
    team_abs = list(set(team_abs))

    # Now, for each team, we need to find the desired data
    # First, we need to get the unique tag that we will search for in the html code
    team_2half_string = "box-{}-h2-basic".format(team_abs[0])

    # Using this code, we will do out BeautifulSoup magic again and get the second half box score for the given team
    source = urllib.request.urlopen(c_link).read()
    soup = BeautifulSoup(source,'html')
    table = soup.find_all("table", {"id": team_2half_string})
    # Once we find the box score, we need to make it into a table
    df = pd.read_html(str(table))[0]

    # The data frame has a multi index
    # We can go ahead and drop that
    df.columns = ['_'.join(col) for col in df.columns.values]
    
    # That leaves some ugly column titles which we will clean up
    new_cols= []
    for col in df.columns.values:
        val = col.split("_")[1]
        new_cols.append(val)
    new_cols[0] = 'Player'
    df.columns = new_cols

    # Now our data frame has nice column names but it has some unnecessary rows
    # We can go ahead and drop those two rows
    df = df.query('Player != "Reserves" & Player != "Team Totals"')

    # There are some odd items that may come up depending on the game
    # We need to replace these with Nan
    df = df.replace("Did Not Play", np.nan)
    df = df.replace("Not With Team", np.nan)
    df = df.replace("Did Not Dress", np.nan)
    df = df.replace("Player Suspended", np.nan)

    # Keep only the rows for players who actually played
    df = df[df['MP'].notna()]
    # Reset index
    df = df.reset_index(drop=True)
    # Convert FG to integer
    df['FG'] = pd.to_numeric(df['FG'])

    # We've accessed the second half box score for the given team
    # We can now compress it down into just the info we want and store those values in a dictionary
    
    # Create a unique dictionary key
    game_key = date + team_abs[0]
    # Team
    team = team_abs[0]
    # Date
    game_date = date[0:4]+"-"+date[4:6]+"-"+date[6:9]
    # Calculate the number of unique scorers for the team
    raw_scorers = len(df.query('FG > 0'))
    # Caluclate the number of unique scorers as a percent of total players
    percent_scorers = raw_scorers / len(df)

    # Add the information to the dictionary
    # Note that this is a nested dictionaty. That is key for unpacking this information later
    game_level_data[game_key] = {'team':team_abs[0],
                                 'date':game_date,
                                 'raw_scorers':raw_scorers,
                                 'percent_scorers':percent_scorers}
    print("Team: {}, Date: {}, Scorers: {}".format(team_abs[0],game_date,raw_scorers))
    
    # Do it all again for the second team
    # Why wasn't this done in a loop? Because it saves us from having to visit the webpage twice
    team_2half_string = "box-{}-h2-basic".format(team_abs[1])

    source = urllib.request.urlopen(c_link).read()
    soup = BeautifulSoup(source,'html')
    table = soup.find_all("table", {"id": team_2half_string})
    df = pd.read_html(str(table))[0]

    # Drop top level index
    df.columns = ['_'.join(col) for col in df.columns.values]

    new_cols= []

    for col in df.columns.values:
        val = col.split("_")[1]
        new_cols.append(val)

    new_cols[0] = 'Player'

    df.columns = new_cols

    # Remove "Reserves" and "Team Totals" rows
    df = df.query('Player != "Reserves" & Player != "Team Totals"')

    # Replace "Did Not Play" and "Not With Team" with Nan
    df = df.replace("Did Not Play", np.nan)
    df = df.replace("Not With Team", np.nan)
    df = df.replace("Did Not Dress", np.nan)
    df = df.replace("Player Suspended", np.nan)

    # Keep only the rows for players who actually played
    df = df[df['MP'].notna()]

    # Reset index
    df = df.reset_index(drop=True)

    # Convert FG to integer
    df['FG'] = pd.to_numeric(df['FG'])

    # Dictionary Key
    game_key = date + team_abs[1]
    # Team
    team = team_abs[1]
    # Date
    game_date = date[0:4]+"-"+date[4:6]+"-"+date[6:9]
    # Raw Scorers
    raw_scorers = len(df.query('FG > 0'))
    # Percent Scorers
    percent_scorers = raw_scorers / len(df)

    # Save Info
    game_level_data[game_key] = {'team':team_abs[1],
                                 'date':game_date,
                                 'raw_scorers':raw_scorers,
                                 'percent_scorers':percent_scorers}
    
    # Print a general statement to know that the code is running
    print("Team: {}, Date: {}, Scorers: {}".format(team_abs[1],game_date,raw_scorers))
    
    # We will sleep for a tenth of a second to reduce to speed of requests we make to the webpage
    time.sleep(0.10)

Team: ORL, Date: 2000-01-02, Scorers: 9
Team: MIA, Date: 2000-01-02, Scorers: 6
Team: CLE, Date: 2000-01-03, Scorers: 7
Team: BOS, Date: 2000-01-03, Scorers: 7
Team: CHI, Date: 2000-01-03, Scorers: 5
Team: POR, Date: 2000-01-03, Scorers: 9
Team: DET, Date: 2000-01-03, Scorers: 7
Team: ORL, Date: 2000-01-03, Scorers: 8
Team: MIL, Date: 2000-01-03, Scorers: 5
Team: PHI, Date: 2000-01-03, Scorers: 8
Team: DEN, Date: 2000-01-03, Scorers: 4
Team: UTA, Date: 2000-01-03, Scorers: 9
Team: WAS, Date: 2000-01-03, Scorers: 8
Team: GSW, Date: 2000-01-03, Scorers: 7
Team: SAC, Date: 2000-01-04, Scorers: 6
Team: CLE, Date: 2000-01-04, Scorers: 8
Team: DAL, Date: 2000-01-04, Scorers: 7
Team: DEN, Date: 2000-01-04, Scorers: 6
Team: SEA, Date: 2000-01-04, Scorers: 7
Team: HOU, Date: 2000-01-04, Scorers: 6


### Closing Up  
Great! Our code worked and we were able to run through the first 10 games of our data set. Now, we need to unpack our dictionary into a data frame so that we can analyze everything and get an answer to our question.

In [15]:
# We can quickly convert our dicitonary to a data frame
# We will sort by raw scorers as that is what  we are most interested in seeing
pd.DataFrame.from_dict(game_level_data, orient='index').sort_values('raw_scorers')

Unnamed: 0,team,date,raw_scorers,percent_scorers
20000103DEN,DEN,2000-01-03,4,0.4
20000103CHI,CHI,2000-01-03,5,0.454545
20000103MIL,MIL,2000-01-03,5,0.625
20000104HOU,HOU,2000-01-04,6,0.666667
20000102MIA,MIA,2000-01-02,6,0.857143
20000104DEN,DEN,2000-01-04,6,0.666667
20000104SAC,SAC,2000-01-04,6,0.666667
20000103CLE,CLE,2000-01-03,7,0.7
20000103BOS,BOS,2000-01-03,7,0.777778
20000103DET,DET,2000-01-03,7,0.875


Great! We have exactly what we were looking for. We now have a list of games with two specific measurements (raw scorers and percent scorers) for each game. We can run this code across our entire set of games and see just how rare that Sixers performance really was.