# Basketball Player Size

In this notebook, I am going to pull data from Basketball-Reference.com on all the players who have played in the NBA. In addition to more detailed information about each player, they have basic stats such as height, weight and years in the NBA. Once we have this data, we can visualize using Tableau to get a sense of how the size of players has changed over time.

First, I need to import the packages. I will be using the requests package to get the html for the site and then BeautifulSoup to pull the tables of data out. I will use pandas to store the data in a DataFrame and tqdm to track the progress of the scraping.

In [86]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm.notebook import tqdm
import numpy as np

First, I'm going to look at just the letter "a" to keep things simple and work out any bugs before moving to the full alphabet. Each letter has its own site with a similar structure (players/a, players/b, etc) which will be helpful when we cycle through them all.

In [35]:
site = "https://www.basketball-reference.com/players/a"
site_html = requests.get(site).content

site_html now has the html of this website and we can dig deeper to pull out the table of data. By looking at the html using the Developer tools in my browser, I see that the table of interest is called "div_players". I can use BeautifulSoup to pull out this table. The table is a "tbody" element and is made up of a series of rows. In between the rows of data are empty strings so we will skip those rows when we go through.

In [43]:
soup = BeautifulSoup(site_html, "html.parser")
stats_table = soup.find("tbody")
stats_dataframe = pd.DataFrame()

ix = 0
for stat in tqdm(stats_table):
    try:
        for field in stat:
            stats_dataframe.loc[ix, field.get('data-stat')]=field.text
        ix += 1
    except:
        pass

stats_dataframe

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=40.0), HTML(value='')))




Unnamed: 0,player,year_min,year_max,pos,height,weight,birth_date,colleges
0,Max Zaslofsky,1947,1956,G-F,6-2,170,"December 7, 1925",St. John's
1,Zeke Zawoluk,1953,1955,F-C,6-7,215,"October 13, 1930",St. John's
2,Cody Zeller,2014,2022,C-F,6-11,240,"October 5, 1992",Indiana
3,Dave Zeller,1962,1962,G,6-1,175,"June 8, 1939",Miami University
4,Gary Zeller,1971,1972,G,6-3,205,"November 20, 1947",Drake University
5,Harry Zeller,1947,1947,C-F,6-4,210,"July 10, 1919",Washington & Jefferson College
6,Luke Zeller,2013,2013,C,6-11,245,"April 7, 1987",Notre Dame
7,Tyler Zeller,2013,2020,F-C,7-0,253,"January 17, 1990",UNC
8,Tony Zeno,1980,1980,F,6-8,210,"October 1, 1957",Arizona State
9,Phil Zevenbergen,1988,1988,C,6-10,230,"April 13, 1964","Seattle Pacific University, Washington"


We need to cycle through every letter of the alphabet to get the full set of players. It's easy enough to make a string with all the letters we want to check and cycle through them to get our full dataset. Since we already did a, we don't need to add those players again.

In [44]:
listofletters = 'bcdefghijklmnopqrstuvwxyz'

parent_site = "https://www.basketball-reference.com/players/"

for letter in tqdm(listofletters):
    site = parent_site + letter
    site_html = requests.get(site).content
    
    soup = BeautifulSoup(site_html, "html.parser")
    stats_table = soup.find("tbody")

    for stat in stats_table:
        try:
            for field in stat:
                stats_dataframe.loc[ix, field.get('data-stat')]=field.text
            ix += 1
        except:
            pass

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=25.0), HTML(value='')))




Let's take a look at our data!

In [45]:
stats_dataframe

Unnamed: 0,player,year_min,year_max,pos,height,weight,birth_date,colleges
0,Max Zaslofsky,1947,1956,G-F,6-2,170,"December 7, 1925",St. John's
1,Zeke Zawoluk,1953,1955,F-C,6-7,215,"October 13, 1930",St. John's
2,Cody Zeller,2014,2022,C-F,6-11,240,"October 5, 1992",Indiana
3,Dave Zeller,1962,1962,G,6-1,175,"June 8, 1939",Miami University
4,Gary Zeller,1971,1972,G,6-3,205,"November 20, 1947",Drake University
...,...,...,...,...,...,...,...,...
4863,Ante Žižić,2018,2020,F-C,6-10,266,"January 4, 1997",
4864,Jim Zoet,1983,1983,C,7-1,240,"December 20, 1953",Kent State University
4865,Bill Zopf,1971,1971,G,6-1,170,"June 7, 1948",Duquesne
4866,Ivica Zubac,2017,2022,C,7-0,240,"March 18, 1997",


The standard positions reflected in this data are F: Forward, C: Center, G: Guard, There are some players that play multiple positions. These players serve a function on the team distinct from players who play one or the other. I will create additional positions for combo players.

In [46]:
set(stats_dataframe['pos'])

{'C', 'C-F', 'F', 'F-C', 'F-G', 'G', 'G-F'}

In [47]:
stats_dataframe['pos'] = stats_dataframe['pos'].replace({'C-F':'F-C', 'G-F':'F-G'})
set(stats_dataframe['pos'])

{'C', 'F', 'F-C', 'F-G', 'G'}

I want the height field to be in inches so that I can work with it as a numerical variable. I convert it by splitting the string on the hyphen and converting the feet to inches

In [88]:
stats_dataframe['height'] = stats_dataframe['height'].apply(lambda x: x.split("-"))
stats_dataframe['height'] = stats_dataframe['height'].apply(lambda x: int(x[0])*12 + int(x[1]))
stats_dataframe['weight'] = stats_dataframe['weight'].replace('', np.nan)

In [97]:
stats_dataframe = stats_dataframe.dropna(subset=['weight'])

In [98]:
stats_dataframe = stats_dataframe.astype({'year_min':int, 'year_max':int, 'weight': int})

Here, I'm going to export my dataframe to a csv before I keep working on it.

In [99]:
stats_dataframe.to_csv('NBAPlayers.csv')

I want to be able to look at how the size of players has changed over time. To do this, I need to have a record with the single year rather than the year_min and year_max.

In [100]:
roster_by_year = pd.DataFrame(columns = ['year','player','pos','height','weight'])

ix = 0
for i, player in tqdm(stats_dataframe.iterrows(), total=stats_dataframe.shape[0]):
    for year in range(player['year_min'], player['year_max']+ 1):
        roster_by_year.loc[ix] = [year, player['player'], player['pos'], player['height'], player['weight']]
        ix += 1
    

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4863.0), HTML(value='')))




In [101]:
 roster_by_year

Unnamed: 0,year,player,pos,height,weight
0,1947,Max Zaslofsky,F-G,74,170
1,1948,Max Zaslofsky,F-G,74,170
2,1949,Max Zaslofsky,F-G,74,170
3,1950,Max Zaslofsky,F-G,74,170
4,1951,Max Zaslofsky,F-G,74,170
...,...,...,...,...,...
25106,2019,Ivica Zubac,C,84,240
25107,2020,Ivica Zubac,C,84,240
25108,2021,Ivica Zubac,C,84,240
25109,2022,Ivica Zubac,C,84,240


In [102]:
roster_by_year.to_csv('NBArosterbyyear.csv')