# Winning In the NBA: Golden State Warriors Edition Pt. 1 (Webscraping)

From 2015-2018 the Golden State Warriors won 3 championships in 4 years. As recently as 2022, they added another championship to their dynastic run. It is safe to say the Warriors have been a dominant force in the NBA over the last decade. Breaking into the playoff scene during the 2014-2015 season, their brilliant displays of motion offense and 'strength in numbers' mentality would shoot them to the upper echelon of the NBA hierarchy the following year. During the 2015-2016 season, the Golden State Warriors would go on to win their first NBA championship in 40 years with a first year head coach in Steve Kerr and a rising trio of Stephen Curry, Klay Thompson, and Draymond Green.

As a San Francisco native and Warriors fan, for this analysis I am going to focus on what NBA team statistics influence a team's ability to win. Considering the Warriors have had considerable success over the last decade, I will be specifically focusing on the metrics that have created the winning formula for Golden State Warriors and how those metrics fare in comparison to the rest of the NBA in this time frame.

Before I can do any analysis, I needed to first obtain the NBA team data. To do this I am going to webscrape data from [basketball references](https://www.basketball-reference.com) a website that contains all kinds of NBA related data on teams and individual players that is freely accessible to the public. In this analysis I am going to be focusing on team related data for all 30 NBA teams over the last 10 full seasons (2014-2023).

In [1]:
import requests
import bs4
import pandas as pd

After exploring the website I found that the url for NBA team statistics vary only by the team abbreviations. For example the Atlanta Hawks statistics page would have a url of "https://www.basketball-reference.com/teams/ATL/stats_basic_totals.html" and the Boston Celtics would then have a url of "https://www.basketball-reference.com/teams/BOS/stats_basic_totals.html". So to make this process more efficient we are going to create a list of urls to request data from. In order to create this list I need to get the list of NBA team abbreviations. I will do this by webscraping from this [wikipedia page](https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Basketball_Association/National_Basketball_Association_team_abbreviations).

In [3]:
# list of NBA team abbreviations from wikipedia site
req = requests.get("https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Basketball_Association/National_Basketball_Association_team_abbreviations")

In [4]:
soup = bs4.BeautifulSoup(req.text, 'lxml')

In [7]:
# add all NBA teams to list
nba_abb = []
count = 0
for row in soup.find_all('td'):
        if count % 2 == 0:
            nba_abb.append(soup.find_all('td')[count].get_text())
            count += 1
        else:
            count += 1

In [8]:
nba_abb

['ATL\n',
 'BOS\n',
 'BKN\n',
 'CHA\n',
 'CHI\n',
 'CLE\n',
 'DAL\n',
 'DEN\n',
 'DET\n',
 'GSWGS[a]\n',
 'HOU\n',
 'IND\n',
 'LAC\n',
 'LAL\n',
 'MEM\n',
 'MIA\n',
 'MIL\n',
 'MIN\n',
 'NOPNO[a]\n',
 'NYKNY[a]\n',
 'OKC\n',
 'ORL\n',
 'PHI\n',
 'PHX\n',
 'POR\n',
 'SAC\n',
 'SASSA[a]\n',
 'TOR\n',
 'UTAUTAH[a]\n',
 'WAS\n']

In [9]:
# clean up list by removing new line characters
for i, name in enumerate(nba_abb):
    nba_abb[i] = name[0:3]

In [10]:
print(nba_abb)

['ATL', 'BOS', 'BKN', 'CHA', 'CHI', 'CLE', 'DAL', 'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI', 'PHX', 'POR', 'SAC', 'SAS', 'TOR', 'UTA', 'WAS']


The NBA team abbreviations need some adjustments as the basketball reference website uses slightly different abbreviations for certain teams. Let's correct them!

In [11]:
nba_abb[2] = 'NJN'

In [12]:
nba_abb[18] = 'NOH'

In [13]:
nba_abb[23] = 'PHO'

In [14]:
print(nba_abb)

['ATL', 'BOS', 'NJN', 'CHA', 'CHI', 'CLE', 'DAL', 'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'NOH', 'NYK', 'OKC', 'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'TOR', 'UTA', 'WAS']


With my list of abbreviations, now I can start creating the list of urls that we want to webscrape our data from. The basketball reference site uses the same url path for all the team stats that we are interested in with only the team abbreviation varying from team to team. So we are going to create the list of urls with this pattern.

In [15]:
urls = []
for team in nba_abb:
    urls.append("https://www.basketball-reference.com/teams/" + team + "/stats_basic_totals.html")

In [56]:
urls[0:10]

['https://www.basketball-reference.com/teams/ATL/stats_basic_totals.html',
 'https://www.basketball-reference.com/teams/BOS/stats_basic_totals.html',
 'https://www.basketball-reference.com/teams/NJN/stats_basic_totals.html',
 'https://www.basketball-reference.com/teams/CHA/stats_basic_totals.html',
 'https://www.basketball-reference.com/teams/CHI/stats_basic_totals.html',
 'https://www.basketball-reference.com/teams/CLE/stats_basic_totals.html',
 'https://www.basketball-reference.com/teams/DAL/stats_basic_totals.html',
 'https://www.basketball-reference.com/teams/DEN/stats_basic_totals.html',
 'https://www.basketball-reference.com/teams/DET/stats_basic_totals.html',
 'https://www.basketball-reference.com/teams/GSW/stats_basic_totals.html']

## Building Function to Webscrape NBA Data

To cycle through the list of urls I saved, I am going to build a function called ```get_team_stats()```. This function will take in a list of urls and a dataframe to store the webscraped information into. I will be using Python's BeautifulSoup package to extract the HTML data from my list of urls and use ```lxml``` to parse through the HTML 'soup' created by the BeautifulSoup package.

In [17]:
nba_team_stats = pd.DataFrame()
nba_team_stats.shape

(0, 0)

In [18]:
def get_team_stat(url, df):
    all_stats = []
    req = requests.get(url)
    soup = bs4.BeautifulSoup(req.text, 'lxml')
    table = soup.select("tbody")

    temp = []
    count = 0
    i = 0
    for div in table[0].find_all(["th","td"]):
        if count / 34 == 12:
            break
        if count % 34 == 0: 
            all_stats.insert(i, temp)
            temp = []
            temp.append(div.get_text())
            count += 1
            i += 1
        else:
            temp.append(div.get_text())
            count += 1
    
    df = df.append(pd.DataFrame(all_stats))
    
    return df

In [19]:
import time
for url in urls:
    time.sleep(2) # slow down the requests to put less strain on website
    temp_df = pd.DataFrame()
    nba_team_stats = nba_team_stats.append(get_team_stat(url, temp_df))

In [20]:
nba_team_stats.head(11)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,30,31,32,33
0,,,,,,,,,,,...,,,,,,,,,,
1,2023-24,NBA,ATL,24.0,29.0,3.0,,26.1,6-6,209.0,...,0.811,686.0,1690.0,2376.0,1386.0,407.0,238.0,710.0,997.0,6448.0
2,2022-23,NBA,ATL,41.0,41.0,2.0,,24.9,6-6,210.0,...,0.818,920.0,2719.0,3639.0,2049.0,580.0,401.0,1060.0,1541.0,9711.0
3,2021-22,NBA,ATL,43.0,39.0,2.0,,26.1,6-6,211.0,...,0.812,823.0,2783.0,3606.0,2017.0,587.0,348.0,972.0,1534.0,9343.0
4,2020-21,NBA,ATL,41.0,31.0,1.0,,25.4,6-6,214.0,...,0.812,760.0,2525.0,3285.0,1737.0,503.0,342.0,953.0,1392.0,8186.0
5,2019-20,NBA,ATL,20.0,47.0,5.0,,24.1,6-6,216.0,...,0.79,661.0,2237.0,2898.0,1605.0,523.0,341.0,1086.0,1548.0,7488.0
6,2018-19,NBA,ATL,29.0,53.0,5.0,,25.1,6-7,215.0,...,0.752,955.0,2825.0,3780.0,2118.0,675.0,419.0,1397.0,1932.0,9294.0
7,2017-18,NBA,ATL,24.0,58.0,5.0,,25.4,6-6,212.0,...,0.785,743.0,2693.0,3436.0,1946.0,638.0,348.0,1276.0,1606.0,8475.0
8,2016-17,NBA,ATL,43.0,39.0,2.0,,27.9,6-6,219.0,...,0.728,842.0,2793.0,3635.0,1938.0,672.0,397.0,1294.0,1491.0,8459.0
9,2015-16,NBA,ATL,48.0,34.0,2.0,,28.2,6-6,217.0,...,0.783,679.0,2772.0,3451.0,2100.0,747.0,486.0,1226.0,1570.0,8433.0


Now that I have the data from all 30 nba teams over the last decade, we need to clean up the dataset a bit. Firstly we need to rename all the columns with its correct name by extracting the columns from the ```soup``` object I created. 

In [32]:
# get headers for dataframe
req = requests.get("https://www.basketball-reference.com/teams/ATL/stats_basic_totals.html")
soup = bs4.BeautifulSoup(req.text, 'lxml')

In [35]:
w_header = soup.select("tr", ["th", "td"])

In [36]:
headers = []
for h in w_header[0].find_all("th"):
    headers.append(h.get_text())
    
print(headers)

['Season', 'Lg', 'Tm', 'W', 'L', 'Finish', '\xa0', 'Age', 'Ht.', 'Wt.', '\xa0', 'G', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']


In [37]:
nba_team_stats.columns = headers

In [38]:
nba_team_stats.columns

Index(['Season', 'Lg', 'Tm', 'W', 'L', 'Finish', ' ', 'Age', 'Ht.', 'Wt.', ' ',
       'G', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%',
       'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
       'PF', 'PTS'],
      dtype='object')

## Clean Dataset

Now that the dataframe is successfully extracted, let's prepare the dataframe for extraction into a csv file. To do this I am going to drop the empty columns that were created when webscraping as well as dropping the index column in the dataframe. 

In [39]:
# drop empty columns
cols = [6,10]
nba_team_stats.drop(nba_team_stats.columns[cols], axis = 1, inplace = True)

In [40]:
# drop index column
nba_team_stats.drop(nba_team_stats.loc[[0]].index, axis = 0, inplace = True)

In [41]:
# copy original dataset to make changes that don't effect original
cleaned_df = nba_team_stats.copy()

In [42]:
# drop incomplete season
cleaned_df = cleaned_df.loc[nba_team_stats["Season"] != '2023-24']
cleaned_df.shape

(300, 32)

In [43]:
cleaned_df.head(10)

Unnamed: 0,Season,Lg,Tm,W,L,Finish,Age,Ht.,Wt.,G,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
2,2022-23,NBA,ATL,41,41,2,24.9,6-6,210,82,...,0.818,920,2719,3639,2049,580,401,1060,1541,9711
3,2021-22,NBA,ATL,43,39,2,26.1,6-6,211,82,...,0.812,823,2783,3606,2017,587,348,972,1534,9343
4,2020-21,NBA,ATL,41,31,1,25.4,6-6,214,72,...,0.812,760,2525,3285,1737,503,342,953,1392,8186
5,2019-20,NBA,ATL,20,47,5,24.1,6-6,216,67,...,0.79,661,2237,2898,1605,523,341,1086,1548,7488
6,2018-19,NBA,ATL,29,53,5,25.1,6-7,215,82,...,0.752,955,2825,3780,2118,675,419,1397,1932,9294
7,2017-18,NBA,ATL,24,58,5,25.4,6-6,212,82,...,0.785,743,2693,3436,1946,638,348,1276,1606,8475
8,2016-17,NBA,ATL,43,39,2,27.9,6-6,219,82,...,0.728,842,2793,3635,1938,672,397,1294,1491,8459
9,2015-16,NBA,ATL,48,34,2,28.2,6-6,217,82,...,0.783,679,2772,3451,2100,747,486,1226,1570,8433
10,2014-15,NBA,ATL,60,22,1,27.8,6-6,218,82,...,0.778,715,2611,3326,2111,744,380,1167,1457,8409
11,2013-14,NBA,ATL,38,44,4,27.6,6-6,220,82,...,0.781,713,2565,3278,2041,680,326,1251,1577,8282


In [44]:
cleaned_df.dtypes

Season    object
Lg        object
Tm        object
W         object
L         object
Finish    object
Age       object
Ht.       object
Wt.       object
G         object
MP        object
FG        object
FGA       object
FG%       object
3P        object
3PA       object
3P%       object
2P        object
2PA       object
2P%       object
FT        object
FTA       object
FT%       object
ORB       object
DRB       object
TRB       object
AST       object
STL       object
BLK       object
TOV       object
PF        object
PTS       object
dtype: object

Looking at the data types in the dataframe, all the columns are objects. When we extract the dataframe as a csv the types for each column will remedy itself, so there is no need to adjust all of the data types now. 

In [45]:
# replace with single year
cleaned_df['Season'].replace({'2022-23': 2023, '2021-22' : 2022, '2020-21' : 2021,'2019-20': 2020, '2018-19' : 2019, 
                              '2017-18' : 2018, '2016-17' : 2017, '2015-16' : 2016, '2014-15' : 2015, '2013-14' : 2014}, inplace=True) 

In [46]:
cleaned_df.head(10)

Unnamed: 0,Season,Lg,Tm,W,L,Finish,Age,Ht.,Wt.,G,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
2,2023,NBA,ATL,41,41,2,24.9,6-6,210,82,...,0.818,920,2719,3639,2049,580,401,1060,1541,9711
3,2022,NBA,ATL,43,39,2,26.1,6-6,211,82,...,0.812,823,2783,3606,2017,587,348,972,1534,9343
4,2021,NBA,ATL,41,31,1,25.4,6-6,214,72,...,0.812,760,2525,3285,1737,503,342,953,1392,8186
5,2020,NBA,ATL,20,47,5,24.1,6-6,216,67,...,0.79,661,2237,2898,1605,523,341,1086,1548,7488
6,2019,NBA,ATL,29,53,5,25.1,6-7,215,82,...,0.752,955,2825,3780,2118,675,419,1397,1932,9294
7,2018,NBA,ATL,24,58,5,25.4,6-6,212,82,...,0.785,743,2693,3436,1946,638,348,1276,1606,8475
8,2017,NBA,ATL,43,39,2,27.9,6-6,219,82,...,0.728,842,2793,3635,1938,672,397,1294,1491,8459
9,2016,NBA,ATL,48,34,2,28.2,6-6,217,82,...,0.783,679,2772,3451,2100,747,486,1226,1570,8433
10,2015,NBA,ATL,60,22,1,27.8,6-6,218,82,...,0.778,715,2611,3326,2111,744,380,1167,1457,8409
11,2014,NBA,ATL,38,44,4,27.6,6-6,220,82,...,0.781,713,2565,3278,2041,680,326,1251,1577,8282


In [None]:
cleaned_df.to_csv("C:/Users/Joyce/Downloads/nba_teams.csv")