# Data Collection and Cleaning

### Research Question: What factors affect a player's average points per game?

In order to determine what factors would have an affect on a NBA player's average points per game, we decided to scrape the statistics of every player from basketball-reference.com. We decided to focus on years 2008 to 2017 to make predictions on the average points per game in 2018 season.

From the website, we determined the player's performance stats as well as other relevant information. The columns for our dataframe had the players' name, height, weight, age, position, season, team, games played, games started, minutes played per game, field goal percentage, 3-point field goal percentage, 2-point field goal percentage, free throw percentage, offensive rebound per game, defensive rebound per game, assists per game, steals per game, blocks per game, turnovers per game, personal fouls per game, and points per game.

In [97]:
# Import statements
%matplotlib inline
import pandas as pd
import numpy as np
import requests
import json
import requests
import time
from bs4 import BeautifulSoup

### Web Scraping for Player Info:

In [98]:
# Main link
resp = requests.get("https://www.basketball-reference.com/")
soup = BeautifulSoup(resp.content, "html.parser")
time.sleep(0.5)

In [99]:
# Access the players page
content = soup.find("div", {"id": "content"})
player_link = content.find("div", {"class": "", "id": "players"})
a_tag = player_link.find("a", href=True)
resp_href = requests.get("https://www.basketball-reference.com" + a_tag['href'])
player_soup = BeautifulSoup(resp_href.content, "html.parser")
time.sleep(0.5)

In [100]:
# Find the list of players A - Z (by last name)
wrap = player_soup.find("div", {"id": "wrap"}).find("div", {"id": "content", "role": "main", "class": "index"})
letter_index = wrap.find("ul", {"class": "page_index"})

letter_arr = letter_index.find_all("a", href=True)
letter_arr = letter_arr[0:118:7] + letter_arr[118:173:7]

**Create the first dataframe with player's weight and height:**

In [101]:
# Find all of the players and create the data frame
playerLinks = []
weight_list = []
height_list = []
name_list = []
for letter in letter_arr:
    resp_href = requests.get("https://www.basketball-reference.com" + letter['href'])
    players_soup = BeautifulSoup(resp_href.content, "html.parser")
    time.sleep(0.2)
    player_table = players_soup.find("div", {"class": "table_outer_container"})
    for player in player_table.find_all("a", href=True):
        if player["href"].startswith("/players/"):
            playerLinks.append(player)
            name_list.append(player.text)
    for weight in player_table.find_all("td", {"data-stat": "weight"}):
        weight_list.append(weight.text)
    for height in player_table.find_all("td", {"data-stat": "height"}):
        height_list.append(height.text)
print("Number of players: " + str(len(playerLinks)))

Number of players: 4677


`player_info_df` dataframe has the list of all NBA players with their respective height and weight. We decided to convert the height to inches and weight to pounds.

In [102]:
# Clean up the columns, change height to inches
d = {"name": name_list, "height": height_list, "weight": weight_list}
player_info_df = pd.DataFrame(data=d)
player_info_df["weight"] = pd.to_numeric(player_info_df["weight"])

height_inches = []
height_arr = player_info_df["height"].str.split("-")
for row in height_arr:
    row = list(map(int, row))
    row = [row[0] * 12, row[1]]
    row = row[0] + row[1]
    height_inches.append(row)
player_info_df["height"] = height_inches

player_info_df.head()

Unnamed: 0,name,height,weight
0,Alaa Abdelnaby,82,240.0
1,Zaid Abdul-Aziz,81,235.0
2,Kareem Abdul-Jabbar,86,225.0
3,Mahmoud Abdul-Rauf,73,162.0
4,Tariq Abdul-Wahad,78,223.0


**Create the second dataframe with player's game statistics:**

In [103]:
# Create the columns
resp_href = requests.get("https://www.basketball-reference.com" + playerLinks[0]['href'])
players_soup = BeautifulSoup(resp_href.content, "html.parser")
time.sleep(0.1)
main_class = players_soup.find("div", {"class": "overthrow table_container", "id":"div_per_game"})
player_table = main_class.find("table", {"class": "row_summable sortable stats_table"})
year = player_table.find("tbody").find_all("tr")[1]
columns = ["name", "season"]
for stat in year.find_all("td"):
    columns.append(stat["data-stat"])

In [104]:
# Create the player detail array
info_list = []

# Loop through all of the players
for player in playerLinks:
    resp_href = requests.get("https://www.basketball-reference.com" + player['href'])
    players_soup = BeautifulSoup(resp_href.content, "html.parser")
    time.sleep(0.3)
    main_class = players_soup.find("div", {"class": "overthrow table_container", "id":"div_per_game"})
    player_table = main_class.find("table", {"class": "row_summable sortable stats_table"})
    
    for year in player_table.find("tbody").find_all("tr"):
        if year.find("th"):
            year_numeric = int(str(year.find("th").text)[:-3])
            
            # Between season 2008 - 2017
            if (year_numeric >= 2008 and year_numeric <= 2017):
                list_of_val = [player.text, year.find("th").text]
                for stat in year.find_all("td"):
                    list_of_val.append(stat.text)
                info_list.append(list_of_val)

`nba_players_df` dataframe contains the game performance statistics that we scraped from basketball-reference.com

In [105]:
# Create the DataFrame and clean the columns
player_df = pd.DataFrame(info_list,columns=columns)
player_df["season"] = player_df["season"].str[:-3].astype(float)
for column in player_df:
    if column not in ("name", "team_id", "lg_id", "pos"):
        player_df[column] = player_df[column].replace('', "0")
        player_df[column] = player_df[column].astype(float)
        player_df[column] = player_df[column].fillna(0)
nba_players_df = player_df[player_df["lg_id"] == "NBA"]
nba_players_df["pos"] = nba_players_df["pos"].map({
       "SG": "SG",
       "PG": "PG",
       "SF": "SF",
       "C": "C",
       "PF": "PF",
       "SG-PG": "SG",
       "PG-SG": "PG",
       "SG-SF": "SG",
       "SF-SG": "SF",
       "C-PF": "C",
       "PF-C": "PF",
       "SF-PF": "SF"
})
nba_players_df = nba_players_df[nba_players_df["team_id"] != "TOT"]
nba_players_df = nba_players_df.drop(columns=["fg_per_g", "fga_per_g", 
                                              "fg3_per_g", "fg3a_per_g",
                                              "fg2_per_g", "fg2a_per_g",
                                              "ft_per_g", "fta_per_g",
                                              "trb_per_g", "lg_id"])

For the final dataframe, we combined the `nba_players_df` and `player_info_df` based on the player's name.

In [106]:
# Combine the two dataframes
final_df = pd.merge(nba_players_df, player_info_df, on='name')

In [107]:
# Display the final dataframe
final_df

Unnamed: 0,name,season,age,team_id,pos,g,gs,mp_per_g,fg_pct,fg3_pct,...,orb_per_g,drb_per_g,ast_per_g,stl_per_g,blk_per_g,tov_per_g,pf_per_g,pts_per_g,height,weight
0,Alex Abrines,2016.0,23.0,OKC,SG,68.0,6.0,15.5,0.393,0.381,...,0.3,1.0,0.6,0.5,0.1,0.5,1.7,6.0,78,200.0
1,Alex Abrines,2017.0,24.0,OKC,SG,75.0,8.0,15.1,0.395,0.380,...,0.3,1.2,0.4,0.5,0.1,0.3,1.7,4.7,78,200.0
2,Alex Acker,2008.0,26.0,DET,SG,7.0,0.0,2.9,0.364,0.000,...,0.0,0.3,0.1,0.3,0.1,0.0,0.0,1.3,77,185.0
3,Alex Acker,2008.0,26.0,LAC,SG,18.0,0.0,9.9,0.400,0.438,...,0.4,0.8,0.6,0.2,0.2,0.4,0.5,3.5,77,185.0
4,Quincy Acy,2012.0,22.0,TOR,PF,29.0,0.0,11.8,0.560,0.500,...,1.0,1.6,0.4,0.4,0.5,0.6,1.8,4.0,79,240.0
5,Quincy Acy,2013.0,23.0,TOR,SF,7.0,0.0,8.7,0.429,0.400,...,0.7,1.4,0.6,0.6,0.4,0.3,1.1,2.7,79,240.0
6,Quincy Acy,2013.0,23.0,SAC,SF,56.0,0.0,14.0,0.472,0.200,...,1.2,2.4,0.4,0.3,0.4,0.5,2.0,2.7,79,240.0
7,Quincy Acy,2014.0,24.0,NYK,PF,68.0,22.0,18.9,0.459,0.300,...,1.2,3.3,1.0,0.4,0.3,0.9,2.2,5.9,79,240.0
8,Quincy Acy,2015.0,25.0,SAC,PF,59.0,29.0,14.8,0.556,0.388,...,1.1,2.1,0.5,0.5,0.4,0.5,1.7,5.2,79,240.0
9,Quincy Acy,2016.0,26.0,DAL,PF,6.0,0.0,8.0,0.294,0.143,...,0.3,1.0,0.0,0.0,0.0,0.3,1.5,2.2,79,240.0


In [108]:
# Convert to CSV file
final_df.to_csv("/Users/lhan/Desktop/Senior/DATA301/project/player.csv", index=False)