# Data Collection and Cleaning

### Research Question: What factors affect a player's average points per game?

In order to determine what factors would have an affect on a NBA player's average points per game, we decided to scrape the statistics of every player and team from basketball-reference.com. We decided to focus on years 2012 to 2017 to make predictions on the average points per game in 2017-18 season.

From the website, we determined the player's performance stats as well as other relevant information on their respective teams. After merging the dataframes, our final dataframe's columns includes the players' name, height, weight, age, position, season, team, games played, games started, minutes played per game, field goal percentage, 3-point field goal percentage, 2-point field goal percentage, free throw percentage, offensive rebound per game, defensive rebound per game, assists per game, steals per game, blocks per game, turnovers per game, personal fouls per game, and points per game.

In [65]:
# Import statements
%matplotlib inline
import pandas as pd
import numpy as np
import requests
import json
import requests
import time
from bs4 import BeautifulSoup

### Web Scraping for Player Info:

In [66]:
# Main link
resp = requests.get("https://www.basketball-reference.com/")
soup = BeautifulSoup(resp.content, "html.parser")
time.sleep(0.5)

In [67]:
# Access the players page
content = soup.find("div", {"id": "content"})
player_link = content.find("div", {"class": "", "id": "players"})
a_tag = player_link.find("a", href=True)
resp_href = requests.get("https://www.basketball-reference.com" + a_tag['href'])
player_soup = BeautifulSoup(resp_href.content, "html.parser")
time.sleep(0.5)

In [119]:
# Find the list of players A - Z (by last name)
wrap = player_soup.find("div", {"id": "wrap"}).find("div", {"id": "content", "role": "main", "class": "index"})
letter_index = wrap.find("ul", {"class": "page_index"})

letter_arr = letter_index.find_all("a", href=True)
letter_arr = letter_arr[0:118:7] + letter_arr[118:173:7]

**Create the first dataframe with player's weight and height:**

In [146]:
# Find all of the players and create the data frame
playerLinks = []
weight_list = []
height_list = []
name_list = []
for letter in letter_arr:
    resp_href = requests.get("https://www.basketball-reference.com" + letter['href'])
    players_soup = BeautifulSoup(resp_href.content, "html.parser")
    time.sleep(0.2)
    player_table = players_soup.find("div", {"class": "table_outer_container"})
    for player in player_table.find("tbody").find_all("tr"):
        ext = player.find("a", href=True)
        year_val = int(str(player.find("td", {"data-stat": "year_max"}).text))
        weight = player.find("td", {"data-stat": "weight"})
        height = player.find("td", {"data-stat": "height"})
        if (ext["href"].startswith("/players/") and year_val >= 2012):
            playerLinks.append(ext)
            weight_list.append(weight.text)
            height_list.append(height.text)
            name_list.append(ext.text)
print("Number of players: " + str(len(playerLinks)))

Number of players: 1121


`player_info_df` dataframe has the list of all NBA players with their respective height and weight. We decided to convert the height to inches and weight to pounds.

In [147]:
# Clean up the columns, change height to inches
d = {"name": name_list, "height": height_list, "weight": weight_list}
player_info_df = pd.DataFrame(data=d)
player_info_df["weight"] = pd.to_numeric(player_info_df["weight"])

height_inches = []
height_arr = player_info_df["height"].str.split("-")
for row in height_arr:
    row = list(map(int, row))
    row = [row[0] * 12, row[1]]
    row = row[0] + row[1]
    height_inches.append(row)
player_info_df["height"] = height_inches

player_info_df.head()

Unnamed: 0,name,height,weight
0,Alex Abrines,78,200
1,Quincy Acy,79,240
2,Jaylen Adams,74,190
3,Jordan Adams,77,209
4,Steven Adams,84,265


**Create the second dataframe with player's game statistics:**

In [148]:
# Create the columns
resp_href = requests.get("https://www.basketball-reference.com" + playerLinks[0]['href'])
players_soup = BeautifulSoup(resp_href.content, "html.parser")
time.sleep(0.1)
main_class = players_soup.find("div", {"class": "overthrow table_container", "id":"div_per_game"})
player_table = main_class.find("table", {"class": "row_summable sortable stats_table"})
year = player_table.find("tbody").find_all("tr")[1]
columns = ["name", "season"]
for stat in year.find_all("td"):
    columns.append(stat["data-stat"])

In [150]:
# Create the player detail array
info_list = []

# Loop through all of the players
for player in playerLinks:
    resp_href = requests.get("https://www.basketball-reference.com" + player['href'])
    players_soup = BeautifulSoup(resp_href.content, "html.parser")
    time.sleep(0.2)
    main_class = players_soup.find("div", {"class": "overthrow table_container", "id":"div_per_game"})
    player_table = main_class.find("table", {"class": "row_summable sortable stats_table"})
    
    for year in player_table.find("tbody").find_all("tr"):
        if year.find("th"):
            year_numeric = int(str(year.find("th").text)[:-3])
            
            # Between season 2012 - 2017
            if (year_numeric >= 2012 and year_numeric <= 2017):
                list_of_val = [player.text, year.find("th").text]
                for stat in year.find_all("td"):
                    list_of_val.append(stat.text)
                info_list.append(list_of_val)

`nba_players_df` dataframe contains the game performance statistics that we scraped from basketball-reference.com

In [151]:
# Create the DataFrame and clean the columns
player_df = pd.DataFrame(info_list,columns=columns)
player_df["season"] = player_df["season"].str[:-3].astype(float)
for column in player_df:
    if column not in ("name", "team_id", "lg_id", "pos"):
        player_df[column] = player_df[column].replace('', "0")
        player_df[column] = player_df[column].astype(float)
        player_df[column] = player_df[column].fillna(0)
nba_players_df = player_df[player_df["lg_id"] == "NBA"]
nba_players_df["pos"] = nba_players_df["pos"].map({
       "SG": "SG",
       "PG": "PG",
       "SF": "SF",
       "C": "C",
       "PF": "PF",
       "SG-PG": "SG",
       "PG-SG": "PG",
       "SG-SF": "SG",
       "SF-SG": "SF",
       "C-PF": "C",
       "PF-C": "PF",
       "SF-PF": "SF"
})
nba_players_df = nba_players_df[nba_players_df["team_id"] != "TOT"]
nba_players_df = nba_players_df.drop(columns=["fg_per_g", "fga_per_g", 
                                              "fg2_per_g", "fg2a_per_g",
                                              "ft_per_g", "fta_per_g",
                                              "trb_per_g", "lg_id"])

### Web Scraping for Team Info:

In [152]:
# Team link page
resp = requests.get("https://www.basketball-reference.com/teams")
soup = BeautifulSoup(resp.content, "html.parser")
teams = soup.find_all("tr", {"class": "full_table"})

# Get team link extention
team_link_ext = []
for i in range(len(teams)):
    team_info = teams[i].find_all("a", href = True)
    team_link_ext.append(team_info[0]["href"])
# First 30 teams are NBA teams
team_link_ext = team_link_ext[:30]

**Get division data to find finishes by conference:**

In [153]:
resp = requests.get("https://www.basketball-reference.com/teams")
soup = BeautifulSoup(resp.content, "html.parser")

division = []
team = []
for i in soup.find_all("div", {"class": "division"}):
    for j in (i.find_all("a")):
        division.append(i.find_all("strong")[0].text)
        team.append(j["href"][7:10]) 

In [154]:
# Create dataframe from scraped division data
division_col = ["division", "team_id"]
division_df = pd.DataFrame([division, team]).T
division_df.columns = division_col
division_df.head()

Unnamed: 0,division,team_id
0,Atlantic,TOR
1,Atlantic,BOS
2,Atlantic,NYK
3,Atlantic,BRK
4,Atlantic,PHI


In [155]:
# Create the columns
columns = []
resp_href = requests.get("https://www.basketball-reference.com" + team_link_ext[0] + "/stats_per_game_totals.html")
team_soup = BeautifulSoup(resp_href.content, "html.parser")
main_class = team_soup.find("div", {"class": "overthrow table_container", "id":"div_stats"})
team_table = main_class.find("table", {"class": "sortable stats_table"})
for col in team_table.find_all("th"):
    columns.append(col["data-stat"])
columns = columns[:34]
del columns[6]
del columns[9]

In [156]:
info_list = []
for link in team_link_ext:
    resp = requests.get("https://www.basketball-reference.com" + link + "/stats_per_game_totals.html")
    soup = BeautifulSoup(resp.content, "html.parser")
    time.sleep(0.2)
    main_class = soup.find("div", {"class": "overthrow table_container", "id":"div_stats"})
    team_table = main_class.find("table", {"class": "sortable stats_table"})
    for year in team_table.find("tbody").find_all("tr"):
        if (year.find("th") and year.find("th").text != "Season"):
            year_numeric = int(str(year.find("th").text)[:-3])

            if (year_numeric >= 2012 and year_numeric <= 2017):
                    list_of_val = [year.find("th").text]
                    for stat in year.find_all("td"):
                        if stat.text != "":
                            list_of_val.append(stat.text)
                    info_list.append(list_of_val)

In [157]:
# Create the DataFrame and clean the columns
team_df = pd.DataFrame(info_list,columns=columns)
team_df["season"] = team_df["season"].str[:-3].astype(float)
team_df["avg_wt"] = pd.to_numeric(team_df["avg_wt"])

height_inches = []
height_arr = team_df["avg_ht"].str.split("-")
for row in height_arr:
    row = list(map(int, row))
    row = [row[0] * 12, row[1]]
    row = row[0] + row[1]
    height_inches.append(row)
team_df["avg_ht"] = height_inches

for column in team_df:
    if column not in ("team_id", "lg_id"):
        team_df[column] = team_df[column].replace('', "0")
        team_df[column] = team_df[column].astype(float)
        team_df[column] = team_df[column].fillna(0)
team_df[team_df.team_id == "CHA"] = team_df[team_df.team_id == "CHA"].replace("CHA", "CHO")
team_df[team_df.team_id == "NOH"] = team_df[team_df.team_id == "NOH"].replace("NOH", "NOP")
team_df[team_df.team_id == "NJN"] = team_df[team_df.team_id == "NJN"].replace("NJN", "BRK")
team_df.head()

Unnamed: 0,season,lg_id,team_id,wins,losses,rank_team,avg_age,avg_ht,avg_wt,g,...,ft_pct,orb_per_g,drb_per_g,trb_per_g,ast_per_g,stl_per_g,blk_per_g,tov_per_g,pf_per_g,pts_per_g
0,2017.0,NBA,ATL,24.0,58.0,5.0,25.4,78.0,212.0,82.0,...,0.785,9.1,32.8,41.9,23.7,7.8,4.2,15.6,19.6,103.4
1,2016.0,NBA,ATL,43.0,39.0,2.0,27.9,78.0,219.0,82.0,...,0.728,10.3,34.1,44.3,23.6,8.2,4.8,15.8,18.2,103.2
2,2015.0,NBA,ATL,48.0,34.0,2.0,28.2,78.0,217.0,82.0,...,0.783,8.3,33.8,42.1,25.6,9.1,5.9,15.0,19.1,102.8
3,2014.0,NBA,ATL,60.0,22.0,1.0,27.8,78.0,218.0,82.0,...,0.778,8.7,31.8,40.6,25.7,9.1,4.6,14.2,17.8,102.5
4,2013.0,NBA,ATL,38.0,44.0,4.0,27.6,78.0,220.0,82.0,...,0.781,8.7,31.3,40.0,24.9,8.3,4.0,15.3,19.2,101.0


In [187]:
#merge division and team df
team_div_df = team_df.merge(division_df, on="team_id")

#map the division to the conference
conference = team_div_df.division.map({
    "Southeast": "E",
    "Atlantic": "E",
    "Central": "E",
    "Northwest": "W",
    "Pacific": "W",
    "Southwest":"W"
}).values
#create a new column for conference
team_div_df["conference"] = conference

In [188]:
#iterate through and assign finish based off year and season
seasons = team_div_df.season.value_counts().index
conf = ["E", "W"]
for season in seasons:
    for c in conf:
        conference_ranking = team_div_df[(team_div_df.season == season) & (team_div_df.conference == c)].sort_values("wins", ascending = False)
        conference_ranking["conference_finish"] = range(1,len(conference_ranking) + 1)
        conf_idx = conference_ranking["conference_finish"].index
        team_div_df.loc[conf_idx, "finish"] = conference_ranking["conference_finish"].values

In [189]:
team_div_df.columns

Index(['season', 'lg_id', 'team_id', 'wins', 'losses', 'rank_team', 'avg_age',
       'avg_ht', 'avg_wt', 'g', 'mp_per_g', 'fg_per_g', 'fga_per_g', 'fg_pct',
       'fg3_per_g', 'fg3a_per_g', 'fg3_pct', 'fg2_per_g', 'fg2a_per_g',
       'fg2_pct', 'ft_per_g', 'fta_per_g', 'ft_pct', 'orb_per_g', 'drb_per_g',
       'trb_per_g', 'ast_per_g', 'stl_per_g', 'blk_per_g', 'tov_per_g',
       'pf_per_g', 'pts_per_g', 'division', 'conference', 'finish'],
      dtype='object')

In [190]:
team_div_df = team_div_df[["season", "team_id", "wins", "losses",
                          "rank_team", "division", "conference",
                          "finish"]]

For the final dataframe, we combined the `nba_players_df` and `player_info_df` based on the player's name.

In [191]:
# Combine the two dataframes
final_player_df = pd.merge(nba_players_df, player_info_df, on='name')
final_df = pd.merge(final_player_df, team_div_df, on=["team_id", "season"])

In [192]:
# Display the final dataframe
final_df

Unnamed: 0,name,season,age,team_id,pos,g,gs,mp_per_g,fg_pct,fg3_per_g,...,pf_per_g,pts_per_g,height,weight,wins,losses,rank_team,division,conference,finish
0,Alex Abrines,2016.0,23.0,OKC,SG,68.0,6.0,15.5,0.393,1.4,...,1.7,6.0,78,200,47.0,35.0,2.0,Northwest,W,6.0
1,Steven Adams,2016.0,23.0,OKC,C,80.0,80.0,29.9,0.571,0.0,...,2.4,11.3,84,265,47.0,35.0,2.0,Northwest,W,6.0
2,Semaj Christon,2016.0,24.0,OKC,PG,64.0,1.0,15.2,0.345,0.2,...,1.2,2.9,75,190,47.0,35.0,2.0,Northwest,W,6.0
3,Norris Cole,2016.0,28.0,OKC,PG,13.0,0.0,9.6,0.308,0.2,...,1.4,3.3,74,175,47.0,35.0,2.0,Northwest,W,6.0
4,Nick Collison,2016.0,36.0,OKC,PF,20.0,0.0,6.4,0.609,0.0,...,0.9,1.7,82,255,47.0,35.0,2.0,Northwest,W,6.0
5,Taj Gibson,2016.0,31.0,OKC,PF,23.0,16.0,21.2,0.497,0.0,...,1.7,9.0,81,232,47.0,35.0,2.0,Northwest,W,6.0
6,Jerami Grant,2016.0,22.0,OKC,PF,78.0,4.0,19.1,0.469,0.6,...,1.8,5.4,81,220,47.0,35.0,2.0,Northwest,W,6.0
7,Josh Huestis,2016.0,25.0,OKC,PF,2.0,0.0,15.5,0.545,1.0,...,0.0,7.0,79,230,47.0,35.0,2.0,Northwest,W,6.0
8,Ersan Ilyasova,2016.0,29.0,OKC,PF,3.0,0.0,20.7,0.375,1.0,...,1.7,5.0,82,235,47.0,35.0,2.0,Northwest,W,6.0
9,Enes Kanter,2016.0,24.0,OKC,C,72.0,0.0,21.3,0.545,0.1,...,2.1,14.3,83,250,47.0,35.0,2.0,Northwest,W,6.0


In [193]:
# Convert to CSV file
final_df.to_csv("/home/jovyan/Data301Winter2019/project/player.csv", index=False)