# Set-up
I built my project around two Kaggle datasets (["NBA Players stats since 1950"](https://www.kaggle.com/drgilermo/nba-players-stats) & ["NBA All Star Game 2000-2016"](https://www.kaggle.com/fmejia21/nba-all-star-game-20002016)). I then scraped sites including stats.NBA.com in my feature engineering efforts. Below, I've pieced together my workflow pieced as best I can, and hopefully in a way that can be reasonably understood by anyone reading it.

# Imports

In [1]:
import pandas as pd
import numpy as np
import time
import requests
import re
from bs4 import BeautifulSoup
import pickle
from tqdm import tqdm
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time, os
import nba_data_functions as nbad
from fake_useragent import UserAgent

ua = UserAgent()
user_agent = {'User-agent': ua.random}
chromedriver = "/Applications/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

# Building main dataset
Below is a trimmed outline of the functions, scraped, calls, etc., that I used to build my final dataset. A note on my target variable: As each player-season (datapoint) contained data current as of the end of that season, and my target was that player's All-Star status the *following* season, I needed to add a column to contain, essentially, the "All-Star?" value from his following season. I handled this in my nbad.get_main_data() function.

In [2]:
# Reading in the two Kaggle datasets
all_stars2000_2016 = pd.read_csv("NBA All Stars 2000-2016 - Sheet1.csv")
master_df = pd.read_csv("Seasons_Stats.csv")

In [3]:
# Scraping and building a dictionary including all NBA teams active between 1999-2000 and today, TV abbreviations and TV market data from 2000-01 and 2001-02.
full_team_dict = nbad.get_teams_dict()

Request status code: 200
Request status code: 200
Request status code: 200
100%|██████████| 37/37 [00:00<00:00, 121336.39it/s]


In [5]:
# Combining all data so far and cleaning (see comments in function)
reindexed_master_df = nbad.get_main_data(master_df, all_stars2000_2016, full_team_dict)

In [7]:
# Adding columns/features reflecting core counting stats per game, and per game relative to that of all players active that season, plus the actual season average in case I need it
reindexed_master_df = nbad.per_game_rel_to_season(reindexed_master_df, ["AST", "PTS", "ORB", "DRB", "TRB", "STL", "BLK", "MP", "3P", "FT", "FTA"])

  0%|          | 0/11 [00:00<?, ?it/s]
  0%|          | 0/7031 [00:00<?, ?it/s][A
  4%|▍         | 292/7031 [00:00<00:02, 2915.12it/s][A
  9%|▊         | 607/7031 [00:00<00:02, 2981.82it/s][A
 14%|█▍        | 968/7031 [00:00<00:01, 3144.76it/s][A
 19%|█▉        | 1340/7031 [00:00<00:01, 3296.86it/s][A
 24%|██▍       | 1716/7031 [00:00<00:01, 3422.13it/s][A
 30%|██▉       | 2098/7031 [00:00<00:01, 3530.65it/s][A
 35%|███▌      | 2484/7031 [00:00<00:01, 3623.12it/s][A
 40%|████      | 2844/7031 [00:00<00:01, 3613.52it/s][A
 45%|████▌     | 3191/7031 [00:00<00:01, 3315.91it/s][A
 50%|█████     | 3518/7031 [00:01<00:01, 3162.34it/s][A
 55%|█████▍    | 3832/7031 [00:01<00:01, 3052.89it/s][A
 59%|█████▉    | 4137/7031 [00:01<00:00, 2927.91it/s][A
 63%|██████▎   | 4431/7031 [00:01<00:00, 2924.38it/s][A
 67%|██████▋   | 4724/7031 [00:01<00:00, 2918.49it/s][A
 72%|███████▏  | 5036/7031 [00:01<00:00, 2974.88it/s][A
 77%|███████▋  | 5388/7031 [00:01<00:00, 3118.47it/s][A
 82%|███

In [8]:
# Adding a column that reflects a player's "Years from prime"
reindexed_master_df["Years from prime"] = (27 - reindexed_master_df["Age"])

In [None]:
# Re-sorting to ensure order needed for next steps (probably could be reordered but at this stage I don't want to cause issues right before submission)
ordered_df = reindexed_master_df.sort_values(by=["Player", "Year"])

## Feature engineering spotlight: "Adjusted TV market value"
Here, I wanted to quantify the boost to a player's public profile and All-Star resume by way of TV market size. The formula I came up with is, applied in the function called below, is as follows:

- Each player-season has a "TV media market" value, originally reported by Nielsen, with noted exceptions:
    - I used the mean market size to replace the NaNs assigned to the two Canadian cities in my dataset
    - Players that were traded were in my earlier cleaning assigned the team they spent the most games with that season
- For each season a player is active, he accrues market value in the following weighted increments:
    - Rookie season: "TV market size" * 0.75
    - Year 2: "TV market size"
    - Years 3+: "TV market size" * (1.1 ** n (where n is the number of seasons played since year 2))
- I decided that more years in the league shouldn't lessen a player's value and so in a pretty reductive choice used each player's max "TV market value" through the year in question and there forward until he moved to an even larger market (i.e. when Andrew Bynum was traded from LA to Philly he was assigned the larger of the two, the LA "TV market value," through the rest of his strangely short career)
- Finally I divided the accrued total by the number of years player to that point

In [9]:
ordered_df = nbad.tv_market_cumulative(ordered_df)

  0%|          | 1/7031 [00:00<00:12, 585.14it/s]


NameError: name 'reindexed_master_df' is not defined

In [None]:
# And then interacting it with a few potentially relevant features
ordered_df["Adjusted TV market value * MP"] = ordered_df["Adjusted TV market value"] * ordered_df["MP/game"]
ordered_df["Adjusted TV market value * GS"] = ordered_df["Adjusted TV market value"] * ordered_df["GS"]

## Feature engineering spotlight: "Trajectory"
Here, I wanted to isolate a player's position on his growth curve to quantify the extent to which he will be better, worse or neither the following season—generally, but as measured in a specific area. I used what is likely the most All-Star-correlated stat (PPG) in my function but it could be updated to calculate the value based on growth in other areas and perhaps even a weighted mean of all the features used in the final model. The formula I came up with is, applied in the function called below, is as follows:

- Unless the player is a rookie, each player-season has a "Points per game" value that is greater than, less than or equal to the value from his previous season
- For each guy in year 2+, I assigned a value reflecting the expected change in PPG using this system:
    - Year 2: Current year's "Points per game" - last year's "Points per game"
    - Years 3+: The change in the *change* in "Points per game" between (Current year's "Points per game" - last year's "Points per game") and (last year's "Points per game" the year before that's "Points per game")
        - If the resulting value was above 5 or below than -5 I multiplied the value by 1.25 or -1.25 respectively to give greater weight to what I'd quantified as an exceptional trend in player growth

One major and problematic decision I made here was to assign players with no previous season to measure against (aka rookies) a "Trajectory" value of 0. In reality, while there is likely more variance, growth between year 1 and year 2 can be expected perhaps more than in any other year-to-year transition. I'll additionally note that like the math in my "Adjusted TV market value" calculus this math is inarguably wonky and could be improved with time.

In [None]:
ordered_df = nbad.get_trajectory(ordered_df)

## Removing injury replacement All-Stars
Later than I would have liked, I decided that a guy named an All-Star as a replacement for an injured All-Star was made an All-Star by means of alternative process and not the one I was attempting to draw my model around.

In [None]:
ordered_df = get_adjusted_all_star_games(ordered_df)

# And then manually adding a column called "Adjusted All-Star next season?"
next_year_adj_column = []
for i in tqdm(range(ordered_df.shape[0]-1)):
    current_guy = ordered_df.iloc[i]["Player"]
    next_guy = ordered_df.iloc[i+1]["Player"]
    if current_guy == next_guy:
        if ordered_df.iloc[i+1]["Adjusted All-Star?"] == 1:
            next_year_adj_column.append(1)
        else:
            next_year_adj_column.append(0)
    else:
        next_year_adj_column.append(0)

# Because last player in his last year (like Big Z) can't make All-Star Game next year
next_year_adj_column.append(0)
ordered_df["Adjusted All-Star next season?"] = next_year_adj_column

## More feature engineering

In [None]:
# Adding a few more features built off existing features
ordered_df["GS/G"] = ordered_df["GS"] / ordered_df["G"]
ordered_df["PTS+AST/game"] = ordered_df["PTS/game"] + ordered_df["AST/game"]
ordered_df["Years from prime ^ 2"] = ordered_df["Years from prime"]**2
ordered_df["PTS+AST/game"] = ordered_df["PTS/game"] + ordered_df["AST/game"]

In [None]:
# Making a column with values representing how many past All-Star Games a player has participated in and then making dummy columns from it
ordered_df = nbad.get_past_all_star_games(ordered_df)
ordered_df[ordered_df["Past All-Star Games (incl this season)"].unique()] = pd.get_dummies(ordered_df["Past All-Star Games (incl this season)"])

In [None]:
# Collecting clutch stats to add as columns via scraping
list_of_seasons = ['1999-00', '2000-01', '2001-02', '2002-03', '2003-04', '2004-05', '2005-06', '2006-07', '2007-08', '2008-09', '2009-10', '2010-11', '2011-12', '2012-13', '2013-14', '2014-15', '2015-16', '2016-17', '2017-2018', '2018-2019']
clutch_stats_df = nbad.get_clutch_stats(list_of_seasons)
ordered_df = nbad.get_clutch(ordered_df)

# Pickling dataset

In [None]:
with open("updated_df.pickle", "wb") as to_write:
    pickle.dump(ordered_df, to_write)