# NBA Injuries
***
## Goal: 
Build model to predict the probability of a player missing a game due to injury within a particular time frame

## Approach:

### Part I: Data Preparation
Tasks:

1. Scrape injury history data from Pro Sports Transactions using Beautiful Soup
2. Scrape player statistics and information from NBA Stats using Beautiful Soup and Selenium and/or nba-api
3. Clean datasets
4. Merge the two datasets


***

Our data is coming from multiple sources, which will need to be compiled into a single dataset before we can train our model(s).

Injury dataset and yearly bios have been already scraped from prosportstransactions.com and nba.com, respectively.
Now we need to gather game data for each player with an injury.

My initial vision for the final dataset:
________________________________________________________________
Player Name/ID | Date of Injury | Injury Type | Repeat Injury? | Contact vs Non-contact | Minutes played in injury game | Minutes played in last n games | Usage rating in last n games | No. games in last n days | Travel time in last n days | Hours since last appearance in game | 

In [1]:
import numpy as np
import pandas as pd

In [2]:
bio1213 = pd.read_csv('data/bios2012-13.csv')
bio1213.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_ABBREVIATION,AGE,PLAYER_HEIGHT,PLAYER_HEIGHT_INCHES,PLAYER_WEIGHT,COLLEGE,COUNTRY,...,GP,PTS,REB,AST,NET_RATING,OREB_PCT,DREB_PCT,USG_PCT,TS_PCT,AST_PCT
0,203932,Aaron Gordon,1610612743,DEN,25.0,6-8,80,235,Arizona,USA,...,50,618,284,161,2.1,0.055,0.15,0.204,0.547,0.165
1,1628988,Aaron Holiday,1610612754,IND,24.0,6-0,72,185,UCLA,USA,...,66,475,89,123,-0.2,0.012,0.06,0.189,0.503,0.139
2,1630174,Aaron Nesmith,1610612738,BOS,21.0,6-5,77,215,Vanderbilt,USA,...,46,218,127,23,-0.5,0.041,0.146,0.133,0.573,0.047
3,1627846,Abdel Nader,1610612756,PHX,27.0,6-5,77,225,Iowa State,Egypt,...,24,160,62,19,5.0,0.02,0.151,0.183,0.605,0.078
4,1629690,Adam Mokoka,1610612741,CHI,22.0,6-4,76,190,,France,...,14,15,5,5,-7.1,0.017,0.077,0.171,0.386,0.179


In [3]:
import nba_api.stats.static.players as players
from nba_api.stats import endpoints


In [4]:
gamelog = endpoints.LeagueGameLog().get_data_frames()[0]
gamelog.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
0,22020,1610612746,LAC,LA Clippers,22000002,2020-12-22,LAC @ LAL,W,240,44,...,29,40,22,10,3,16,29,116,7,1
1,22020,1610612747,LAL,Los Angeles Lakers,22000002,2020-12-22,LAL vs. LAC,L,240,38,...,37,45,22,4,2,19,20,109,-7,1
2,22020,1610612751,BKN,Brooklyn Nets,22000001,2020-12-22,BKN vs. GSW,W,240,42,...,44,57,24,11,7,20,22,125,26,1
3,22020,1610612744,GSW,Golden State Warriors,22000001,2020-12-22,GSW @ BKN,L,240,37,...,34,47,26,6,6,18,24,99,-26,1
4,22020,1610612755,PHI,Philadelphia 76ers,22000013,2020-12-23,PHI vs. WAS,W,240,41,...,37,47,22,11,8,18,25,113,6,1


In [5]:
gsw_bkn_id = '0022000001'
play_by_play = endpoints.PlayByPlay(game_id='0022000002').get_data_frames()[0]

What if we took injury dates, matched them to game, looked at play by play?

In [7]:
injuries = pd.read_csv('data/injuries.csv')

In [8]:
steph_games2013 = endpoints.PlayerGameLog(player_id=201939, season='2013').get_data_frames()[0]
steph_injuries = injuries[injuries['Player'] == 'Stephen Curry']

First step is to find player id's for each player in the injuries dataset

In [53]:
# Get df of every NBA player ever
all_players = players.get_players()
players_df = pd.DataFrame(all_players)
len(players_df['full_name'].unique())

4465

We have a total of 4501 players in our dataframe

In [34]:
unique_players = injuries['Player'].unique()
# Count number of unique players in injuries database
len(unique_players)

823

In [36]:
# Count number of players in the players 
len(players_df.loc[players_df['full_name'].isin(unique_players)])

736

There are 823 unique players in the injured database vs 736 matches in the players database.
This is most likely due to alternate names/spellings/nicknames.

In [21]:
# Injuries dataset lists multiple variations on player name separated by '/' as a single string
# Need to split into multiple strings so we can search for all variations in the NBA dataset
split_injured_players = []
for player in unique_players:
    split_player = player.replace('/', ' ').split('   ')
    for item in split_player:
        split_injured_players.append(item)

In [30]:
injured_df = players_df[players_df['full_name'].isin(split_injured_players)]
injured_df['full_name'].unique()

array(['Alex Abrines', 'Quincy Acy', 'Jaylen Adams', 'Jordan Adams',
       'Steven Adams', 'Bam Adebayo', 'Arron Afflalo', 'Alexis Ajinca',
       'Furkan Aldemir', 'Cole Aldrich', 'LaMarcus Aldridge',
       'Cliff Alexander', 'Kyle Alexander', 'Nickeil Alexander-Walker',
       'Grayson Allen', 'Jarrett Allen', 'Kadeem Allen', 'Lavoy Allen',
       'Ray Allen', 'Tony Allen', 'Al-Farouq Aminu', 'Lou Amundson',
       'Chris Andersen', 'Alan Anderson', 'James Anderson',
       'Justin Anderson', 'Kyle Anderson', 'Ryan Anderson',
       'Ike Anigbogu', 'Giannis Antetokounmpo', 'Kostas Antetokounmpo',
       'Carmelo Anthony', 'Joel Anthony', 'Pero Antic',
       'Ryan Arcidiacono', 'Trevor Ariza', 'Darrell Arthur', 'Omer Asik',
       'Gustavo Ayon', 'Deandre Ayton', 'Luke Babbitt', 'Dwayne Bacon',
       'Marvin Bagley III', 'Cameron Bairstow', 'Ron Baker', 'Lonzo Ball',
       'Mo Bamba', 'Leandro Barbosa', 'J.J. Barea', 'Andrea Bargnani',
       'Harrison Barnes', 'Matt Barnes', 'Wi