# NBA Injuries
***
## Goal: 
Build model to predict the probability of a player missing a game due to injury within a particular time frame

## Approach:

### Part I: Data Preparation
Tasks:

1. Scrape injury history data from Pro Sports Transactions using Beautiful Soup
2. Scrape player statistics and information from NBA Stats using Beautiful Soup and Selenium and/or nba-api
3. Clean datasets
4. Merge the two datasets


***

Our data is coming from multiple sources, which will need to be compiled into a single dataset before we can train our model(s).

Injury dataset and yearly bios have been already scraped from prosportstransactions.com and nba.com, respectively.
Now we need to gather game data for each player with an injury.

My initial vision for the final dataset:
________________________________________________________________
Player Name/ID | Date of Injury | Injury Type | Repeat Injury? | Contact vs Non-contact | Minutes played in injury game | Minutes played in last n games | Usage rating in last n games | No. games in last n days | Travel time in last n days | Hours since last appearance in game | 

In [164]:
import numpy as np
import pandas as pd

In [165]:
bio1213 = pd.read_csv('data/bios2012-13.csv')
bio1213.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_ABBREVIATION,AGE,PLAYER_HEIGHT,PLAYER_HEIGHT_INCHES,PLAYER_WEIGHT,COLLEGE,COUNTRY,...,GP,PTS,REB,AST,NET_RATING,OREB_PCT,DREB_PCT,USG_PCT,TS_PCT,AST_PCT
0,203932,Aaron Gordon,1610612743,DEN,25.0,6-8,80,235,Arizona,USA,...,50,618,284,161,2.1,0.055,0.15,0.204,0.547,0.165
1,1628988,Aaron Holiday,1610612754,IND,24.0,6-0,72,185,UCLA,USA,...,66,475,89,123,-0.2,0.012,0.06,0.189,0.503,0.139
2,1630174,Aaron Nesmith,1610612738,BOS,21.0,6-5,77,215,Vanderbilt,USA,...,46,218,127,23,-0.5,0.041,0.146,0.133,0.573,0.047
3,1627846,Abdel Nader,1610612756,PHX,27.0,6-5,77,225,Iowa State,Egypt,...,24,160,62,19,5.0,0.02,0.151,0.183,0.605,0.078
4,1629690,Adam Mokoka,1610612741,CHI,22.0,6-4,76,190,,France,...,14,15,5,5,-7.1,0.017,0.077,0.171,0.386,0.179


In [166]:
import nba_api.stats.static.players as players
from nba_api.stats import endpoints


In [167]:
gamelog = endpoints.LeagueGameLog().get_data_frames()[0]
gamelog.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
0,22020,1610612746,LAC,LA Clippers,22000002,2020-12-22,LAC @ LAL,W,240,44,...,29,40,22,10,3,16,29,116,7,1
1,22020,1610612747,LAL,Los Angeles Lakers,22000002,2020-12-22,LAL vs. LAC,L,240,38,...,37,45,22,4,2,19,20,109,-7,1
2,22020,1610612744,GSW,Golden State Warriors,22000001,2020-12-22,GSW @ BKN,L,240,37,...,34,47,26,6,6,18,24,99,-26,1
3,22020,1610612751,BKN,Brooklyn Nets,22000001,2020-12-22,BKN vs. GSW,W,240,42,...,44,57,24,11,7,20,22,125,26,1
4,22020,1610612755,PHI,Philadelphia 76ers,22000013,2020-12-23,PHI vs. WAS,W,240,41,...,37,47,22,11,8,18,25,113,6,1


What if we took injury dates, matched them to game, looked at play by play?

In [168]:
injuries = pd.read_csv('data/injuries.csv')

First step is to find player id's for each player in the injuries dataset

In [169]:
# Get df of every NBA player ever
all_players = players.get_players()
players_df = pd.DataFrame(all_players)
len(players_df['full_name'].unique())

4465

We have a total of 4465 players in our dataframe

In [170]:
unique_players = injuries['Player'].unique()
# Count number of unique players in injuries database
len(unique_players)

823

In [171]:
# Count number of matches between unique_players and players_df 
len(players_df.loc[players_df['full_name'].isin(unique_players)])

736

There are 823 unique players in the injured database vs 736 matches in the players database.
This is most likely due to alternate names/spellings/nicknames.

In [172]:
# Injuries dataset lists multiple variations on player name separated by '/' as a single string
# Need to split into multiple strings so we can search for all variations in the NBA dataset

split_injured_players = []
for player in unique_players:
    split_player = player.replace('/', ' ').split('   ')
    for item in split_player:
        split_injured_players.append(item)

In [173]:
injured_players = players_df[players_df['full_name'].isin(split_injured_players)]
len(injured_players['full_name'].unique())

778

We're closer at 778 matches. We could try to track down the rest, but I think we're okay to move on for now.

Now we want to join the dataframes.
Reminder of what we want our final dataset to look like:

Player Name/ID | Date of Injury | Injury Type | Repeat Injury? | Contact vs Non-contact | Minutes played in injury game | Minutes played in last n games | Usage rating in last n games | No. games in last n days | Travel time in last n days | Hours since last appearance in game | Player bios

In [174]:
injuries[injuries['Player'].str.contains('/')]

Unnamed: 0,Date,Team,Player,Injury
3,2012-10-30,Knicks,Amare Stoudemire / Amar'e Stoudemire,arthroscopic surgery on left knee (out indefin...
8,2012-10-30,Spurs,Emanuel Ginobili / Manu Ginobili,back spasms (DTD)
16,2012-11-02,Magic,Maurice Harkless / Moe Harkless,surgery to repair hernia (DTD)
38,2012-11-09,Timberwolves,Jose Juan Barea / Jose Barea / J.J. Barea,sprained left foot (DNP)
41,2012-11-10,Jazz,Maurice Williams / Mo Williams,strained right abductor (DNP)
...,...,...,...,...
6969,2020-03-02,Hawks,Cameron Reddish / Cam Reddish,sore lower back (DTD)
6973,2020-03-04,Wizards,Ishmael Smith / Ish Smith,left hamstring injury (DTD)
6993,2020-03-10,Clippers,Louis Williams / Lou Williams,right calf injury (DTD)
7017,2020-07-29,Pacers,Domantas Sabonis / Domas Sabonis,left foot injury (out for season)


Before we can merge the dataframes, we need to deal with multiple names in the injuries df. We'll start by separating the injuries df into two: one with multiple names, the other with just one

In [175]:
mult_names = injuries[injuries['Player'].str.contains('/')]
one_name = injuries[~injuries['Player'].str.contains('/')]

In [176]:
# converting full_name series to dict for performance
player_names_dict = players_df['full_name'].to_dict()

In [177]:
mult_names1 = mult_names.copy()


In [178]:
def match_official_name(df, split_name_dict):
    '''
    Returns variation of name matching the official records
    If no match is found, NA
    '''
    splits = df.Player.str.split(' / ')
    official_names = []
    print(type(splits))
    for names in splits:
        match_flag = 0
        for name in names:
            if name in split_name_dict.values():
                official_names.append(name)
                match_flag = 1
            
        if match_flag < 1:
            official_names.append('NA')

    return official_names

In [179]:
official_names = match_official_name(mult_names1, player_names_dict)
mult_names1['official'] = official_names

<class 'pandas.core.series.Series'>


In [180]:
mult_names2 = mult_names1[mult_names1.official != 'NA'] \
    .drop(columns=['Player']) \
    .rename(columns={'official':'Player'})

In [181]:
injuries_official = pd.concat([one_name, mult_names2])

In [186]:
# Removing periods from names for consistency
injuries_official.Player = injuries_official.Player.str.replace('.', '', regex=False)
players_df.full_name = players_df.full_name.str.replace('.', '', regex=False)

Now that the injuries dataset and the players dataset have matching names, we can work on collecting game data for each of these injuries.
But first, let's merge the two on player names

In [187]:
merged_df = injuries_official.merge(players_df, how='left', left_on='Player', right_on='full_name')
merged_df

Unnamed: 0,Date,Team,Player,Injury,id,full_name,first_name,last_name,is_active
0,2012-10-30,Bulls,Derrick Rose,recovering from surgery on left knee to repair...,201565.0,Derrick Rose,Derrick,Rose,True
1,2012-10-30,Celtics,Darko Milicic,back spasms (DTD),2545.0,Darko Milicic,Darko,Milicic,False
2,2012-10-30,Clippers,Grant Hill,bone bruise in right knee (DTD),255.0,Grant Hill,Grant,Hill,False
3,2012-10-30,Knicks,Iman Shumpert,recovering from surgery on left knee to repair...,202697.0,Iman Shumpert,Iman,Shumpert,True
4,2012-10-30,Mavericks,Jared Cunningham,sprained thumb (DTD),203099.0,Jared Cunningham,Jared,Cunningham,False
...,...,...,...,...,...,...,...,...,...
7030,2020-03-02,Hawks,Cam Reddish,sore lower back (DTD),1629629.0,Cam Reddish,Cam,Reddish,True
7031,2020-03-04,Wizards,Ish Smith,left hamstring injury (DTD),202397.0,Ish Smith,Ish,Smith,True
7032,2020-03-10,Clippers,Lou Williams,right calf injury (DTD),101150.0,Lou Williams,Lou,Williams,True
7033,2020-07-29,Pacers,Domantas Sabonis,left foot injury (out for season),1627734.0,Domantas Sabonis,Domantas,Sabonis,True


Now we should make sure they merged correctly by checking for NA's.

In [184]:
unmerged = merged_df[merged_df.isna().any(axis=1)]
unmerged

Unnamed: 0,Date,Team,Player,Injury,id,full_name,first_name,last_name,is_active
20,2012-11-06,Bobcats,Gerald Henderson Jr,sprained left foot (out indefinitely),,,,,
96,2012-11-26,Grizzlies,Mike Conley Jr,flu (DNP),,,,,
107,2012-11-29,Spurs,(William) Tony Parker,rest (DNP),,,,,
184,2012-12-12,Bucks,Larry Sanders (b 1988-11-21),illness (DNP),,,,,
196,2012-12-14,Bucks,Mike Dunleavy Jr,bruised left knee (DNP),,,,,
...,...,...,...,...,...,...,...,...,...
6416,2019-12-23,Knicks,Marcus Morris,left Achilles injury (DTD),,,,,
6462,2020-01-10,Magic,DJ Augustine,bruised left knee (DTD),,,,,
6554,2020-02-12,Jazz,Mike Conley Jr,illness (DTD),,,,,
6623,2020-03-10,Grizzlies,Jontay Porter,right knee injury (DTD),,,,,


Only 7035 rows, only 276 were unmatched. We should be okay dropping these entries. We can also drop the full_name, first_name, last_name, and is_active columns. We should also drop the decimal in id and convert to strings

In [191]:
merged_df = merged_df.dropna() \
    .drop(columns=['full_name', 'last_name', 'first_name', 'is_active'])

In [201]:
merged_df.id = merged_df.id.apply(str) \
    .str[:-2]

In [202]:
merged_df

Unnamed: 0,Date,Team,Player,Injury,id
0,2012-10-30,Bulls,Derrick Rose,recovering from surgery on left knee to repair...,201565
1,2012-10-30,Celtics,Darko Milicic,back spasms (DTD),2545
2,2012-10-30,Clippers,Grant Hill,bone bruise in right knee (DTD),255
3,2012-10-30,Knicks,Iman Shumpert,recovering from surgery on left knee to repair...,202697
4,2012-10-30,Mavericks,Jared Cunningham,sprained thumb (DTD),203099
...,...,...,...,...,...
7030,2020-03-02,Hawks,Cam Reddish,sore lower back (DTD),1629629
7031,2020-03-04,Wizards,Ish Smith,left hamstring injury (DTD),202397
7032,2020-03-10,Clippers,Lou Williams,right calf injury (DTD),101150
7033,2020-07-29,Pacers,Domantas Sabonis,left foot injury (out for season),1627734
