# NBA Injuries
***
## Goal: 
Build model to predict the probability of a player missing a game due to injury within a particular time frame

## Approach:

### Part I: Data Preparation
Tasks:

1. Scrape injury history data from Pro Sports Transactions using Beautiful Soup
2. Scrape player statistics and information from NBA Stats using Beautiful Soup and Selenium and/or nba-api
3. Clean datasets
4. Merge the two datasets


***

Our data is coming from multiple sources, which will need to be compiled into a single dataset before we can train our model(s).

Injury dataset and yearly bios have been already scraped from prosportstransactions.com and nba.com, respectively.
Now we need to gather game data for each player with an injury.

My initial vision for the final dataset:
________________________________________________________________
Player Name/ID | Date of Injury | Injury Type | Repeat Injury? | Contact vs Non-contact | Minutes played in injury game | Minutes played in last n games | Usage rating in last n games | No. games in last n days | Travel time in last n days | Hours since last appearance in game | 

In [66]:
import numpy as np
import pandas as pd

In [67]:
bio1213 = pd.read_csv('data/bios2012-13.csv')
bio1213.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_ABBREVIATION,AGE,PLAYER_HEIGHT,PLAYER_HEIGHT_INCHES,PLAYER_WEIGHT,COLLEGE,COUNTRY,...,GP,PTS,REB,AST,NET_RATING,OREB_PCT,DREB_PCT,USG_PCT,TS_PCT,AST_PCT
0,203932,Aaron Gordon,1610612743,DEN,25.0,6-8,80,235,Arizona,USA,...,50,618,284,161,2.1,0.055,0.15,0.204,0.547,0.165
1,1628988,Aaron Holiday,1610612754,IND,24.0,6-0,72,185,UCLA,USA,...,66,475,89,123,-0.2,0.012,0.06,0.189,0.503,0.139
2,1630174,Aaron Nesmith,1610612738,BOS,21.0,6-5,77,215,Vanderbilt,USA,...,46,218,127,23,-0.5,0.041,0.146,0.133,0.573,0.047
3,1627846,Abdel Nader,1610612756,PHX,27.0,6-5,77,225,Iowa State,Egypt,...,24,160,62,19,5.0,0.02,0.151,0.183,0.605,0.078
4,1629690,Adam Mokoka,1610612741,CHI,22.0,6-4,76,190,,France,...,14,15,5,5,-7.1,0.017,0.077,0.171,0.386,0.179


In [68]:
import nba_api.stats.static.players as players
from nba_api.stats import endpoints


In [69]:
gamelog = endpoints.LeagueGameLog().get_data_frames()[0]
gamelog.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
0,22020,1610612746,LAC,LA Clippers,22000002,2020-12-22,LAC @ LAL,W,240,44,...,29,40,22,10,3,16,29,116,7,1
1,22020,1610612747,LAL,Los Angeles Lakers,22000002,2020-12-22,LAL vs. LAC,L,240,38,...,37,45,22,4,2,19,20,109,-7,1
2,22020,1610612751,BKN,Brooklyn Nets,22000001,2020-12-22,BKN vs. GSW,W,240,42,...,44,57,24,11,7,20,22,125,26,1
3,22020,1610612744,GSW,Golden State Warriors,22000001,2020-12-22,GSW @ BKN,L,240,37,...,34,47,26,6,6,18,24,99,-26,1
4,22020,1610612764,WAS,Washington Wizards,22000013,2020-12-23,WAS @ PHI,L,240,39,...,35,40,28,7,4,20,26,107,-6,1


What if we took injury dates, matched them to game, looked at play by play?

In [70]:
injuries = pd.read_csv('data/injuries.csv')

First step is to find player id's for each player in the injuries dataset

In [71]:
# Get df of every NBA player ever
all_players = players.get_players()
players_df = pd.DataFrame(all_players)
len(players_df['full_name'].unique())

4465

We have a total of 4465 players in our dataframe

In [72]:
unique_players = injuries['Player'].unique()
# Count number of unique players in injuries database
len(unique_players)

823

In [42]:
# Count number of matches between unique_players and players_df 
len(players_df.loc[players_df['full_name'].isin(unique_players)])

736

There are 823 unique players in the injured database vs 736 matches in the players database.
This is most likely due to alternate names/spellings/nicknames.

In [43]:
# Injuries dataset lists multiple variations on player name separated by '/' as a single string
# Need to split into multiple strings so we can search for all variations in the NBA dataset

split_injured_players = []
for player in unique_players:
    split_player = player.replace('/', ' ').split('   ')
    for item in split_player:
        split_injured_players.append(item)

In [44]:
injured_players = players_df[players_df['full_name'].isin(split_injured_players)]
len(injured_players['full_name'].unique())

778

We're closer at 778 matches. We could try to track down the rest, but I think we're okay to move on for now.

Now we want to join the dataframes.
Reminder of what we want our final dataset to look like:

Player Name/ID | Date of Injury | Injury Type | Repeat Injury? | Contact vs Non-contact | Minutes played in injury game | Minutes played in last n games | Usage rating in last n games | No. games in last n days | Travel time in last n days | Hours since last appearance in game | Player bios

In [45]:
injuries[injuries['Player'].str.contains('/')]

Unnamed: 0,Date,Team,Player,Injury
3,2012-10-30,Knicks,Amare Stoudemire / Amar'e Stoudemire,arthroscopic surgery on left knee (out indefin...
8,2012-10-30,Spurs,Emanuel Ginobili / Manu Ginobili,back spasms (DTD)
16,2012-11-02,Magic,Maurice Harkless / Moe Harkless,surgery to repair hernia (DTD)
38,2012-11-09,Timberwolves,Jose Juan Barea / Jose Barea / J.J. Barea,sprained left foot (DNP)
41,2012-11-10,Jazz,Maurice Williams / Mo Williams,strained right abductor (DNP)
...,...,...,...,...
6969,2020-03-02,Hawks,Cameron Reddish / Cam Reddish,sore lower back (DTD)
6973,2020-03-04,Wizards,Ishmael Smith / Ish Smith,left hamstring injury (DTD)
6993,2020-03-10,Clippers,Louis Williams / Lou Williams,right calf injury (DTD)
7017,2020-07-29,Pacers,Domantas Sabonis / Domas Sabonis,left foot injury (out for season)


Before we can merge the dataframes, we need to deal with multiple names in the injuries df. We'll start by separating the injuries df into two: one with multiple names, the other with just one

In [46]:
mult_names = injuries[injuries['Player'].str.contains('/')]
one_name = injuries[~injuries['Player'].str.contains('/')]

In [47]:
# converting full_name series to dict for performance
player_names_dict = players_df['full_name'].to_dict()

In [48]:
mult_names1 = mult_names.copy()


In [49]:
def match_official_name(df, split_name_dict):
    '''
    Returns variation of name matching the official records
    If no match is found, NA
    '''
    splits = df.Player.str.split(' / ')
    official_names = []
    print(type(splits))
    for names in splits:
        match_flag = 0
        for name in names:
            if name in split_name_dict.values():
                official_names.append(name)
                match_flag = 1
            
        if match_flag < 1:
            official_names.append('NA')

    return official_names

In [50]:
official_names = match_official_name(mult_names1, player_names_dict)
mult_names1['official'] = official_names

<class 'pandas.core.series.Series'>


In [51]:
mult_names2 = mult_names1[mult_names1.official != 'NA'] \
    .drop(columns=['Player']) \
    .rename(columns={'official':'Player'})

In [52]:
injuries_official = pd.concat([one_name, mult_names2])

In [53]:
# Removing periods from names for consistency
injuries_official.Player = injuries_official.Player.str.replace('.', '', regex=False)
players_df.full_name = players_df.full_name.str.replace('.', '', regex=False)

Now that the injuries dataset and the players dataset have matching names, we can work on collecting game data for each of these injuries.
But first, let's merge the two on player names

In [54]:
merged_df = injuries_official.merge(players_df[['full_name', 'id']], how='left', left_on='Player', right_on='full_name')
merged_df

Unnamed: 0,Date,Team,Player,Injury,full_name,id
0,2012-10-30,Bulls,Derrick Rose,recovering from surgery on left knee to repair...,Derrick Rose,201565.0
1,2012-10-30,Celtics,Darko Milicic,back spasms (DTD),Darko Milicic,2545.0
2,2012-10-30,Clippers,Grant Hill,bone bruise in right knee (DTD),Grant Hill,255.0
3,2012-10-30,Knicks,Iman Shumpert,recovering from surgery on left knee to repair...,Iman Shumpert,202697.0
4,2012-10-30,Mavericks,Jared Cunningham,sprained thumb (DTD),Jared Cunningham,203099.0
...,...,...,...,...,...,...
7030,2020-03-02,Hawks,Cam Reddish,sore lower back (DTD),Cam Reddish,1629629.0
7031,2020-03-04,Wizards,Ish Smith,left hamstring injury (DTD),Ish Smith,202397.0
7032,2020-03-10,Clippers,Lou Williams,right calf injury (DTD),Lou Williams,101150.0
7033,2020-07-29,Pacers,Domantas Sabonis,left foot injury (out for season),Domantas Sabonis,1627734.0


Now we should make sure they merged correctly by checking for NA's.

In [55]:
unmerged = merged_df[merged_df.isna().any(axis=1)]
unmerged

Unnamed: 0,Date,Team,Player,Injury,full_name,id
20,2012-11-06,Bobcats,Gerald Henderson Jr,sprained left foot (out indefinitely),,
96,2012-11-26,Grizzlies,Mike Conley Jr,flu (DNP),,
107,2012-11-29,Spurs,(William) Tony Parker,rest (DNP),,
184,2012-12-12,Bucks,Larry Sanders (b 1988-11-21),illness (DNP),,
196,2012-12-14,Bucks,Mike Dunleavy Jr,bruised left knee (DNP),,
...,...,...,...,...,...,...
6416,2019-12-23,Knicks,Marcus Morris,left Achilles injury (DTD),,
6462,2020-01-10,Magic,DJ Augustine,bruised left knee (DTD),,
6554,2020-02-12,Jazz,Mike Conley Jr,illness (DTD),,
6623,2020-03-10,Grizzlies,Jontay Porter,right knee injury (DTD),,


Only 7035 rows, only 276 were unmatched. We should be okay dropping these entries. We can also drop the full_name, first_name, last_name, and is_active columns. We should also drop the decimal in id and convert to strings

In [56]:
merged_df = merged_df.dropna() \
    .drop(columns=['full_name'])

In [57]:
merged_df.id = merged_df.id.apply(str) \
    .str[:-2]

In [58]:
merged_df.id

0        201565
1          2545
2           255
3        202697
4        203099
         ...   
7030    1629629
7031     202397
7032     101150
7033    1627734
7034    1628964
Name: id, Length: 6759, dtype: object

The next step is to merge player data with the injuries database. Begin by separating entries by season.

In [59]:
merged_df.Date = pd.to_datetime(merged_df.Date, yearfirst=True)

In [60]:
# Using a mask to separate seasons
mask1213 = (merged_df.Date > '2012-10-29') & (merged_df.Date < '2013-06-21')
mask1314 = (merged_df.Date > '2013-10-28') & (merged_df.Date < '2014-06-16')
mask1415 = (merged_df.Date > '2014-10-29') & (merged_df.Date < '2015-06-17')
mask1516 = (merged_df.Date > '2015-10-28') & (merged_df.Date < '2016-06-20')
mask1617 = (merged_df.Date > '2016-10-24') & (merged_df.Date < '2017-06-19')
mask1718 = (merged_df.Date > '2017-10-16') & (merged_df.Date < '2018-06-18')
mask1819 = (merged_df.Date > '2018-10-15') & (merged_df.Date < '2019-06-14')
mask1920 = (merged_df.Date > '2019-10-21') & (merged_df.Date < '2020-03-12')

inj1213 = merged_df.loc[mask1213]
inj1314 = merged_df.loc[mask1314]
inj1415 = merged_df.loc[mask1415]
inj1516 = merged_df.loc[mask1516]
inj1617 = merged_df.loc[mask1617]
inj1718 = merged_df.loc[mask1718]
inj1819 = merged_df.loc[mask1819]

bios1213 = pd.read_csv('data/bios2012-13.csv')
bios1213.PLAYER_ID = bios1213.PLAYER_ID.apply(str)
bios1314 = pd.read_csv('data/bios2013-14.csv')
bios1314.PLAYER_ID = bios1314.PLAYER_ID.apply(str)
bios1415 = pd.read_csv('data/bios2014-15.csv')
bios1415.PLAYER_ID = bios1415.PLAYER_ID.apply(str)
bios1516 = pd.read_csv('data/bios2015-16.csv')
bios1516.PLAYER_ID = bios1516.PLAYER_ID.apply(str)
bios1617 = pd.read_csv('data/bios2016-17.csv')
bios1617.PLAYER_ID = bios1617.PLAYER_ID.apply(str)
bios1718 = pd.read_csv('data/bios2017-18.csv')
bios1718.PLAYER_ID = bios1718.PLAYER_ID.apply(str)
bios1819 = pd.read_csv('data/bios2018-19.csv')
bios1819.PLAYER_ID = bios1819.PLAYER_ID.apply(str)


In [61]:
bios1213[bios1213.isna().any(axis=1)]

Unnamed: 0,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_ABBREVIATION,AGE,PLAYER_HEIGHT,PLAYER_HEIGHT_INCHES,PLAYER_WEIGHT,COLLEGE,COUNTRY,...,GP,PTS,REB,AST,NET_RATING,OREB_PCT,DREB_PCT,USG_PCT,TS_PCT,AST_PCT
28,1629717,Armoni Brooks,1610612745,HOU,22.0,6-3,75,195,Houston,USA,...,20,223,68,30,-11.7,0.017,0.112,0.175,0.565,0.09
73,1626184,Chasson Randle,1610612753,ORL,28.0,6-2,74,185,Stanford,USA,...,41,266,82,74,-15.0,0.008,0.084,0.154,0.515,0.143
93,1629751,Dakota Mathias,1610612755,PHI,25.0,6-4,76,200,Purdue,USA,...,8,48,7,13,-10.6,0.007,0.041,0.166,0.474,0.157
133,1629962,Devin Cannady,1610612753,ORL,25.0,6-2,74,183,Princeton,USA,...,8,34,5,1,-13.7,0.0,0.065,0.181,0.547,0.023
286,1629620,Justin Robinson,1610612760,OKC,23.0,6-1,73,195,Virginia Tech,USA,...,9,21,7,9,-22.0,0.011,0.067,0.111,0.453,0.15
297,1629833,Keljin Blevins,1610612757,POR,25.0,6-4,76,200,Montana State,USA,...,17,12,10,4,-22.2,0.031,0.071,0.126,0.3,0.083
355,1629716,Marques Bolden,1610612739,CLE,23.0,6-10,82,249,Duke,USA,...,6,7,6,0,12.3,0.083,0.088,0.136,0.537,0.0
439,1629606,Robert Franks,1610612753,ORL,24.0,6-7,79,225,Washington State,USA,...,7,43,14,5,-14.4,0.053,0.082,0.148,0.638,0.07
475,204456,T.J. McConnell,1610612754,IND,29.0,6-1,73,190,Arizona,USA,...,69,596,256,456,2.5,0.03,0.106,0.149,0.583,0.337


In [62]:
inj1213 = inj1213.merge(bios1213[['AGE', 'PLAYER_HEIGHT_INCHES', 'PLAYER_WEIGHT', 'PLAYER_ID']], how='left', left_on='id', right_on='PLAYER_ID')
inj1314 = inj1314.merge(bios1213[['AGE', 'PLAYER_HEIGHT_INCHES', 'PLAYER_WEIGHT', 'PLAYER_ID']], how='left', left_on='id', right_on='PLAYER_ID')
inj1415 = inj1415.merge(bios1213[['AGE', 'PLAYER_HEIGHT_INCHES', 'PLAYER_WEIGHT', 'PLAYER_ID']], how='left', left_on='id', right_on='PLAYER_ID')
inj1516 = inj1516.merge(bios1213[['AGE', 'PLAYER_HEIGHT_INCHES', 'PLAYER_WEIGHT', 'PLAYER_ID']], how='left', left_on='id', right_on='PLAYER_ID')
inj1617 = inj1617.merge(bios1213[['AGE', 'PLAYER_HEIGHT_INCHES', 'PLAYER_WEIGHT', 'PLAYER_ID']], how='left', left_on='id', right_on='PLAYER_ID')
inj1718 = inj1718.merge(bios1213[['AGE', 'PLAYER_HEIGHT_INCHES', 'PLAYER_WEIGHT', 'PLAYER_ID']], how='left', left_on='id', right_on='PLAYER_ID')
inj1819 = inj1819.merge(bios1213[['AGE', 'PLAYER_HEIGHT_INCHES', 'PLAYER_WEIGHT', 'PLAYER_ID']], how='left', left_on='id', right_on='PLAYER_ID')

In [63]:
inj1617

Unnamed: 0,Date,Team,Player,Injury,id,AGE,PLAYER_HEIGHT_INCHES,PLAYER_WEIGHT,PLAYER_ID
0,2016-10-26,Grizzlies,Tony Allen,right knee injury (DTD),2754,,,,
1,2016-10-26,Pacers,Aaron Brooks,sore knee (DTD),201166,,,,
2,2016-10-26,Raptors,Lucas Nogueira,sore/sprained left ankle (DTD),203512,,,,
3,2016-10-27,Blazers,Meyers Leonard,DNP (P),203086,29.0,84.0,260.0,203086
4,2016-10-28,Knicks,Ron Baker,sore ankle (DTD),1627758,,,,
...,...,...,...,...,...,...,...,...,...
732,2017-04-14,Wizards,Sheldon Mac,strained calf (DTD),1627815,,,,
733,2017-05-08,Rockets,Nene,torn left adductor muscle (out for season),2403,,,,
734,2017-05-13,Cavaliers,Edy Tavares,fractured right hand (out for season),204002,,,,
735,2017-05-17,Warriors,Zaza Pachulia,bruised right heel,2585,,,,


In [64]:
inj1617[inj1617.isna().any(axis=1)]

Unnamed: 0,Date,Team,Player,Injury,id,AGE,PLAYER_HEIGHT_INCHES,PLAYER_WEIGHT,PLAYER_ID
0,2016-10-26,Grizzlies,Tony Allen,right knee injury (DTD),2754,,,,
1,2016-10-26,Pacers,Aaron Brooks,sore knee (DTD),201166,,,,
2,2016-10-26,Raptors,Lucas Nogueira,sore/sprained left ankle (DTD),203512,,,,
4,2016-10-28,Knicks,Ron Baker,sore ankle (DTD),1627758,,,,
5,2016-10-28,Pacers,Rodney Stuckey,sore right hamstring (DTD),201155,,,,
...,...,...,...,...,...,...,...,...,...
732,2017-04-14,Wizards,Sheldon Mac,strained calf (DTD),1627815,,,,
733,2017-05-08,Rockets,Nene,torn left adductor muscle (out for season),2403,,,,
734,2017-05-13,Cavaliers,Edy Tavares,fractured right hand (out for season),204002,,,,
735,2017-05-17,Warriors,Zaza Pachulia,bruised right heel,2585,,,,


It looks like there's a ton of players with missing data - far more than we'd like.
There's an API call to retrieve data on a player-by-player basis, but it'll take a long time and we run into the risk of our connection timing out. Plus, we'll eventually want information for all players, not just the ones who are currently injured. 

Instead, we can extract the data directly from https://www.basketball-reference.com/players/. 

A quick overview of the process:

1. Extract data using BeautifulSoup
2. Transform data using Pandas
3. Load into SQL database

In [65]:
import requests
from bs4 import BeautifulSoup
from string import ascii_lowercase

URL = 'https://www.basketball-reference.com/players/a/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

raw = soup.tbody.find_all(['th', 'td'])
strings = []


In [192]:
from collections import defaultdict

for line in raw:
    strings.append(line.string)

l = []



In [193]:
raw_names = soup.tbody.find_all('th')
raw_names_list = []
for name in raw_names:
    raw_names_list.append(name.a.string)

z = zip(raw_names_list, strings[1::8], strings[2::8], strings[3::8], 
    strings[4::8], strings[5::8], strings[6::8], strings[7::8])

for name, from_, to_, pos, ht, wt, bday, coll in z:
    d = {}
    d['Name'] = name
    d['From'] = from_
    d['Position'] = pos
    d['Height'] = ht
    d['Weight'] = wt
    d['To'] = to_
    d['Birth Date'] = bday
    d['College'] = coll
    l.append(d)


[{'Name': 'Alaa Abdelnaby',
  'From': '1991',
  'Position': 'F-C',
  'Height': '6-10',
  'Weight': '240',
  'To': '1995',
  'Birth Date': 'June 24, 1968',
  'College': 'Duke'},
 {'Name': 'Zaid Abdul-Aziz',
  'From': '1969',
  'Position': 'C-F',
  'Height': '6-9',
  'Weight': '235',
  'To': '1978',
  'Birth Date': 'April 7, 1946',
  'College': 'Iowa State'},
 {'Name': 'Kareem Abdul-Jabbar',
  'From': '1970',
  'Position': 'C',
  'Height': '7-2',
  'Weight': '225',
  'To': '1989',
  'Birth Date': 'April 16, 1947',
  'College': 'UCLA'},
 {'Name': 'Mahmoud Abdul-Rauf',
  'From': '1991',
  'Position': 'G',
  'Height': '6-1',
  'Weight': '162',
  'To': '2001',
  'Birth Date': 'March 9, 1969',
  'College': 'LSU'},
 {'Name': 'Tariq Abdul-Wahad',
  'From': '1998',
  'Position': 'F',
  'Height': '6-6',
  'Weight': '223',
  'To': '2003',
  'Birth Date': 'November 3, 1974',
  'College': None},
 {'Name': 'Shareef Abdur-Rahim',
  'From': '1997',
  'Position': 'F',
  'Height': '6-9',
  'Weight': '225

In [190]:
raw_stats = soup.tbody.find_all('td')

names1 = []

for line in raw_stats:
    print(line)

names1

<td class="right" data-stat="year_min">1991</td>
<td class="right" data-stat="year_max">1995</td>
<td class="center" data-stat="pos">F-C</td>
<td class="right" csk="82.0" data-stat="height">6-10</td>
<td class="right" data-stat="weight">240</td>
<td class="left" csk="19680624" data-stat="birth_date"><a href="/friv/birthdays.cgi?month=6&amp;day=24">June 24, 1968</a></td>
<td class="left" data-stat="colleges"><a href="/friv/colleges.fcgi?college=duke">Duke</a></td>
<td class="right" data-stat="year_min">1969</td>
<td class="right" data-stat="year_max">1978</td>
<td class="center" data-stat="pos">C-F</td>
<td class="right" csk="81.0" data-stat="height">6-9</td>
<td class="right" data-stat="weight">235</td>
<td class="left" csk="19460407" data-stat="birth_date"><a href="/friv/birthdays.cgi?month=4&amp;day=7">April 7, 1946</a></td>
<td class="left" data-stat="colleges"><a href="/friv/colleges.fcgi?college=iowast">Iowa State</a></td>
<td class="right" data-stat="year_min">1970</td>
<td class

[]

In [200]:
import csv
fields = l[0].keys()
with open('testref.csv', 'w') as csvFile:
    dict_writer = csv.DictWriter(csvFile, fields)
    dict_writer.writeheader()
    dict_writer.writerows(l)

In [214]:
from dateutil.parser import parse
import datetime
print(l[0]['Birth Date'])
print(type(parse(l[0]['Birth Date'])))
now = datetime.datetime.now()
print(type(now))
now-parse(l[0]['Birth Date'])


June 24, 1968
<class 'datetime.datetime'>
<class 'datetime.datetime'>


datetime.timedelta(days=19427, seconds=78932, microseconds=664930)

In [215]:
datetime.datetime.str

TypeError: an integer is required (got type NavigableString)