# "NBA Injury Report Sports Analytics Project (Part 1)"
> "Introduction and setup for how to structure a complete sports analytics project"

- toc: true 
- badges: true
- comments: true
- author Jeremy Abramson
- categories: [sports analytics, nba, python, research, projects, injuries]
- image: images/chart-preview.png

# Introduction



## How To Read This Post

## Project Description

It all started with a tweet from basketball analytics legend and Korean cinema aficionado [Ed Kupfer](https://twitter.com/EdKupfer):
> twitter: https://twitter.com/EdKupfer/status/1385682808174256129

Pretty straightforward, right?

### Setting Goals
As per The Kickstarter™️ model, let's define our three goals.  
In this instance, we're going to be somewhat aggressive here, since the creation of our dataset involves some [at this point] unknown tooling.

#### 1.  Fallback Goal
This one seems pretty obvious: 
> Download and "lift" the data from the NBA Injury Report PDFs into some sort of machine-readable format (`CSV`, `JSON`, etc.)

#### 2.  Baseline Goal
This one is a little more open ended, but I think a good option might be: 
> Present the data we obtained above in some sort of webapp/dashboard/etc.

#### 3.  Stretch Goal
As we were pretty aggressive with our first two goals, I think it's perfectly reasonable to not have a Stretch Goal.
We could say something vague like "analyze the injury data" but in my opinion, that's actually *counterproductive*.
The reason being that the goal of this project isn't really analysis!
The goal is is unlock some data in one format, and present it to "the user" in another format that's more amenable for *their* absorption.  
This may not sound very "sexy", but the reality is that this data collection/"chewing"/presenting process is basically [one of the most important things](https://www.sportperformanceanalysis.com/article/communication-with-coaches-as-a-performance-analyst) in sports analytics.
So unless you have a *specific* analysis in mind, I think it makes sense to not have a Stretch Goal, and to just focus on the first two goals.
If we do it right, perhaps something interesting will fall out, and we can explore it at that point!


# Fallback Goal: Downloading and Processing NBA Injury Report PDFs

I find it's helpful to spend a little time — *before* coding! — structuring my thinking about the tasks at hand.
By virtue of setting goals above, already have two "high level" tasks: 

1. Downloading PDFs
2. Processing PDFs.

But we can break those down a little further.
First off, we can't download anything without knowing *what* to download.
There doesn't seem to any obvious "central repository" for URLs to injury reports, and since Ed's tweet mentions "just change the date", it seems safe to assume we'll need to build a list of URLs based on dates.
At this point, I'll note that I'm not sure exactly how to extract the data from the PDFs (minor detail, right?!?)
But I'm pretty confident it *can* be done, so we'll save a little "Google-fu" for later.
However, it does seem reasonable to use `Pandas` to post-process our data once we do get it out of the PDF, since it's tabular.
Lastly, we might want to fancy with [asynchronous requests](https://www.twilio.com/blog/asynchronous-http-requests-in-python-with-aiohttp) (or perhaps not!  It might not be worth the effort, but let's note it here).

Taking all of the above into account, our **Fallback Goal** outline might look like this:

1. Download PDFs
    1. Generate list of URLs (dates) to injury report PDFs
    2. Download URLs from list (do this asynchronously?)
2.  Process PDFs
    1. Extract data from PDF (with what?)
    2. Post-process data (Pandas)

> Tip: One thing we'll find as we progress is that this outline, as pedantic as it may seem, is incomplete.
Digging into the above steps, we'll invariably find places where the sub-steps aren't necessarily or obviously entailed by the macro steps.
If this gets too far out of hand — when, exactly, this happens is a bit of a judgment call — it makes sense to go back and revise the outline.
For the sake of this blog post, I'm going to go forward with this level of granularity, and we'll see how far we can get.

Now we can tackle these individually, and make sure we're always progressing toward the end goal!

## Downloading The Injury Reports

Before we can actually do anything interesting with the injury reports, we need to download them.
To do that, we need to know where there are.
After some brief googling, it seems like there really isn't a central repository for these reports, and they're just uploaded one at a time{% fn 1 %}.


The one example URL from Ed's tweet looks like this:
> `https://ak-static.cms.nba.com/referee/injury/Injury-Report_2021-04-23_01PM.pdf`

Looks like there's 4 parts to the URL:
1. Some boilerplate: `https://ak-static.cms.nba.com/referee/injury/Injury-Report_`
2. A date: `2021-04-23`
3. A time: `01PM`
4. The file extension: `.pdf`

We're only interested in the middle two (the date and the time), as the boilerplate URL and the file extension won't change.  From googling{% fn 2 %} `"nba injury report"` and clicking on [the second link](https://official.nba.com/nba-injury-report-2020-21-season/) it seems like the times may be `1:30 PM`, `5:30 PM` and `8:30 PM`.
For simplicity, let's just focus on the `8:30PM` file (it's likely to be the most complete anyway).

### Building a URL List to Download (Finding Gamedays?)


Google helped us sort out what we might use for the `time` portion of the URL, but the `date` part is actually a little more complicated than we might like.
Since, to the best of my knowledge, there's no "central" page that has links to each report, we'll need to know the date for *every* report we want to download.
According to the [official NBA annoucement](https://official.nba.com/nba-injury-report-2020-21-season/), the dates we're interested in are actually the day *before* game days, except on the second days of back-to-backs, in which case presumably the injury report comes out the day of the game.

This all sounds complicated, and if there's one thing I've learned about projects like this it's that nothing saps enthusiasm like complication and extra effort that doesn't "feel" like it's in the service of what we're trying to  accomplish.

So, given what we know, it seems as if we have two options:
1. Get a copy of the NBA schedule, and try to download reports for days we know there's a game
2. Just start downloading [a range of?] dates and see what we get

There's nothing wrong with taking a stab at option 1 here, but this project is about injury reports, not the NBA schedule, so we'll start with option 2{% fn 3 %}.

### Using `datetime.timedelta` to Construct A List of Dates
Programmatically dealing with dates can be [frustrating](https://xkcd.com/1179/), and doing so in Python is no exception.
Despite that, it seems relatively obvious that we need to deal with `date`-like objects here, not just `string` or `numeric` representations of the dates themselves.
(Trying to "subtract" a day from `01-01-2021` without some internal representation of the calendar seems problematic).

Thinking at a high level, it seems like what we might want to do is build{% fn 3 %} a list of dates, starting at, say, today — or some other reasonable choice, like the end of the NBA season — and work backward until...well, it's not exactly clear how far back we need to go.
But the notion still seems reasonable: start at a date and count backward day by day, creating a new date each time.
Luckily, Python has exactly the thing we need to do this: the `timedelta` object.

For our purposes, a `timedelta` object is simply something that allows us to perform arithmetic on `datetime` objects.
Consider the following:

In [117]:
from datetime import datetime, timedelta

# Random_date is a datetime object
random_date = datetime.strptime('2011-09-18', '%Y-%m-%d')
# Let's create a new datetime object by subtracting a timedelta object
# Note that we're using the "days" parameter.  Try experimenting with other values!
new_date = random_date - timedelta(days=10)

print(f"Ten days before {random_date} is {new_date}")

Ten days before 2011-09-18 00:00:00 is 2011-09-08 00:00:00


> Tip: Adding or subtracting `datetime` and `timedelta` objects from each other can be tricky.  Expand the code below to see examples of what is valid, and what returns an error.  Try playing with *negative* `timedelta` values!

The 2021 NBA season [ended on July 20th](https://www.youtube.com/watch?v=9O920AD-teU), and from manual testing, [https://ak-static.cms.nba.com/referee/injury/Injury-Report_2021-07-20_08PM.pdf](https://ak-static.cms.nba.com/referee/injury/Injury-Report_2021-07-20_08PM.pdf) is a valid link (the next day, unsurprisingly, is not).
So let's start there, and count backward until we reach the day *before* the [start](https://en.wikipedia.org/wiki/2020–21_NBA_season) of the season, which would be December 21st, 2020.

In [118]:
season_start = datetime.strptime('2020-12-21', '%Y-%m-%d')
season_end = datetime.strptime('2021-07-20', '%Y-%m-%d')
season_duration = season_end - season_start
# season_duraction is a timedelta object
print(f'{season_duration=} is of type {type(season_duration)}')

season_duration=datetime.timedelta(days=211) is of type <class 'datetime.timedelta'>


In [248]:
import pandas as pd
test = pd.date_range(start='12-21-2020', end='07-20-2021')
test[0].date()

datetime.date(2020, 12, 21)

In [242]:
for thing in test:
    print(datetime.strftime(thing.date(), '%Y-%m-%d'))

2020-12-21
2020-12-22
2020-12-23
2020-12-24
2020-12-25
2020-12-26
2020-12-27
2020-12-28
2020-12-29
2020-12-30
2020-12-31
2021-01-01
2021-01-02
2021-01-03
2021-01-04
2021-01-05
2021-01-06
2021-01-07
2021-01-08
2021-01-09
2021-01-10
2021-01-11
2021-01-12
2021-01-13
2021-01-14
2021-01-15
2021-01-16
2021-01-17
2021-01-18
2021-01-19
2021-01-20
2021-01-21
2021-01-22
2021-01-23
2021-01-24
2021-01-25
2021-01-26
2021-01-27
2021-01-28
2021-01-29
2021-01-30
2021-01-31
2021-02-01
2021-02-02
2021-02-03
2021-02-04
2021-02-05
2021-02-06
2021-02-07
2021-02-08
2021-02-09
2021-02-10
2021-02-11
2021-02-12
2021-02-13
2021-02-14
2021-02-15
2021-02-16
2021-02-17
2021-02-18
2021-02-19
2021-02-20
2021-02-21
2021-02-22
2021-02-23
2021-02-24
2021-02-25
2021-02-26
2021-02-27
2021-02-28
2021-03-01
2021-03-02
2021-03-03
2021-03-04
2021-03-05
2021-03-06
2021-03-07
2021-03-08
2021-03-09
2021-03-10
2021-03-11
2021-03-12
2021-03-13
2021-03-14
2021-03-15
2021-03-16
2021-03-17
2021-03-18
2021-03-19
2021-03-20
2021-03-21

Now, constructing the list of dates is actually a little tricky.
Unfortunately, we can't just "increment" `season_start` until it reaches `season_end` (that would be too easy).
We basically have to turn our `season_duration` — which is a `timedelta` object — into an integral `range`, and then we iterate over *that*, subtracting a day for each value in the range.
To do this, we use the `.days` attribute of the `timedelta` object.
That gives us an `integer` representation of the how many days are between the two dates used to create the `timedelta` in the first place.
We then convert this integral number of days *back* into a `timedelta`, which we subtract from our `season_end` variable.
Confused?  Hopefully the code will help clarify:

In [119]:
# We'll store our dates here
season_dates = list()
# Convert the timedelta into an integral range, from 0 to the number of days in season_duration (211)
for day_number in range(season_duration.days):
    # Subtract day_number of days from our season endpoint (after converting to a timedelta)
    gameday = season_end - timedelta(days=day_number)
    # Append this new datetime to our list
    season_dates.append(gameday)
print(f'season_dates spans from {season_dates[0]} to {season_dates[-1]} with length {len(season_dates)}')

season_dates spans from 2021-07-20 00:00:00 to 2020-12-22 00:00:00 with length 211


Uh oh, we're off by one, and a day short!
Makes sense, since we're only iterating 211 times, so we lose one of the endpoints.
Fixing that, and converting the above to a list comprehension, we get:

In [135]:
# season_duration.days+1 gives us the range we're looking for
season_dates = [season_end - timedelta(days=day_number) for day_number in range(season_duration.days+1)]
print(f'season_dates spans from {season_dates[0]} to {season_dates[-1]} with length {len(season_dates)}')

season_dates spans from 2021-07-20 00:00:00 to 2020-12-21 00:00:00 with length 212


That's more like it.
Finally, we have our list of dates to use in our URLs.
On to downloading!

# Downloading The Injury Reports

For simplicity's sake, we'll just use the `requests` module to download the PDFs sequentially.
Later, we can see about using *asynchronous* requests to greatly speed up our downloads, but for now let's just do things one at a time.

We need to do three things:
1. Construct a URL from the dates we generated above
2. Download the PDF at the URL (if it exists)
3. Save it to disk; eventually we'll need to process the PDFs, but for now let's just store them


## Constructing an Injury Report URL
As we saw [above](https://jeremyabramson.github.io/blog/sports%20analytics/nba/python/research/projects/injuries/2021/07/25/nba-injuries.html#Decoding-the-Filename-Schema-From-the-URL), the URL has 4 parts.
For simplicity, let's only worry about gathering the 8PM report, which is the one that's most likely to have all the relevant information for the day.

As a reminder, the URL's look like this: 
> https://ak-static.cms.nba.com/referee/injury/Injury-Report_2021-07-20_08PM.pdf

Let's define variables for the parts of the URL that are static (for now, we may want to download other `report_time` values later):

In [121]:
# We don't need to define a variable for the extension
url_stem = 'https://ak-static.cms.nba.com/referee/injury/Injury-Report_'
report_time = '08PM'

And let's build an example URL, using the first date from the `season_dates` list we created above:

In [122]:
print(f'{url_stem}{season_dates[0]}_{report_time}.pdf')

https://ak-static.cms.nba.com/referee/injury/Injury-Report_2021-07-20 00:00:00_08PM.pdf


Almost, but not quite!
We need to convert our date back to a string, using `strftime()` ("string from time").
We use the exact same `format string` that we used when we made the call to `strptime()` ("string parse time") previously, in this case `'%Y-%m-%d'`.

In [132]:
# Convert the datetime object back to a string for our URL
print(f'{url_stem}{datetime.strftime(season_dates[0], "%Y-%m-%d")}_{report_time}.pdf')

https://ak-static.cms.nba.com/referee/injury/Injury-Report_2021-07-20_08PM.pdf


If you copy and paste that link into the search bar — or click on it if you're viewing this in a notebook — you should be greeted be the injury report for Game 6 of the NBA finals. 
Now we just need to iterate through the `season_dates` list and download each PDF, if it exists.
We could do this lazily, with something simple like:
```python
import requests
# Loop through our dates lists
for gameday in season_dates:
    # Construct the URL and download the PDF
    resp = requests.get(f'{url_stem}{datetime.strftime(gameday, "%Y-%m-%d")}_{report_time}.pdf')
    # Write to disk
    with open(f'{datetime.strftime(gameday, "%Y-%m-%d")}_{report_time}.pdf', 'wb') as f:
        f.write(resp.content)
```

With the code above, what happens if there's no report for a specific day?
Or if there's a network error, either with the server or our own internet?

In general, it's a good idea to program a little defensively when accessing network assets, especially in cases like this, where there's a high likelihood of accessing an "invalid" URL, with nothing on the other end to download.
It's easy to imagine the downstream PDF processing — that we haven't written yet :-) — breaking when it tries to open an empty file, for example.
We could try to handle the error at that point, but it's probably better to do it here, and only write valid files{% fn 5 %}.

We'll assume that accessing a URL without a valid injury report PDF will return an [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) of something other than **200**, which is "success" {% fn 6 %}.
To handle this, we'll use the `requests` library's `raise_for_status()` method, which can raise an exception when we get a non-success status code.
We can then catch that exception, and skip writing the file.

Let's add that to the code above, refactor things a skosh to be a little cleaner, and bring in some code from above for clarity:

> Important: Note that the `.raise_for_status()` call is the one that raises the exception, **not** the `.get()` call!  Make sure `raise_for_status()` is inside your `try/except` block! 

In [136]:
import requests
url_stem = 'https://ak-static.cms.nba.com/referee/injury/Injury-Report_'
report_time = '08PM'
# Loop through our dates lists
for gameday in season_dates:
    # Construct a URL
    url = f'{url_stem}{datetime.strftime(gameday, "%Y-%m-%d")}_{report_time}.pdf'
    # Make the request to download the PDF
    resp = requests.get(url)
    try:
        resp.raise_for_status()
    # This is the exception to catch for non-200 response codes; this won't necessarily catch network errors
    except requests.HTTPError:
        print(f'No valid PDF on {gameday}, removing from list')
        season_dates.remove(gameday)
    # The "else" block of a try/except happens if there isn't an exception
    else:
        filename = f'{datetime.strftime(gameday, "%Y-%m-%d")}_{report_time}.pdf'
        # Write to disk
        print(f'Writing PDF for {gameday}')
        with open(filename, 'wb') as f:
            f.write(resp.content)

Writing PDF for 2021-07-20 00:00:00
Writing PDF for 2021-07-19 00:00:00
No valid PDF on 2021-07-18 00:00:00, removing from list
Writing PDF for 2021-07-16 00:00:00
No valid PDF on 2021-07-15 00:00:00, removing from list
Writing PDF for 2021-07-13 00:00:00
No valid PDF on 2021-07-12 00:00:00, removing from list
Writing PDF for 2021-07-10 00:00:00
No valid PDF on 2021-07-09 00:00:00, removing from list
Writing PDF for 2021-07-07 00:00:00
Writing PDF for 2021-07-06 00:00:00
Writing PDF for 2021-07-05 00:00:00
No valid PDF on 2021-07-04 00:00:00, removing from list
Writing PDF for 2021-07-02 00:00:00
Writing PDF for 2021-07-01 00:00:00
Writing PDF for 2021-06-30 00:00:00
Writing PDF for 2021-06-29 00:00:00
Writing PDF for 2021-06-28 00:00:00
Writing PDF for 2021-06-27 00:00:00
Writing PDF for 2021-06-26 00:00:00
Writing PDF for 2021-06-25 00:00:00
Writing PDF for 2021-06-24 00:00:00
Writing PDF for 2021-06-23 00:00:00
Writing PDF for 2021-06-22 00:00:00
Writing PDF for 2021-06-21 00:00:00


KeyboardInterrupt: 

# Extracting Tables from PDF Files

Now that we have our 212 injury report PDFs, it's time to extract the data in the tables.
While I've briefly played around with optical character recognition (OCR) tools like [Tesseract](https://github.com/tesseract-ocr/tesseract), strict OCR seemed like a bad fit.
Other than that, I wasn't sure what tool to use, which means...google to the rescue!

At first, I [foolishly] googled for *\"python parse PDF\"*, and eventually *\"python parse PDF text\"*, thinking  it'd be obvious I meant "parse PDF **tables**".
Apparently clairvoyance isn't quite built into google [yet], so I ended up checking out packages such as [PyPDF2](https://pypi.org/project/PyPDF2/), [pdfreader](https://pypi.org/project/pdfreader/), and eventually [pdfminer.six](pdfminer.six).
These tools seem much more oriented toward programmatically creating and manipulating PDF documents, than extracting text.
Eventually down the list of google results I came upon [this stackoverflow post](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file).
One of the answers mentioned [tabula](https://tabula.technology) and converting PDF tables to dataframes.
This seemed promising, 


In [146]:
import tabula
import pandas as pd

In [147]:
pd.set_option('display.max_rows', 500)

In [148]:
tabula.read_pdf("2021-07-20_08PM.pdf", pages='all')

Unnamed: 0,Game Date,Game Time,Matchup,Team,Player Name,Current Status,Reason
0,07/20/2021,09:00 (ET),PHX@MIL,Milwaukee Bucks,"Antetokounmpo, Thanasis",Out,Health and Safety Protocols
1,,"DiVincenzo, Donte",Out,Injury/Illness - Left Ankle; Surgery,,,
2,Phoenix Suns,"Saric, Dario",Out,Injury/Illness - Right Acl; Tear,,,


In [165]:
df = tabula.read_pdf("2021-03-20_08PM.pdf", stream=True, pages='all')

In [166]:
df = df.drop(df.loc[df['Game Date'] == 'Game Date'].index).reset_index(drop=True)
df['Game Date'] = df['Game Date'].fillna(method='pad')
df['Matchup'] = df['Matchup'].fillna(method='pad')
df['Team'] = df['Team'].fillna(method='pad')
df['Game Time'] = df['Game Time'].fillna(method='pad')
#df[['Last Name', 'First Name']] = df['Player Name'].str.split(',', expand=True)
#df = df.drop(columns=['Player Name']))

In [174]:
def clean_injury_report(df):
    df = df.drop(df.loc[df['Game Date'] == 'Game Date'].index).reset_index(drop=True)
    df['Game Date'] = df['Game Date'].fillna(method='pad')
    df['Matchup'] = df['Matchup'].fillna(method='pad')
    df['Team'] = df['Team'].fillna(method='pad')
    df['Game Time'] = df['Game Time'].fillna(method='pad')
    return df

In [180]:
from pathlib import Path
from tqdm.notebook import tqdm

In [177]:
pdfs = sorted(list(Path('./').glob('*.pdf')))

In [207]:
tabula.read_pdf('2021-05-20_08PM.pdf', stream=True, guess=False, pages='all')

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Injury Report: 05/20/21 08:30 PM,Unnamed: 5
0,Game Date,Game Time,Matchup,Team,Player Name Current Status,Reason
1,05/20/2021,08:00 (ET),IND@WAS,Indiana Pacers,"Brogdon, Malcolm Available",Injury/Illness - Right Hamstring; Sore
2,,,,,"Lamb, Jeremy Out",Injury/Illness - Left Knee; Sore
3,,,,,"LeVert, Caris Out",Health and Safety Protocols
4,,,,,"Sumner, Edmond Available",Injury/Illness - Left Knee; Contusion
5,,,,,"Turner, Myles Out",Injury/Illness - Right Toe; Partial Plantar Pl...
6,,,,,"Warren, T.J. Out",Injury/Illness - Left Foot; Stress Fracture
7,,,,Washington Wizards,"Avdija, Deni Out",Injury/Illness - Right Ankle; Right ankle frac...
8,,,,,"Bryant, Thomas Out",Injury/Illness - Left Knee; Left ACL injury
9,05/21/2021,09:00 (ET),MEM@GSW,Golden State Warriors,,NOT YET SUBMITTED


In [216]:
tabula.read_pdf('2021-05-20_08PM.pdf', stream=True, area=[52.099,16.84,562.561,834.632], pages='all')

Unnamed: 0,Game Date,Game Time,Matchup,Team,Player Name,Current Status,Reason
0,05/20/2021,08:00 (ET),IND@WAS,Indiana Pacers,"Brogdon, Malcolm",Available,Injury/Illness - Right Hamstring; Sore
1,,,,,"Lamb, Jeremy",Out,Injury/Illness - Left Knee; Sore
2,,,,,"LeVert, Caris",Out,Health and Safety Protocols
3,,,,,"Sumner, Edmond",Available,Injury/Illness - Left Knee; Contusion
4,,,,,"Turner, Myles",Out,Injury/Illness - Right Toe; Partial Plantar Pl...
5,,,,,"Warren, T.J.",Out,Injury/Illness - Left Foot; Stress Fracture
6,,,,Washington Wizards,"Avdija, Deni",Out,Injury/Illness - Right Ankle; Right ankle frac...
7,,,,,"Bryant, Thomas",Out,Injury/Illness - Left Knee; Left ACL injury
8,05/21/2021,09:00 (ET),MEM@GSW,Golden State Warriors,,,NOT YET SUBMITTED
9,,,,Memphis Grizzlies,"McDermott, Sean",Out,Injury/Illness - Left Foot; Soreness


In [213]:
tabula.read_pdf('2021-05-16_08PM.pdf', stream=True, pages='all')

Unnamed: 0,Game Date,Game Time,Matchup,Team,Player Name,Current Status,Reason
0,05/16/2021,01:00 (ET),BOS@NYK,Boston Celtics,"Brown, Jaylen",Out,Injury/Illness - Left Scapholunate; Ligament S...
1,,,,,"Fournier, Evan",Out,Injury/Illness - Right Knee; Hyperflexion
2,,,,,"Smart, Marcus",Out,Injury/Illness - Right Calf; Contusion
3,,,,,"Tatum, Jayson",Out,Injury/Illness - Left Ankle; Impingement
4,,,,,"Thompson, Tristan",Out,Injury/Illness - Left Pectoral; Strain
5,,,,,"Walker, Kemba",Out,Injury/Illness - Left Cervical Nerve; Irritation
6,,,,,"Williams III, Robert",Out,Injury/Illness - Left Foot; Turf Toe
7,,,,New York Knicks,"Robinson, Mitchell",Out,Injury/Illness - Right Foot; surgery
8,,,,,"Vildoza, Luca",Out,Not With Team
9,,,CHA@WAS,Charlotte Hornets,"Hayward, Gordon",Out,Injury/Illness - Right Foot; Sprain


In [217]:
dfs = list()
for pdf in tqdm(pdfs):
    print(f'processing {pdf}')
    df = tabula.read_pdf(pdf, stream=True, area=[52.099,16.84,562.561,834.632], pages='all')
    df = clean_injury_report(df)
    dfs.append(df)

  0%|          | 0/124 [00:00<?, ?it/s]

processing 2021-03-14_08PM.pdf
processing 2021-03-15_08PM.pdf
processing 2021-03-16_08PM.pdf
processing 2021-03-17_08PM.pdf
processing 2021-03-18_08PM.pdf
processing 2021-03-19_08PM.pdf
processing 2021-03-20_08PM.pdf
processing 2021-03-21_08PM.pdf
processing 2021-03-22_08PM.pdf
processing 2021-03-23_08PM.pdf
processing 2021-03-24_08PM.pdf
processing 2021-03-25_08PM.pdf
processing 2021-03-26_08PM.pdf
processing 2021-03-27_08PM.pdf
processing 2021-03-28_08PM.pdf
processing 2021-03-29_08PM.pdf
processing 2021-03-30_08PM.pdf
processing 2021-03-31_08PM.pdf
processing 2021-04-01_08PM.pdf
processing 2021-04-02_08PM.pdf
processing 2021-04-03_08PM.pdf
processing 2021-04-04_08PM.pdf
processing 2021-04-05_08PM.pdf
processing 2021-04-06_08PM.pdf
processing 2021-04-07_08PM.pdf
processing 2021-04-08_08PM.pdf
processing 2021-04-09_08PM.pdf
processing 2021-04-10_08PM.pdf
processing 2021-04-11_08PM.pdf
processing 2021-04-12_08PM.pdf
processing 2021-04-13_08PM.pdf
processing 2021-04-14_08PM.pdf
processi

In [218]:
df = pd.concat(dfs)

In [224]:
df

Unnamed: 0,Game Date,Game Time,Matchup,Team,Player Name,Current Status,Reason
0,03/14/2021,02:00 (ET),MEM@OKC,Memphis Grizzlies,"Jackson Jr., Jaren",Out,Injury/Illness - Left Knee; Meniscus Surgery R...
1,03/14/2021,02:00 (ET),MEM@OKC,Oklahoma City Thunder,"Ariza, Trevor",Out,Not With Team
2,03/14/2021,02:00 (ET),MEM@OKC,Oklahoma City Thunder,"Bazley, Darius",Out,Injury/Illness - Left Shoulder; Contusion
3,03/14/2021,02:00 (ET),MEM@OKC,Oklahoma City Thunder,"Dort, Luguentz",Out,Injury/Illness - Left Great Toe; Sprain
4,03/14/2021,02:00 (ET),MEM@OKC,Oklahoma City Thunder,"Hall, Josh",Out,Injury/Illness - Left Knee; Soreness
...,...,...,...,...,...,...,...
1,07/20/2021,09:00 (ET),PHX@MIL,Milwaukee Bucks,"DiVincenzo, Donte",Out,Injury/Illness - Left Ankle; Surgery
2,07/20/2021,09:00 (ET),PHX@MIL,Phoenix Suns,"Saric, Dario",Out,Injury/Illness - Right Acl; Tear
0,07/20/2021,09:00 (ET),PHX@MIL,Milwaukee Bucks,"Antetokounmpo, Thanasis",Out,Health and Safety Protocols
1,07/20/2021,09:00 (ET),PHX@MIL,Milwaukee Bucks,"DiVincenzo, Donte",Out,Injury/Illness - Left Ankle; Surgery


In [222]:
df['Game Date'].nunique()

115

In [223]:
len(season_dates)

207

{{ "If you happen to find a central place for these, let me know!" | fndetail: 1 }}

{{ "I'm going to try my best to show the provenance of as much insight as I can here.  I had no idea when the injury reports came out, so I googled it.  Don't be ashamed to do likewise if there's something you need to know!" | fndetail: 2 }}


{{ '...and sincerely apologize to Adam Silver next time we run into him for the extra traffic on the [nba.com](https://nba.com) servers.' | fndetail: 3 }}

{{ 'Ideally we would *[generate](https://www.geeksforgeeks.org/python-list-comprehensions-vs-generator-expressions/)* the list versus building it all at once, but conceptually it&#39;s the same thing' | fndetail: 4 }}

{{ "From manual inspection of the HTTP headers, it looks like nba.com returns a 403 ('Forbidden') when there isn't a valid PDF" | fndetail: 5 }}

{{ "If you're thinking 'You should do both', you get a gold star!" | fndetail: 6 }}

## Footnotes

You can have footnotes in notebooks, however the syntax is different compared to markdown documents. [This guide provides more detail about this syntax](https://github.com/fastai/fastpages/blob/master/_fastpages_docs/NOTEBOOK_FOOTNOTES.md), which looks like this:

```
{% raw %}For example, here is a footnote {% fn 1 %}.
And another {% fn 2 %}
{{ 'This is the footnote.' | fndetail: 1 }}
{{ 'This is the other footnote. You can even have a [link](www.github.com)!' | fndetail: 2 }}{% endraw %}
```

For example, here is a footnote {% fn 10 %}.

And another {% fn 11 %}

{{ 'This is the footnote.' | fndetail: 10 }}
{{ 'This is the other footnote. You can even have a [link](www.github.com)!' | fndetail: 11 }}