# "NBA Injury Report Sports Analytics Project (Part 1)"
> "Introduction and setup for how to structure a complete sports analytics project"

- toc: true 
- badges: true
- comments: true
- author Jeremy Abramson
- categories: [sports analytics, nba, python, research, projects, injuries]
- image: images/chart-preview.png

# Introduction



## How To Read This Post

# Project Description

It all started with a tweet from basketball analytics legend and Korean cinema aficionado [Ed Kupfer](https://twitter.com/EdKupfer):
> twitter: https://twitter.com/EdKupfer/status/1385682808174256129


Pretty straightforward, right?

# Kickstarter Model™️ Goals
As per The Kickstarter Model™️, we need to define our three goals.
At first blush, we might suggest something like the following:
- Fallback Goal: "Get the PDFs"
- Baseline Goal: "Extract data from the PDFs"
- Stretch Goal: "Present the data somehow"

That's absolutely defensible, and has the bonus of structurally aligning with the major technical "tasks" of the project.
But I'm not sure a directory full of PDFs qualifies as a "dataset" you could pass on to someone else as a contribution.
Instead, we'll go with a slightly different approach (but feel free to use the above if it works better for you).

## 1. Fallback Goal
> Download and "lift" the data from the NBA Injury Report PDFs into some sort of machine-readable format (`CSV`, `JSON`, etc.)

Since "getting the data" is kind of the main point of this entire exercise, even though extracting data from the PDFs requires requires some [at this point] unknown tooling, we're going to have both downloading *and* processing the PDFs be be our Fallback Goal.
If we don't do anything else, or if nothing else "works", at least we'll have accomplished something useful that we can share with the community.

## 2.  Baseline Goal
> Present the data we obtained above in some sort of webapp/dashboard/etc.

Structuring things this way now aligns with what we normally consider a "Baseline" Goal: A fully-fledged "deliverable" of some sort that provides value to the community.
Doing this may sound a bit aggressive in terms of complexity, but our "webapp" doesn't need to be all that fancy.
It's just a "skin" to slice and dice the data we generate from the Fallback Goal.
It *should* be reasonably straightforward{% fn 1 %}.

## 3.  Stretch Goal
> None

As we were pretty aggressive with our first two goals, I think it's perfectly reasonable to not have a Stretch Goal.
Normally, you might think to say something vague like "analyze the injury data", but in my opinion, that's actually *counterproductive*.
The reason being that the goal of this project isn't really analysis!
The goal is is unlock some data in one format, and present it to "the user" in another format that's more amenable for *their* absorption.  
This may not sound very "sexy", but the reality is that this data collection/"chewing"/presenting process is basically [one of the most important things](https://www.sportperformanceanalysis.com/article/communication-with-coaches-as-a-performance-analyst) in sports analytics.
So unless you have a *specific* analysis in mind, I think it makes sense to not have a Stretch Goal, and to just focus on the first two goals.
If we do this right, perhaps something interesting will fall out of our dashboard, and we can explore it at that point.

> Note: Again, the point of setting these goals isn't to spur endless debate or to create another roadblock to starting a project.  It's just a way of structuring potential future design decisions, and making sure things like feature creep and "great-being-the-enemy-of-good" don't impede actually delivering on the goals.

Now that we have a conceptual goal framework in which to work, let's get to it!

# Fallback Goal: Downloading and Processing NBA Injury Report PDFs

The Kickstarter Model™️ is there to help us structure our effort wisely, and make productive high level design decisions, but it doesn't really say anything about *how* to do whatever it is we want to do!
To that end, once we have our goals, I find it's helpful to spend a little time — *before* coding! — structuring my thinking about each of the tasks at hand.

## Divvying Up Tasks
As per the discussion above, for our Fallback Goal, it seems like we already have two "high level" tasks: 

1. Downloading PDFs
2. Processing PDFs.

But we can break those down a little further.
First off, we can't download anything without knowing *what* to download.
Ed's tweet mentions "just change the date", it seems safe to assume we'll need to build a list of URLs based on dates, unless there's some central page with links to all of the injury reports, in which case we can just scrape the URLs from that.

And, of course, at this point I'm not sure exactly how to extract the data from the PDFs (minor detail, right?!?)
But I'm pretty confident it *can* be done, so we'll save a little "Google-fu" for later.
However, it does seem reasonable to use `Pandas` to post-process our data once we do get it out of the PDF, since it's definitely going to be tabular.
Lastly, we might want to fancy with [asynchronous requests](https://www.twilio.com/blog/asynchronous-http-requests-in-python-with-aiohttp) (or perhaps not!  It might not be worth the effort, but let's note it here).

Taking all of the above into account, our **Fallback Goal** task outline might look like this:

1. Download PDFs
    1. Generate list of URLs (dates) to injury report PDFs
    2. Download URLs from list (do this asynchronously?)
2.  Process PDFs
    1. Extract data from PDF (with what?)
    2. Post-process data (Pandas)

> Tip: One thing we'll find as we progress is that this outline, as pedantic as it may seem, is incomplete.  Digging into the above steps, we'll invariably find places where the sub-steps aren't necessarily or obviously entailed by the macro steps.  If this gets too far out of hand — when, exactly, this happens is a bit of a judgment call — it makes sense to go back and revise the outline.  For the sake of this blog post, I'm going to go forward with this level of granularity, and we'll see how far we can get.

Now we can tackle these individually, and make sure we're always progressing toward the end goal!

## Task 1: Downloading The Injury Reports

Before we can actually do anything interesting with the injury reports, we need to download them.
To do that, we need to know where there are.
After some brief googling, it seems like there really isn't a central repository for these reports, and they're just uploaded one at a time{% fn 2 %}.


The one example URL from Ed's tweet looks like this:
> `https://ak-static.cms.nba.com/referee/injury/Injury-Report_2021-04-23_01PM.pdf`


Seems like there's 4 parts to the URL:
1. Some boilerplate: `https://ak-static.cms.nba.com/referee/injury/Injury-Report_`
2. A date: `2021-04-23`
3. A time: `01PM`
4. The file extension: `.pdf`

We're only interested in the middle two (the date and the time), as the boilerplate URL and the file extension won't change.  From googling{% fn 30 %} `"nba injury report"` and clicking on [the second link](https://official.nba.com/nba-injury-report-2020-21-season/) it seems like the times may be `1:30 PM`, `5:30 PM` and `8:30 PM`.
Although from Ed's example, we see a `01PM` time in the URL, *not* `1:30`. 
So we'll go ahead and assume the appropriate times URL-wise are `01PM`, `05PM` and `08PM`.  
Lastly, for simplicity, let's just focus on the `8:30PM` (or `8PM`, in URL-speak) file, as it's likely to be the most complete.
We can download the rest later if we want.


### Generate a List of URLs to Download
Google helped us sort out what we might use for the `time` portion of the URL, but the `date` part is actually a little more complicated than we might like, as we need to know the specific URL of each injury report we want to download.
According to the [official NBA annoucement](https://official.nba.com/nba-injury-report-2020-21-season/), the dates we're interested in are actually the day *before* game days, except on the second days of back-to-backs, in which case presumably the injury report comes out the day of the game.

This all sounds complicated, and if there's one thing I've learned about projects like this it's that nothing saps enthusiasm like complication and extra effort that doesn't "feel" like it's in the service of what we're trying to  accomplish{% fn 40 %}.

So, given what we know, it seems as if we have two options:
1. Get a copy of the NBA schedule, and try to download reports for days we know there's a game
2. Just start downloading [a range of?] dates and see what we get

There's nothing wrong with taking a stab at option 1 here, but this project is about injury reports, not the NBA schedule, so we'll start with option 2{% fn 50 %}.

#### Building a List of Dates: The Easy Way
Programmatically dealing with dates can be [frustrating](https://xkcd.com/1179/), and doing so in Python is no exception.
There's a whole rabbit hole we can go down, talking about `timedelta` objects and "date arithmetic" et cetera, but let's just cut right to the chase and use `Pandas`.
Pandas certainly has its issues, but this is one case where it'll make our lives *much* simpler.

So finally...some code!

In [3]:
import pandas as pd

(Okay, so that wasn't very satisfying.
There's more code coming soon, I promise!)

We're going to use the `pandas.date_range` ([docs](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html)) function to generate, well, a range of dates.
We'll use these dates as portion of the URL schema detailed above, but in order to do so, we need to know what range to generate.

The day before the start of the NBA season until the end of the NBA season seems like a reasonable starting point.
If we needed to do this for some number of seasons, we'd want to automate this, but for now, let's just copy the dates manually. 
Perusing the [wikipedia page](https://en.wikipedia.org/wiki/2020–21_NBA_season) for the 2020-2021 NBA season, it looks like the season [started](https://www.espn.com/nba/recap/_/gameId/401266805) on December 22, 2020, and [ended](https://www.youtube.com/watch?v=9O920AD-teU) on July 20th.
From manual testing, [https://ak-static.cms.nba.com/referee/injury/Injury-Report_2021-07-20_08PM.pdf](https://ak-static.cms.nba.com/referee/injury/Injury-Report_2021-07-20_08PM.pdf) is a valid link, and the next day, unsurprisingly, is not.

Remembering our "the injury reports can come out the day *before*" note from above, this gives us our range: `December 21, 2020` through `July 20, 2021`. 

Looking at the [docs](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html) for `pandas.date_range` it there's a ton of parameters we can use, but `start` and `end` look like they might be all we need.
Now, for real this time, let's look at some code!

In [4]:
# This should generate some list-like thing of datetime-like things
season_dates = pd.date_range(start='12-21-2020', end='07-20-2021')
# Let's look at the first and last dates in our list
print(f'The 2020-21 NBA season started on {season_dates[0]}, and ended on {season_dates[-1]}')

The 2020-21 NBA season started on 2020-12-21 00:00:00, and ended on 2021-07-20 00:00:00


(It may not look like much, but if you have *any* idea how obnoxious this would be without `pandas`, you'll be impressed!)
Now that we have our list of dates, we just need to add them to our URL schema.
Let's define variables for the parts of the URL that are static (for now, we may want to download other `report_time` values later):

As a reminder, the URL's look like this: 
> `https://ak-static.cms.nba.com/referee/injury/Injury-Report_2021-07-20_08PM.pdf`

In [29]:
# We don't need to define a variable for the extension
url_stem = 'https://ak-static.cms.nba.com/referee/injury/Injury-Report_'
report_time = '08PM'

And let's build an example URL, using the first date from the `season_dates` list we created above:

In [9]:
print(f'{url_stem}{season_dates[0]}_{report_time}.pdf')

https://ak-static.cms.nba.com/referee/injury/Injury-Report_2020-12-21 00:00:00_08PM.pdf


Almost, but not quite!
It looks like our `season_dates[0]` object is some sort of `datetime`-y type thing.
But let's find out what Let's find out exactly:

In [10]:
print(f'{season_dates[0]} is of type {type(season_dates[0])}')

2020-12-21 00:00:00 is of type <class 'pandas._libs.tslibs.timestamps.Timestamp'>


So, it's not quite a standard Python `datetime` object ([docs](https://docs.python.org/3/library/datetime.html#datetime.datetime)).
But it sure seems like it might behave like one{% fn 60 %}.
That being the case, let's just go ahead and try to use the standard method we use to process Python dates into strings: `strftime()`.

> Tip: I could never remember which one was which between `strftime` and `strptime` until I leared that `strftime()` is "string *from* time" and `strptime()` is "string *parse* time" (That sound you hear is the "The More You Know" star whoosing by)

From our sample URL, it looks like we need things in a `YYYY-MM-DD` format.
Luckily, that's a relatively simple format string, but if you want to try it out and be sure, [this interactive site for Python datetime format strings](https://www.strfti.me) is super cool.

In [11]:
from datetime import datetime

In [12]:
print(f"{url_stem}{datetime.strftime(season_dates[0], '%Y-%m-%d')}_{report_time}.pdf")

https://ak-static.cms.nba.com/referee/injury/Injury-Report_2020-12-21_08PM.pdf


> Note: Add note about `from` imports and changing to double-quotes in `f-strings`

Success!

Now, let's build a list of similarly formatted URLs.

In [13]:
season_urls = list()
for day in season_dates:
    season_urls.append(f"{url_stem}{datetime.strftime(day, '%Y-%m-%d')}_{report_time}.pdf")

Or, if a list comprehension if you prefer (which seems reasonable here, since we're *building* a list):

In [15]:
season_urls = [f"{url_stem}{datetime.strftime(day, '%Y-%m-%d')}_{report_time}.pdf" for day in season_dates]

Let's sanity check the first five URLs:

In [16]:
season_urls[:5]

['https://ak-static.cms.nba.com/referee/injury/Injury-Report_2020-12-21_08PM.pdf',
 'https://ak-static.cms.nba.com/referee/injury/Injury-Report_2020-12-22_08PM.pdf',
 'https://ak-static.cms.nba.com/referee/injury/Injury-Report_2020-12-23_08PM.pdf',
 'https://ak-static.cms.nba.com/referee/injury/Injury-Report_2020-12-24_08PM.pdf',
 'https://ak-static.cms.nba.com/referee/injury/Injury-Report_2020-12-25_08PM.pdf']

Looks like we're good to go!

### Download the PDFs

For simplicity's sake, we'll just use the `requests` module to download the PDFs sequentially.
Later, we can see about using *asynchronous* requests to greatly speed up our downloads, but for now let's just do things one at a time.

We could do this lazily, with something simple like:
```python
import requests
# Loop through our URL list
for url in season_urls:
    # Download the PDF
    resp = requests.get(url)
    # Write to disk; Create a filename from the URL
    filename = url.split('/')[-1]
    with open(filename, 'wb') as f:
        f.write(resp.content)
```

With the code above, what happens if there's no report for a specific day?
Or if there's a network error, either with the server or our own internet?

In general, it's a good idea to program a little defensively when accessing network assets, especially in cases like this, where there's a high likelihood of accessing an "invalid" URL (i.e. one with nothing on the other end to download).
It's easy to imagine the downstream PDF processing — that we haven't written yet 🙂 — breaking when it tries to open an empty file, for example.
We could try to handle the error at that point, but it's probably better to do it here, and only write valid files{% fn 70 %}.

We'll assume that accessing a URL with a valid injury report on the other end will return [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) **200** ("Success!"), and one without a PDF will return something else{% fn 80 %}.
To handle this, we'll use the `requests` library's `raise_for_status()` method, which can raise an exception when we get a non-success status code.
We can then catch that exception, and skip writing the file.

Let's add that to the code above:

> Tip: You'll see some code that explicitly *tests* for a specific `status code` code using an `if` statement, but this is almost always a job for a `try/except` block.  One of the core "Pythonic" tenets is "Ask for forgiveness, not permission".  

```python
import requests

# Loop through our URL lists
for url in season_urls:
    resp = requests.get(url)
    try:
        resp.raise_for_status()
    # This is the exception to catch for non-200 response codes; this won't necessarily catch network errors
    except requests.HTTPError:
        print(f'No valid PDF at {url}, removing from list')
        season_dates.remove(url)
    # The "else" block of a try/except happens if there isn't an exception
    else:
        filename = url.split('/')[-1]
        # Write to disk
        print(f'Writing PDF for {url}')
        with open(filename, 'wb') as f:
            f.write(resp.content)
```

> Important: Note that the `.raise_for_status()` call is the one that raises the exception, **not** the `.get()` call!  Make sure `raise_for_status()` is inside your `try/except` block! 

One thing we might note is that if we want to download the other injury report times, we'll need to create a new list of URLs.
That being the case, perhaps it might make sense to construct the URL right before we use it, instead of all at once before hand.
And while we're at it, perhaps we should make this a function, so we don't have to download all the files in a loop if we don't want to.

The function should take two parameters, the `date` to download, and the `time` of the report (unsurprisingly, these are the two "dynamic" parts of our URL schema.)
We can use the chunks of code from building our URL list above to construct the URL right before we use it.

In [22]:
import requests
def download_injury_report(report_date, report_time='08PM'):
    # So we can alert an interactive user if something goes wrong
    from warnings import warn
    url_stem = 'https://ak-static.cms.nba.com/referee/injury/Injury-Report_'
    
    # Build the URL
    url = f"{url_stem}{datetime.strftime(report_date, '%Y-%m-%d')}_{report_time}.pdf"
    # Fetch the PDF
    resp = requests.get(url)
    try:
        resp.raise_for_status()
    # This is the exception to catch for non-200 response codes; this won't necessarily catch network errors
    except requests.HTTPError:
        # Just beccause the caller will have to make sure to handle the "None" case doesn't mean we can't alert them
        # stacklevel=2 cribbed from: https://stackoverflow.com/questions/60083173/warnings-module-prints-part-of-warning-twice
        warn(f'Could not download URL {url}', stacklevel=2)
        return None
    # The "else" block of a try/except happens if there isn't an exception
    else:
        # Return the content to the caller; they can write it to disk or process it
        return resp

> Note: You may be surprised that I removed the part that writes the PDF to disk.  I'll spare you the bloviating, but in general I try to adopt a somewhat "[functional](https://betterprogramming.pub/what-is-a-pure-function-3b4af9352f6f)" programming style whenever possible.  Basically, this means I try not to have lots of things happening inside my functions that someone using them might not be aware of (in this context, these things are called "side effects").  A function called `download_injury_report` should basically just do that (and nothing else).  If we want to save the report we can do that too, but we should be explicit about where and how that is happening.  By returning the response object, we can let the caller — us, but us outside this function! — decide what to do with it.

Let's give our function a shot!
We'll use one of the dates we generated previously (note that we don't have to provide a time, since we put a default value for the function above).

In [23]:
season_dates[-3]

Timestamp('2021-07-18 00:00:00', freq='D')

In [24]:
resp = download_injury_report(season_dates[-3])

  resp = download_injury_report(season_dates[-3])


Okay, I cheated a bit, and chose a date with no game.
But it looks like our warning is working correctly!
Let's download a valid PDF and write it to disk:

In [25]:
resp = download_injury_report(season_dates[0])

# Never can be too careful!
if resp:
    # Rather than convert season_dates[0] to a string, let's just use the .url attribute
    # from the requests.response object, and process the filename like we did above
    filename = resp.url.split('/')[-1]
    with open(filename, 'wb') as f:
        f.write(resp.content)
        print(f'Wrote {filename} succesfully')

Wrote Injury-Report_2020-12-21_08PM.pdf succesfully


If you've cloned this notebook and are running it, you should have a copy of `Injury-Report_2020-12-21_08PM.pdf` in the `_notebooks` directory of the repo.

Now let's download the season's worth, but putting something like the above back in a loop.
We'll make two small modifications though, using two of the best and most indispensable libraries in the modern Python ecosystem.
First, we'll define a `data` directory to write our files (using the sublime `pathlib`), so we don't clutter our `_notebooks` directory, and secondly, we'll use the *amazing* [tqdm](https://github.com/tqdm/tqdm) to monitor our progress, instead of printing a status message{% fn 90 %}.

> Tip: If you don't have `tqdm` installed (for shame!), `conda install -c conda-forge tqdm` (or `pip install tqdm` if you *must*) will do the trick.  You *are* using a Conda environment for this, right?!?

In [95]:
from tqdm.notebook import tqdm
from pathlib import Path

# Check out https://stackoverflow.com/questions/50110800/python-pathlib-make-directories-if-they-don-t-exist
# for more on the .mkdir() method
DATA_DIR = Path.cwd() /'injury_reports'
DATA_DIR.mkdir(parents=True, exist_ok=True)

for day in tqdm(season_dates):
    resp = download_injury_report(day)

    # Never can be too careful!
    if resp:
        # Rather than convert season_dates[0] to a string, let's just use the .url attribute
        # from the requests.response object, and process the filename like we did above
        filename = resp.url.split('/')[-1]
        with open(DATA_DIR / filename, 'wb') as f:
            f.write(resp.content)

  0%|          | 0/212 [00:00<?, ?it/s]

  resp = download_injury_report(day)
  resp = download_injury_report(day)
  resp = download_injury_report(day)
  resp = download_injury_report(day)
  resp = download_injury_report(day)
  resp = download_injury_report(day)
  resp = download_injury_report(day)


## Task 2: Extracting Tables from PDF Files

Now that we have our 205 injury report PDFs (culled from 212 days!), it's time to extract the data in the tables.
While I've briefly played around with optical character recognition (OCR) tools like [Tesseract](https://github.com/tesseract-ocr/tesseract), strict OCR seemed like a bad fit.
Other than that, I wasn't sure what tool to use, which means...google to the rescue!

At first, I [foolishly] googled for "*python parse PDF*", and eventually "*python parse PDF text*", thinking  it'd be obvious I meant "parse PDF **tables**".
Apparently clairvoyance isn't quite built into google [yet], so I ended up checking out packages such as [PyPDF2](https://pypi.org/project/PyPDF2/), [pdfreader](https://pypi.org/project/pdfreader/), and eventually [pdfminer.six](pdfminer.six).
These tools seem much more oriented toward programmatically creating and manipulating PDF documents, than extracting text.
Eventually down the list of google results I came upon [this stackoverflow post](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file).
One of the answers mentioned [tabula](https://tabula.technology) and converting PDF tables to dataframes.
This seemed promising, 


In [97]:
import tabula
import pandas as pd

In [98]:
pd.set_option('display.max_rows', 500)

In [126]:
injury_reports = sorted(list(DATA_DIR.glob('*.pdf')))

In [127]:
injury_reports[0]

PosixPath('/Users/abramson/git/jeremyabramson/blog/_notebooks/injury_reports/Injury-Report_2020-12-21_08PM.pdf')

In [129]:
tabula.read_pdf(DATA_DIR / injury_reports[0], pages='all')

Unnamed: 0,Game Date,Game Time,Matchup,Team,Player Name,Current Status,Reason
0,12/22/2020,07:00 (ET),GSW@BKN,Brooklyn Nets,"Claxton, Nicolas",Out,Injury/Illness - Right Knee; Tendinopathy
1,,,Golden State Warriors,"Green, Draymond",Out,Injury/Illness - Right foot; soreness,
2,,,,"Smailagic, Alen",Out,Injury/Illness - Right Knee; Soreness,
3,10:00 (ET),LAC@LAL,LA Clippers,"Morris Sr., Marcus",Out,Injury/Illness - Right Knee; Soreness,


In [128]:
tabula.read_pdf(DATA_DIR / injury_reports[-1], pages='all')

Unnamed: 0,Game Date,Game Time,Matchup,Team,Player Name,Current Status,Reason
0,07/20/2021,09:00 (ET),PHX@MIL,Milwaukee Bucks,"Antetokounmpo, Thanasis",Out,Health and Safety Protocols
1,,"DiVincenzo, Donte",Out,Injury/Illness - Left Ankle; Surgery,,,
2,Phoenix Suns,"Saric, Dario",Out,Injury/Illness - Right Acl; Tear,,,


In [None]:
tabula.read_pdf(DATA_DIR / injury_reports[100], pages='all')

In [132]:
df = tabula.read_pdf(injury_reports[40], stream=True, pages='all')

In [136]:
df = df.drop(df.loc[df['Game Date'] == 'Game Date'].index).reset_index(drop=True)
df['Game Date'] = df['Game Date'].fillna(method='pad')
df['Matchup'] = df['Matchup'].fillna(method='pad')
df['Team'] = df['Team'].fillna(method='pad')
df['Game Time'] = df['Game Time'].fillna(method='pad')
#df[['Last Name', 'First Name']] = df['Player Name'].str.split(',', expand=True)
#df = df.drop(columns=['Player Name'])

In [None]:
df

In [101]:
def clean_injury_report(df):
    df = df.drop(df.loc[df['Game Date'] == 'Game Date'].index).reset_index(drop=True)
    df['Game Date'] = df['Game Date'].fillna(method='pad')
    df['Matchup'] = df['Matchup'].fillna(method='pad')
    df['Team'] = df['Team'].fillna(method='pad')
    df['Game Time'] = df['Game Time'].fillna(method='pad')
    return df

In [180]:
from pathlib import Path
from tqdm.notebook import tqdm

In [100]:
pdfs = sorted(list(Path('./injury_reports').glob('*.pdf')))

In [207]:
tabula.read_pdf('2021-05-20_08PM.pdf', stream=True, guess=False, pages='all')

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Injury Report: 05/20/21 08:30 PM,Unnamed: 5
0,Game Date,Game Time,Matchup,Team,Player Name Current Status,Reason
1,05/20/2021,08:00 (ET),IND@WAS,Indiana Pacers,"Brogdon, Malcolm Available",Injury/Illness - Right Hamstring; Sore
2,,,,,"Lamb, Jeremy Out",Injury/Illness - Left Knee; Sore
3,,,,,"LeVert, Caris Out",Health and Safety Protocols
4,,,,,"Sumner, Edmond Available",Injury/Illness - Left Knee; Contusion
5,,,,,"Turner, Myles Out",Injury/Illness - Right Toe; Partial Plantar Pl...
6,,,,,"Warren, T.J. Out",Injury/Illness - Left Foot; Stress Fracture
7,,,,Washington Wizards,"Avdija, Deni Out",Injury/Illness - Right Ankle; Right ankle frac...
8,,,,,"Bryant, Thomas Out",Injury/Illness - Left Knee; Left ACL injury
9,05/21/2021,09:00 (ET),MEM@GSW,Golden State Warriors,,NOT YET SUBMITTED


In [216]:
tabula.read_pdf('2021-05-20_08PM.pdf', stream=True, area=[52.099,16.84,562.561,834.632], pages='all')

Unnamed: 0,Game Date,Game Time,Matchup,Team,Player Name,Current Status,Reason
0,05/20/2021,08:00 (ET),IND@WAS,Indiana Pacers,"Brogdon, Malcolm",Available,Injury/Illness - Right Hamstring; Sore
1,,,,,"Lamb, Jeremy",Out,Injury/Illness - Left Knee; Sore
2,,,,,"LeVert, Caris",Out,Health and Safety Protocols
3,,,,,"Sumner, Edmond",Available,Injury/Illness - Left Knee; Contusion
4,,,,,"Turner, Myles",Out,Injury/Illness - Right Toe; Partial Plantar Pl...
5,,,,,"Warren, T.J.",Out,Injury/Illness - Left Foot; Stress Fracture
6,,,,Washington Wizards,"Avdija, Deni",Out,Injury/Illness - Right Ankle; Right ankle frac...
7,,,,,"Bryant, Thomas",Out,Injury/Illness - Left Knee; Left ACL injury
8,05/21/2021,09:00 (ET),MEM@GSW,Golden State Warriors,,,NOT YET SUBMITTED
9,,,,Memphis Grizzlies,"McDermott, Sean",Out,Injury/Illness - Left Foot; Soreness


In [None]:
tabula.read_pdf('2021-05-16_08PM.pdf', stream=True, pages='all')

In [None]:
dfs = list()
for pdf in tqdm(pdfs):
    print(f'processing {pdf}')
    df = tabula.read_pdf(pdf, stream=True, area=[52.099,16.84,562.561,834.632], pages='all')
    df = clean_injury_report(df)
    dfs.append(df)

In [103]:
df = pd.concat(dfs)

In [104]:
df

Unnamed: 0,Game Date,Game Time,Matchup,Team,Player Name,Current Status,Reason
0,12/22/2020,07:00 (ET),GSW@BKN,Brooklyn Nets,"Claxton, Nicolas",Out,Injury/Illness - Right Knee; Tendinopathy
1,12/22/2020,07:00 (ET),GSW@BKN,Golden State Warriors,"Green, Draymond",Out,Injury/Illness - Right foot; soreness
2,12/22/2020,07:00 (ET),GSW@BKN,Golden State Warriors,"Smailagic, Alen",Out,Injury/Illness - Right Knee; Soreness
3,12/22/2020,10:00 (ET),LAC@LAL,LA Clippers,"Morris Sr., Marcus",Out,Injury/Illness - Right Knee; Soreness
0,12/22/2020,07:00 (ET),GSW@BKN,Brooklyn Nets,"Claxton, Nicolas",Out,Injury/Illness - Right Knee; Tendinopathy
...,...,...,...,...,...,...,...
1,07/20/2021,09:00 (ET),PHX@MIL,Milwaukee Bucks,"DiVincenzo, Donte",Out,Injury/Illness - Left Ankle; Surgery
2,07/20/2021,09:00 (ET),PHX@MIL,Phoenix Suns,"Saric, Dario",Out,Injury/Illness - Right Acl; Tear
0,07/20/2021,09:00 (ET),PHX@MIL,Milwaukee Bucks,"Antetokounmpo, Thanasis",Out,Health and Safety Protocols
1,07/20/2021,09:00 (ET),PHX@MIL,Milwaukee Bucks,"DiVincenzo, Donte",Out,Injury/Illness - Left Ankle; Surgery


In [105]:
df['Game Date'].nunique()

192

In [106]:
len(season_dates)

212

In [108]:
df['Team'].unique()

array(['Brooklyn Nets', 'Golden State Warriors', 'LA Clippers',
       'Charlotte Hornets', 'Cleveland Cavaliers', 'Miami Heat',
       'Orlando Magic', 'Indiana Pacers', 'New York Knicks',
       'Washington Wizards', 'Boston Celtics', 'Milwaukee Bucks',
       'New Orleans Pelicans', 'Toronto Raptors', 'Atlanta Hawks',
       'Chicago Bulls', 'Detroit Pistons', 'Minnesota Timberwolves',
       'Houston Rockets', 'Oklahoma City Thunder', 'Memphis Grizzlies',
       'San Antonio Spurs', 'Denver Nuggets', 'Sacramento Kings',
       'Portland Trail Blazers', 'Utah Jazz', 'Dallas Mavericks',
       'Phoenix Suns', 'Los Angeles Lakers', 'Philadelphia 76ers',
       'Non-NBA Team'], dtype=object)

In [109]:
df[df['Team'] == 'Non-NBA Team']

Unnamed: 0,Game Date,Game Time,Matchup,Team,Player Name,Current Status,Reason
0,03/07/2021,08:00 (ET),UNK@UNK,Non-NBA Team,,,NOT YET SUBMITTED
1,03/07/2021,08:00 (ET),UNK@UNK,Non-NBA Team,,,NOT YET SUBMITTED
0,03/07/2021,08:00 (ET),UNK@UNK,Non-NBA Team,,,NOT YET SUBMITTED
1,03/07/2021,08:00 (ET),UNK@UNK,Non-NBA Team,,,NOT YET SUBMITTED


{{ "Famous last [words](google.com)" | fndetail: 1 }}

{{ "If you happen to find a central place for these, let me [know](https://www.google.com)!" | fndetail: 2 }}

{{ "I'm going to try my best to show the provenance of as much insight as I can here.  I had no idea when the injury reports came out, so I googled it.  Don't be ashamed to do likewise if there's something you need to know!" | fndetail: 30 }}

{{ "'Decoding the NBA schedule' would certainly be something we could have added to our outline.  This just shows that something as seemingly simple as 'download the PDFs' requires a number of steps that might not be readily apparent!" | fndetail: 40 }}

{{ '...and if we happen to run into Adam Silver at a Christmas party, be sure to apologize for the extra traffic on the NBA.com servers' | fndetail: 50 }}

{{ "The notion that 'something *is* (e.g.) a list because it behaves like a list' is basically the meaning behind [Duck Typing](https://realpython.com/lessons/duck-typing/)" | fndetail: 60 }}

{{ "If you're thinking 'You should do both', you get a gold star!" | fndetail: 70 }}

{{ "From manual inspection of the HTTP headers, it looks like nba.com returns a 403 ('Forbidden') when there isn't a valid PDF" | fndetail: 80 }}

{{ "Or, we could explore the wonderful world of [logging](https://www.toptal.com/python/in-depth-python-logging)!" | fndetail: 90 }}

## Footnotes

You can have footnotes in notebooks, however the syntax is different compared to markdown documents. [This guide provides more detail about this syntax](https://github.com/fastai/fastpages/blob/master/_fastpages_docs/NOTEBOOK_FOOTNOTES.md), which looks like this:

```
{% raw %}For example, here is a footnote {% fn 1 %}.
And another {% fn 2 %}
{{ 'This is the footnote.' | fndetail: 1 }}
{{ 'This is the other footnote. You can even have a [link](www.github.com)!' | fndetail: 2 }}{% endraw %}
```