# Scraping Player Injury Data
I scraped all player injury data from TSN.ca's player profiles. In order to find the URLs for each player profile I first scraped the main TSN player page for links to every available profile. I then iterated through each profile to scrape the section containing injury, transaction, and suspension data. I finally converted the list of profile data into a DataFrame, removed non-injury events, and parsed reports for length and type of injury. I utilized Selenium for scraping because both the TSN player menu and profile pages use Javascript to populate their table data.

### Table of Contents
* [1. Imports and Functions](#sec1)
* [2. Create a Shared Webdriver](#sec2)
* [3. Scrape Player Profile Links](#sec3)
* [4. Scrape Each Player Profile](#sec4)
* [5. Convert List of Profiles to DataFrames](#sec5)
* [6. Cleanup](#sec6)

<a id='sec1'></a>
### 1. Imports and Functions
Most of my code is stored in nhl_injuries_code.py so that it can easily be used across notebooks. This notebook utilizes the following functions from that file:
* **read_profile_links**: Scrapes all player profile links from TSN.ca's main player page
* **read_player_profile**: Scrapes a player's injury, transaction, and suspension history from TSN.ca
* **profiles_to_dfs**: Converts a list of player profiles into a DataFrame containing just injury data
* **var_to_pickle**: Writes the given variable to a pickle file
* **read_pickle**: Reads the given pickle file

In [1]:
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

from nhl_injuries_code import (read_profile_links,
                               read_player_profile,
                               profiles_to_dfs,
                               var_to_pickle,
                               read_pickle)

<a id='sec2'></a>
### 2. Create a Shared Webdriver
In the interest of efficiency, all functions use the same webdriver since they simply run in sequence.

In [2]:
chromedriver = "/Applications/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

<a id='sec3'></a>
### 3. Scrape Player Profile Links
Scraping all links takes around 30 seconds, but the list of links gets loaded from a pickle if the pickle exists

In [3]:
links_pickle = 'pickle_data/player_links.pk'
links = read_pickle(links_pickle)
if links == None:
    links = read_profile_links(driver)
    var_to_pickle(links, links_pickle)

<a id='sec4'></a>
### 4. Scrape Each Player Profile
**Warning**: Scraping all profiles takes around 3 hours. If a pickle exists, the code below will use it as a starting point and then scrape the remaining data, or simply use the pickle if it contains all profile data. If scraping is required, the pickle gets updated automatically every **save_step** profiles scraped.

In [4]:
profiles_pickle = 'pickle_data/player_profiles.pk'
bio_url = 'https://www.tsn.ca%s/bio'
save_step = 50

# Loads data from pickle if it exists, otherwise scrapes it
profiles = read_pickle(profiles_pickle)
if profiles == None:
    profiles = []
    
# If pickle data is not complete, picks up where it left off
start = len(profiles)
if start < len(links):
    for cnt,link in enumerate(links[start:]):
        profiles.append(read_player_profile(driver, bio_url % link))
        if (cnt+1) % save_step == 0:
            var_to_pickle(profiles, profiles_pickle)
    var_to_pickle(profiles, profiles_pickle)

<a id='sec5'></a>
### 5. Convert List of Profiles to DataFrames
* **names_df**: player name data with columns [Name, Birth_Date]
* **injuries_df**: player injury data with columns [Name, Birth_Date, Injury_Date, Report, Games_Missed, Cause]

In [5]:
names_df, injuries_df = profiles_to_dfs(profiles)

<a id='sec6'></a>
### 6. Cleanup
Quits the shared webdriver and saves pickles of the DataFrames.

In [6]:
driver.quit()
names_pickle = '../data/names_df.pk'
injuries_pickle = '../data/injuries_df.pk'
var_to_pickle(names_df, names_pickle)
var_to_pickle(injuries_df, injuries_pickle)