<h1>Gamelog Grabber</h1>
<p>The following code scrapes the game logs for the 4000+ players in the inital directory and saves them to individual csvs.  Only seasons from the last 30 years (from 1990-1991 onwards) are considered.</p>

In [5]:
import pandas as pd
import numpy as np
import requests as r
import re
from bs4 import BeautifulSoup as bs

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)

In [6]:
#import player list from csv and tidy 
players = pd.read_csv('nhlplayerlist_goalieonly.txt','\t')
players.drop(columns=['Unnamed: 0','index'],axis=1,inplace=True)
players.head(5)

Unnamed: 0,player,unique_id,year_start,year_finish,position,link
0,David Aebischer,aebisda01,2001,2008,G,https://www.hockey-reference.com/players/a/aeb...
1,Sami Aittokallio,aittosa01,2013,2014,G,https://www.hockey-reference.com/players/a/ait...
2,Jake Allen,allenja01,2013,2020,G,https://www.hockey-reference.com/players/a/all...
3,Jorge Alves,alvesjo01,2017,2017,G,https://www.hockey-reference.com/players/a/alv...
4,Frederik Andersen,anderfr01,2014,2020,G,https://www.hockey-reference.com/players/a/and...


<h3>Functions to aid in Data Scrapping</h3>
<p>The what follows are the functions used in the data collection<ul><li><p><strong>loglink</strong> - Gets the urls for each year the player played and returns it as a dictionary</p></li><li><p><strong>gamelogger</strong> - Grabs the game logs for each of the links entered and returns them as a DataFrame.</p></li></ul>
</p>


In [7]:
def loglink(pwebsite):
    """
    Gets the urls for each year the player played and returns it as a dictionary
    input: (string) url for the player in question
    output: (dict) year/link pairs for each year the player competed
    """
    page = r.get(pwebsite)
    soup = bs(page.text,'html.parser')
    links = soup.find_all('a',href=True)

    #find the game log links
    gamelogs = [] #the list of the game log inks
    for l in links:
        if 'gamelog' in l['href']:
            gamelogs.append(l['href'])
    gamelogs = list(dict.fromkeys(gamelogs)) #remove duplicate entries

    #find the game logs
    gameloglinks = {} #a dictionary with the year:gameloglink
    for gl in gamelogs:
        reg = re.search(r'gamelog\/([\d]*)',gl)
        if reg[1]:
            gameloglinks[int(reg[1])] = gl
    return gameloglinks
def gamelogger(loglinks):
    """
    Grabs the game logs for each of the links entered and returns them as a DataFrame
    Input:  (dict) Year/Gamelog pairs
    Output: (dataframe) the entire post-1990 gamelog data (excluding playoffs) for that player
    """
    glogs = pd.DataFrame()
    #for every year after 1990 (1991 is the 1990-1991 season), grab the gamelogs
    column_names = []
    for key,value in loglinks.items():
        if key>1990:
            page = r.get('https://hockey-reference.com'+value)
            soup = bs(page.text,'html.parser')

            for eachitem in soup.find_all('tr',id=True): #for each row in table
                rowdata = []
                for eachc in eachitem.find_all('td'): #for each column in row
                    if glogs.empty:
                        column_names.append(eachc['data-stat'])
                    rowdata.append(eachc.text)
                if glogs.empty:
                    glogs = pd.DataFrame(columns=column_names)
                glogs = glogs.append(dict(zip(glogs.columns,rowdata)), ignore_index=True) 
    return glogs

<h3>Get the data</h3>
<p>Data is aggregated and saved in tabs delimited format where each player is a different file and the name is the uniqueid.</p>

In [8]:
failed_players = [] #capture failed players
for i,player in players.iterrows():
    try:
        gamelogdict = loglink(player['link'])
    except:
        failed_players.append(player['player'])
        print("Failed to get gamelog links for %s" % (player['player']))
    try:
        log = gamelogger(gamelogdict)
    except:
        failed_players.append(player['player'])
        print("Error getting the gamelog for %s" % (player['player']))
    #write gamelog to csv for further analysis    
    log.to_csv('player_gamelogs\\'+player['unique_id']+'.txt')
    print('finished %d of %d or %.2f%%'%(i,players.shape[0],100*i/players.shape[0]))

finished 0 of 481 or 0.00%
finished 1 of 481 or 0.21%
finished 2 of 481 or 0.42%
finished 3 of 481 or 0.62%
finished 4 of 481 or 0.83%
finished 5 of 481 or 1.04%
finished 6 of 481 or 1.25%
finished 7 of 481 or 1.46%
finished 8 of 481 or 1.66%
finished 9 of 481 or 1.87%
finished 10 of 481 or 2.08%
finished 11 of 481 or 2.29%
finished 12 of 481 or 2.49%
finished 13 of 481 or 2.70%
finished 14 of 481 or 2.91%
finished 15 of 481 or 3.12%
finished 16 of 481 or 3.33%
finished 17 of 481 or 3.53%
finished 18 of 481 or 3.74%
finished 19 of 481 or 3.95%
finished 20 of 481 or 4.16%
finished 21 of 481 or 4.37%
finished 22 of 481 or 4.57%
finished 23 of 481 or 4.78%
finished 24 of 481 or 4.99%
finished 25 of 481 or 5.20%
finished 26 of 481 or 5.41%
finished 27 of 481 or 5.61%
finished 28 of 481 or 5.82%
finished 29 of 481 or 6.03%
finished 30 of 481 or 6.24%
finished 31 of 481 or 6.44%
finished 32 of 481 or 6.65%
finished 33 of 481 or 6.86%
finished 34 of 481 or 7.07%
finished 35 of 481 or 7.28%
fi

finished 279 of 481 or 58.00%
finished 280 of 481 or 58.21%
finished 281 of 481 or 58.42%
finished 282 of 481 or 58.63%
finished 283 of 481 or 58.84%
finished 284 of 481 or 59.04%
finished 285 of 481 or 59.25%
finished 286 of 481 or 59.46%
finished 287 of 481 or 59.67%
finished 288 of 481 or 59.88%
finished 289 of 481 or 60.08%
finished 290 of 481 or 60.29%
finished 291 of 481 or 60.50%
finished 292 of 481 or 60.71%
finished 293 of 481 or 60.91%
finished 294 of 481 or 61.12%
finished 295 of 481 or 61.33%
finished 296 of 481 or 61.54%
finished 297 of 481 or 61.75%
finished 298 of 481 or 61.95%
finished 299 of 481 or 62.16%
finished 300 of 481 or 62.37%
finished 301 of 481 or 62.58%
finished 302 of 481 or 62.79%
finished 303 of 481 or 62.99%
finished 304 of 481 or 63.20%
finished 305 of 481 or 63.41%
finished 306 of 481 or 63.62%
finished 307 of 481 or 63.83%
finished 308 of 481 or 64.03%
finished 309 of 481 or 64.24%
finished 310 of 481 or 64.45%
finished 311 of 481 or 64.66%
finished 3

In [None]:
#fix dead link for 3 failed players 
#failed_players=['Todd Reirden','Erik Reitz','Mike Vernon']
for p in failed_players:
    gamelogd = loglink(players.loc[players['player']==p,'link'].values[0])
    log = gamelogger(gamelogd)
    uid = players.loc[players['player']==p,'unique_id'].values[0]
    print(uid)
    log.to_csv('player_gamelogs\\'+uid+'.txt')