<h1>Team Log Grabber</h1>
<p>This notebook grabs the regular season logs for every hockey team from 1990 on</p>

In [None]:
import pandas as pd
import numpy as np
import requests as r
import re
from bs4 import BeautifulSoup as bs

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)

<h2>First Grab a list of teams</h2>
<p>All the hockey teams are linked via a link /teams/XXX/history.html.  Defunct teams (or teams like the Atlanta Thrashers which moved to Winnipeg) are listed as the team to which they transfered to.  Before getting the gamelogs, we first scrap the team abbreviations which change from season to season.  This, the team names, and the gamelog links for the season are stored in a DataFrame.</p>

In [None]:
#first get a link to their histories

link_list = [] #list of links to the respective teams
url = 'https://www.hockey-reference.com'
t = r.get(url+'/teams')
soup = bs(t.text, 'html.parser')
links = soup.find_all('a')
for l in links:
    if 'history.html' in l['href']:
        link_list.append(l['href'])

In [None]:
#now create a dataframe consisting of the season, the team name, team abbreviation and the link season_log
teamDataFrame = pd.DataFrame(columns=['Season','team_name','team_abbr','link'])

for link in link_list:
    page = r.get(url+link)
    soup = bs(page.text, 'html.parser')
    links = soup.find_all('table')
    for row in links[0].find_all('tr')[1:]:

        #get season & link
        main_c = row.find_all('th')[0]
        link = main_c.find_all('a')

        if len(link)>0:
            link = link[0]['href'] #this gets the link to team main page

        if main_c['data-stat']=='season':
            season = main_c.text #this gets the season
        for c in row.find_all('td'):
            if c['data-stat']=='team_name':
                tname = c.text.strip('*')   #here's the team name
        abbr = re.search(r'teams\/(\w{3})\/',link)[1]  #the abbreviation is extracted from the link
        team_gamelog = 'teams/'+abbr+'/'+str(int(season[:4])+1)+'_gamelog.html'

        #record if the season is after 1990
        if int(season[:4])<1990:   #we don't care before 1990
            break
        else:
            teamDataFrame = teamDataFrame.append(dict(zip(teamDataFrame.columns,[season,tname,abbr,team_gamelog])),ignore_index=True)

teamDataFrame.to_csv('team_gamelog_list.txt','\t')
teamDataFrame.head()

<h2>Now Grab the Gamelogs</h2>
<p>Now I can use the links from above to grab the game logs which I'll store as a txt under the unique identifier team_abbr/season (for example ANA2020)</p>

In [11]:
def gamelogger(link):
    """
    Grabs the game logs for each of the link entered and returned as a DataFrame
    Input:  (string) gamelog link
    Output: (dataframe) gamelog data (excluding playoffs) for that year
    """
    glogs = pd.DataFrame()
    #for every year after 1990 (1991 is the 1990-1991 season), grab the gamelogs
    column_names = []
    page = r.get('https://hockey-reference.com/'+link)
    soup = bs(page.text,'html.parser')

    for eachitem in soup.find_all('tr',id=True): #for each row in table
        rowdata = []
        #only count regular season games
        if eachitem['id'][11:13] == 'rs':
            for eachc in eachitem.find_all('td'): #for each column in row
                if glogs.empty:
                    column_names.append(eachc['data-stat'])
                rowdata.append(eachc.text)
            if glogs.empty:
                glogs = pd.DataFrame(columns=column_names)
            glogs = glogs.append(dict(zip(glogs.columns,rowdata)), ignore_index=True) 
    return glogs

In [13]:
for i,team in teamDataFrame.iterrows():
    try:
        log = gamelogger(team['link'])
        log.to_csv('team_gamelogs\\'+team['team_abbr']+'_'+str(int(team['Season'][:4])+1)+'.txt','\t')
    except:
        print("Error with row %d, link %s" % (i,team['link']))