# Scrape NHL Lineups
## Scraping lines from a single date
In this notebook, I show some of the basic functions that I use to gather NHL starting lineups attempt to answer the questions of if/when/how often teams should "shake up their lines." I scrape the data from https://www.rotogrinders.com, a daily fantasy sports website. I use rotogrinders, because many other websites (e.g. https://www.nhl.com/stats, https://www.hockey-reference.com) do not list starting lineups in their stats (at least, not where I could find).

First, let's load in requests and BeautifulSoup to acquire and parse the data.

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

Let's just do one page. As a test we will scrape all of the team names (and whether they are home or away) and their lineups for Jan 7, 2017.

In [2]:
page = requests.get("https://rotogrinders.com/lineups/nhl?date=2017-01-07&site=fanduel")
soup = BeautifulSoup(page.content, 'html.parser')

All of the games can be found in the 'li' tag with an attribute data-role="lineup-card."

In [3]:
games = soup.find_all('li',{'data-role':"lineup-card"})

Let's build some functions to do the dirty low-level scraping. This is useful to reduce repeated code, and to allow the code to be changed more easily if the html on rotogrinders ever changes. For now, I am only scraping the forward lines, and not considering the defensive pairings.

In [4]:
def get_lineups(games):
    """
    Gather all of the lineups for a list of games.
    INPUTS:
        -) games: a list of tags, each of which containing one game
    OUTPUTS:
        -) AwayTeams: list of the away teams
        -) HomeTeams: list of the home teams
        -) AwayLineups: list of dicts of the away teams' lineups
        -) HomeLineups: list of dicts of the away teams' lineups
    """
    Ngames = len(games)
    AwayTeams = [0]*Ngames
    HomeTeams = [0]*Ngames
    AwayLineups = [dict() for x in range(Ngames)]
    HomeLineups = [dict() for x in range(Ngames)]
    for i in range(Ngames):
        AwayTeams[i] = games[i].attrs["data-away"]
        HomeTeams[i] = games[i].attrs["data-home"]   
        AR = games[i].find('div',{'class':'blk away-team'})
        HR = games[i].find('div',{'class':'blk home-team'})
        for l in range(1,5):
            AwayLineups[i]['L'+str(l)] = get_line(AR,l)
            HomeLineups[i]['L'+str(l)] = get_line(HR,l)
    return AwayTeams, HomeTeams, AwayLineups, HomeLineups

In [5]:
def get_line(roster_tag,line):
    """
    Gather a single line given the line number and a tag.
    INPUTS:
        -) roster_tag: tag of the roster for the team
        -) line: integer of the line number to grab
    OUTPUT:
        -) tuple of the names of the players in the line
    """
    line_tags = roster_tag.find("h4",text="Line "+str(line)).parent.find_all('li')
    line = []
    for lt in line_tags:
        try:
            line.append(lt.find('a',{'class':'player-popup'}).attrs['title'])
        except AttributeError:
            # If this player does not have a fantasy score, they built the html differently
            line.append(lt.find('span').text.strip())
    # Return as a tuple
    return line[0], line[1], line[2]

Then, to gather all of the lineups for Jan 7, 2017, I can just run:

    

In [6]:
AwayTeams, HomeTeams, AwayLineups, HomeLineups = get_lineups(games)


Let's check the output:

In [7]:
for i in range(2):
    print("Game {0}".format(i))
    print('Away Team = {0}'.format(AwayTeams[i]))
    print('Lineup:')
    print(AwayLineups[i])
    print('Away Team = {0}'.format(AwayTeams[i]))
    print('Lineup:')
    print(AwayLineups[i])
    

Game 0
Away Team = WPG
Lineup:
{'L2': ('Patrik Laine', 'Nikolaj Ehlers', 'Mark Scheifele'), 'L3': ('Shawn Matthias', 'Joel Armia', 'Adam Lowry'), 'L4': ('Andrew Copp', 'Nic Petan', 'Drew Stafford'), 'L1': ('Mathieu Perreault', 'Blake Wheeler', 'Bryan Little')}
Away Team = WPG
Lineup:
{'L2': ('Patrik Laine', 'Nikolaj Ehlers', 'Mark Scheifele'), 'L3': ('Shawn Matthias', 'Joel Armia', 'Adam Lowry'), 'L4': ('Andrew Copp', 'Nic Petan', 'Drew Stafford'), 'L1': ('Mathieu Perreault', 'Blake Wheeler', 'Bryan Little')}
Game 1
Away Team = TBL
Lineup:
{'L2': ('Alex Killorn', 'Jonathan Drouin', 'Valtteri Filppula'), 'L3': ('Matthew Peca', 'Luke Witkowski', 'Vladislav Namestnikov'), 'L4': ('Ryan Callahan', 'J.T. Brown', 'Cedric Paquette'), 'L1': ('Ondrej Palat', 'Nikita Kucherov', 'Tyler Johnson')}
Away Team = TBL
Lineup:
{'L2': ('Alex Killorn', 'Jonathan Drouin', 'Valtteri Filppula'), 'L3': ('Matthew Peca', 'Luke Witkowski', 'Vladislav Namestnikov'), 'L4': ('Ryan Callahan', 'J.T. Brown', 'Cedric Pa

-----
Looking pretty good!

## Scraping ALL the lineups!
Here I will generalize the functions, add a new function to help clean up some messy data, and then write a code to download all of the regular season lineups from the 2016-2017 season. 

I realized that rotogrinders used the abbreviation for the team names, which I though was great! Then, I realized that they messed up the abbreviation of the Montreal Canadians (it should be "MTL", but they use "MON"). So I have a function ``check_abbrev`` to check the abbreviations.

I then generalize the ``get_lineups`` function to a ``get_lineups_date`` function that takes the date and returns the teams and lineups for that date. I also add in a few lines of code to save the html from rotogrinders, if the file doesn't exist. Any reruns will then take significantly less time.

In [8]:
def check_abbrev(abbrev):
    """
    Check and, if necessary, fix the rotogrinders abbreviations.
    """
    if abbrev == "MON":
        abbrev = "MTL"
    return abbrev

In [9]:
def get_lineups_date(date):
    """
    Gather all of the lineups for a list of games.
    INPUTS:
        -) date: a string in the form YYYY-MM-DD
    OUTPUTS:
        -) AwayTeams: list of the away teams
        -) HomeTeams: list of the home teams
        -) AwayLineups: list of dicts of the away teams' lineups
        -) HomeLineups: list of dicts of the away teams' lineups
    """
    try:
        soup = BeautifulSoup(open("data/rotogriders_date_{0}.html".format(date)), "html.parser")
    except FileNotFoundError:
        print("Downloading Data...")
        page = requests.get("https://rotogrinders.com/lineups/nhl?date={0}&site=fanduel".format(date))
        soup = BeautifulSoup(page.content, 'html.parser')
        # Save to file
        with open("data/rotogriders_date_{0}.html".format(date), "w") as f:
            f.write(str(soup))
    games = soup.find_all('li',{'data-role':"lineup-card"})
    Ngames = len(games)
    AwayTeams = [0]*Ngames
    HomeTeams = [0]*Ngames
    AwayLineups = [dict() for x in range(Ngames)]
    HomeLineups = [dict() for x in range(Ngames)]
    for i in range(Ngames):
        AwayTeams[i] = check_abbrev(games[i].attrs["data-away"])
        HomeTeams[i] = check_abbrev(games[i].attrs["data-home"])
        AR = games[i].find('div',{'class':'blk away-team'})
        HR = games[i].find('div',{'class':'blk home-team'})
        for l in range(1,5):
            AwayLineups[i]['L'+str(l)] = get_line(AR,l)
            HomeLineups[i]['L'+str(l)] = get_line(HR,l)
    return AwayTeams, HomeTeams, AwayLineups, HomeLineups

I also need to make ``get_line`` more robust. For instance, some teams may only dress 11 forwards (and 7 defensemen) if they have an injury leaving 2 players on one line. I will just name an opening in an incomplete line as "None None."

In [10]:
def get_line(roster_tag,line):
    """
    Gather a single line given the line number and a tag.
    INPUTS:
        -) roster_tag: tag of the roster for the team
        -) line: integer of the line number to grab
    OUTPUT:
        -) tuple of the names of the players in the line
    """
    line_tags = roster_tag.find("h4",text="Line "+str(line)).parent.find_all('li')
    line = []
    for lt in line_tags:
        try:
            line.append(lt.find('a',{'class':'player-popup'}).attrs['title'])
        except AttributeError:
            # If this player does not have a fantasy score, they built the html differently
            line.append(lt.find('span').text.strip())
    # Test if line only has 1-2 players
    assert len(line) > 0
    for i in range(3-len(line)):
        line.append("None None")
    # Return as a tuple
    return line[0], line[1], line[2]

Now I just need to gather a list of all of the dates and team names containing NHL games in the 2016-2017 season. 

This is listed very clearly at https://www.hockey-reference.com/leagues/NHL_2017_games.html, so it is pretty straightforward to scrape it from there.

In [11]:
page = requests.get("https://www.hockey-reference.com/leagues/NHL_2017_games.html")
soup = BeautifulSoup(page.content, 'html.parser')

In [12]:
reg_season_tag = soup.find('table',{"id":'games'})
ind_game_tags = reg_season_tag.find('tbody').find_all('tr')
Ngames = len(ind_game_tags)
dates     = pd.Series([np.datetime64('2009-01-01')]*Ngames)
AwayTeams = np.empty(Ngames,dtype='U3')
HomeTeams = np.empty(Ngames,dtype='U3')

In [13]:
for i in range(Ngames):
    dates[i]     = ind_game_tags[i].find('th',{'data-stat':'date_game'}).find('a').text
    AwayTeams[i] = ind_game_tags[i].find('td',{'data-stat':'visitor_team_name'}).attrs['csk'][:3]
    HomeTeams[i] = ind_game_tags[i].find('td',{'data-stat':'home_team_name'}).attrs['csk'][:3]

Awesome! And lucky me! Their date format matches what I need. Yay! Now, lets get all of the lineups from the 2016-2017 season, and save the date, team names, and lineups to a pandas dataframe.

First let's create a data frame with a date and the team names, and empty values for lineups:
(Note: While we assume that the lineups on Rotogrinders is correct, their data isnt perfect. For instance, on 2016-12-19 they list a game that didn't happen. To mitigate that, I use the dates and team names from hockey-reference.com as a truth value, and only grab the lineups from rotogrinders)

In [14]:
df = pd.DataFrame(dates,columns=["Date"])
df["Away Team"] = AwayTeams
df["Home Team"] = HomeTeams

In [15]:
for loc in ("A","H"):
    for ln in range(1,5): # Line number
        for pn in range(3): # player in line
            df['{0}L{1}-{2}'.format(loc,ln,pn)] = ''

In [16]:
udates = []
dump = [udates.append(i) for i in dates if not udates.count(i)]

In [17]:
df.head()

Unnamed: 0,Date,Away Team,Home Team,AL1-0,AL1-1,AL1-2,AL2-0,AL2-1,AL2-2,AL3-0,...,HL1-2,HL2-0,HL2-1,HL2-2,HL3-0,HL3-1,HL3-2,HL4-0,HL4-1,HL4-2
0,2016-10-12,STL,CHI,,,,,,,,...,,,,,,,,,,
1,2016-10-12,CGY,EDM,,,,,,,,...,,,,,,,,,,
2,2016-10-12,TOR,OTT,,,,,,,,...,,,,,,,,,,
3,2016-10-12,LAK,SJS,,,,,,,,...,,,,,,,,,,
4,2016-10-13,MTL,BUF,,,,,,,,...,,,,,,,,,,


Let's do the big scrape! (note: this can take a bit as the html from rotogrinders is somewhat... robust.)

In [18]:
nloc = 0 # Row to write to
nwrite = 0
for date in udates:
    # print(date)
    AwayTeams, HomeTeams, AwayLineups, HomeLineups = get_lineups_date(date)
    # Save to Dataframe
    Ngames = len(AwayTeams)
    for i in range(Ngames):
        # Find game in dataframe
        nloc = np.where((df['Date'] == date) & 
                        (df['Away Team'] == AwayTeams[i]) & 
                        (df['Home Team'] == HomeTeams[i]))
        # There are some extra games on rotogrinders that didn't happen
        if len(nloc[0]) != 1:
            continue
        nloc = nloc[0][0]
        for loc in ("A","H"):
            if loc == "A":
                Lineup = AwayLineups[i]
            else:
                Lineup = HomeLineups[i]
            for ln in range(1,5): 
                for pn in range(3):
                    df['{0}L{1}-{2}'.format(loc,ln,pn)].iat[nloc] = Lineup['L{0}'.format(ln)][pn]
        nloc += 1

Again let's check the output

In [19]:
df.tail()

Unnamed: 0,Date,Away Team,Home Team,AL1-0,AL1-1,AL1-2,AL2-0,AL2-1,AL2-2,AL3-0,...,HL1-2,HL2-0,HL2-1,HL2-2,HL3-0,HL3-1,HL3-2,HL4-0,HL4-1,HL4-2
1225,2017-04-09,CAR,PHI,Teuvo Teravainen,Jordan Staal,Sebastian Aho,Sergey Tolchinsky,Joakim Nordstrom,Victor Rask,Elias Lindholm,...,Claude Giroux,Dale Weise,Brayden Schenn,Sean Couturier,Wayne Simmonds,Valtteri Filppula,Jordan Weal,Travis Konecny,Pierre-Edouard Bellemare,Mike Vecchione
1226,2017-04-09,COL,STL,Rene Bourque,Nathan MacKinnon,Sven Andrighetto,Gabriel Landeskog,Matt Duchene,Tyson Jost,Blake Comeau,...,Jaden Schwartz,Alexander Steen,Patrik Berglund,David Perron,Magnus Paajarvi,Vladimir Sobotka,Jori Lehtera,Ryan Reaves,Dmitrij Jaskin,Zach Sanford
1227,2017-04-09,BUF,TBL,Evander Kane,Ryan O'Reilly,Brian Gionta,Sam Reinhart,Jack Eichel,Tyler Ennis,Evan Rodrigues,...,Nikita Kucherov,Cory Conacher,Yanni Gourde,Alex Killorn,Luke Witkowski,Jonathan Drouin,Vladislav Namestnikov,Greg McKegg,Gabriel Dumont,J.T. Brown
1228,2017-04-09,CBJ,TOR,Nick Foligno,Brandon Saad,Oliver Bjorkstrand,Cam Atkinson,Brandon Dubinsky,Boone Jenner,Lauri Korpikoski,...,Nazem Kadri,Mitchell Marner,Tyler Bozak,James van Riemsdyk,Zach Hyman,William Nylander,Auston Matthews,Kasperi Kapanen,Brian Boyle,Matt Martin
1229,2017-04-09,FLA,WSH,Jaromir Jagr,Jonathan Huberdeau,Vincent Trocheck,Thomas Vanek,Nick Bjugstad,Reilly Smith,Colton Sceviour,...,Nicklas Backstrom,Tom Wilson,Marcus Johansson,Evgeny Kuznetsov,Lars Eller,Paul Carey,Andre Burakovsky,Chandler Stephenson,Garrett Mitchell,Daniel Winnik


And let's make sure we found lineups for all of the games. Are any ``AL1-0`` positions empty?

In [20]:
print((df['AL1-1'] == '').any())

False


Now, let's save the pandas dataframe.


In [21]:
df.to_pickle("data/Lineups.pkl")