# Scrape NHL Lineups
## Scraping lines from a single date
In this notebook, I show some of the basic functions that I use to gather NHL starting lineups attempt to answer the questions of if/when/how often teams should "shake up their lines." I scrape the data from https://www.rotogrinders.com, a daily fantasy sports website. I use rotogrinders, because many other websites (e.g. https://www.nhl.com/stats, https://www.hockey-reference.com) do not list starting lineups in their stats (at least, not where I could find).

First, let's load in requests and BeautifulSoup to acquire and parse the data.

In [1]:
import requests
from bs4 import BeautifulSoup


Let's just do one page. As a test we will scrape all of the team names (and whether they are home or away) and their lineups for Jan 7, 2017.

In [2]:
page = requests.get("https://rotogrinders.com/lineups/nhl?date=2017-01-07&site=fanduel")
soup = BeautifulSoup(page.content, 'html.parser')

All of the games can be found in the 'li' tag with an attribute data-role="lineup-card."

In [3]:
games = soup.find_all('li',{'data-role':"lineup-card"})

Let's build some functions to do the dirty low-level scraping. This is useful to reduce repeated code, and to allow the code to be changed more easily if the html on rotogrinders ever changes. For now, I am only scraping the forward lines, and not considering the defensive pairings.

In [10]:
def get_lineups(games):
    """
    Gather all of the lineups for a list of games.
    INPUTS:
        -) games: a list of tags, each of which containing one game
    OUTPUTS:
        -) AwayTeams: list of the away teams
        -) HomeTeams: list of the home teams
        -) AwayLineups: list of dicts of the away teams' lineups
        -) HomeLineups: list of dicts of the away teams' lineups
    """
    Ngames = len(games)
    AwayTeams = [0]*Ngames
    HomeTeams = [0]*Ngames
    AwayLineups = [dict() for x in range(Ngames)]
    HomeLineups = [dict() for x in range(Ngames)]
    for i in range(Ngames):
        AwayTeams[i] = games[i].attrs["data-away"]
        HomeTeams[i] = games[i].attrs["data-home"]   
        AR = games[i].find('div',{'class':'blk away-team'})
        HR = games[i].find('div',{'class':'blk home-team'})
        for l in range(1,5):
            AwayLineups[i]['L'+str(l)] = get_line(AR,l)
            HomeLineups[i]['L'+str(l)] = get_line(HR,l)
    return AwayTeams, HomeTeams, AwayLineups, HomeLineups

In [11]:
def get_line(roster_tag,line):
    """
    Gather a single line given the line number and a tag.
    INPUTS:
        -) roster_tag: tag of the roster for the team
        -) line: integer of the line number to grab
    OUTPUT:
        -) tuple of the names of the players in the line
    """
    line_tags = roster_tag.find("h4",text="Line "+str(line)).parent.find_all('li')
    line = []
    for lt in line_tags:
        try:
            line.append(lt.find('a',{'class':'player-popup'}).attrs['title'])
        except AttributeError:
            # If this player does not have a fantasy score, they built the html differently
            line.append(lt.find('span').text.strip())
    # Return as a tuple
    return line[0], line[1], line[2]

Then, to gather all of the lineups for Jan 7, 2017, I can just run:

    

In [13]:
AwayTeams, HomeTeams, AwayLineups, HomeLineups = get_lineups(games)


Let's check the output:

In [17]:
for i in range(len(games)):
    print("Game {0}".format(i))
    print('Away Team = {0}'.format(AwayTeams[i]))
    print('Lineup:')
    print(AwayLineups[i])
    print('Away Team = {0}'.format(AwayTeams[i]))
    print('Lineup:')
    print(AwayLineups[i])
    

Game 0
Away Team = WPG
Lineup:
{'L1': ('Mathieu Perreault', 'Blake Wheeler', 'Bryan Little'), 'L4': ('Andrew Copp', 'Nic Petan', 'Drew Stafford'), 'L3': ('Shawn Matthias', 'Joel Armia', 'Adam Lowry'), 'L2': ('Patrik Laine', 'Nikolaj Ehlers', 'Mark Scheifele')}
Away Team = WPG
Lineup:
{'L1': ('Mathieu Perreault', 'Blake Wheeler', 'Bryan Little'), 'L4': ('Andrew Copp', 'Nic Petan', 'Drew Stafford'), 'L3': ('Shawn Matthias', 'Joel Armia', 'Adam Lowry'), 'L2': ('Patrik Laine', 'Nikolaj Ehlers', 'Mark Scheifele')}
Game 1
Away Team = TBL
Lineup:
{'L1': ('Ondrej Palat', 'Nikita Kucherov', 'Tyler Johnson'), 'L4': ('Ryan Callahan', 'J.T. Brown', 'Cedric Paquette'), 'L3': ('Matthew Peca', 'Luke Witkowski', 'Vladislav Namestnikov'), 'L2': ('Alex Killorn', 'Jonathan Drouin', 'Valtteri Filppula')}
Away Team = TBL
Lineup:
{'L1': ('Ondrej Palat', 'Nikita Kucherov', 'Tyler Johnson'), 'L4': ('Ryan Callahan', 'J.T. Brown', 'Cedric Paquette'), 'L3': ('Matthew Peca', 'Luke Witkowski', 'Vladislav Namestnik

Looking pretty good!

## Scraping ALL the lineups!
Here I will generalize the functions, a write a code to download all of the regular season lineups from the 2016-2017 season. 

First, I am going to generalize the ``get_lineups`` function to a ``get_lineups_date`` function that takes the date and returns the teams and lineups for that date.

In [18]:
def get_lineups_date(date):
    """
    Gather all of the lineups for a list of games.
    INPUTS:
        -) date: a string in the form YYYY-MM-DD
    OUTPUTS:
        -) AwayTeams: list of the away teams
        -) HomeTeams: list of the home teams
        -) AwayLineups: list of dicts of the away teams' lineups
        -) HomeLineups: list of dicts of the away teams' lineups
    """
    page = requests.get("https://rotogrinders.com/lineups/nhl?date={0}&site=fanduel".format(date))
    soup = BeautifulSoup(page.content, 'html.parser')
    games = soup.find_all('li',{'data-role':"lineup-card"})
    Ngames = len(games)
    AwayTeams = [0]*Ngames
    HomeTeams = [0]*Ngames
    AwayLineups = [dict() for x in range(Ngames)]
    HomeLineups = [dict() for x in range(Ngames)]
    for i in range(Ngames):
        AwayTeams[i] = games[i].attrs["data-away"]
        HomeTeams[i] = games[i].attrs["data-home"]   
        AR = games[i].find('div',{'class':'blk away-team'})
        HR = games[i].find('div',{'class':'blk home-team'})
        for l in range(1,5):
            AwayLineups[i]['L'+str(l)] = get_line(AR,l)
            HomeLineups[i]['L'+str(l)] = get_line(HR,l)
    return AwayTeams, HomeTeams, AwayLineups, HomeLineups

Now I just need to gather a list of all of the dates containing NHL games in the 2016-2017 season. 

This is listed very clearly at https://www.hockey-reference.com/leagues/NHL_2017_games.html, so it is pretty straightforward to scrape it from there.

In [19]:
page = requests.get("https://www.hockey-reference.com/leagues/NHL_2017_games.html")
soup = BeautifulSoup(page.content, 'html.parser')

In [46]:
reg_season_tag = soup.find('table',{"id":'games'})
date_tags = reg_season_tag.find_all('th',{'data-stat':'date_game'})

In [84]:
dates = [0]*(len(date_tags)-1)
for i in range(1,len(date_tags)):
    dates[i-1] = date_tags[i].find('a').text

Awesome! And lucky me! Their date format matches what I need. Yay! Now, lets get all of the lineups from the 2016-2017 season (note: this can take a bit, and one would not want to do this too much as to not flood their servers with requests), and save the date, team names, and lineups to a pandas dataframe.

First let's create a data frame with a date for each game in the season, and empty values for the team names and lineups:

In [92]:
import pandas as pd
df = pd.DataFrame(dates,columns=["Date"])

In [95]:
df['Away Team'] = ''
df['Home Team'] = ''
for loc in ("A","H"):
    for ln in range(1,5): # Line number
        for pn in range(3): # player in line
            df['{0}L{1}-{2}'.format(loc,ln,pn)] = ''

In [115]:
udates = []
dump = [udates.append(i) for i in dates if not udates.count(i)]

Let's do the big scrape!

In [116]:
with open("data/lineups2016-2017-reg.csv",'w') as fout:
    for date in udates:
        print("Date:",date)
        AwayTeams, HomeTeams, AwayLineups, HomeLineups = get_lineups_date(date)
        # Save to Dataframe
        Ngames = len(AwayTeams)
        for i in range(Ngames):
            df['Away Team'].iat[i] = AwayTeams[i]
            df['Home Team'].iat[i] = HomeTeams[i]
            for loc in ("A","H"):
                if loc == "A":
                    Lineup = AwayLineups[i]
                else:
                    Lineup = HomeLineups[i]
                for ln in range(1,5): 
                    for pn in range(3):
                        df['{0}L{1}-{2}'.format(loc,ln,pn)].iat[i] = Lineup['L{0}'.format(ln)][pn]


Date: 2016-10-12
Date: 2016-10-13
Date: 2016-10-14
Date: 2016-10-15
Date: 2016-10-16
Date: 2016-10-17
Date: 2016-10-18


IndexError: list index out of range

In [None]:
 df


In [82]:
dates

[0,
 '2016-10-12',
 '2016-10-12',
 '2016-10-12',
 '2016-10-12',
 '2016-10-13',
 '2016-10-13',
 '2016-10-13',
 '2016-10-13',
 '2016-10-13',
 '2016-10-13',
 '2016-10-13',
 '2016-10-13',
 '2016-10-13',
 '2016-10-14',
 '2016-10-14',
 '2016-10-14',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-15',
 '2016-10-16',
 '2016-10-16',
 '2016-10-16',
 '2016-10-17',
 '2016-10-17',
 '2016-10-17',
 '2016-10-17',
 '2016-10-18',
 '2016-10-18',
 '2016-10-18',
 '2016-10-18',
 '2016-10-18',
 '2016-10-18',
 '2016-10-18',
 '2016-10-18',
 '2016-10-18',
 '2016-10-18',
 '2016-10-18',
 '2016-10-18',
 '2016-10-19',
 '2016-10-19',
 '2016-10-20',
 '2016-10-20',
 '2016-10-20',
 '2016-10-20',
 '2016-10-20',
 '2016-10-20',
 '2016-10-20',
 '2016-10-20',
 '2016-10-20',
 '2016-10-20',
 '2016-10-20',
 '2016-10-21',
 '2016-10-21',
 '2016-10-21',
 '2016-10-22',
 '2016-10-22',
 '2016