## Setup

In [73]:
import requests
import json
import pandas

## Making an API request: The schedule endpoint

A quick search on Google initially returns the following URL for the NHL's API (which is more specifically the schedule endpoint, one of many). There we see a JSON (?) output listing all the games scheduled for a given day, the teams participating, their record so far in the season, and the venue. 

In [74]:
url = 'https://statsapi.web.nhl.com/api/v1/schedule?date=2020-03-01'
requests.get(url)
response = requests.get(url)

Just executing the request returns a 200 response, which indicates the call was successful. If the API didn't exist (e.g. 'https://statsapi.web.nhl.com/api/v1/schedul'), we'd get a 404 response. 

For any (successful) request, we can then access a number of attributes (e.g. response.status_code, response.headers).  

In [75]:
response.json()

{'copyright': 'NHL and the NHL Shield are registered trademarks of the National Hockey League. NHL and NHL team marks are the property of the NHL and its teams. © NHL 2020. All Rights Reserved.',
 'totalItems': 6,
 'totalEvents': 0,
 'totalGames': 6,
 'totalMatches': 0,
 'wait': 10,
 'dates': [{'date': '2020-03-01',
   'totalItems': 6,
   'totalEvents': 0,
   'totalGames': 6,
   'totalMatches': 0,
   'games': [{'gamePk': 2019021011,
     'link': '/api/v1/game/2019021011/feed/live',
     'gameType': 'R',
     'season': '20192020',
     'gameDate': '2020-03-01T17:00:00Z',
     'status': {'abstractGameState': 'Final',
      'codedGameState': '7',
      'detailedState': 'Final',
      'statusCode': '7',
      'startTimeTBD': False},
     'teams': {'away': {'leagueRecord': {'wins': 38,
        'losses': 20,
        'ot': 7,
        'type': 'league'},
       'score': 5,
       'team': {'id': 4,
        'name': 'Philadelphia Flyers',
        'link': '/api/v1/teams/4'}},
      'home': {'league

This yields the same output (and format) as what we saw earlier on the webpage itself. The early elements of the JSON output tell us there are 6 items (here games), and for each the same information is provided, in the structure. 

We can navigate to the data about the first game using indices and keys, although it's not particularly efficient:

In [76]:
response.json()['dates'][0]['games'][0]

{'gamePk': 2019021011,
 'link': '/api/v1/game/2019021011/feed/live',
 'gameType': 'R',
 'season': '20192020',
 'gameDate': '2020-03-01T17:00:00Z',
 'status': {'abstractGameState': 'Final',
  'codedGameState': '7',
  'detailedState': 'Final',
  'statusCode': '7',
  'startTimeTBD': False},
 'teams': {'away': {'leagueRecord': {'wins': 38,
    'losses': 20,
    'ot': 7,
    'type': 'league'},
   'score': 5,
   'team': {'id': 4,
    'name': 'Philadelphia Flyers',
    'link': '/api/v1/teams/4'}},
  'home': {'leagueRecord': {'wins': 35,
    'losses': 26,
    'ot': 4,
    'type': 'league'},
   'score': 3,
   'team': {'id': 3, 'name': 'New York Rangers', 'link': '/api/v1/teams/3'}}},
 'venue': {'id': 5054,
  'name': 'Madison Square Garden',
  'link': '/api/v1/venues/5054'},
 'content': {'link': '/api/v1/game/2019021011/content'}}

The response itself mentions a few additional API endpoints (e.g. venues, teams, game).

## Parsing an API response: The live feed endpoint
### Output overview

Here we'll retrieve the game data for game ID 2019021150 (March 1st 2020 game between the Flyers and the Rangers). 
The (long) ouput contains three main sections (keys? list? dictionaries?):
1. gameData, where the summary info about the game (e.g. teams, lineups, location) is
2. liveData, where all the events (e.g. period start, shot, goal) and the relevant players & locations are
3. linescore (and others), where post-game summary stats and game stars are 

Below is the 9th event in liveData, as an example:

In [77]:
url = 'https://statsapi.web.nhl.com/api/v1/game/2019021011/feed/live'
responseRaw = requests.get(url)
responseRaw.json()['liveData']['plays']['allPlays'][10]

{'players': [{'player': {'id': 8477948,
    'fullName': 'Travis Sanheim',
    'link': '/api/v1/people/8477948'},
   'playerType': 'Shooter'},
  {'player': {'id': 8468685,
    'fullName': 'Henrik Lundqvist',
    'link': '/api/v1/people/8468685'},
   'playerType': 'Goalie'}],
 'result': {'event': 'Shot',
  'eventCode': 'NYR82',
  'eventTypeId': 'SHOT',
  'description': 'Travis Sanheim Wrist Shot saved by Henrik Lundqvist',
  'secondaryType': 'Wrist Shot'},
 'about': {'eventIdx': 10,
  'eventId': 82,
  'period': 1,
  'periodType': 'REGULAR',
  'ordinalNum': '1st',
  'periodTime': '01:48',
  'periodTimeRemaining': '18:12',
  'dateTime': '2020-03-01T17:25:08Z',
  'goals': {'away': 0, 'home': 0}},
 'coordinates': {'x': 46.0, 'y': 14.0},
 'team': {'id': 4,
  'name': 'Philadelphia Flyers',
  'link': '/api/v1/teams/4',
  'triCode': 'PHI'}}

### Event types
Not all event types will have the same information available (and therefore not the same keys in the output). Goal or shot vs start of period vs penalty, for example. 

We can see all the possible event types (that ocurred during that one sample game): 

In [78]:
response = responseRaw.json()['liveData']['plays']['allPlays']

set([event['result']['eventTypeId'] for event in response])

{'BLOCKED_SHOT',
 'FACEOFF',
 'GAME_END',
 'GAME_OFFICIAL',
 'GAME_SCHEDULED',
 'GIVEAWAY',
 'GOAL',
 'HIT',
 'MISSED_SHOT',
 'PENALTY',
 'PERIOD_END',
 'PERIOD_OFFICIAL',
 'PERIOD_READY',
 'PERIOD_START',
 'SHOT',
 'STOP',
 'TAKEAWAY'}

And how often each ocurred: 

In [79]:
event_types = [event['result']['eventTypeId'] for event in response]
[[x,event_types.count(x)] for x in set(event_types)]

[['GAME_END', 1],
 ['TAKEAWAY', 13],
 ['PERIOD_END', 3],
 ['GAME_SCHEDULED', 1],
 ['STOP', 34],
 ['FACEOFF', 51],
 ['PENALTY', 12],
 ['GIVEAWAY', 18],
 ['BLOCKED_SHOT', 19],
 ['GAME_OFFICIAL', 1],
 ['PERIOD_READY', 3],
 ['PERIOD_OFFICIAL', 3],
 ['SHOT', 44],
 ['HIT', 52],
 ['GOAL', 8],
 ['MISSED_SHOT', 29],
 ['PERIOD_START', 3]]

### Parsing the full game output (JSON to dataframe)

For each event, up to 5 different lists containing details about the event may be returned: 
* 'players' which lists the player(s) involved in an event
* 'event' which is populated for all event types
* 'about' which will provide information about the timing of the event (e.g. period, timestamp, score at time of event)
* 'coordinates' which will only be populated for shot and related events
* 'team' which identifies the team the event pertains to
    
Here, processing the JSON output will mean transforming it into a dataframe keeping only a subset of the keys, and populating with null where values are not available. 

For each event, we'll be interested in keeping the following information:
* eventId, eventTypeId
* period, periodTime, dateTime
* xCoordinate, yCoordinate
* teamTriCode (e.g. PHI for Philadelphia)
* player1Name, player1Type, player2Name, player2Type

The code has some very basic error handling to populate with null (or more specifically the None keyword) when the data is not available (e.g. player names when the event is beginning of period). 

In [80]:
#create empty dataframe where final output will be stores
gameEvents = pandas.DataFrame(columns = ['eventId', 'eventTypeId', 'period', 'periodTime', 'dateTime',
                                           'xCoordinate', 'yCoordinate', 'teamTriCode', 'player1Name', 
                                           'player1Type', 'player2Name', 'player2Type'])

#loop through each element in the JSON output
for event in response:
    #create temporary dataframe to store the data for each individual event
    tempData = pandas.DataFrame(columns = ['eventId', 'eventTypeId', 'period', 'periodTime', 'dateTime',
                                           'xCoordinate', 'yCoordinate', 'teamTriCode', 'player1Name', 
                                           'player1Type', 'player2Name', 'player2Type'],
                            index=range(1))
    
    tempData['eventId'][0] = event['about']['eventId']
    tempData['eventTypeId'][0] = event['result']['eventTypeId']
    tempData['dateTime'][0] = event['about']['dateTime']
    tempData['period'][0] = event['about']['period']
    tempData['periodTime'][0] = event['about']['periodTime']
    try: 
        tempData['xCoordinate'][0] = event['coordinates']['x']
    except:
        tempData['xCoordinate'][0] = None
    try: 
        tempData['yCoordinate'][0] = event['coordinates']['y']
    except:
        tempData['yCoordinate'][0] = None
    try: 
        tempData['teamTriCode'][0] = event['team']['triCode']
    except:
        tempData['teamTriCode'][0] = None
    try: 
        tempData['player1Name'][0] = event['players'][0]['player']['fullName']
    except:
        tempData['player1Name'][0] = None
    try: 
        tempData['player1Type'][0] = event['players'][0]['player']['playerType']
    except:
        tempData['player1Type'][0] = None
    try: 
        tempData['player2Name'][0] = event['players'][1]['player']['fullName']
    except:
        tempData['player2Name'][0] = None
    try: 
        tempData['player2Type'][0] = event['players'][1]['player']['playerType']
    except:
        tempData['player2Type'][0] = None
    
    #append temp dataframe to final one
    gameEvents = gameEvents.append(tempData, ignore_index=True)

In [81]:
gameEvents.head()

Unnamed: 0,eventId,eventTypeId,period,periodTime,dateTime,xCoordinate,yCoordinate,teamTriCode,player1Name,player1Type,player2Name,player2Type
0,1,GAME_SCHEDULED,1,00:00,2020-03-01T16:09:34Z,,,,,,,
1,5,PERIOD_READY,1,00:00,2020-03-01T17:22:01Z,,,,,,,
2,51,PERIOD_START,1,00:00,2020-03-01T17:22:15Z,,,,,,,
3,52,FACEOFF,1,00:00,2020-03-01T17:22:15Z,0.0,0.0,PHI,Kevin Hayes,,Ryan Strome,
4,8,HIT,1,00:07,2020-03-01T17:22:32Z,14.0,36.0,NYR,Adam Fox,,Travis Konecny,


### Parsing the full game output (JSON to lists to dataframe)

While this works well, I read some StackOverflow comments that mentioned you should actually append to a list first, and once that process is complete convert the list to a DataFrame in one go. Sounds like it mostly has to do with performance and memory usage. 

I'm not as comfortable with lists so this might not the best way to do that using them. I did learn along the way that lists are mutable, meaning that when appending to an existing list you don't need to specify to re-assign the post-append list to itself. 

In [82]:
#create 12 empty lists
eventId = []
eventTypeId = []
period = []
periodTime = []
dateTime = []
xCoordinate = []
yCoordinate = []
teamTriCode = []
player1Name = []
player1Type = []
player2Name = []
player2Type = []

for event in response:
    eventId.append(event['about']['eventId'])
    eventTypeId.append(event['result']['eventTypeId'])
    dateTime.append(event['about']['dateTime'])
    period.append(event['about']['period'])
    periodTime.append(event['about']['periodTime'])
    
    try: 
        xCoordinate.append(event['coordinates']['x'])
    except:
        xCoordinate.append(None)
        
    try: 
        yCoordinate.append(event['coordinates']['y'])
    except:
        yCoordinate.append(None)
        
    try: 
        teamTriCode.append(event['team']['triCode'])
    except:
        teamTriCode.append(None)
        
    try: 
        player1Name.append(event['players'][0]['player']['fullName'])
    except:
        player1Name.append(None)
        
    try: 
        player1Type.append(event['players'][0]['player']['playerType'])
    except:
        player1Type.append(None)
        
    try: 
        player2Name.append(event['players'][1]['player']['fullName'])
    except:
        player2Name.append(None)
        
    try: 
        player2Type.append(event['players'][1]['player']['playerType'])
    except:
        player2Type.append(None)

#to dataframe
gameEvents = pandas.DataFrame(list(zip(eventId, eventTypeId, period, periodTime, dateTime, 
                                       xCoordinate, yCoordinate, teamTriCode, player1Name, 
                                       player1Type, player2Name, player2Type)), 
                              columns = ['eventId', 'eventTypeId', 'period', 'periodTime', 'dateTime',
                                           'xCoordinate', 'yCoordinate', 'teamTriCode', 'player1Name', 
                                           'player1Type', 'player2Name', 'player2Type']) 

#replace NaNs by None
gameEvents = gameEvents.where(pandas.notnull(gameEvents), None)

In [83]:
gameEvents.head()

Unnamed: 0,eventId,eventTypeId,period,periodTime,dateTime,xCoordinate,yCoordinate,teamTriCode,player1Name,player1Type,player2Name,player2Type
0,1,GAME_SCHEDULED,1,00:00,2020-03-01T16:09:34Z,,,,,,,
1,5,PERIOD_READY,1,00:00,2020-03-01T17:22:01Z,,,,,,,
2,51,PERIOD_START,1,00:00,2020-03-01T17:22:15Z,,,,,,,
3,52,FACEOFF,1,00:00,2020-03-01T17:22:15Z,0.0,0.0,PHI,Kevin Hayes,,Ryan Strome,
4,8,HIT,1,00:07,2020-03-01T17:22:32Z,14.0,36.0,NYR,Adam Fox,,Travis Konecny,


As we can see, the output looks the exact same. We started with the JSON output from the NHL API, and ended up with a dataframe containing the data points we were interested in keeping for each event.

A superior solution though, might be to be able to skip the more manal post-processing, and instead convert all attributes to their own column without needing to explicitely do so for each.

## Parsing an API response: The venue endpoint
The venue endpoint is one of the more straight-forward ones in terms of the responses it returns. Unlike the live feed reponse, there are not nested lists or dictionaries in it. Instead, it simply is a dictionary. 

In [84]:
url = 'https://statsapi.web.nhl.com/api/v1/venues/5046'
responseRaw = requests.get(url).json()
responseRaw
pandas.DataFrame(responseRaw['venues'][0], index=[0])

Unnamed: 0,id,name,link,appEnabled
0,5046,Honda Center,/api/v1/venues/5046,True


In [85]:
url = 'https://statsapi.web.nhl.com/api/v1/venues/'
responseRaw = requests.get(url).json()
responseRaw
pandas.DataFrame(responseRaw['venues']).head()


Unnamed: 0,id,name,link,appEnabled
0,310,NYCB Live,/api/v1/venues/310,True
1,5017,Amalie Arena,/api/v1/venues/5017,True
2,5019,American Airlines Center,/api/v1/venues/5019,True
3,5027,BB&T Center,/api/v1/venues/5027,True
4,5030,Bridgestone Arena,/api/v1/venues/5030,True


## Parsing an API response: The player endpoint
The last endpoint we'll look at here - the player endpoint - returns a reponse that is slightly more complex than the venue one, but much more straightforward than the live feed. 

For one players, the response contains about 20 keys, and 2 nested dictionaries (currentTeam and primaryPosition).

In [86]:
url = 'https://statsapi.web.nhl.com/api/v1/people/8468685'
responseRaw = requests.get(url).json()
responseRaw

{'copyright': 'NHL and the NHL Shield are registered trademarks of the National Hockey League. NHL and NHL team marks are the property of the NHL and its teams. © NHL 2020. All Rights Reserved.',
 'people': [{'id': 8468685,
   'fullName': 'Henrik Lundqvist',
   'link': '/api/v1/people/8468685',
   'firstName': 'Henrik',
   'lastName': 'Lundqvist',
   'primaryNumber': '30',
   'birthDate': '1982-03-02',
   'currentAge': 38,
   'birthCity': 'Are',
   'birthCountry': 'SWE',
   'nationality': 'SWE',
   'height': '6\' 1"',
   'weight': 182,
   'active': True,
   'alternateCaptain': False,
   'captain': False,
   'rookie': False,
   'shootsCatches': 'L',
   'rosterStatus': 'Y',
   'currentTeam': {'id': 3,
    'name': 'New York Rangers',
    'link': '/api/v1/teams/3'},
   'primaryPosition': {'code': 'G',
    'name': 'Goalie',
    'type': 'Goalie',
    'abbreviation': 'G'}}]}

If we just take the output and convert it to a pandas dataframe, the two nested dictionaries each become their own column, containing the full dict, rather than separating the underlying keys. 

In [87]:
pandas.DataFrame(responseRaw['people'])

Unnamed: 0,id,fullName,link,firstName,lastName,primaryNumber,birthDate,currentAge,birthCity,birthCountry,...,height,weight,active,alternateCaptain,captain,rookie,shootsCatches,rosterStatus,currentTeam,primaryPosition
0,8468685,Henrik Lundqvist,/api/v1/people/8468685,Henrik,Lundqvist,30,1982-03-02,38,Are,SWE,...,"6' 1""",182,True,False,False,False,L,Y,"{'id': 3, 'name': 'New York Rangers', 'link': ...","{'code': 'G', 'name': 'Goalie', 'type': 'Goali..."


If we look at the API reponse for an individual player, we can use the below to process the two nested dictionaries separately and bind the resulting columns to the dataframe created in the previous step: 

In [88]:
#convert full response to dataframe
playerData = pandas.DataFrame(responseRaw['people'])

#loop through all columns of the new dataframe
for column in playerData:
    #if column is a dictionary, convert to temporary dataframe and append to main one
    if(type(playerData[column][0]) is dict):
        tempData = pandas.DataFrame(playerData[column][0], index=[0])
        tempData.columns = [column + '_' + string for string in tempData.columns]
        playerData = pandas.concat([playerData.reset_index(drop=True), tempData], axis=1)

playerData

Unnamed: 0,id,fullName,link,firstName,lastName,primaryNumber,birthDate,currentAge,birthCity,birthCountry,...,rosterStatus,currentTeam,primaryPosition,currentTeam_id,currentTeam_name,currentTeam_link,primaryPosition_code,primaryPosition_name,primaryPosition_type,primaryPosition_abbreviation
0,8468685,Henrik Lundqvist,/api/v1/people/8468685,Henrik,Lundqvist,30,1982-03-02,38,Are,SWE,...,Y,"{'id': 3, 'name': 'New York Rangers', 'link': ...","{'code': 'G', 'name': 'Goalie', 'type': 'Goali...",3,New York Rangers,/api/v1/teams/3,G,Goalie,Goalie,G
