I will be examining the NHL Player season statistics data that is presented so far this season

## Scraping the Data from Elite Prospects with help of an API


In [42]:
import requests
import json
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import TopDownHockey_Scraper.TopDownHockey_EliteProspects_Scraper as tdhepscrape
from unidecode import unidecode

#### Initializing what season and league we are pulling data from 

In [43]:
# I would like the current 2024-2025 NHL season data
league = ["nhl"]
season = ["2003-2004", "2004-2005", "2005-2006", "2006-2007", "2007-2008", "2008-2009", "2009-2010", "2010-2011", "2011-2012", "2012-2013", "2013-2014", "2014-2015", "2015-2016", "2016-2017", "2017-2018", "2018-2019", "2019-2020", "2020-2021", "2021-2022", "2022-2023", "2023-2024", "2024-2025"]

In [44]:
# I will be using the TopDownHockey_EliteProspects_Scraper to get the data
nhl_since_2003 = tdhepscrape.get_skaters(league, season)
nhl_since_2003.head()

Your scrape request is skater data from the following leagues:
['nhl']
In the following seasons:
2003-2004, 2004-2005, 2005-2006, 2006-2007, 2007-2008, 2008-2009, 2009-2010, 2010-2011, 2011-2012, 2012-2013, 2013-2014, 2014-2015, 2015-2016, 2016-2017, 2017-2018, 2018-2019, 2019-2020, 2020-2021, 2021-2022, 2022-2023, 2023-2024 and 2024-2025
Beginning scrape of nhl skater data from 2003-2004.
Successfully scraped all nhl skater data from 2003-2004.
Beginning scrape of nhl skater data from 2004-2005.
Successfully scraped all nhl skater data from 2004-2005.
Beginning scrape of nhl skater data from 2005-2006.
Successfully scraped all nhl skater data from 2005-2006.
Beginning scrape of nhl skater data from 2006-2007.
Successfully scraped all nhl skater data from 2006-2007.
Beginning scrape of nhl skater data from 2007-2008.
Successfully scraped all nhl skater data from 2007-2008.
Beginning scrape of nhl skater data from 2008-2009.
Successfully scraped all nhl skater data from 2008-2009.
Begin

Unnamed: 0,player,team,gp,g,a,tp,ppg,pim,+/-,link,season,league,playername,position
0,Martin St-Louis (RW),Tampa Bay Lightning,82,38,56,94,1.15,24,35,https://www.eliteprospects.com/player/8772/mar...,2003-2004,nhl,Martin St-Louis (RW),RW
1,Ilya Kovalchuk (LW),Atlanta Thrashers,81,41,46,87,1.07,63,-10,https://www.eliteprospects.com/player/3660/ily...,2003-2004,nhl,Ilya Kovalchuk (LW),LW
2,Joe Sakic (C),Colorado Avalanche,81,33,54,87,1.07,42,11,https://www.eliteprospects.com/player/8862/joe...,2003-2004,nhl,Joe Sakic (C),C
3,Markus Näslund (LW),Vancouver Canucks,78,35,49,84,1.08,58,24,https://www.eliteprospects.com/player/697/mark...,2003-2004,nhl,Markus Näslund (LW),LW
4,Marián Hossa (RW),Ottawa Senators,81,36,46,82,1.01,46,4,https://www.eliteprospects.com/player/4720/mar...,2003-2004,nhl,Marián Hossa (RW),RW


In [45]:
nhl_since_2003.columns

Index(['player', 'team', 'gp', 'g', 'a', 'tp', 'ppg', 'pim', '+/-', 'link',
       'season', 'league', 'playername', 'position'],
      dtype='object')

## Removing irrelevant columns

In [46]:
# Redundant to have a 'player' and 'playername' column, drop 'positon' and 'player'
nhl_since_2003 = nhl_since_2003.drop(['player', 'position'], axis = 1)
nhl_since_2003

Unnamed: 0,team,gp,g,a,tp,ppg,pim,+/-,link,season,league,playername
0,Tampa Bay Lightning,82,38,56,94,1.15,24,35,https://www.eliteprospects.com/player/8772/mar...,2003-2004,nhl,Martin St-Louis (RW)
1,Atlanta Thrashers,81,41,46,87,1.07,63,-10,https://www.eliteprospects.com/player/3660/ily...,2003-2004,nhl,Ilya Kovalchuk (LW)
2,Colorado Avalanche,81,33,54,87,1.07,42,11,https://www.eliteprospects.com/player/8862/joe...,2003-2004,nhl,Joe Sakic (C)
3,Vancouver Canucks,78,35,49,84,1.08,58,24,https://www.eliteprospects.com/player/697/mark...,2003-2004,nhl,Markus Näslund (LW)
4,Ottawa Senators,81,36,46,82,1.01,46,4,https://www.eliteprospects.com/player/4720/mar...,2003-2004,nhl,Marián Hossa (RW)
...,...,...,...,...,...,...,...,...,...,...,...,...
19939,Carolina Hurricanes,-,-,-,-,-,-,-,https://www.eliteprospects.com/player/10967/je...,2024-2025,nhl,Jesper Fast (RW)
19940,Colorado Avalanche,-,-,-,-,-,-,-,https://www.eliteprospects.com/player/10393/ga...,2024-2025,nhl,Gabriel Landeskog (LW/RW)
19941,Washington Capitals,-,-,-,-,-,-,-,https://www.eliteprospects.com/player/9209/t.j...,2024-2025,nhl,T.J. Oshie (RW)
19942,Toronto Maple Leafs,-,-,-,-,-,-,-,https://www.eliteprospects.com/player/6020/cal...,2024-2025,nhl,Calle Järnkrok (C/W)


Notice how specific the syntax is needed to locate a player, currentlytly we have to use the playername with parenthesized positions to locate a player. This is not ideal, we want to be able to locate a player by just their name, and then we can use the position column to filter by position. We will need to remove the parenthesized positions from the playername column and put them into a new column called 'position' and then we can filter by position. We will also need to remove the parenthesis from the playername column. We will do this in the next step.

In [47]:
# Currently, to locate a player, we need to use the playername with paranethesized positions
nhl_since_2003.loc[(nhl_since_2003['playername'] == 'Nathan MacKinnon (C/RW)')]


Unnamed: 0,team,gp,g,a,tp,ppg,pim,+/-,link,season,league,playername
8900,Colorado Avalanche,82,24,39,63,0.77,26,20,https://www.eliteprospects.com/player/99204/na...,2013-2014,nhl,Nathan MacKinnon (C/RW)
9926,Colorado Avalanche,64,14,24,38,0.59,34,-7,https://www.eliteprospects.com/player/99204/na...,2014-2015,nhl,Nathan MacKinnon (C/RW)
10728,Colorado Avalanche,72,21,31,52,0.72,20,-4,https://www.eliteprospects.com/player/99204/na...,2015-2016,nhl,Nathan MacKinnon (C/RW)
11640,Colorado Avalanche,82,16,37,53,0.65,16,-14,https://www.eliteprospects.com/player/99204/na...,2016-2017,nhl,Nathan MacKinnon (C/RW)
12471,Colorado Avalanche,74,39,58,97,1.31,55,11,https://www.eliteprospects.com/player/99204/na...,2017-2018,nhl,Nathan MacKinnon (C/RW)
13375,Colorado Avalanche,82,41,58,99,1.21,34,20,https://www.eliteprospects.com/player/99204/na...,2018-2019,nhl,Nathan MacKinnon (C/RW)
14295,Colorado Avalanche,69,35,58,93,1.35,12,13,https://www.eliteprospects.com/player/99204/na...,2019-2020,nhl,Nathan MacKinnon (C/RW)
15206,Colorado Avalanche,48,20,45,65,1.35,37,22,https://www.eliteprospects.com/player/99204/na...,2020-2021,nhl,Nathan MacKinnon (C/RW)
16152,Colorado Avalanche,65,32,56,88,1.35,42,22,https://www.eliteprospects.com/player/99204/na...,2021-2022,nhl,Nathan MacKinnon (C/RW)
17159,Colorado Avalanche,71,42,69,111,1.56,30,29,https://www.eliteprospects.com/player/99204/na...,2022-2023,nhl,Nathan MacKinnon (C/RW)


## Extracting the positions played from PlayerName column 
This helps us include 2 position players in our analysis

In [48]:
# Extract the positions between the parenthesis and move them to a new column called 'position'
nhl_since_2003['position'] = nhl_since_2003['playername'].str.extract(r'\((.*?)\)')
# Remove the parenthesis from the playername column
nhl_since_2003['playername'] = nhl_since_2003['playername'].str.replace(r'\(.*?\)', '', regex=True)
# Now we can locate a player by just their name

##### Some coaches allow their players to try different positions on the ice, so I am choosing to seperate those who do into two position columns, position1 and position2

In [49]:
# Using the position column, we are going to split the position into two columns: 'position1' and 'position2'
# We will use the split function to split the position column into two columns

nhl_since_2003[['position1', 'position2']] = nhl_since_2003['position'].str.split('/', expand=True)

## Replacing Null Values
Several players have not yet reached the ice this year because of injury. Insted of leaving statistics blank I want to replace the respective categories with zero

In [50]:
# We need to replace all of the blank (-) values with 0 in the columns 'gp', 'g', 'a', 'tp', 'ppg', 'pim'

nhl_since_2003 = nhl_since_2003.replace('-', 0)
nhl_since_2003

Unnamed: 0,team,gp,g,a,tp,ppg,pim,+/-,link,season,league,playername,position,position1,position2
0,Tampa Bay Lightning,82,38,56,94,1.15,24,35,https://www.eliteprospects.com/player/8772/mar...,2003-2004,nhl,Martin St-Louis,RW,RW,
1,Atlanta Thrashers,81,41,46,87,1.07,63,-10,https://www.eliteprospects.com/player/3660/ily...,2003-2004,nhl,Ilya Kovalchuk,LW,LW,
2,Colorado Avalanche,81,33,54,87,1.07,42,11,https://www.eliteprospects.com/player/8862/joe...,2003-2004,nhl,Joe Sakic,C,C,
3,Vancouver Canucks,78,35,49,84,1.08,58,24,https://www.eliteprospects.com/player/697/mark...,2003-2004,nhl,Markus Näslund,LW,LW,
4,Ottawa Senators,81,36,46,82,1.01,46,4,https://www.eliteprospects.com/player/4720/mar...,2003-2004,nhl,Marián Hossa,RW,RW,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19939,Carolina Hurricanes,0,0,0,0,0,0,0,https://www.eliteprospects.com/player/10967/je...,2024-2025,nhl,Jesper Fast,RW,RW,
19940,Colorado Avalanche,0,0,0,0,0,0,0,https://www.eliteprospects.com/player/10393/ga...,2024-2025,nhl,Gabriel Landeskog,LW/RW,LW,RW
19941,Washington Capitals,0,0,0,0,0,0,0,https://www.eliteprospects.com/player/9209/t.j...,2024-2025,nhl,T.J. Oshie,RW,RW,
19942,Toronto Maple Leafs,0,0,0,0,0,0,0,https://www.eliteprospects.com/player/6020/cal...,2024-2025,nhl,Calle Järnkrok,C/W,C,W


Just for redundancy aspects, I am going to delete the intial position column because there are two to be referenced now

In [51]:
nhl_since_2003 = nhl_since_2003.drop(columns= ['position'])

## Removing any padding and Extra spaces

The additional spaces around each field are causing our loc to not be 100% accurate. We could choose to implement and recognize the padding, but I prefer to remove any extra characters from the dataset. 

In [52]:
# We need to strip the following spaces that come AFTER a players name in the player column

nhl_since_2003['playername'] = nhl_since_2003['playername'].str.strip()

In [53]:
nhl_since_2003.loc[(nhl_since_2003['playername'] == 'Sebastian Aho')]

Unnamed: 0,team,gp,g,a,tp,ppg,pim,+/-,link,season,league,playername,position1,position2
11663,Carolina Hurricanes,82,24,25,49,0.6,26,-1,https://www.eliteprospects.com/player/152111/s...,2016-2017,nhl,Sebastian Aho,C,W
12514,Carolina Hurricanes,78,29,36,65,0.83,24,4,https://www.eliteprospects.com/player/152111/s...,2017-2018,nhl,Sebastian Aho,C,W
13143,New York Islanders,22,1,3,4,0.18,6,-5,https://www.eliteprospects.com/player/67208/se...,2017-2018,nhl,Sebastian Aho,D,
13389,Carolina Hurricanes,82,30,53,83,1.01,26,25,https://www.eliteprospects.com/player/152111/s...,2018-2019,nhl,Sebastian Aho,C,W
14310,Carolina Hurricanes,68,38,28,66,0.97,26,10,https://www.eliteprospects.com/player/152111/s...,2019-2020,nhl,Sebastian Aho,C,W
15214,Carolina Hurricanes,56,24,33,57,1.02,32,16,https://www.eliteprospects.com/player/152111/s...,2020-2021,nhl,Sebastian Aho,C,W
15890,New York Islanders,3,1,1,2,0.67,2,-1,https://www.eliteprospects.com/player/67208/se...,2020-2021,nhl,Sebastian Aho,D,
16164,Carolina Hurricanes,79,37,44,81,1.03,38,18,https://www.eliteprospects.com/player/152111/s...,2021-2022,nhl,Sebastian Aho,C,W
16694,New York Islanders,36,2,10,12,0.33,10,-6,https://www.eliteprospects.com/player/67208/se...,2021-2022,nhl,Sebastian Aho,D,
17218,Carolina Hurricanes,75,36,31,67,0.89,42,8,https://www.eliteprospects.com/player/152111/s...,2022-2023,nhl,Sebastian Aho,C,W


In [54]:
# Using unidecode to remove accented characters from ['player'] column - 
# this is a more robust solution than the above, but either works

nhl_since_2003['playername'] = nhl_since_2003['playername'].apply(unidecode)

#### Trying a player name that will contain accented charactersers to test the above code - Teuvo Teravainen"""

In [55]:
nhl_since_2003.loc[nhl_since_2003['playername'] == 'Teuvo Teravainen']

Unnamed: 0,team,gp,g,a,tp,ppg,pim,+/-,link,season,league,playername,position1,position2
9710,Chicago Blackhawks,3,0,0,0,0.0,0,0,https://www.eliteprospects.com/player/44567/te...,2013-2014,nhl,Teuvo Teravainen,C,W
10293,Chicago Blackhawks,34,4,5,9,0.26,2,4,https://www.eliteprospects.com/player/44567/te...,2014-2015,nhl,Teuvo Teravainen,C,W
10860,Chicago Blackhawks,78,13,22,35,0.45,20,-2,https://www.eliteprospects.com/player/44567/te...,2015-2016,nhl,Teuvo Teravainen,C,W
11709,Carolina Hurricanes,81,15,27,42,0.52,16,-6,https://www.eliteprospects.com/player/44567/te...,2016-2017,nhl,Teuvo Teravainen,C,W
12523,Carolina Hurricanes,82,23,41,64,0.78,14,8,https://www.eliteprospects.com/player/44567/te...,2017-2018,nhl,Teuvo Teravainen,C,W
13404,Carolina Hurricanes,82,21,55,76,0.93,12,30,https://www.eliteprospects.com/player/44567/te...,2018-2019,nhl,Teuvo Teravainen,C,W
14319,Carolina Hurricanes,68,15,48,63,0.93,8,20,https://www.eliteprospects.com/player/44567/te...,2019-2020,nhl,Teuvo Teravainen,C,W
15545,Carolina Hurricanes,21,5,10,15,0.71,4,5,https://www.eliteprospects.com/player/44567/te...,2020-2021,nhl,Teuvo Teravainen,C,W
16199,Carolina Hurricanes,77,22,43,65,0.84,24,22,https://www.eliteprospects.com/player/44567/te...,2021-2022,nhl,Teuvo Teravainen,C,W
17378,Carolina Hurricanes,68,12,25,37,0.54,16,11,https://www.eliteprospects.com/player/44567/te...,2022-2023,nhl,Teuvo Teravainen,C,W


In [56]:
# Repositioning the names so they are more easily readable - this is an only NHL dataset
# so I am going to drop the league and year column as well
nhl_since_2003.drop(columns = ['league'])
nhl_since_2003 = nhl_since_2003[['playername', 'season', 'position1','position2', 'team', 'gp', 'g', 'a', 'tp', 'ppg', 'pim', '+/-', 'link']]

In [57]:
nhl_since_2003

Unnamed: 0,playername,season,position1,position2,team,gp,g,a,tp,ppg,pim,+/-,link
0,Martin St-Louis,2003-2004,RW,,Tampa Bay Lightning,82,38,56,94,1.15,24,35,https://www.eliteprospects.com/player/8772/mar...
1,Ilya Kovalchuk,2003-2004,LW,,Atlanta Thrashers,81,41,46,87,1.07,63,-10,https://www.eliteprospects.com/player/3660/ily...
2,Joe Sakic,2003-2004,C,,Colorado Avalanche,81,33,54,87,1.07,42,11,https://www.eliteprospects.com/player/8862/joe...
3,Markus Naslund,2003-2004,LW,,Vancouver Canucks,78,35,49,84,1.08,58,24,https://www.eliteprospects.com/player/697/mark...
4,Marian Hossa,2003-2004,RW,,Ottawa Senators,81,36,46,82,1.01,46,4,https://www.eliteprospects.com/player/4720/mar...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
19939,Jesper Fast,2024-2025,RW,,Carolina Hurricanes,0,0,0,0,0,0,0,https://www.eliteprospects.com/player/10967/je...
19940,Gabriel Landeskog,2024-2025,LW,RW,Colorado Avalanche,0,0,0,0,0,0,0,https://www.eliteprospects.com/player/10393/ga...
19941,T.J. Oshie,2024-2025,RW,,Washington Capitals,0,0,0,0,0,0,0,https://www.eliteprospects.com/player/9209/t.j...
19942,Calle Jarnkrok,2024-2025,C,W,Toronto Maple Leafs,0,0,0,0,0,0,0,https://www.eliteprospects.com/player/6020/cal...


In [58]:
# Putting our final cleaned dataset into a CSV file
nhl_since_2003.to_csv('nhl_since_2003.csv')