<h1>Player Fetcher</h1>
<p>The purpose of this code is to scrape a list of nhl players off the hockey-reference website</p>

In [1]:
import pandas as pd
import numpy as np
import requests as r
import re
from string import ascii_lowercase as al #these are the lowercase ascii letters
from bs4 import BeautifulSoup as bs

<h2>Gathering the Players</h2>
<p>Before we gather any data, we need to see who we are gathering the data for.  The following code sweeps through the player inventory on <a href="https://hockey-reference.com">"hockey-reference.com</a> capturing all nhl players regardless of the year played</p>

In [10]:
#Main Url
url = "https://www.hockey-reference.com/players/"

In [36]:
nhlplayers = pd.DataFrame(columns=['player','unique_id','year_start','year_finish','position','link'])

for letter in al: #for each letter in al 
    try:
        t = r.get(url+letter) #searching for hockey players with the surname starting with letter
        soup = bs(t.text, 'html.parser')
        players = soup.find_all('p', class_='nhl') #only get the nhl players
        for i,player in enumerate(players):
            plink = player.find_all('a',href=True)
            unique_id = re.search(r'\/([\w\d\.]*)\.html',plink[0]['href'])
            play = re.search(r'([\w\.\-\s]*) \((\d{4})-(\d{4}), ([\w\b]*)',player.text)
            inputdata = [play[1],unique_id[1],play[2],play[3],play[4],'https://www.hockey-reference.com'+plink[0]['href']]
            #print(i,inputdata)
            nhlplayers = nhlplayers.append(dict(zip(nhlplayers.columns, inputdata)), ignore_index=True)
        print("Finished adding %d players for letter %s" % (i,letter))
    except:
        print(plink)

Finished adding 224 players for letter a
Finished adding 760 players for letter b
Finished adding 530 players for letter c
Finished adding 373 players for letter d
Finished adding 132 players for letter e
Finished adding 270 players for letter f
Finished adding 397 players for letter g
Finished adding 555 players for letter h
Finished adding 33 players for letter i
Finished adding 209 players for letter j
Finished adding 397 players for letter k
Finished adding 494 players for letter l
Finished adding 850 players for letter m
Finished adding 169 players for letter n
Finished adding 119 players for letter o
Finished adding 422 players for letter p
Finished adding 16 players for letter q
Finished adding 385 players for letter r
Finished adding 805 players for letter s
Finished adding 276 players for letter t
Finished adding 13 players for letter u
Finished adding 140 players for letter v
Finished adding 288 players for letter w
Finished adding 288 players for letter x
Finished adding 42 

<h2>Filtering the Players</h2>
<p>While we previously filtered only players in the NHL, we now need to refine the filtering slightly, removing players who played before the availablility of game logs).  Initially, I'll only keep players who played after 1990 in the hopes that I can get the injury data for them.  This filtering drops the number of available players from 7983 to 4842.  The other thing I'll do for simplification is to change the reduce number of positions to Forward, Defense or Goalie.</p>

In [43]:
#convert datatypes as appropriate
nhlplayers[['year_start','year_finish']] = nhlplayers[['year_start','year_finish']].astype(int)

In [44]:
nhlplayers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4842 entries, 0 to 7982
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   player       4842 non-null   object
 1   unique_id    4842 non-null   object
 2   year_start   4842 non-null   int32 
 3   year_finish  4842 non-null   int32 
 4   position     4842 non-null   object
 5   link         4842 non-null   object
dtypes: int32(2), object(4)
memory usage: 227.0+ KB


In [45]:
#drop players who finished before 1990
nhlplayers.drop(nhlplayers[nhlplayers['year_finish']<1990].index,inplace=True)

In [46]:
nhlplayers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4842 entries, 0 to 7982
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   player       4842 non-null   object
 1   unique_id    4842 non-null   object
 2   year_start   4842 non-null   int32 
 3   year_finish  4842 non-null   int32 
 4   position     4842 non-null   object
 5   link         4842 non-null   object
dtypes: int32(2), object(4)
memory usage: 227.0+ KB


In [47]:
nhlplayers['position'].unique()

array(['C', 'LW', 'RW', 'G', 'D', 'F', 'W'], dtype=object)

In [48]:
position_remap = {'C':'F','LW':'F','RW':'F','D':'D','G':'G','W':'F'}
nhlplayers['position'] = nhlplayers['position'].replace(position_remap)
nhlplayers['position'].unique()

array(['F', 'G', 'D'], dtype=object)

In [50]:
nhlplayers[nhlplayers['position']=='G'].reset_index().to_csv('nhlplayerlist_goalieonly.txt','\t')

In [51]:
nhlplayers[nhlplayers['position']=='G']

Unnamed: 0,player,unique_id,year_start,year_finish,position,link
31,David Aebischer,aebisda01,2001,2008,G,https://www.hockey-reference.com/players/a/aeb...
49,Sami Aittokallio,aittosa01,2013,2014,G,https://www.hockey-reference.com/players/a/ait...
70,Jake Allen,allenja01,2013,2020,G,https://www.hockey-reference.com/players/a/all...
85,Jorge Alves,alvesjo01,2017,2017,G,https://www.hockey-reference.com/players/a/alv...
92,Frederik Andersen,anderfr01,2014,2020,G,https://www.hockey-reference.com/players/a/and...
...,...,...,...,...,...,...
7904,Allen York,yorkal01,2012,2012,G,https://www.hockey-reference.com/players/y/yor...
7917,Wendell Young,youngwe01,1986,1995,G,https://www.hockey-reference.com/players/y/you...
7923,Matt Zaba,zabama01,2010,2010,G,https://www.hockey-reference.com/players/z/zab...
7941,Jeff Zatkoff,zatkoje01,2014,2017,G,https://www.hockey-reference.com/players/z/zat...
