# NHL Project

### NHL Player & Goalie Analysis

In [1]:
# import our tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

First I'm going to import the goalie stats. I'll import one season, 2010-2011, to see what the file looks like and how it needs to be imported. 

After this step, I'll write a for loop to open each season from 2010-2011, to 2019-2020 and concat each file into one dataframe. 

## NHL Goalies 2011 - 2019 Stats

In [62]:
# import one season
df = pd.read_csv('2010_2019_goalie_season_stats/2010-goalie.csv', engine='python', sep=',', header=1)

# check
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Player Name  87 non-null     object 
 1   Team         87 non-null     object 
 2   Games        87 non-null     int64  
 3   W            87 non-null     int64  
 4   L            87 non-null     int64  
 5   OTL          87 non-null     int64  
 6   GAA          87 non-null     float64
 7   GA           87 non-null     int64  
 8   SA           87 non-null     object 
 9   SV           87 non-null     object 
 10  SV%          87 non-null     float64
 11  SO           87 non-null     int64  
 12  MIN          87 non-null     object 
dtypes: float64(2), int64(6), object(5)
memory usage: 9.0+ KB


There are mix between numeric and non-numeric columns, but all of the data should be numeric with the exception of `Player Name` and `Team`. I'll convert the rest of the columns to numeric once I've imported every season. 

In [108]:
# import all files into one dataframe. 
import datetime as dt
import glob

# set file path
path = r'/Users/lizzy/Desktop/nhl_project/2010_2019_goalie_season_stats/'
all_files = glob.glob(path + "/*.csv")

# create empty list 
li = []

# loop through file list
for filename in all_files:
    
    # create a dataframe for each file
    df = pd.read_csv(filename, index_col=None, header=1)
    
    # get the name of the file to add the season to each dataframe
    linearray = filename.split("/")
    season = linearray[-1].split('-')[0]
    
    # add new column for season at the beginning
    df.insert(0, 'Season', season)
    
    # append each dataframe to our list
    li.append(df)

# create final dataframe with all seasons. 
goalies = pd.concat(li, axis=0, ignore_index=True)  

In [109]:
# check
goalies.head()

Unnamed: 0,Season,Player Name,Team,Games,W,L,OTL,GAA,GA,SA,SV,SV%,SO,MIN
0,2014,Carey Price,MON,66,44,16,6,2.0,130,1953,1823,0.933,9,3977
1,2014,Pekka Rinne,NSH,64,41,17,6,2.2,140,1807,1667,0.923,4,3851
2,2014,Braden Holtby,WAS,73,41,20,10,2.2,157,2044,1887,0.923,9,4247
3,2014,Ben Bishop,DAL,62,40,13,5,2.3,136,1620,1484,0.916,4,3519
4,2014,Jaroslav Halak,BOS,59,38,17,4,2.4,144,1673,1529,0.914,6,3550


In [110]:
# null values and datatypes
goalies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 907 entries, 0 to 906
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Season       907 non-null    object 
 1   Player Name  907 non-null    object 
 2   Team         907 non-null    object 
 3   Games        907 non-null    int64  
 4   W            907 non-null    int64  
 5   L            907 non-null    int64  
 6   OTL          907 non-null    int64  
 7   GAA          907 non-null    float64
 8   GA           907 non-null    int64  
 9   SA           907 non-null    object 
 10  SV           907 non-null    object 
 11  SV%          907 non-null    float64
 12  SO           907 non-null    int64  
 13  MIN          907 non-null    object 
dtypes: float64(2), int64(6), object(6)
memory usage: 99.3+ KB


There are no nulls, but there are a mix of numeric and non-numeric columns. 

In [111]:
# change SA, SV and MIN to numeric columns. 
goalies['SA'] = goalies['SA'].str.replace(',', '').astype(int)

# change SVs
goalies['SV'] = goalies['SV'].str.replace(',', '').astype(int)

# change MINs
goalies['MIN'] = goalies['MIN'].str.replace(',', '').astype(int)

In [113]:
# check data types again
goalies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 907 entries, 0 to 906
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Season       907 non-null    object 
 1   Player Name  907 non-null    object 
 2   Team         907 non-null    object 
 3   Games        907 non-null    int64  
 4   W            907 non-null    int64  
 5   L            907 non-null    int64  
 6   OTL          907 non-null    int64  
 7   GAA          907 non-null    float64
 8   GA           907 non-null    int64  
 9   SA           907 non-null    int64  
 10  SV           907 non-null    int64  
 11  SV%          907 non-null    float64
 12  SO           907 non-null    int64  
 13  MIN          907 non-null    int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 99.3+ KB


In [115]:
# write cleaned up dataframe to new csv file. 
# goalies.to_csv('goalies-all.csv')

## NHL Players 2010 - 2019 Stats

In [78]:
# let's import one season of player stats to check out the data. 
players = pd.read_csv('2010_2019_player_season_stats/2010-player.csv', engine='python', header=1)

# check 
players.head()

Unnamed: 0,Player Name,Team,Pos,Games,G,A,Pts,+/-,PIM,SOG,GWG,G.1,A.1,G.2,A.2,Hits,BS
0,Corey Perry,DAL,RW,82,50,48,98,9,104,290,11,14,17,4,1,64,41
1,Steven Stamkos,TB,C,82,45,46,91,3,74,272,8,17,19,0,0,84,37
2,Jarome Iginla,FA,RW,82,43,43,86,0,40,289,6,14,16,0,0,103,28
3,Ryan Kesler,ANH,C,82,41,32,73,24,66,260,7,15,15,3,1,124,80
4,Daniel Sedin,FA,LW,82,41,63,104,30,32,266,10,18,24,0,0,13,12


Ok, that looks good. Now I'll create my dataframe with all 10 seasons. 


In [85]:
# set file path
path = r'/Users/lizzy/Desktop/nhl_project/2010_2019_player_season_stats/'
all_files = glob.glob(path + "/*.csv")

# create empty list 
li = []

# loop through file list
for filename in all_files:
    
    # create a dataframe for each file
    df = pd.read_csv(filename, index_col=None, header=1)
    
    # get the name of the file to add the season to each dataframe
    linearray = filename.split("/")
    season = linearray[-1].split('-')[0]
    
    # add new column for season at the beginning
    df.insert(0, 'Season', season)
    
    # append each dataframe to our list
    li.append(df)

# create final dataframe with all seasons. 
players = pd.concat(li, axis=0, ignore_index=True)  

In [86]:
players.rename(columns={'+/-':'PlusMinus','G.1':'PPG', 'A.1':'PPA', 'G.2':'SHG', 'A.2':"SHA"}, inplace=True)

I've renamed some of the columns so they are easier to work with later. 

In [116]:
players.head()

Unnamed: 0,Season,Player Name,Team,Pos,Games,G,A,Pts,PlusMinus,PIM,SOG,GWG,PPG,PPA,SHG,SHA,Hits,BS
0,2011,Steven Stamkos,TB,C,82,60,37,97,7,66,303,12,12,13,0,0,109,37
1,2011,Evgeni Malkin,PIT,C,75,50,59,109,18,70,339,9,12,22,0,0,29,41
2,2011,Marian Gaborik,OTT,RW,82,41,35,76,15,34,276,7,10,11,0,0,63,40
3,2011,James Neal,EDM,RW,80,40,41,81,6,87,329,4,18,12,0,0,108,15
4,2011,Alex Ovechkin,WAS,LW,78,38,27,65,-8,26,303,3,13,10,0,0,215,42


In [117]:
# check datatypes. 
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8857 entries, 0 to 8856
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Season       8857 non-null   object
 1   Player Name  8857 non-null   object
 2   Team         8857 non-null   object
 3   Pos          8857 non-null   object
 4   Games        8857 non-null   int64 
 5   G            8857 non-null   int64 
 6   A            8857 non-null   int64 
 7   Pts          8857 non-null   int64 
 8   PlusMinus    8857 non-null   int64 
 9   PIM          8857 non-null   int64 
 10  SOG          8857 non-null   int64 
 11  GWG          8857 non-null   int64 
 12  PPG          8857 non-null   int64 
 13  PPA          8857 non-null   int64 
 14  SHG          8857 non-null   int64 
 15  SHA          8857 non-null   int64 
 16  Hits         8857 non-null   int64 
 17  BS           8857 non-null   int64 
dtypes: int64(14), object(4)
memory usage: 1.2+ MB


We have a mix of numeric and non-numeric again, but the datatypes are what they should be so I don't have to make any changes here to the columns. 

In [118]:
# save to csv file.
# players.to_csv('players-all.csv')