# Hockey fights

David Singer, the gentleman who runs [hockeyfights dot com](http://www.hockeyfights.com/), was kind enough to provide us with a cut of the data powering his website for us to use in training sessions. Thanks, David!

This data lives here: `../data/hockey-fights.xlsx`. Every row in the data is one fight.

Let's take a look, eh?

First, we'll import pandas, then we'll use the `read_excel()` method to load the data into a dataframe. (Note: To use this functionality, we'll also need the `xlrd` library, which luckily we've installed already.)

In [10]:
import pandas as pd

In [11]:
df = pd.read_excel('../data/hockey-fights.xlsx', sheet_name='fights')

In [12]:
df.head()

Unnamed: 0,fight_id,date,game_id,away_team_id,away_team_name,home_team_id,home_team_name,away_player_id,away_player_name,away_player_got_fm,home_player_id,home_player_name,home_player_got_fm,fight_period,fight_minutes,fight_seconds,fight_votes_away_win,fight_votes_home_win,fight_votes_draw
0,111386,2012-04-07,137330,26,St. Louis Blues,10,Dallas Stars,587,Barret Jackman,True,13274,Ryan Garbutt,True,2,5,34,37,4,1
1,111389,2012-04-07,137327,19,New York Islanders,9,Columbus Blue Jackets,4999,Matt Martin,True,1833,Derek Dorsett,True,2,12,1,13,2,35
2,111388,2012-04-07,137327,19,New York Islanders,9,Columbus Blue Jackets,1743,Micheal Haley,True,2705,Jared Boll,True,2,9,46,3,46,8
3,111387,2012-04-07,137327,19,New York Islanders,9,Columbus Blue Jackets,4999,Matt Martin,True,1833,Derek Dorsett,True,1,5,2,16,2,21
4,111384,2012-04-07,137321,22,Philadelphia Flyers,24,Pittsburgh Penguins,16507,Harry Zolnierczyk,True,6702,Joe Vitale,True,1,2,22,4,128,1


### Check out the data

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 19 columns):
fight_id                2097 non-null int64
date                    2097 non-null datetime64[ns]
game_id                 2097 non-null int64
away_team_id            2097 non-null int64
away_team_name          2097 non-null object
home_team_id            2097 non-null int64
home_team_name          2097 non-null object
away_player_id          2097 non-null int64
away_player_name        2097 non-null object
away_player_got_fm      2097 non-null bool
home_player_id          2097 non-null int64
home_player_name        2097 non-null object
home_player_got_fm      2097 non-null bool
fight_period            2097 non-null int64
fight_minutes           2097 non-null int64
fight_seconds           2097 non-null int64
fight_votes_away_win    2097 non-null int64
fight_votes_home_win    2097 non-null int64
fight_votes_draw        2097 non-null int64
dtypes: bool(2), datetime64[ns](1), int64(12),

In [14]:
df.date.min()

Timestamp('2011-10-07 00:00:00')

In [15]:
df.date.max()

Timestamp('2016-04-10 00:00:00')

In [16]:
df.away_team_name.unique()

array(['St. Louis Blues', 'New York Islanders', 'Philadelphia Flyers',
       'Buffalo Sabres', 'Arizona Coyotes', 'San Jose Sharks',
       'Anaheim Ducks', 'Columbus Blue Jackets', 'Pittsburgh Penguins',
       'New York Rangers', 'Toronto Maple Leafs', 'Edmonton Oilers',
       'Washington Capitals', 'Minnesota Wild', 'Ottawa Senators',
       'Dallas Stars', 'Winnipeg Jets', 'Boston Bruins',
       'Colorado Avalanche', 'Florida Panthers', 'Nashville Predators',
       'Montreal Canadiens', 'Vancouver Canucks', 'Chicago Blackhawks',
       'New Jersey Devils', 'Carolina Hurricanes', 'Los Angeles Kings',
       'Calgary Flames', 'Tampa Bay Lightning', 'Detroit Red Wings'],
      dtype=object)

In [17]:
df.home_team_name.unique()

array(['Dallas Stars', 'Columbus Blue Jackets', 'Pittsburgh Penguins',
       'Boston Bruins', 'St. Louis Blues', 'Los Angeles Kings',
       'Edmonton Oilers', 'Colorado Avalanche', 'Philadelphia Flyers',
       'New Jersey Devils', 'Buffalo Sabres', 'Tampa Bay Lightning',
       'Anaheim Ducks', 'Chicago Blackhawks', 'New York Islanders',
       'San Jose Sharks', 'Arizona Coyotes', 'Carolina Hurricanes',
       'Vancouver Canucks', 'Winnipeg Jets', 'Montreal Canadiens',
       'Washington Capitals', 'Minnesota Wild', 'New York Rangers',
       'Calgary Flames', 'Florida Panthers', 'Ottawa Senators',
       'Toronto Maple Leafs', 'Nashville Predators', 'Detroit Red Wings'],
      dtype=object)

In [18]:
# etc ...

### Come up with a list of questions

- Which player was involved in the most fights?
- Average number of fights per game?
- What was the longest fight?

... what else?

### Q: Which player was involved in the most fights?

This one will be a little tricky because of how the data is structured -- a player could be fighting either as the home or away player, so there's not an obvious column to group or pivot on. There are a couple of strategies we could use to answer this question, but here's what we're going to do:

- Use the [`concat()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) method to stack the column values in each player ID column into one Series (we're using player ID instead of name to avoid the "John Smith" problem (or, I guess, "Graham MacKenzie"))
- Use `value_counts()` to get a count
- Grab the player ID with the most fights by getting the first ([0]) element in the `index` list for the Series returned by `value_counts()`
- Go back to the original data frame and filter for that ID, then use [`iloc`](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) to get a single fight record with the player's name, team, etc.

In [19]:
# use concat to stack the home and away player IDs
all_player_ids = pd.concat([df.home_player_id, df.away_player_id])

In [20]:
# use value_counts() to get a frequency count, then grab the top (first) one
top = all_player_ids.value_counts().index[0]

In [22]:
# filter the main data frame for fights involving that player
# arbitrarily, i have chosen to filter on the away_player_id column
# and grab the first record with iloc
fightiest_player_fight = df[df.away_player_id == top].iloc[0]

# get the away player's name from this fight
fightiest_player_name = fightiest_player_fight['away_player_name']

# and his team name
fightiest_player_team = fightiest_player_fight['away_team_name']

# and print them
print(fightiest_player_name, fightiest_player_team)

Cody McLeod Colorado Avalanche


### Q: Average number of fights per game?

This one will be pretty easy. We need two numbers: The total number of fights -- which is the same as asking how many records are in our original data frame -- and the total number of games, which will just involve counting the unique number of games in our data.

To get the number of records in our data frame, we shall use the `shape` attribute, which returns a [tuple](https://www.tutorialspoint.com/python/python_tuples.htm) with two things: the number of rows (the first thing) and the number of columns (the second thing). You can access items in a tuple [just like you'd access items in a list](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#Lists): With square brackets `[]` and the index number of the thing you're trying to get.

To get the number of unique games, we're going to use the [`nunique()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nunique.html) method to get the number of unique game IDs. (How did I know about the `nunique()` method? I didn't until I Googled "pandas count unique values series.")

In [16]:
df.shape

(2097, 19)

In [17]:
num_fights = df.shape[0]

In [18]:
num_games = df.game_id.nunique()

In [19]:
avg_fights_per_game = num_fights / num_games

print(avg_fights_per_game)

1.2541866028708133


### Q: What was the longest fight?

We have fight duration as a mixture of minutes and seconds, so we first need to convert to seconds ((minutes * 60) + seconds). We'll create a new column, `fight_duration`, for this.

In [24]:
# create a new column that takes fight minutes, times 60, plus fight seconds
df['fight_duration'] = (df.fight_minutes * 60) + df.fight_seconds

Now it's just a matter of sorting our data frame top to bottom by that new column, with `sort_values()`, and using [`.iloc[0]`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) to grab the first record.

In [25]:
df.sort_values('fight_duration', ascending=False).iloc[0]

fight_id                             116979
date                    2013-03-22 00:00:00
game_id                              155823
away_team_id                             11
away_team_name            Detroit Red Wings
home_team_id                              1
home_team_name                Anaheim Ducks
away_player_id                         8977
away_player_name              Brian Lashoff
away_player_got_fm                     True
home_player_id                        14446
home_player_name              Kyle Palmieri
home_player_got_fm                     True
fight_period                              2
fight_minutes                            20
fight_seconds                             0
fight_votes_away_win                     14
fight_votes_home_win                      5
fight_votes_draw                         34
fight_duration                         1200
Name: 653, dtype: object

A 20-minute fight! Wow! But, if you [check the video](https://www.youtube.com/watch?v=1ocGtekdf8o), this one appears to be a data entry issue.