## Get some context:
- 2018 Rules Changes And Points of Emphasis: https://operations.nfl.com/the-rules/2018-rules-changes-and-points-of-emphasis/
- NFL 2018 Health and Safety Report: https://annualreport.playsmartplaysafe.com/#data-injury-reduction-plan
- NFL 2017 Injury Data: https://www.playsmartplaysafe.com/newsroom/reports/2017-injury-data/
- Evolution of NFL Rules: https://operations.nfl.com/the-rules/evolution-of-the-nfl-rules/
- ESPN coverage of NFL Call to Action: http://www.espn.com/nfl/story/_/id/24743994/really-changed-nfl-call-action-concussions
- 2018 Kickoff Rule Changes: https://www.si.com/nfl/2018/09/07/nfl-kickoff-rule-changes-explained-onside-return-clarified
- Chronology of Kickoff rule changes: http://www.footballzebras.com/2018/05/chronology-of-kickoff-rules-changes/
- Relevant Concussion paper: https://journals.sagepub.com/doi/full/10.1177/0363546518804498

## Game Plan
- Look at the film, be the film...
    - Find key attributes that increase risk of concussion
    - Note if any new rules ('Use of Helmet' Rule) exist for 2018 which possibly could have prevented that concussion
- Compare those features to the control videos and routes of players from punts without concussions
- The goal should be to propose rules that improve player safety, but also to maintaining the integrity of the game.
- The notebook will be broken up into multiple parts: preprocessing, consussion analysis, proposed rules with potential impacts to the game

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import re

## GAME DATA
- <b>Game Data</b>: Game level data that specifies the type of season (pre, reg, post), week, and hosting city and team. Each game is uniquely identified across all seasons using GameKey.
- <b>GameKey</b>: is unique
- Two seasons: 2016, 2017
    - 65 preseason games
    - 256 regular season games
    - 11 post season, 1 allstar
- Preseason splits: [1:65], [334:398]
- Regular split: [66:321], [399:654]
- Postseason split: [322:332], [655:665]
- Probowl split: 333, 666
- You can explore this dataset on your own, but i didn't find it very helpful other than understanding the seasonal splits (above)

In [2]:
game_df = pd.read_csv('data/game_data.csv')
print(game_df.shape)
game_df.head(1)

# Cleanup Memory
del game_df

(666, 18)


# PLAY INFORMATION
- <b>Play Information</b>: Play level data that describes the type of play, possession team, score and a brief narrative of each play. Plays are uniquely identified using a its PlayID along with the corresponding GameKey. <b>PlayIDs ARE NOT UNIQUE.</b>
- All plays are punts (just check counts of 'Play_Type')
- This dataset has alot of information that can be useful in your analysis. 'PlayDescription' tells you the summary of the play which can be used to parse and classify each play. You know where the ball is being punted from ('YardLine') and who is punting ('Poss_Team') so you can combine this information with 'PlayDescription' to synthesize how far a player returns the ball as well as create your own 'reward' metrics for placing a value on the punt return. <b>If you're going to implement a rule change that effects how the ball is received, you should understand what plays in this dataset, you'll have affected (usually negated).</b>
    - I'm sure many folk are considering the CFL rule for a "no yards penalty" or some variation of it. Essentially the punt receiver (PR) gets at least a 5 yard buffer against any opponent player to allow for receiving the punt. Penalties are doled out if a punt team player that is not the punter or was behind the punter upon the punt enters this restricted area. Note CFL does not have a fair catch rule. Anyway a rule where there is some restricted area/safety area for the PR would result in many plays in this play dataset being negated. It's important to understand whats being negated and understand what value to the game if any is being lost as a result.
    - The funny CFL punt: https://www.sbnation.com/2017/9/9/16280946/cfl-punt-return-weird-plays-bc-lions-montreal-alouettes-2017

In [3]:
play_df = pd.read_csv('data/play_information.csv')
print(play_df.shape)
play_df.head(1)

(6681, 14)


Unnamed: 0,Season_Year,Season_Type,GameKey,Game_Date,Week,PlayID,Game_Clock,YardLine,Quarter,Play_Type,Poss_Team,Home_Team_Visit_Team,Score_Home_Visiting,PlayDescription
0,2016,Pre,2,08/13/2016,2,191,12:30,LA 47,1,Punt,LA,LA-DAL,0 - 7,"(12:30) J.Hekker punts 52 yards to DAL 1, Cent..."


In [4]:
# HOW MANY GAMES HAD NO PUNTS AND WHICH GAMES
stuff = []
# Collect all game id's in punt data
for element in play_df['GameKey']:
    stuff.append(element)
print('Number of games without a punt:', 666 - len(set(stuff)))

for element in [i for i in range(1, 667)]:
    if element not in set(stuff):
        print('Game', element, 'had no punts')

Number of games without a punt: 4
Game 1 had no punts
Game 333 had no punts
Game 390 had no punts
Game 666 had no punts


- Game 1: 'Hall of Fame Game' was cancelled due to weather
- Game 333: Probowl game
    - If you search the game and find the box-score, there were 3 punts
- Game 399: Was cancelled due to Hurricane Harvey :(
- Game 666: Probowl game
    - If you search the game and find the box-score, there were 8 punts

- I'm gonna drop <b>alot of columns</b> because I don't find them very useful. I'll look specifically at <b>'PlayDescription'</b> to get a rough idea of how a play panned out, use that data to create one-hot encodings of types of plays (touchback, punt return, blocked kick, etc) and miscellaneous attributes of a play (fumble, muffed, etc). This will help to understand how many plays could be deemed 'interesting' (exciting, action after the catch, a blocked kick) and 'uninteresting' (out of bounds kicks, touchbacks, fair catches, etc.). This labeling is subjective and is used later to place value on the result of a punt both to a team and its fanbase.
- Note: I'll sometimes interuse the words play and punt. Sorry if there is any confusion, I always mean the entirety of the punt play when I use those words.

- Example Play Descriptions for one-hot-encodings:
    - <b>Interesting outcomes</b>:
        - <b>Returned Punt</b>: B.Nortman punts 40 yards to BUF 23, Center-C.Holba. B.Tate to BUF 34 for 11 yards (D.Payne).
        - <b>Muffed catch</b>: S.Waters punts 36 yards to BLT 15, Center-J.Jansen. K.Clay MUFFS catch, RECOVERED by CAR-F.Whittaker at BLT 12. F.Whittaker to BLT 12 for no gain (K.Clay).
        - <b>Blocked Punt</b>: B.Wing punt is BLOCKED by B.Carter, Center-Z.DeOssie, recovered by NYG-J.Currie at NYG 15. J.Currie to NYG 15 for no gain (J.Burris).
        - <b>Fumbles</b>: M.Darr punts 42 yards to TEN 14, Center-J.Denney. K.Reed to TEN 21 for 7 yards (Dan.Thomas). FUMBLES (Dan.Thomas), RECOVERED by MIA-J.Denney at TEN 23. J.Denney to TEN 23 for no gain (K.Byard).
        - <b>Touchdown</b>: J.Locke punts 61 yards to CIN 20, Center-K.McDermott. A.Erickson for 80 yards, TOUCHDOWN.
        - <b>Fake Punt</b>: P.McAfee pass deep right to E.Swoope to PIT 8 for 35 yards (J.Gilbert).
            - Passing: P.McAfee pass deep right to E.Swoope to PIT 8 for 35 yards (J.Gilbert).
            - Running: C.Jones left end to PHI 43 for 30 yards (D.Sproles). Fake punt run around left end.
                - Lots of variations in descriptions for these bad boys
    - Uninteresting outcomes:
        - <b>Fair Catch</b>: J.Locke punts 47 yards to GB 10, Center-K.McDermott, fair catch by M.Hyde.
        - <b>Downed Punt</b>: J.Locke punts 50 yards to GB 9, Center-K.McDermott, downed by MIN-J.Kearse.
            - This is a play where the punting team controls the ball before any receiving team player after the ball has been punted
        - <b>Touchbacks</b>: J.Hekker punts 50 yards to end zone, Center-J.Overbaugh, Touchback.
        - <b>Out of Bounds Punt</b>: J.Schum punts 35 yards to MIN 34, Center-B.Goode, out of bounds.
        - <b>Dead Ball</b>: B.Nortman punts 51 yards to BUF 34, Center-C.Holba. B.Tate, dead ball declared at BUF 34 for no gain.
        - <b>No Play</b>: (:04) (Punt formation) PENALTY on ATL-M.Bosher, Delay of Game, 5 yards, enforced at ATL 49 - No Play.
            - Some 'No Play' or '(Punt formation) Penalty' descriptions vary where a punt was executed and a penalty occurred that would negate the play, such that the punt is reattempted
            - Such penalties include: False Start, Illegal Substitution, Delay of Game, Illegal Formation, Neutral Zone Infraction, Player Out of Bounds on Punt, Defensive 12 On-field, Ineligible Downfield Kick, Illegal Shift, Unnecessary Roughness, Roughing the Kicker, Defensive Offside, Ineligible Downfield Kick, Offensive Holding

- Note: a play may have more than one of the above classifications.

In [16]:
# Create condensed version of play data
keeper_columns = ['GameKey', 'PlayID', 'PlayDescription', 'Poss_Team', 'YardLine']
condensed_play_df = play_df[keeper_columns].copy()

In [17]:
def find_that_play_word(keyword, df):
    """Help to find keywords"""
    df[keyword] = 0
    count = 0
    for i, description in enumerate(df['PlayDescription']):
        game_key = df.loc[i, 'GameKey']
        play_id = df.loc[i, 'PlayID']
        # Find keyword in lowercased string of play description
        if description.lower().find(keyword) != -1:
#             print('Keyword', keyword, 'found for (game, play):', '(' + str(game_key) + ',' + str(play_id) + ')')
#             print('Play description:', description)
#             print('---')
                
            # One-hot encode with keyword
            df.loc[i, keyword] = 1
            count += 1

    print('# of', keyword, 'occuring on a punt play:', count)

- Choice of strings to parse for were determined based off just perusing the 'PlayDescription' until my eyes bled. There are probably cases where I'm still making poor assumptions, but I'll have to live with it.

### Uninteresting Outcomes

In [18]:
find_that_play_word('fair catch', condensed_play_df)
find_that_play_word('touchback', condensed_play_df)
find_that_play_word('downed', condensed_play_df)
find_that_play_word(', out of bounds', condensed_play_df)
find_that_play_word('dead ball', condensed_play_df)
find_that_play_word('no play', condensed_play_df)
find_that_play_word('(punt formation) penalty on', condensed_play_df) # Picks up additional 'no play' type punts

# of fair catch occuring on a punt play: 1663
# of touchback occuring on a punt play: 407
# of downed occuring on a punt play: 811
# of , out of bounds occuring on a punt play: 640
# of dead ball occuring on a punt play: 2
# of no play occuring on a punt play: 243
# of (punt formation) penalty on occuring on a punt play: 138


- Some of these counts may overlap, but won't matter for the processing

In [19]:
# Reduce play_df even further 
where_condition = (
    (condensed_play_df['fair catch'] == 1) |
    (condensed_play_df['touchback'] == 1) |
    (condensed_play_df['downed'] == 1) |
    (condensed_play_df[', out of bounds'] == 1) |
    (condensed_play_df['dead ball'] == 1) |
    (condensed_play_df['no play'] == 1) |
    (condensed_play_df['(punt formation) penalty on'] == 1))
interesting_plays_df = condensed_play_df[~where_condition].reset_index(drop=True)

print('There are now', len(interesting_plays_df), '"interesting plays" from', len(condensed_play_df), 'punt plays')
print('Proportion of interesting punts:', len(interesting_plays_df)/len(condensed_play_df))
interesting_plays_df.head(1)

There are now 2980 "interesting plays" from 6681 punt plays
Proportion of interesting punts: 0.44604101182457717


Unnamed: 0,GameKey,PlayID,PlayDescription,Poss_Team,YardLine,fair catch,touchback,downed,", out of bounds",dead ball,no play,(punt formation) penalty on
0,2,1227,"(10:01) C.Jones punts 40 yards to LA 42, Cente...",DAL,DAL 18,0,0,0,0,0,0,0


- So we can see that around <b>55.4% of punts (3701/6681)</b> result in a play that is 'uninteresting'. Maybe the punt isn't worth the time.
- The <b>touchback rate was 57.6% for kickoffs in 2016</b>. Just for perspective of 'uninteresting outcomes'.
    - Reference: http://www.espn.com/nfl/story/_/id/18393780/kickoff-returns-reduced-18-percentage-points-2016-season
- Now that we have a condensed set of punt plays where something potentially interesting occurred, lets parse for the more interesting than interesting plays on punts (touchdowns, fumbles, blocks, etc.).
    - Note: If you're curious on this filtering, it does filter out some plays that involved concussions (GameKey, PlayID): (21, 2587), (234, 3278), (266, 2902), (280, 2918), (399, 3312), (607, 978)

## LINK TO FAIR CATCH ANALYSIS* * * * * * * * * 

In [9]:
# Create dataset of fair catch plays
# It will be used in later analyses
where_condition = ((condensed_play_df['fair catch'] == 1))
fc_df = condensed_play_df[where_condition].reset_index(drop=True)
# Drop unnecessary columns
keeper_columns = ['GameKey', 'PlayID', 'PlayDescription', 'Poss_Team', 'YardLine']
fc_df = fc_df[keeper_columns]
fc_df.to_csv('data/play-fc.csv', index=False)

In [20]:
# I only have this here for reference of what I've filtered by
uninteresting_keywords = ['fair catch', 'touchback.', 'downed', ', out of bounds', 'dead ball', 'no play',
                         '(punt formation) penalty on']
interesting_keywords = ['muffs', 'blocked by','touchdown.', 'fumble', 'ruling', 'fake punt',
                        'up the middle', 'pass', 'right end', 'left end', 'right guard',
                        'direct snap', 'touchdown nullified']

### Interesting outcomes

In [21]:
# 'Interesting outcomes'
find_that_play_word('muffs', interesting_plays_df)
find_that_play_word('blocked by', interesting_plays_df)
find_that_play_word('touchdown.', interesting_plays_df)
find_that_play_word('fumble', interesting_plays_df)
find_that_play_word('ruling', interesting_plays_df)
find_that_play_word('fake punt', interesting_plays_df)
find_that_play_word('safety', interesting_plays_df)
find_that_play_word('up the middle', interesting_plays_df)
find_that_play_word('pass', interesting_plays_df)
find_that_play_word('right end', interesting_plays_df)
find_that_play_word('left end', interesting_plays_df)
find_that_play_word('right guard', interesting_plays_df)
find_that_play_word('direct snap', interesting_plays_df)
find_that_play_word('touchdown nullified', interesting_plays_df)

# of muffs occuring on a punt play: 198
# of blocked by occuring on a punt play: 29
# of touchdown. occuring on a punt play: 41
# of fumble occuring on a punt play: 68
# of ruling occuring on a punt play: 25
# of fake punt occuring on a punt play: 6
# of safety occuring on a punt play: 8
# of up the middle occuring on a punt play: 11
# of pass occuring on a punt play: 20
# of right end occuring on a punt play: 7
# of left end occuring on a punt play: 3
# of right guard occuring on a punt play: 3
# of direct snap occuring on a punt play: 9
# of touchdown nullified occuring on a punt play: 12


In [47]:
# Create a dataset where plays are currently assumed to be actual punt returns 
where_condition = (
    (interesting_plays_df['muffs'] == 1) |
    (interesting_plays_df['blocked by'] == 1) |
    (interesting_plays_df['touchdown.'] == 1) |
    (interesting_plays_df['fumble'] == 1) |
    (interesting_plays_df['ruling'] == 1) |
    (interesting_plays_df['fake punt'] == 1) |
    (interesting_plays_df['safety'] == 1) |
    (interesting_plays_df['up the middle'] == 1) |
    (interesting_plays_df['pass'] == 1) |
    (interesting_plays_df['right end'] == 1) |
    (interesting_plays_df['left end'] == 1) |
    (interesting_plays_df['right guard'] == 1) |
    (interesting_plays_df['direct snap'] == 1) |
    (interesting_plays_df['touchdown nullified'] == 1))
remainder_df = interesting_plays_df[~where_condition].reset_index(drop=True)

# Isolate touchdowns that were from punt returns
where_condition = ((interesting_plays_df['touchdown.'] == 1) &
                   (interesting_plays_df['blocked by'] == 0) &
                   (interesting_plays_df['direct snap'] == 0) &
                   (interesting_plays_df['right guard'] == 0) &
                   (interesting_plays_df['fumble'] == 0) &
                   (interesting_plays_df['pass'] == 0))
td_df = interesting_plays_df[where_condition].reset_index(drop=True)

# Combine touchdown punt returns and regular punt returns
remainder_df = pd.concat([remainder_df, td_df], axis=0)

# Drop unnecessary columns
keeper_columns = ['GameKey', 'PlayID', 'PlayDescription', 'Poss_Team', 'YardLine']
remainder_df = remainder_df[keeper_columns]
remainder_df.reset_index(inplace=True, drop=True)
print(remainder_df.shape)
remainder_df.head()

(2627, 5)


Unnamed: 0,GameKey,PlayID,PlayDescription,Poss_Team,YardLine
0,2,1227,"(10:01) C.Jones punts 40 yards to LA 42, Cente...",DAL,DAL 18
1,3,455,(6:44) (Punt formation) S.Koch punts 54 yards ...,BLT,BLT 32
2,3,1542,(2:54) (Punt formation) S.Koch punts 45 yards ...,BLT,BLT 34
3,4,927,"(1:53) A.Lee punts 40 yards to GB 27, Center-C...",CLV,CLV 33
4,4,1725,"(2:48) A.Lee punts 66 yards to GB 15, Center-C...",CLV,CLV 19


In [48]:
# Just for reference if you want to filter return plays that have a penalty
find_that_play_word('penalty on', remainder_df)

# of penalty on occuring on a punt play: 531


- So now we have a condensed set of punts that result in some return minus the above filtered 'interesting' plays.
- We'll now look at this set of plays and extract some information from the 'PlayDescription'

### Play Description Parsing
- The following work-up/analysis does not adjust for penalties on the play. I know this isn't clean and big returns on punts do have a pretty good chance of a penalty was helping with the success of the return, but I'm only parsing the PlayDescription to get a rough idea of the return amounts and potential value of a return. I'll go back and parse more properly in future versions of the notebook.

In [49]:
# for i, element in enumerate(remainder_df['PlayDescription']):
#     print(i)
#     print(element)

In [50]:
'''
Need to parse through PlayDescription in order to get return distance of play and distance to touchdown
Patterns that return two distances for yardage on play are lateral plays
'''

# Regex for them patterns
punt_distance_pattern = re.compile(r'punts ((-?)\d+) yards? to(\s| \w+ )((-?)\d+)')
yards_gained_pattern = re.compile(r'for ((-?)\d+) yard')
no_yards_gained_pattern = re.compile(r'([A-Z]\w+) ((-?)\d+) (for no gain)')

remainder_df['punt distance'] = 0
remainder_df['side ball lands'] = ''
remainder_df['yardline received'] = 0
remainder_df['yardage on play'] = 0

for i, element in enumerate(remainder_df['PlayDescription']):

    punt_distance = punt_distance_pattern.findall(element) # ('Punt distance', '', 'Side Ball Lands', 'Yardline Received')
    yards_gained = yards_gained_pattern.findall(element)   # ('Yardage on Play', '', )
    no_gain = no_yards_gained_pattern.findall(element)
    
#     print(punt_distance)
#     print(yards_gained)
#     print(no_gain)
    
    # A play that results in yards gained or lossed
    if yards_gained != []:
        remainder_df.loc[i, 'punt distance'] = int(punt_distance[0][0])
        remainder_df.loc[i, 'side ball lands'] = punt_distance[0][2]
        remainder_df.loc[i, 'yardline received'] = int(punt_distance[0][3])
        
        # A normal return
        if len(yards_gained) == 1:
            remainder_df.loc[i, 'yardage on play'] = int(yards_gained[0][0])
            
        # For laterals
        else:
            remainder_df.loc[i, 'yardage on play'] = int(yards_gained[0][0]) + int(yards_gained[1][0])
            
    # A play that resulted in no gain in yards
    elif no_gain != []:
        remainder_df.loc[i, 'punt distance'] = int(punt_distance[0][0])
        remainder_df.loc[i, 'side ball lands'] = punt_distance[0][2]
        remainder_df.loc[i, 'yardline received'] = int(punt_distance[0][3])

#     print('---')

In [51]:
# Doing some hand processing of specific returns where the yardage gained on return was
# officially changed (I know not elegant, especially if dataframe indices change overtime)
culprits = [476, 891, 1062, 1064, 1096, 2193]
yard_changes = [14, 6, 0, 3, 0, 4]
for i, element in enumerate(culprits):
    remainder_df.loc[element, 'yardage on play'] = yard_changes[i]

- We'll calculate distance to a touchdown for each play to create a reward metric for each play
- A more proper metric for the value of a punt return should also take into account the current score, time remaining, playoff implications, and return by the home team or not just to name a few factors. I don't do this just to have a more simplified model for reward. 
    - Note that this model is just a proportion, which is also a bit flawed in the sense that if a kick is a very short kick in the punt teams territory, the chances of a reasonable return are very low although the value of the play both for the return team and fans of that team are high despite a 'reward' from my calculation will show up as being low. 

In [52]:
# Check if punt team is always punting from their side of the field
count = 0
for i in range(len(remainder_df)):
    team_name_len = len(remainder_df.loc[i, 'Poss_Team'])
    if remainder_df.loc[i, 'Poss_Team'] == remainder_df.loc[i, 'YardLine'][:team_name_len]:
        continue
    else:
        count += 1
print("Number of plays where punt team is punting in opponents territory:", count)
print("Proportion of plays that are in opponents territory:", count/remainder_df.shape[0])

Number of plays where punt team is punting in opponents territory: 46
Proportion of plays that are in opponents territory: 0.017510468214693566


In [53]:
def calculate_distance_to_td (data_sample):
    '''Calculate distance needed for touchdown for each play'''
    # Punts that land on the 50 yard line
    if data_sample['yardline received'] == 50:
        distance_to_touchdown = 50
    
    # Punting on punting team's side of field
    elif data_sample['Poss_Team'] == data_sample['YardLine'][:len(data_sample['Poss_Team'])]:
        # Ball remains on punt team's side of field
        if data_sample['side ball lands'] == data_sample['YardLine'][:len(data_sample['Poss_Team'])]:
            distance_to_touchdown = data_sample['yardline received']
        # Ball is punted to return team's side of field
        else:
            distance_to_touchdown = (50 - data_sample['yardline received']) + 50
            
    # Punting on opponents side of field
    else:
        distance_to_touchdown = (50 - data_sample['yardline received']) + 50
    return distance_to_touchdown

In [54]:
# Calculate the value of a punt return based solely on the proportion of yardage gained on the return
# Relative to how many yards are needed to score a touchdown from where the punt initially lands
remainder_df['reward'] = 0
for i in range(len(remainder_df)):
    yards_on_return = remainder_df.loc[i, 'yardage on play']
    distance_to_touchdown = calculate_distance_to_td(remainder_df.iloc[i, :])
    remainder_df.loc[i, 'reward'] = yards_on_return / distance_to_touchdown
#     print('Value of return:', yards_on_return / distance_to_touchdown)

remainder_df.head()

Unnamed: 0,GameKey,PlayID,PlayDescription,Poss_Team,YardLine,penalty on,punt distance,side ball lands,yardline received,yardage on play,reward
0,2,1227,"(10:01) C.Jones punts 40 yards to LA 42, Cente...",DAL,DAL 18,0,40,LA,42,25,0.431034
1,3,455,(6:44) (Punt formation) S.Koch punts 54 yards ...,BLT,BLT 32,0,54,CAR,14,9,0.104651
2,3,1542,(2:54) (Punt formation) S.Koch punts 45 yards ...,BLT,BLT 34,1,45,CAR,21,-1,-0.012658
3,4,927,"(1:53) A.Lee punts 40 yards to GB 27, Center-C...",CLV,CLV 33,0,40,GB,27,1,0.013699
4,4,1725,"(2:48) A.Lee punts 66 yards to GB 15, Center-C...",CLV,CLV 19,0,66,GB,15,5,0.058824


In [55]:
# Create dataset for external usage
remainder_df.to_csv('data/play-punt_returns.csv', index=False)

In [56]:
remainder_df = pd.read_csv('data/play-punt_returns.csv')

### PLAYER PUNT DATA
- <b>Player Punt Data</b>: Player level data that specifies the traditional football position for each player. Each player is identified using his GSISID.
- <b>GSISID</b>: unique id for each player
- This dataset is mostly useful for identifying particular players in the videos provided
- Note that some players are shown to have multiple jersey numbers associated with them

In [57]:
player_df = pd.read_csv('data/player_punt_data.csv')
print(player_df.shape)
print(player_df[player_df['GSISID'] == 28121])

# Clear up memory
del player_df

(3259, 3)
      GSISID Number Position
186    28121     22       RB
452    28121    31w       RB
1114   28121     28       RB
1353   28121      6       RB
2051   28121    31o       RB


### PLAY PLAYER ROLE DATA
- <b>Play Player Role Data</b>: Play and player level data that specifies a punt specific player role. This dataset will specify each player that played in each play. A player’s role in a play is uniquely defined by the Gamekey PlayID and GSISID.

In [58]:
play_player_role_df = pd.read_csv('data/play_player_role_data.csv')
print(play_player_role_df.shape)
play_player_role_df.tail(2)

(146573, 5)


Unnamed: 0,Season_Year,GameKey,PlayID,GSISID,Role
146571,2017,414,3425,33704,PRG
146572,2017,414,2210,33704,PRG


# NGS DATA
- <b>Next Gen Stats</b>:player level data that describes the movement of each player during a play. NGS data is processed by BIOCORE to produce relevant speed and direction data. The NGS data is identified using GameKey, PlayID, and GSISID. Player data for each play is provided as a function of time (Time) for the duration of the play.
- Players are recorded at every <b>10th of a second or 100 milliseconds</b>
- Field dimensions: 120 yards by 53.3 yards
- Speed can be calculated with Time and dis
- <b>'Event'</b> record events during the play such as a ball snap, the punt, when a punt is received, etc.

# <a id='ngs_analysis'>Analyze NGS play data</a>

[Plot Return Yardage](#return_yardage)
- We'll be taking our condensed set of games and natural joining with the NGS data for each year and then recombining data so we have a dataset of 'cleanish' punt returns. We'll use this data to get conditional statistics on returns given distance of closest punt player to punt receiver. We'll use the assumption that the furthest player from the punt at the 'event': 'punt_returned' is the one who actually is receiving the ball.
- Consider events, speed, and player proximities

In [60]:
# How many unique plays in play_player_role dataset?
print('# of unique plays according to play_player_role dataset:',
      len(play_player_role_df.groupby(['GameKey','PlayID']).size().reset_index().rename(columns={0:'count'})))

# How many roles are there?
print('# of roles in dataset:', len(play_player_role_df['Role'].value_counts()))

# of unique plays according to play_player_role dataset: 6670
# of roles in dataset: 52


- NGS data exists for 6666 punts 
- Play Information dataset has 6681 punts accounted for
- I just wanted to note that there is some missing data between the datasets

### So what do I care about?
- We're going to look at how players are positioned given a particular play event. We'll look specifically at the event 'punt_received' to see how close the punt team players are to the punt receiver. I am interested in whether or not if there is much 'value' or 'reward' for returns in which an opposing player is relatively close (<= 8 yards) to the punt receiver. This is done to better understand how much value would be lost if a rule in which a 'downed ball' would be called if punt team players are too close to the punt receiver before the ball is caught.
    - Reasons for having a restricted zone for the PR is to hopefully negate up to <b>19 of the concussions</b> that are in the concussion dataset. Just watch the film to see which ones this pertains too.
    - Having a restricted zone would help to slow down the punt play and allow the PR around a second to decide what to do after the catch. This could potentially temper player speeds as they approach the PR to prevent a penalty from occurring.

In [61]:
# Dataset was preprocessed in a separate notebook
condensed_ngs = pd.read_csv('data/NGS-punt_returns.csv')
condensed_ngs.head()

Unnamed: 0,GameKey,PlayID,PlayDescription,Poss_Team,YardLine,penalty on,punt distance,side ball lands,yardline received,yardage on play,reward,GSISID,Time,x,y,dis,o,dir,Event,Role
0,3,455,(6:44) (Punt formation) S.Koch punts 54 yards ...,BLT,BLT 32,0,54,CAR,14,9,0.104651,31762.0,2016-08-11 23:50:54.300,40.450001,30.99,0.01,343.5,74.970001,,PLG
1,3,455,(6:44) (Punt formation) S.Koch punts 54 yards ...,BLT,BLT 32,0,54,CAR,14,9,0.104651,27539.0,2016-08-11 23:50:54.300,38.740002,27.6,0.13,9.73,242.610001,,PPR
2,3,455,(6:44) (Punt formation) S.Koch punts 54 yards ...,BLT,BLT 32,0,54,CAR,14,9,0.104651,32573.0,2016-08-11 23:50:54.300,40.32,27.32,0.02,23.82,269.019989,,PRT
3,3,455,(6:44) (Punt formation) S.Koch punts 54 yards ...,BLT,BLT 32,0,54,CAR,14,9,0.104651,30458.0,2016-08-11 23:50:54.300,40.490002,32.279999,0.01,329.540009,45.400002,,PLT
4,3,455,(6:44) (Punt formation) S.Koch punts 54 yards ...,BLT,BLT 32,0,54,CAR,14,9,0.104651,27557.0,2016-08-11 23:50:54.300,41.200001,29.91,0.01,10.25,175.419998,,PLS


## LINK TO PREPROCESSING NOTEBOOK * * * 

## LINK TO PUNT RETURN ANALYSIS NOTEBOOK * * * 

In [62]:
def event_df_creation(df, event):
    '''Get a new dataframe with data pertinent to a particular event'''
    new_df = df[df['Event'] == event].reset_index(drop=True)
    unique_ids = new_df.groupby(['GameKey','PlayID']).size().reset_index().rename(columns={0:'count'})
    return new_df, unique_ids

In [63]:
# Look at 'punt_received' event
punt_received_df, PR_unique_ids = event_df_creation(condensed_ngs, 'punt_received')
print(punt_received_df.shape)
punt_received_df.head(1)

(57386, 20)


Unnamed: 0,GameKey,PlayID,PlayDescription,Poss_Team,YardLine,penalty on,punt distance,side ball lands,yardline received,yardage on play,reward,GSISID,Time,x,y,dis,o,dir,Event,Role
0,3,455,(6:44) (Punt formation) S.Koch punts 54 yards ...,BLT,BLT 32,0,54,CAR,14,9,0.104651,31338.0,2016-08-11 23:51:08.400,64.199997,9.53,0.91,8.75,101.239998,punt_received,PRW


In [64]:
# Look at 'punt' event
punt_df, P_unique_ids = event_df_creation(condensed_ngs, 'punt')
print(punt_df.shape)
punt_df.head(1)

(57403, 20)


Unnamed: 0,GameKey,PlayID,PlayDescription,Poss_Team,YardLine,penalty on,punt distance,side ball lands,yardline received,yardage on play,reward,GSISID,Time,x,y,dis,o,dir,Event,Role
0,3,455,(6:44) (Punt formation) S.Koch punts 54 yards ...,BLT,BLT 32,0,54,CAR,14,9,0.104651,24417.0,2016-08-11 23:51:04.000,31.67,29.41,0.18,346.529999,130.740005,punt,P


In [65]:
print('Number of plays with punt_received as an event:',
      punt_received_df.groupby(['GameKey','PlayID']).size().reset_index().drop(columns=0).shape[0],
     'vs', remainder_df.shape[0], 'from the remainder dataframe')
print('Number of plays with punt as an event:',
      punt_df.groupby(['GameKey','PlayID']).size().reset_index().drop(columns=0).shape[0],
     'vs', remainder_df.shape[0], 'from the remainder dataframe')

Number of plays with punt_received as an event: 2604 vs 2627 from the remainder dataframe
Number of plays with punt as an event: 2617 vs 2627 from the remainder dataframe


- I'll have to hunt down those 13 plays that have no punt returns and see whatsup later.
- So we're gonna calculate the proximity of players to the punt receiver when the ball is consider caught (event = punt_received). This will help to analyze how a plays outcome is affected by the proximity of punt team players to the PR. The <b>assumption</b> for this calculation is that the ball is always received by the PR.

In [None]:
# Let's indicate what team the player is playing on based off player role
return_team_positions = ['PR', 'PDL1', 'PDL2', 'PDL3', 'PDL4', 'PDR1', 'PDR2', 'PDR3', 'PDR4', 'VL', 'VR', 
                         'PLL', 'PLR', 'VRo', 'VRi', 'VLi', 'VLo', 'PLM', 'PLR1', 'PLR2', 'PLL1', 'PLL2',
                         'PFB', 'PDL5', 'PDR5', 'PDL6', 'PLR3', 'PLL3', 'PDR6', 'PLM1', 'PDM']
punt_team_positions = ['P', 'PLS', 'PPR', 'PLG', 'PRG', 'PLT', 'PRT', 'PLW', 'PRW', 'GL', 'GR',
                       'GRo', 'GRi', 'GLi', 'GLo', 'PC', 'PPRo', 'PPRi', 'PPL', 'PPLi', 'PPLo']

def label_team(df):
    '''Label each player by the team they play on'''
    df['team'] = ''

    for i, role in enumerate(df['Role']):
        print(i, role)
        if role in return_team_positions:
            df.loc[i, 'team'] = 'return team'
        elif role in punt_team_positions:
            df.loc[i, 'team'] = 'punt team'
        else:
            df.loc[i, 'team'] = 'unknown'

In [None]:
# Label punt_received dataset
label_team(punt_received_df)

In [None]:
def calculate_player_proximity(role_x, role_y, player_x, player_y):
    '''Calculate distance of a player to a particular role'''
    leg_x = (role_x - player_x) ** 2
    leg_y = (role_y - player_y) ** 2
    hypotenuse = np.sqrt(leg_x + leg_y)
    return hypotenuse

In [None]:
def calculate_proximity_for_play(df, unique_ids, role):
    # Create feature for player proximity
    df['proximity_to_' + role] = 0
    
    for i in range(len(df)):
        print(i)
        
        # Play Information
        game_key = df.loc[i, 'GameKey']
        play_id = df.loc[i, 'PlayID']
        
        # Get one unique set of data points related to a single (GameKey, PlayID) pair
        where_condition = ((df['GameKey'] == game_key) &\
                           (df['PlayID'] == play_id))
        just_view = df[where_condition].reset_index()
        
        if any(just_view['Role'] == role):
            # Get coordinates of a player with a particular role
            role_x = just_view.loc[just_view['Role'] == role, 'x'].values[0]
            role_y = just_view.loc[just_view['Role'] == role, 'y'].values[0]

            # Current Player coordinates
            position_x = df.loc[i, 'x']
            position_y = df.loc[i, 'y']

            # Calculate proximity
            proximity = calculate_player_proximity(role_x, role_y, position_x, position_y)
            df.loc[i, 'proximity_to_' + role] = proximity

        # Plays that don't actually have the particular role represented
        else:
            continue

In [None]:
# Calculate proximity to punt receiver for each player
calculate_proximity_for_play(punt_received_df, PR_unique_ids, 'PR')

In [None]:
# We'll drop punts that have no PR designated and the data point of the PR (proximity_to_PR == 0)
pr_proximities = punt_received_df[~(punt_received_df['proximity_to_PR'] == 0)].reset_index(drop=True).copy()

In [None]:
def plot_punt_events(df, unique_ids):
    '''
    Function to either plot out plays where a given event has occured
    df: NGS dataframe that has already been filtered by some event
    unique_ids: unique (gamekey,playID) pairs built from NGS dataframe of an event
    '''
    # Iterate through unique (GameKey, PlayID)
    for i in range(len(unique_ids)):
        print(i)

        # Play information
        game_key = unique_ids.loc[i, 'GameKey']
        play_id = unique_ids.loc[i, 'PlayID']

        # Get one unique set of data points related to a single (GameKey, PlayID) pair
        where_condition = ((df['GameKey'] == game_key) &\
                           (df['PlayID'] == play_id))
        just_view = df[where_condition].reset_index()
        print('Play Description:', just_view.loc[0, 'PlayDescription'])
        
        # Prepare figure of players positions when punt is received
        fig, ax = plt.subplots()
        sns.set()
        for j in range(len(just_view)):
            # Player information
            team = just_view.loc[j, 'team']
            position_x = just_view.loc[j, 'x']
            position_y = just_view.loc[j, 'y']

            # Label based on offense or defense and map player position
            if team == 'return team':
                ax.scatter(position_x, position_y, alpha=0.8, color='red')
            elif team == 'punt team':
                ax.scatter(position_x, position_y, alpha=0.8, color='blue')
            else:
                ax.scatter(position_x, position_y, alpha=0.8, color='green')

        # Normal length of field is 120 yards
        plt.xlim(-10, 130)
        plt.xticks(np.arange(0, 130, step=10),
                   ['End', 'GL', '10', '20', '30', '40', '50', '40', '30', '20', '10', 'GL', 'End'])
        # Normal width is 53.3 yards
        plt.ylim(-10, 65)
        plt.yticks(np.arange(0, 65, 53.3), ['Sideline', 'Sideline'])
        plt.title('Playing Field')
        plt.xlabel('yardline')
        plt.ylabel('width of field')
        red_patch = mpatches.Patch(color='red', label='return team')
        blue_patch = mpatches.Patch(color='blue', label='punt team')
        green_patch = mpatches.Patch(color='green', label='unknown')
        plt.legend(handles=[red_patch, blue_patch, green_patch])
        plt.show()

In [None]:
# Plot all positions at the time of specific event
plot_punt_events(punt_received_df, PR_unique_ids)

In [None]:
pr_proximities.head(1)

In [None]:
# Get set of unique_ids for the punts for the pr_proximities dataset
pr_proximities_unique_ids = pr_proximities.groupby(['GameKey','PlayID']).size()\
    .reset_index().rename(columns={0:'count'})

In [None]:
def calculate_closest_player(df, unique_ids, column):
    unique_ids['proximity'] = 0
    
    for i in range(len(unique_ids)):
        print(i)
        
        # Play information
        game_key = unique_ids.loc[i, 'GameKey']
        play_id = unique_ids.loc[i, 'PlayID']

        # Get one unique set of data points related to a single (GameKey, PlayID) pair
        where_condition = ((df['GameKey'] == game_key) &\
                           (df['PlayID'] == play_id))
        just_view = df[where_condition].reset_index(drop=True)
        
        # Take minimum of series
        unique_ids.loc[i, 'proximity'] = min(just_view[column])

In [None]:
calculate_closest_player(pr_proximities, pr_proximities_unique_ids, 'proximity_to_PR')

In [None]:
'''Plot of distribution distance of closest player to punt receiver'''
bins = [i for i in range(0, 40, 1)]
plt.hist(pr_proximities_unique_ids['proximity'], bins=bins)

plt.title('Distribution of closest player on punt team to punt receiver')
plt.xlabel('Yards')
plt.ylabel('count')
plt.show()

pr_proximities_unique_ids['proximity'].describe()

In [None]:
# Combine Unique ID's appropriately
yard_and_proximity = pd.merge(pr_proximities_unique_ids, remainder_df,
                          how='inner',
                          on=['GameKey', 'PlayID'])
print(yard_and_proximity.shape)
yard_and_proximity.head()

### Let's Look at Yardage and Proximity

In [None]:
def build_that_histogram(x, title):
    bins = [i for i in range(min(x), max(x), 1)]
    plt.hist(x, bins=bins)

    plt.title(title)
    plt.xlabel('Yards')
    plt.ylabel('count')
    plt.show()

In [None]:
# Let's look at what are the number of yards gained given how close the nearest opposing player
# to the punt receiver is
mean_dif = []
for i in range(1, 20):
    
    # Check for cases less than or equal to a certain distance
    where_condition = yard_and_proximity['proximity'] <= i
    title = 'Proximity <= ' + str(i) + ' yards'
    x = yard_and_proximity[where_condition]['yardage on play']
    build_that_histogram(x, title)
    print(x.describe())
    mean1 = x.describe()[1]
    
    # Check for cases greater than a certain distance
    where_condition = yard_and_proximity['proximity'] > i
    title = 'Proximity > ' + str(i) + ' yards'
    x = yard_and_proximity[where_condition]['yardage on play']
    build_that_histogram(x, title)
    print(x.describe())
    mean2 = x.describe()[1]
    
    mean_dif.append(mean2 - mean1)
    print('---')
    print('---')

In [None]:
plt.scatter([i for i in range(1, 20)], mean_dif)

### Let's Look at Reward and Proximity

In [None]:
def build_that_histogram_rewards(x, title):
    bins = [i * 0.01 for i in range(-15, 100, 1)]
    plt.hist(x, bins=bins)

    plt.title(title)
    plt.xlabel('Reward')
    plt.ylabel('Count')
    plt.show()

In [None]:
mean_dif = []
for i in range(1, 20):
    # Check for cases less than or equal to a certain distance
    where_condition = yard_and_proximity['proximity'] <= i
    title = 'Proximity <= ' + str(i) + ' yards'
    x = yard_and_proximity[where_condition]['reward']
    build_that_histogram_rewards(x, title)
    print(x.describe())
    mean1 = x.describe()[1]
    
    # Check for cases greater than than a certain distance
    where_condition = yard_and_proximity['proximity'] > i
    title = 'Proximity > ' + str(i) + ' yards'
    x = yard_and_proximity[where_condition]['reward']
    build_that_histogram_rewards(x, title)
    print(x.describe())
    mean2 = x.describe()[1]
    
    mean_dif.append(mean2 - mean1)
    print('---')
    print('---')

In [None]:
plt.scatter([i for i in range(1, 20)], mean_dif)

- So intuitively the further a punt team player is from the PR the higher the mean return distance and 'reward' value for return.
    - By looking at the plot of mean differences we can see at what yardage there is marginal gains in yardage between the distributions in proximities. This looks like about the distance of 8-9 yards.
- Considering this and that 18 of the concussions could have benefitted from a PR restricted zone of 8 yards, a rule for an 8 yard restricted zone for a PR is proposed.
- <b>PR Restricted Zone Rule:</b> The PR is given 8 yards to the nearest punt team player. If a punt team player infringes on this zone when the ball contacts a surface (player, field, ref) after the ball has been punted, the play is called dead and a 10 yard penalty is assesed from where the ball contacts a surface after the ball has been punted. A PR must be making attempt to catch the ball for this restricted zone to apply.

In [None]:
RB = 3
WR = 23
CB = 4
S = 2
salaries = [10250000, 634143, 555000, 622450, 2914000, 964443, 2792829, 705205, 
            1400000, 10000000, 1907000, 758915, 633944, 1400000, 1486143, 7000000, 
            1005000, 669998, 646555, 525000, 2820696, 769039, 540000, 480000, 
            741497, 2585497, 2340000, 905000, 1005000, 5500000, 6000000, 658961]
np.average(salaries)

### Cost of Losing Punt Receiver
- 5 of 10 offensive players receiving a concussion are PR
    - PR Position Breakdown for current roster:
        - WR: 23, CB: 4, RB: 3, S: 2
    - Their value is ~2.2M/year and was calculated from:
        - https://www.ourlads.com/nfldepthcharts/depthchartpos/PR
        - https://www.spotrac.com/
    - 2013: 16 days on average off the field with a concussion
        - https://www.usatoday.com/story/sports/nfl/2013/07/31/concussions-keeping-nfl-players-off-the-field-longer/2604023/