## Abstract

This kernel produces my analysis for Kaggle's NFL Punt Analytics Competition.  My goal is to reduce the number of concussions on punt plays, through rule changes that are supported by the given data.  Two important aspects of the potential rule changes are that they cannot drastically alter the fabric of the game and that they must have the ability to be easily implemented.  I believe both of my proposed rule changes satisfy these requirements and will have a positive impact on punt safety in the NFL.

## Excutive Summary
My analysis concludes with two proposes rules changes regarding punt plays:
1. **Gunner Blockers** - When Team A presents a punt formation, Team B may have at most 1 blocker aligned opposite each of Team A抯 end men on the line of scrimmage (gunners) at the snap of the ball. 

2. **Touchbacks** - If the result of the punt play is a touchback,  the dead-ball spot for the following snap will be from the 25-yard line.

The code below will comprehensively take you through my thought process; starting with exploring the data to the conclusions and proposals.  For each of the rule change, the code will:
* Describe & visualize the rule changes
* Examine the evidence and reasoning supported by the data
* Forecast the benefits, should the rules be implemented
* Explore potential unintended consequences

The two rule changes are independent of one another, however they are based on the notion that decreasing punt ***returns*** will decrease injuries.

In [None]:
#Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
%matplotlib inline
import re
import warnings
warnings.filterwarnings('ignore')
from plotly.offline import init_notebook_mode, iplot
from plotly import tools
import plotly.graph_objs as go
init_notebook_mode(connected=True)
from IPython.display import display, HTML

I use the below function to load the field layout. Special thanks to Chris Crawford for providing the code in his starter kernel.

In [None]:
def load_field():
    field = dict(
        title = "Player Activity",
        plot_bgcolor='darkseagreen',
        showlegend=True,
        xaxis=dict(
            autorange=False,
            range=[0, 120],
            showgrid=False,
            zeroline=False,
            showline=True,
            linecolor='black',
            linewidth=1,
            mirror=True,
            ticks='',
            tickmode='array',
            tickvals=[10,20, 30, 40, 50, 60, 70, 80, 90, 100, 110],
            ticktext=['Goal', 10, 20, 30, 40, 50, 40, 30, 20, 10, 'Goal'],
            showticklabels=True
        ),
        yaxis=dict(
            title='',
            autorange=False,
            range=[-3.3,56.3],
            showgrid=False,
            zeroline=False,
            showline=True,
            linecolor='black',
            linewidth=1,
            mirror=True,
            ticks='',
            showticklabels=False
         ),
        shapes=[
            dict(
                type='line',
                layer='below',
                x0=0,
                y0=0,
                x1=120,
                y1=0,
                line=dict(
                    color='white',
                    width=2
                )
            ),
            dict(
                type='line',
                layer='below',
                x0=0,
                y0=53.3,
                x1=120,
                y1=53.3,
                line=dict(
                    color='white',
                    width=2
                )
            ),
            dict(
                type='line',
                layer='below',
                x0=10,
                y0=0,
                x1=10,
                y1=53.3,
                line=dict(
                    color='white',
                    width=10
                )
            ),
            dict(
                type='line',
                layer='below',
                x0=20,
                y0=0,
                x1=20,
                y1=53.3,
                line=dict(
                    color='white'
                )
            ),
            dict(
                type='line',
                layer='below',
                x0=30,
                y0=0,
                x1=30,
                y1=53.3,
                line=dict(
                    color='white'
                )
            ),
            dict(
                type='line',
                layer='below',
                x0=40,
                y0=0,
                x1=40,
                y1=53.3,
                line=dict(
                    color='white'
                )
            ),
            dict(
                type='line',
                layer='below',
                x0=50,
                y0=0,
                x1=50,
                y1=53.3,
                line=dict(
                    color='white'
                )
            ),
            dict(
                type='line',
                layer='below',
                x0=60,
                y0=0,
                x1=60,
                y1=53.3,
                line=dict(
                    color='white'
                )
            ),dict(
                type='line',
                layer='below',
                x0=70,
                y0=0,
                x1=70,
                y1=53.3,
                line=dict(
                    color='white'
                )
            ),dict(
                type='line',
                layer='below',
                x0=80,
                y0=0,
                x1=80,
                y1=53.3,
                line=dict(
                    color='white'
                )
            ),
            dict(
                type='line',
                layer='below',
                x0=90,
                y0=0,
                x1=90,
                y1=53.3,
                line=dict(
                    color='white'
                )
            ),dict(
                type='line',
                layer='below',
                x0=100,
                y0=0,
                x1=100,
                y1=53.3,
                line=dict(
                    color='white'
                )
            ),
            dict(
                type='line',
                layer='below',
                x0=110,
                y0=0,
                x1=110,
                y1=53.3,
                line=dict(
                    color='white',
                    width=10
                )
            )
        ]
    )
    
    return field

In [None]:
#Read in Data
game_data = pd.read_csv('../input/game_data.csv')
play_info = pd.read_csv('../input/play_information.csv')
play_player_role = pd.read_csv('../input/play_player_role_data.csv')
player_punt = pd.read_csv('../input/player_punt_data.csv')
video_footage_control = pd.read_csv('../input/video_footage-control.csv')
video_footage_injury = pd.read_csv('../input/video_footage-injury.csv')
video_review = pd.read_csv('../input/video_review.csv')

I'll create a function to read in all of the Next Gen Stats data.  I'll then concatenate the individual files into one large file to make the NGS data easier to work with.


In [None]:
def read_NGS_data(file_lst):
    for i, file in enumerate(file_lst):
        print('Reading in {}'.format(file))
        data = pd.read_csv('../input/' + file)
        if i == 0:
            NGS_df = data
            del data
        else:
            NGS_df = pd.concat([NGS_df, data])
            del data
    return NGS_df

file_lst = ['NGS-2016-pre.csv','NGS-2016-reg-wk1-6.csv','NGS-2016-reg-wk7-12.csv','NGS-2016-reg-wk13-17.csv','NGS-2016-post.csv','NGS-2017-pre.csv','NGS-2017-reg-wk1-6.csv','NGS-2017-reg-wk7-12.csv','NGS-2017-reg-wk13-17.csv','NGS-2017-post.csv']

NGS_df = read_NGS_data(file_lst)

## Exploratory Data Analysis

In [None]:
print('There are concussion injuries on ' \
      + str(round(len(video_review) / float(len(play_info)) * 100, 2)) \
      + '% of ' + 'punt plays')

My initial reaction is that head injuries are already rare on punt plays (<1%).
However, given the severity of these injuries, any reduction to concussions would be beneficial for the NFL.
### How are players being injured?
Explore:
* Primary player acitivities
* Impact types (ex. Helmet-to-Helmet)
* Friendly Fire
* Partner activities

I create bar graphs with Plotly to show the counts for each of the bullet points.

In [None]:
trace1 = go.Bar(
        x=video_review.groupby(['Player_Activity_Derived'], \
                               as_index=False)['PlayID'].count()['Player_Activity_Derived'],
        y=video_review.groupby(['Player_Activity_Derived'],\
                               as_index=False)['PlayID'].count()['PlayID']
    )
trace2 = go.Bar(
        x=video_review.groupby(['Primary_Impact_Type'], \
                               as_index=False)['PlayID'].count()['Primary_Impact_Type'],
        y=video_review.groupby(['Primary_Impact_Type'], \
                               as_index=False)['PlayID'].count()['PlayID'],
    )
trace3 = go.Bar(
        x=video_review.groupby(['Friendly_Fire'], \
                               as_index=False)['PlayID'].count()['Friendly_Fire'],
        y=video_review.groupby(['Friendly_Fire'], \
                               as_index=False)['PlayID'].count()['PlayID'],
    )
trace4 = go.Bar(
        x=video_review.groupby(['Primary_Partner_Activity_Derived'], \
                               as_index=False)['PlayID'].count()['Primary_Partner_Activity_Derived'],
        y=video_review.groupby(['Primary_Partner_Activity_Derived'], \
                               as_index=False)['PlayID'].count()['PlayID'],
    )

fig = tools.make_subplots(rows=2,
                          cols=2, 
                          subplot_titles=('Player Activity Derived', 
                                          'Primary Impact Type',
                                          'Friendly Fire', 
                                          'Primary Partner Activity Derived')
                         )

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 2, 1)
fig.append_trace(trace4, 2, 2)

fig['layout'].update(showlegend=False)

iplot(fig, filename='injury-eda')

#### Quick Conclusions
* No primary player activity significantly stands out
* Surprisingly, helmet-to-helmet and helmet-to-body are involved in the same amount of injuries
* Friendly fire is only clear 16% of the time
* No partner activity Stands Out

### Who's getting hurt?
Explore:
* The side of the ball (coverage vs return)
* The punt role of the players involved

Below, I categorize player punt roles into sides of the ball and create bar graphs to visualize the injuries on each side of the ball for primary and partner.

In [None]:
return_roles = ['PDL1','PDL2','PDL3','PDL4','PDL5','PDL6','PDM','PDR1','PDR2','PDR3','PDR4','PDR5','PDR6'
                ,'PFB','PLL','PLL1','PLL2','PLL3','PLM','PLM1','PLR','PLR1','PLR2','PLR3','PR','VL','VLi'
                ,'VLo','VR','VRi','VRo']

coverage_roles = ['GL','GLi','GLo','GR','GRi','GRo','P','PC','PLG','PLS','PLT','PLW','PPL','PPLi','PPLo'
                 ,'PPR','PPRi','PPRo','PRG','PRT','PRW']

#Merge the player role data with the injury review data on the primary player
inj_players = video_review.merge(play_player_role, how='inner', on=['Season_Year', 'GameKey', 'PlayID', 'GSISID'])
inj_players.rename(columns={'Role':'inj_role'}, inplace=True)

#Create a column to determine side of the ball
inj_players['inj_side_of_ball'] = np.where(inj_players.inj_role.isin(return_roles), 'return',
                                          np.where(inj_players.inj_role.isin(coverage_roles), 'coverage', ''))

#Merge the player role data with the injury review data on the partner player
partner_players = video_review[['Season_Year', 'GameKey', 'PlayID', 'Primary_Partner_GSISID']]
partner_players['Primary_Partner_GSISID'] = partner_players.loc[:,'Primary_Partner_GSISID'] \
                                                .replace('Unclear','0').fillna(0).astype(int)
partner_players = partner_players.merge(play_player_role, how='left', \
                                        left_on=['Season_Year', 'GameKey', 'PlayID', 'Primary_Partner_GSISID'],\
                                       right_on =['Season_Year', 'GameKey', 'PlayID', 'GSISID'])
partner_players = partner_players.drop('GSISID', axis=1)
partner_players.rename(columns={'Role':'partner_role'}, inplace=True)

#Create a column to determine side of the ball
partner_players['partner_side_of_ball'] = np.where(partner_players.partner_role.isin(return_roles), 'return',
                                          np.where(partner_players.partner_role.isin(coverage_roles), 'coverage', ''))

#Concatenate the primary and partner dataframes
inj_partner_df = pd.concat([inj_players,partner_players[['partner_role','partner_side_of_ball']]], axis = 1)

In [None]:
trace1 = go.Bar(
        x=inj_partner_df.groupby(['inj_side_of_ball'], as_index=False)['PlayID'].count()['inj_side_of_ball'],
        y=inj_partner_df.groupby(['inj_side_of_ball'], as_index=False)['PlayID'].count()['PlayID']
    )

trace2 = go.Bar(
        x=inj_partner_df.groupby(['partner_side_of_ball'], as_index=False)['PlayID'].count()['partner_side_of_ball'],
        y=inj_partner_df.groupby(['partner_side_of_ball'], as_index=False)['PlayID'].count()['PlayID']
    )

fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('Injured Side of Ball', 'Partner Side of Ball'))

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)

fig['layout'].update(showlegend=False)

iplot(fig, filename='side-of-ball')

Pivot the injury df to see count of injuries by primary and partner combination

In [None]:
pd.pivot_table(inj_partner_df, index=['inj_role'], columns=['partner_role'],
              values='GSISID', aggfunc='count', margins=True).fillna('-')

#### Quick Conclusions
* The primary player injured is a member of the coverage team 73% of the time
* There is less of a chasm between the side of the ball for partner players
    * 4 instances with no partner player
* Looking at specific roles:
    * PR is involved in 13 total injuries - 8 of which are as a partner
    * The top coverage positions as primary injuries are PRG, PLW, PLG, GL
    * There is only 1 combination with more than 1 injury - PRG & PR

### On what types of plays are players getting hurt?

Explore:
* What are the outcomes of punt plays?
* What are the frequencies of each outcome?
* What are the proportions of injuries for each outcome?

I start by adding a column in the play_info dataframe to label the general outcome of play.  I'll then merge the play_info dataframe with the play_info dataframe in order to see the play outcomes with injuries.  By grouping and aggregating on the outcome column, I can compare the injury rates by outcome.

In [None]:
play_info['outcome'] =  np.where(play_info['PlayDescription'].str.contains('aborted|Fumbled snap|FUMBLES, and recovers', flags=re.IGNORECASE, regex=True), 'aborted',
                        np.where(play_info['PlayDescription'].str.contains('fake|pass|right end|left end|up the middle|Direct snap|right guard', flags=re.IGNORECASE, regex=True), 'fake',
                        np.where(play_info['PlayDescription'].str.contains('muffs', flags=re.IGNORECASE, regex=True), 'muff',         
                        np.where(play_info['PlayDescription'].str.contains('fair catch by', flags=re.IGNORECASE, regex=True), 'fair_catch',
                        np.where(play_info['PlayDescription'].str.contains('touchback', flags=re.IGNORECASE, regex=True), 'touchback',
                        np.where(play_info['PlayDescription'].str.contains('blocked|deflected', flags=re.IGNORECASE, regex=True), 'blocked',
                        np.where(play_info['PlayDescription'].str.contains('out of bounds.', flags=re.IGNORECASE, regex=False), 'oob',
                        np.where(play_info['PlayDescription'].str.contains('downed', flags=re.IGNORECASE, regex=True), 'downed', 
                        np.where(play_info['PlayDescription'].str.contains('safety', flags=re.IGNORECASE, regex=True), 'safety',
                        np.where(play_info['PlayDescription'].str.contains('[0-9]+ for [-+]?[0-9]+ yards?|for no gain|touchdown|(to [A-Z]+ [0-9]+ for [-+]?[0-9]+ yards?)|(to [0-9]+ for [-+]?[0-9]+ yards?)', flags=re.IGNORECASE, regex=True), 'return',         
                        np.where(play_info['PlayDescription'].str.contains('- no play|delay of game|false start, declined|penalty enforced', flags=re.IGNORECASE, regex=True), 'no_play', ' ')))))))))))

#Merge the play info df with the injury df
pi = play_info.merge(video_review[['Season_Year', 'GameKey', 'PlayID','GSISID']], how='left', on =['Season_Year', 'GameKey', 'PlayID'])
pi['injury'] = np.where(pi.GSISID.notnull(), 1, 0).astype(int)
pi.drop('GSISID', axis = 1, inplace=True)
pi_inj_grouped = pi.groupby(['outcome'], as_index=False)['injury'] \
    .agg({'total_plays':'count','injuries':sum}) \
    .sort_values('total_plays', ascending = False) \
    .reset_index(drop=True)

pi_inj_grouped['injury_rate'] = round(pi_inj_grouped['injuries'] / pi_inj_grouped['total_plays'] * 100, 1).astype(str) + '%'
pi_inj_grouped

#### Quick Conclusions: 
* Returns are the most frequent outcome (41%)
* Over 1% of punt plays with a return have an injury
* This is 10x higher than when a fair catch is called
* The next 3 most frequent outcomes have lower than average injury rates:
    * downed 0.4%
    * out of bounds 0%
    * touchback 0%
    
Since the majority of injuries are occuring when there is a return, I'll look to find the differences between returns and other outcomes, starting with the time the distance between the returner and the coverage team.

### Returner and Coverage Team Metrics During Punt
Explore:
* Distribution of punt hang times
    * will need a general sense of how long the ball is in the air
* For each second from the time the ball is punted to the time it is received:
    * calculate distance from each coverage man to the returner
    * calculate the speed (meters per sec) for each player

#### Punt Hang Times
I create a function to help calculate the time in secs from punt to land.  Then counts are visualized in a histogram and summary statistics are presented.  Since we don't have event data on the location of the football, I limit the events to punts that are either received or fair caught.

In [None]:
def get_hang_time(ngs_df, start_event='punt', *stop_events):
    punt_event = ngs_df.loc[ngs_df.Event==start_event] \
        .groupby(['Season_Year', 'GameKey','PlayID'], as_index = False)['Time'].min()
    punt_event.rename(columns = {'Time':'punt_time'}, inplace=True)
    punt_event['punt_time'] = pd.to_datetime(punt_event['punt_time'],\
                                             format='%Y-%m-%d %H:%M:%S.%f')
    
    receiving_event = ngs_df.loc[ngs_df.Event.isin(stop_events)] \
        .groupby(['Season_Year', 'GameKey','PlayID'], as_index = False)['Time'].min()
    receiving_event.rename(columns = {'Time':'receiving_time'}, inplace=True)
    receiving_event['receiving_time'] = pd.to_datetime(receiving_event['receiving_time'],\
                                             format='%Y-%m-%d %H:%M:%S.%f')
    
    punt_df = punt_event.merge(receiving_event, how='inner', on = ['Season_Year','GameKey','PlayID']) \
                .reset_index(drop=True)
    
    punt_df['hang_time'] = (punt_df['receiving_time'] - punt_df['punt_time']).dt.total_seconds()
    
    return punt_df

punt_df = get_hang_time(NGS_df, 'punt', 'punt_received', 'fair_catch')

#Visualize the hang times with a histogram
data = [go.Histogram(x=punt_df.hang_time)]

layout = go.Layout(
    title='Hang Time Histogram',
    xaxis=dict(
        title='Seconds'
    ),
    yaxis=dict(
        title='Punt Count'
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='hang-hist')

In [None]:
#Print a few summary statistics
print('The mean hang time is ' + str(round(punt_df['hang_time'].mean(), 1)))
print('The median hang time is ' + str(round(punt_df['hang_time'].median(), 1)))
print('The standard deviation ' + str(round(punt_df['hang_time'].std(), 1)))
print(str(round(len(punt_df.loc[punt_df.hang_time < 6]) / len(punt_df) * 100, 1))\
      + '% of hang times are less than 5 1/2 seconds')

### Space, Speed & Blocker Counts
In order to analyze the distance between every coverage player and the returner, I created the below function.  Along with distance (in yards) it also performs a speed calculation (in meters per second) on 1 second intervals starting at the time of the ball being punted. Another aspect of this function is that it counts the number of gunner blockers for each play.

*Note the coverage_returner_space function may take 30 mins to an hour to complete, but it is a key component in constructing the dataframe for the rest of the analysis

In [None]:
def coverage_returner_space(play_df, ngs_df):
    cov_ret_lst = []
    gunner_blockers = ['VL','VLi','VLo','VR','VRi','VRo']
    for i in range(0, len(play_df)):
        season_key = play_df['Season_Year'][i]
        game_key = play_df['GameKey'][i]
        play_id = play_df['PlayID'][i]
        outcome = play_df['outcome'][i]
        injury = play_df['injury'][i]
        if i % 100 == 0:
            print('{} / {}'.format(i, len(play_df)))
        if game_key in ngs_df.GameKey and play_id in ngs_df.PlayID:
            filtered_play = ngs_df.loc[(ngs_df.GameKey == game_key) \
                                       & (ngs_df.PlayID == play_id)].sort_values('Time').reset_index(drop=True)
            filtered_play = filtered_play.merge(play_player_role, \
                                                how='inner', on = ['Season_Year','GameKey','PlayID','GSISID'])
            if len(filtered_play) > 0:
                filtered_play['Time'] = pd.to_datetime(filtered_play['Time'], \
                                                       format='%Y-%m-%d %H:%M:%S.%f')
                punt_event_time = filtered_play.loc[filtered_play.Event == 'punt'].Time.min()
                receiving_event_time = filtered_play.loc[(filtered_play.Event == 'punt_received') | \
                                                         (filtered_play.Event == 'fair_catch')].Time.min()
                gunner_blocker_count = len(filtered_play.loc[filtered_play['Role'].isin(gunner_blockers)]['Role'].unique())
                filtered_play = filtered_play.loc[(filtered_play.Time >= punt_event_time) & \
                                                  (filtered_play.Time <= receiving_event_time)]
                coverage_df = filtered_play.loc[filtered_play['Role'].isin(coverage_roles)].sort_values('Time')
                coverage_df.rename(columns={'x':'cov_x', 
                                          'y': 'cov_y',
                                          'GSISID': 'cov_GSISID',
                                          'dis': 'cov_dis',
                                          'o': 'cov_o',
                                          'dir': 'cov_dir',
                                          'Role': 'cov_Role'
                                         }, inplace=True)
                
                returner_df = filtered_play.loc[filtered_play['Role'] == 'PR'].sort_values('Time')
                returner_df.rename(columns={'x':'ret_x', 
                                          'y': 'ret_y',
                                          'GSISID': 'ret_GSISID',
                                          'dis': 'ret_dis',
                                          'o': 'ret_o',
                                          'dir': 'ret_dir',
                                          'Role': 'ret_Role'
                                         }, inplace=True)
                returner_df = returner_df.drop('Event', axis = 1)
                
                cov_ret_df = coverage_df.merge(returner_df, how ='inner', on = ['Season_Year','GameKey','PlayID','Time'])
                cov_ret_df['dis_from_ret'] = ((cov_ret_df['cov_x'] -  cov_ret_df['ret_x']) ** 2 \
                                           + (cov_ret_df['cov_y'] -  cov_ret_df['ret_y']) ** 2).apply(np.sqrt)
                cov_ret_df['time_since_punt'] = cov_ret_df['Time'] - punt_event_time
                times_to_capture = [punt_event_time + pd.Timedelta(seconds=i) for i in range(0, 7)]
                cov_ret_df = cov_ret_df.loc[cov_ret_df['Time'].isin(times_to_capture)]
                cov_ret_df['gunner_blockers'] = gunner_blocker_count
                cov_ret_df['outcome'] = outcome
                cov_ret_df['injury'] = injury
                cov_ret_df['cov_speed'] = convert_to_meters_per_sec(cov_ret_df.cov_dis, 9.1)
                cov_ret_df['ret_speed'] = convert_to_meters_per_sec(cov_ret_df.ret_dis, 9.1)
                if len(cov_ret_df) > 0:
                    cov_ret_lst.append(cov_ret_df)
                    
    cov_ret_df = pd.concat(cov_ret_lst).reset_index(drop=True)          
    return cov_ret_df

def convert_to_meters_per_sec(dis_vector, converter):
    mps_vector = dis_vector * converter
    return mps_vector

cov_ret_df = coverage_returner_space(pi, NGS_df)

#### Injury Play Animation and Distance Calculation
- For an added visualization, below is animation of an example play
- It tracks the paths of an injured punt returner and the cover man involved in the tackle
- The animation also shows the distance between the two players during the play
- Begin by pressing the "Play" button

In [None]:
#Selecting an exmple injury play for returner and coverman
ex_play = NGS_df.loc[(NGS_df.Season_Year==2016) & (NGS_df.GameKey== 234) \
                     & (NGS_df.PlayID== 3278) & (NGS_df.GSISID== 28620)
                    ].sort_values('Time')

ret_play = NGS_df.loc[(NGS_df.Season_Year==2016) & (NGS_df.GameKey== 234) \
                     & (NGS_df.PlayID== 3278) & (NGS_df.GSISID== 27860)].sort_values('Time')

#Removing time prior to the ball being snapped
ball_snap_time = ex_play.loc[ex_play.Event == 'ball_snap'].Time.min()
ex_play = ex_play.loc[ex_play.Time >= ball_snap_time].reset_index(drop=True)
ret_play = ret_play.loc[ret_play.Time >= ball_snap_time].reset_index(drop=True)

#Filling NA's in the event column with the prior event
ex_play['Event'] = ex_play['Event'].fillna(method='ffill')

#Data for Animation
x = np.array(ex_play.x)
y = np.array(ex_play.y)
xx = np.array(ex_play.x)
yy = np.array(ex_play.y)

x1 = np.array(ret_play.x)
y1 = np.array(ret_play.y)
xx1 = np.array(ret_play.x)
yy1 = np.array(ret_play.y)

N = len(x)

data=[dict(x=x, y=y, 
            name='Distance',
            mode='lines',
            textposition='bottom center',
            line=dict(width=2, color=None)
          ),
      dict(x=x, y=y, 
            name='Injured Player',
            mode='markers',
            marker=dict(color=None, size=15)
          ),
      dict(x=x1, y=y1, 
           name = 'Partner Player',
           mode='markers',
           marker=dict(color='orange', size=15)
         )
    ]

layout = load_field()
layout['hovermode'] = 'closest'
layout['updatemenus'] = [{'type': 'buttons',
                           'buttons': [{'label': 'Play',
                                        'method': 'animate',
                                        'args': [None]}]}]

frames=[dict(data=[dict(x=[x1[k]], 
                        y=[y1[k]], 
                        mode='markers', 
                        marker=dict(color='#013369', size=15),
                        name='Partner Player'
                        ),
                   dict(x=[x[k]], 
                        y=[y[k]], 
                        mode='markers', 
                        marker=dict(color='orange', size=15),
                        name='Injured Player'
                        ), 
                   dict(x=[xx[k], xx1[k], None, xx[k], xx1[k]], 
                        y=[yy[k], yy1[k], None, yy[k], yy1[k]], 
                        mode='lines', 
                        text='Distance: {}'.format(round(np.sqrt((xx1[k] - xx[k])**2 + (yy1[k] - yy[k])**2),0)),
                        textposition='bottom center',
                        line=dict(color='#2c3539', width=2),
                        name='Distance'
                       )
                  ], layout=dict(title=ex_play.Event[k],
                                 annotations=[
                                     dict(x=100,
                                          y=5,
                                          showarrow=False,
                                          font=dict(
                                              family='Courier New, monospace',
                                              size=14,
                                              color='#ffffff'),
                                          align='center',
                                          bordercolor='#c7c7c7',
                                          borderwidth=2,
                                          borderpad=4,
                                          bgcolor='#2c3539',
                                          opacity=0.8,
                                          text='{} Yds'.format(round(np.sqrt((xx1[k] - xx[k])**2 + (yy1[k] - yy[k])**2),0)),
                                          )
                                 ]
                                )
            ) for k in range(0, N, 5)]
          
figure1=dict(data=data, layout=layout, frames=frames)
iplot(figure1)

### Analyzing coverage and returner data frame

Now that the dataframe from the coverage_returner_space function is construted, I can dive into some serious analysis

Explore:
* Does the space between the coverage team and the returner impact the return rate?
* How far is each coverage position from the returner during the punt?
* When does the coverage team close the most distance?
* Which coverage positions have the greatest impact on returns?
    * Focusing on returns vs fair catches
    
I start by pivoting the data to show mean distance at each second for return vs fair catch    

In [None]:
dist_pivot = pd.pivot_table(cov_ret_df.loc[cov_ret_df.outcome.isin(['return', 'fair_catch'])]
                            , index='time_since_punt'
                            , columns='outcome'
                            , values='dis_from_ret', aggfunc='mean').round(1)

dist_pivot['delta'] = dist_pivot['return'] - dist_pivot['fair_catch']
dist_pivot

I then want to see how close each position is to the returner on average.

In [None]:
#Group inside and outside gunners into left and right
cov_ret_df = cov_ret_df.replace('GLi', 'GL').replace('GLo', 'GL').replace('GRi', 'GR').replace('GRo', 'GR')
#On average, how close is each position to the returner?
cov_ret_df.groupby(['cov_Role'], as_index=False)['dis_from_ret'] \
    .agg({'count':'count', 'dis_from_ret':np.mean}) \
    .sort_values('dis_from_ret').round(1)

A line graph to shows the average distance from the returner for each coverage position at each second of the punt

In [None]:
by_time_pos = cov_ret_df.groupby(['time_since_punt', 'cov_Role'], as_index=False)['dis_from_ret'].mean()

trace_lst = []

for role in by_time_pos.cov_Role.unique():
    if role not in ['Go', 'PC', 'PPLi', 'PPLo']:
        trace = go.Scatter(
            x = by_time_pos.loc[by_time_pos.cov_Role == role].time_since_punt.dt.total_seconds(),
            y = by_time_pos.loc[by_time_pos.cov_Role == role].dis_from_ret,
            mode = 'lines',
            name = role
        )

        trace_lst.append(trace)

data = trace_lst

layout = dict(title = 'Distance From Returner by Coverage Position ',
              xaxis = dict(title = 'Seconds from Punt'),
              yaxis = dict(title = 'Avg Distance (Yds)', 
                           range = [0,60]),
              hovermode = 'closest')

fig = dict(data=data, layout=layout)

iplot(fig, filename='line-dist')

A grouped bar graph shows the avg distance from the returner for each coverage position by return vs fair catch.

In [None]:
pos_grouped_df = cov_ret_df.groupby(['cov_Role', 'outcome'], as_index=False)['dis_from_ret'].agg({'mean_dist':np.mean,
                                                                                                  'count': 'count'})
                                                                                                  
pos_grouped_df = pos_grouped_df.loc[(pos_grouped_df.cov_Role != 'PPLi') & \
                                    (pos_grouped_df.cov_Role != 'PPLo') & \
                                    (pos_grouped_df.cov_Role != 'PC') & \
                                    (pos_grouped_df.cov_Role != 'Go')]

trace1 = go.Bar(
        x=pos_grouped_df.loc[pos_grouped_df.outcome == 'fair_catch']['cov_Role'],
        y=pos_grouped_df.loc[pos_grouped_df.outcome == 'fair_catch']['mean_dist'],
        name='fair catch'
    )

trace2 = go.Bar(
        x=pos_grouped_df.loc[pos_grouped_df.outcome == 'return']['cov_Role'],
        y=pos_grouped_df.loc[pos_grouped_df.outcome == 'return']['mean_dist'],
        name='return'
    )

data = [trace1, trace2]
layout = go.Layout(
    barmode='group',
    xaxis=dict(title='Position'),
    yaxis= dict(title='Avg Distance From Returner (Yds)')
)

fig=go.Figure(data=data, layout=layout)
iplot(fig, filename='grouped-bar')

From the graphs above, it's clear that on average the gunners are closest to the returner.  The below table shows at each second, how often a gunner is the closest to the returner.

In [None]:
min_dist_from_ret = cov_ret_df.groupby(['Season_Year', 'GameKey','PlayID','time_since_punt'],
                                       as_index = False)['dis_from_ret'].min()

closest_df = cov_ret_df.merge(min_dist_from_ret.drop('time_since_punt', axis = 1),
                              how = 'inner',
                              on=['Season_Year', 'GameKey', 'PlayID', 'dis_from_ret'])


closest_df = closest_df.groupby(['time_since_punt','cov_Role'],
                   as_index=False)['dis_from_ret'].agg({'# of times closest': 'count',
                                                        'avg distance':np.mean})\
        .sort_values(['time_since_punt', '# of times closest'], ascending=[True, False])\
        .reset_index(drop=True)

gunner_times_closest = []
other_times_closest = []

for sec in closest_df.time_since_punt.unique():
    sec_df = closest_df.loc[closest_df.time_since_punt == sec]
    gunner_times_closest.append(sec_df.loc[sec_df.cov_Role.isin(['GR', 'GL'])]['# of times closest'].sum())
    other_times_closest.append(sec_df.loc[-sec_df.cov_Role.isin(['GR', 'GL'])]['# of times closest'].sum())
    
final_closest_df = pd.DataFrame({'time_since_punt': closest_df.time_since_punt.unique(),
            'gunner_closest': gunner_times_closest,
              'other_closest': other_times_closest
             }, columns = ['time_since_punt', 'gunner_closest', 'other_closest'])

final_closest_df['gunner_closest_perc'] = (final_closest_df['gunner_closest'] / \
                                          (final_closest_df['gunner_closest'] + final_closest_df['other_closest']) \
                                            * 100).round(1).astype(str) + '%'

final_closest_df[['time_since_punt','gunner_closest_perc']].set_index('time_since_punt')

#### Quick Conclusions: 
* It's clear gunners are the most impactful players on the coverage team
    * They close the distance to the returner the quickest
    * On average, they are 8 yards closer than any other position
    * The gunners are the closest cover men to the returner 70-90% of the time
    
Since it is very rare that the return team has 0, 1, or 5 gunner blockers, we filter those plays out

In [None]:
print(round(cov_ret_df.gunner_blockers.value_counts(normalize=True), 2))
cov_ret_df = cov_ret_df.loc[cov_ret_df.gunner_blockers.isin([2, 3, 4])]

Visualize play counts by alignment with a pie graph

In [None]:
#
labels = cov_ret_df.groupby(['gunner_blockers'], \
                   as_index=False)['PlayID'].agg({"play_count": pd.Series.nunique})['gunner_blockers']

values = cov_ret_df.groupby(['gunner_blockers'], \
                   as_index=False)['PlayID'].agg({"play_count": pd.Series.nunique})['play_count']

trace = go.Pie(labels=labels, values=values)

iplot([trace], filename='alignment_pie')

Create a table to show the return rates by the # of gunner blockers

In [None]:
cov_ret_count = cov_ret_df.groupby('gunner_blockers', \
                                     as_index=False)['PlayID'].agg({"play_count": pd.Series.nunique}) 

cov_ret_returns = cov_ret_df.loc[cov_ret_df.outcome == 'return'].groupby('gunner_blockers', \
                                     as_index=False)['PlayID'].agg({"return_count": pd.Series.nunique})

cov_ret_merged = cov_ret_count.merge(cov_ret_returns)

cov_ret_merged['return_rate'] = (cov_ret_merged['return_count'] / 
                                 cov_ret_merged['play_count'] * 100).round(1).astype(str) + '%'

#Create similar table but with injury rates
cov_ret_injury = cov_ret_df.loc[cov_ret_df.injury == 1].groupby('gunner_blockers', \
                                     as_index=False)['PlayID'].agg({"injury_count": pd.Series.nunique})

cov_ret_merged = cov_ret_merged.merge(cov_ret_injury)

cov_ret_merged['injury_rate'] = (cov_ret_merged['injury_count'] / 
                                 cov_ret_merged['play_count'] * 100).round(1).astype(str) + '%'

#Merge the two data frames and only show the rates for easier viewing
cov_ret_merged[['gunner_blockers', 'return_rate', 'injury_rate']]

Speed (in meters per second) vs Number of Gunner Blockers at each second

In [None]:
cov_ret_gun = cov_ret_df.loc[(cov_ret_df.cov_Role == 'GR') |
                              (cov_ret_df.cov_Role == 'GL')]

pd.pivot_table(cov_ret_gun, 
               index=['gunner_blockers'], values='cov_speed',
               columns=['time_since_punt'], aggfunc=[np.mean]).round(1)

With a line graph, visualize the change in speed per second, grouped by the number of blockers

In [None]:
cov_ret_gun_grouped = cov_ret_gun.groupby(['gunner_blockers', 'time_since_punt'], as_index=False)['cov_speed'].mean()

trace_lst = []

for gun in cov_ret_gun_grouped.gunner_blockers.unique():
    trace = go.Scatter(
        x = cov_ret_gun_grouped.loc[cov_ret_gun_grouped.gunner_blockers == gun].time_since_punt.dt.total_seconds(),
        y = cov_ret_gun_grouped.loc[cov_ret_gun_grouped.gunner_blockers == gun].cov_speed,
        mode = 'lines',
        name = str(gun)
    )

    trace_lst.append(trace)

data = trace_lst

layout = dict(title = 'Gunner Speed by # of Blockers',
              xaxis = dict(title = 'Seconds from Punt'),
              yaxis = dict(title = 'Avg Speed (Meters per Sec)'),
              hovermode = 'closest')

fig = dict(data=data, layout=layout)

iplot(fig, filename='line-speed')

#### Quick Conclusions: 
* 2 gunner blockers have the lowest return rate:
    * 35% less than 3 gunner blockers
    * 38% less than 4 gunner blockers
* 2 gunner blockers have the lowest injury rate:
    * 40% less than 3 gunner blockers
    * 50% less than 4 gunner blockers
*    Although 2 gunner blockers allow the gunners to reach max speed the quickest, in this alignment the gunners are able to "throttle down" down their speed by the time they get close to the returner
    * This is a really interesting finding.  Gunners will actually be able to make contact with the returner at a lesser speed if they are able to reach higher speeds earlier on in the play
    
### Proposal #1
When Team A presents a punt formation:

Team B may have at most 1 blocker aligned opposite each of Team A抯 end men on the line of scrimmage (gunners) at the snap of the ball

#### Show correct vs incorrect alignment    

In [None]:
def visualize_alignment(next_gen_df, game_id_lst, role_df):
    alignment_df = next_gen_df.loc[(next_gen_df.Season_Year == game_id_lst[0]) & \
                        (next_gen_df.GameKey == game_id_lst[1]) & \
                        (next_gen_df.PlayID == game_id_lst[2]) & \
                        (next_gen_df.Event == game_id_lst[3])].sort_values('y').reset_index(drop=True)
    
    align_merged = alignment_df.merge(play_player_role, how='left', \
                                          on =['Season_Year', 'GameKey', 'PlayID', 'GSISID'])
    
    align_merged['side_of_ball'] = np.where(align_merged.Role.isin(return_roles), 'return',
                                          np.where(align_merged.Role.isin(coverage_roles), 'coverage', ''))
    
    trace1 = go.Scatter(
        x = align_merged.loc[align_merged.side_of_ball == 'return'].x,
        y = align_merged.loc[align_merged.side_of_ball == 'return'].y,
        mode = 'markers',
        marker = dict(color='#013369', size=10),
        name = 'Return'
    )

    trace2 = go.Scatter(
        x = align_merged.loc[align_merged.side_of_ball == 'coverage'].x,
        y = align_merged.loc[align_merged.side_of_ball == 'coverage'].y,
        mode = 'markers',
        marker = dict(color='orange', size=10),
        name = 'Coverage'
    )
    
    #Change alignment of one of the players
    align_merged.at[2,'y'] = 7

    trace3 = go.Scatter(
        x = align_merged.loc[align_merged.side_of_ball == 'return'].x,
        y = align_merged.loc[align_merged.side_of_ball == 'return'].y,
        mode = 'markers',
        marker = dict(color='#013369', size=10),
        name = 'Return'
    )

    trace4 = go.Scatter(
        x = align_merged.loc[align_merged.side_of_ball == 'coverage'].x,
        y = align_merged.loc[align_merged.side_of_ball == 'coverage'].y,
        mode = 'markers',
        marker = dict(color='orange', size=10),
        name = 'Coverage'
    )

    fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('Legal Alignment', 'Illegal Alignment'))

    fig.append_trace(trace1, 1, 1)
    fig.append_trace(trace2, 1, 1)
    fig.append_trace(trace3, 1, 2)
    fig.append_trace(trace4, 1, 2)
    
    fig['layout'].update(showlegend=False)
    
    return iplot(fig, filename='alignments')

visualize_alignment(NGS_df, [2016, 234, 3278, 'ball_snap'], play_player_role)

### Forecasting the benefit of Proposal #1
The data shows that there is conservatively a 35% reduction in returns when the return team uses 2 gunner blockers compared with 3 or 4.  Therefore the below calculation can be made to get a sense of the  approximate impact the rule change would have on injuries:

In [None]:
returns_per_season = len(play_info.loc[play_info.outcome=='return'])/2
return_rate1 = .3
return_rate2 = .4
#injury rates taken from earlier EDA
return_inj_rate = .011
non_return_inj_rate = .0002


#Lower Bound
lower_inj = round(((returns_per_season * return_rate1) * return_inj_rate) \
      - ((returns_per_season * return_rate1) * non_return_inj_rate), 1)

print('With a 30% reduction in return rate, there would be a drop of approximately {} injuries'
     .format(int(lower_inj)))

#Upper Bound
upper_inj = round(((returns_per_season * return_rate2) * return_inj_rate) \
      - ((returns_per_season * return_rate2) * non_return_inj_rate), 1)

print('With a 40% reduction in return rate, there would be a drop of approximately {} injuries'
     .format(round(upper_inj),1))


#### Quick Conclusions: 
* If this rule were to be implemented, I forecast that the return rate would be reduced by 30-40%, which would reduce the injury rate by a similar margin
* This reduction in the return rate would eliminate 400-550 returns per season, which would lead to approximately 4-6 less concussion injuries per season

## Continuing EDA for 2nd Proposal

Unfortunately, the dataset provided does not have event data on the football (only by player).  It would be great to know the location of the football at every 10th of a second.  However, the play information can be parsed to get a general sense of where the ball started, stopped, etc.

Explore:
* How far do punts travel?
* What yard line do punts land?
* How often do outcomes occur?

The function below utilizes regular expression to parse the description field of the play_info data in order to extract features for the above bullet points. I then create 3 histograms for Punt Distance, Punt To yard line, Hang Time and print summary statistics.

In [None]:
def parse_play_description(df, outcome_lst):
    parsed_df = play_info.loc[play_info.outcome.isin(outcome_lst),\
                             ['Season_Year','GameKey','PlayID',\
                              'PlayDescription','outcome']].reset_index(drop=True)
    punt_to_lst = []
    punt_dist_lst = []
    return_dist_lst = []
    punt_regex = '(punts [0-9]+ yards? to [A-Z]* [-+]?[0-9]+)| (punts [0-9]+ yards? to [-+]?[0-9]+)'
    return_regex = '(to [A-Z]* [0-9]+ for [-+]?[0-9]+ yards?)|(to [0-9]+ for [-+]?[0-9]+ yards?)|(ob at [A-Z]* [-+]?[0-9]+ for [-+]?[0-9]+ yards?)|(ob at [0-9]+ for [-+]?[0-9]+ yards?)|(for [-+]?[0-9]+ yards?, TOUCHDOWN)'
    
    for i in range(0, len(parsed_df)):
        punt_search = re.search(punt_regex, parsed_df.PlayDescription[i])
        return_search = re.search(return_regex, parsed_df.PlayDescription[i])
    
        if punt_search:
            punt_snip = re.findall(r'-?\d+', punt_search.group(0))
            if parsed_df.outcome[i] in ['downed','fair_catch', 'oob', 'return', 'muff']:
                punt_to_lst.append(int(punt_snip[-1]))
                punt_dist_lst.append(int(punt_snip[0]))
            else:
                if parsed_df.outcome[i] == 'touchback':
                    punt_to_lst.append(0)
                    punt_dist_lst.append(int(punt_snip[0]))
                else:
                    print('Missing Punt Outcome at Row {}'.format(i))
        else:
            if parsed_df.outcome[i] == 'touchback':
                punt_to_lst.append(0)
                punt_dist_lst.append(int(punt_snip[0]))
            else:
                print('Missing Punt Outcome at Row {}'.format(i))
        
        if return_search:
            return_snip = re.findall(r'-?\d+', return_search.group(0))
            return_dist_lst.append(int(return_snip[-1]))
        else:
            if parsed_df.outcome[i] == 'touchback':
                return_dist_lst.append(20)
            elif parsed_df.outcome[i] in ['downed','fair_catch', 'oob', 'muff']:
                return_dist_lst.append(0)
            elif 'no gain' in parsed_df.PlayDescription[i]:
                return_dist_lst.append(0)
            else:
                print('Missing Return Outcome at Row {}'.format(i))   
                
    parsed_df['punt_to'] = punt_to_lst
    parsed_df['punt_dist'] = punt_dist_lst
    parsed_df['return_dist'] = return_dist_lst
                                   
    return parsed_df

pdd = parse_play_description(play_info, ['touchback', 'fair_catch','oob', 'downed', 'return', 'muff'])

#Add in hang_time feature from punt_df
pdd = pdd.merge(punt_df[['Season_Year', 'GameKey', 'PlayID', 'hang_time']], 
                how ='left', 
                on=['Season_Year', 'GameKey', 'PlayID'])

In [None]:
trace1 = go.Histogram(
        x=pdd.punt_dist
    )

trace2 = go.Histogram(
        x=pdd.loc[pdd.punt_to >= 0].punt_to
    )

trace3 = go.Histogram(
        x=pdd.hang_time
    )
fig = tools.make_subplots(rows=1, cols=3, subplot_titles=('Punt Distance', 'Punt To', 'Hang Time'))

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)

fig['layout'].update(showlegend=False)
fig['layout'].update(xaxis=dict(title='Yards'))
fig['layout'].update(xaxis2=dict(title='Yard Line'))
fig['layout'].update(xaxis3=dict(title='Seconds'))

iplot(fig, filename='punt_hist')

In [None]:
#Print a few summary statistics
punt_to_mean = round(np.mean(pdd.punt_to), 1)
punt_dist_mean = round(np.mean(pdd.punt_dist), 1)
hang_time_mean = round(np.mean(pdd.hang_time), 1)

print('The mean yard line the ball is punted to is {}'.format(punt_to_mean))
print('The mean distance of a punt is {} yards'.format(punt_dist_mean))
print('The mean hang_time is {} seconds'.format(hang_time_mean))

print('Touchbacks occur on ' + str(round(len(pdd.loc[pdd.outcome=='touchback']) \
                                        / len(pdd) * 100, 1)) + '% of plays')

Visualize play counts by outcome with a pie graph

In [None]:
labels = pdd.groupby(['outcome'], \
                   as_index=False)['PlayID'].agg({"play_count": 'count'})['outcome']

values = pdd.groupby(['outcome'], \
                   as_index=False)['PlayID'].agg({"play_count": 'count'})['play_count']

trace = go.Pie(labels=labels, values=values)

iplot([trace], filename='outcome_pie')

#### Quick Conclusions: 
* Punts are kicked to yard line 0 (touchbacks) with the most frequency by far
* However, touchbacks only make up 6.3% of standard punt plays
* Punt distance is approximately normally distributed around a mean of 45 yds


### How do Return Rates change by Punt Metrics?

Explore:
* Categorize the punt description data into bins
* How do these bins correlate with return/injury rates?

Binning the punt data will make it easier to construct conclusions about clusters of the data rather than looking at it as a whole.  I'll then visualize the return rates by each of these bins.

In [None]:
punt_dist_bins = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]
punt_to_bins = [0, 10, 20, 30, 40, 50]
hang_time_bins = [3, 3.5, 4, 4.5, 5, 5.5]

def bin_punt_data(pdd, col, bin_lst):
    bin_col = col + '_bin'
    pdd[bin_col] = pd.cut(pdd[col], bin_lst)
    bin_count = pdd \
        .groupby(bin_col, as_index=False)['PlayID'] \
        .agg({'play_count': 'count'})
        
    bin_returns = pdd.loc[pdd.outcome == 'return'] \
        .groupby(bin_col, as_index=False)['PlayID'] \
        .agg({'return_count':'count'})
        
    merged_bins = bin_count.merge(bin_returns)
    merged_bins['return_rate'] = (merged_bins['return_count'] / merged_bins['play_count']).round(2)
    merged_bins[bin_col] = merged_bins[bin_col].astype(str)
    
    return merged_bins

punt_dist_bin = bin_punt_data(pdd, 'punt_dist', punt_dist_bins)
punt_to_bin = bin_punt_data(pdd, 'punt_to', punt_to_bins)
hang_time_bin = bin_punt_data(pdd, 'hang_time', hang_time_bins)

In [None]:
trace1 = go.Bar(
        x=punt_dist_bin.loc[punt_dist_bin.return_count > 50]['punt_dist_bin'],
        y=punt_dist_bin.loc[punt_dist_bin.return_count > 50]['return_rate']
    )

trace2 = go.Bar(
        x=punt_to_bin['punt_to_bin'],
        y=punt_to_bin['return_rate']
    )

trace3 = go.Bar(
        x=hang_time_bin['hang_time_bin'],
        y=hang_time_bin['return_rate']
    )

fig = tools.make_subplots(rows=1, cols=3, subplot_titles=('Punt Distance', 'Punt To', 'Hang Time'), shared_yaxes=True)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)

fig['layout'].update(showlegend=False)
fig['layout'].update(yaxis=dict(title='Return Rate', range=[0,1], tickformat=',.0%'))
fig['layout'].update(xaxis=dict(title='Yards'))
fig['layout'].update(xaxis2=dict(title='Yard Line'))
fig['layout'].update(xaxis3=dict(title='Seconds'))

iplot(fig, filename='punt_bars')

Compare the return rates for punts 45 yards or less

In [None]:
dist_under45 = punt_dist_bin.loc[punt_dist_bin.punt_dist_bin.isin(['(0, 5]',
                                                    '(5, 10]',
                                                    '(10, 15]',
                                                    '(15, 20]',
                                                    '(20, 25]',
                                                    '(25, 30]',
                                                    '(30, 35]',
                                                    '(35, 40]',
                                                    '(40, 45]'
                                                   ])]
dist_over45 = punt_dist_bin.loc[punt_dist_bin.punt_dist_bin.isin(['(0, 5]',
                                                    '(5, 10]',
                                                    '(10, 15]',
                                                    '(15, 20]',
                                                    '(20, 25]',
                                                    '(25, 30]',
                                                    '(30, 35]',
                                                    '(35, 40]',
                                                    '(40, 45]'
                                                   ])==False]
print('Return Rate for punts under 45 yards is {}'\
          .format(str(round(dist_under45.return_count.sum() / dist_under45.play_count.sum() * 100, 1))) +'%')

print('Return Rate for punts over 45 yards is {}'\
          .format(str(round(dist_over45.return_count.sum() / dist_over45.play_count.sum() * 100, 1))) +'%')

Compare the return rates for hang times 4.5 seconds or less

In [None]:
hang_under45 = hang_time_bin.loc[hang_time_bin.hang_time_bin.isin(['(3.0, 3.5]',
                                                                   '(3.5, 4.0]',
                                                                   '(4.0, 4.5]',
                                                                  ])]
hang_over45 = hang_time_bin.loc[hang_time_bin.hang_time_bin.isin(['(3.0, 3.5]',
                                                                   '(3.5, 4.0]',
                                                                   '(4.0, 4.5]'
                                                                  ])==False]

print('Return Rate for punts under 4.5 sec hang time is {}'\
          .format(str(round(hang_under45.return_count.sum() / hang_under45.play_count.sum() * 100, 1))) +'%')

print('Return Rate for punts over 4.5 sec hang time is {}'\
          .format(str(round(hang_over45.return_count.sum() / hang_over45.play_count.sum() * 100, 1))) +'%')

Create a violin plot to visualize return distance by hang time bin

In [None]:
pdd_ret = pdd.loc[(pdd.return_dist <=100) & (pdd.outcome == 'return') & (pd.notnull(pdd.hang_time_bin))]
pdd_ret['hang_time_bin'] = pdd_ret['hang_time_bin'].astype(str)

data = []
for i in range(0,len(pd.unique(pdd_ret['hang_time_bin']))):
    trace = {
            "type": 'violin',
            "x": pdd_ret['hang_time_bin'][pdd_ret['hang_time_bin']\
                                               == pd.unique(pdd_ret['hang_time_bin'])[i]],
            "y": pdd_ret['return_dist'][pdd_ret['hang_time_bin']\
                                               == pd.unique(pdd_ret['hang_time_bin'])[i]],
            "name": pd.unique(pdd_ret['hang_time_bin'])[i],
            "box": {
                "visible": True
            },
            "meanline": {
                "visible": True
            }
        }
    data.append(trace)

        
fig = {
    "data": data,
    "layout" : {
        "title": "",
        "yaxis": {
            "zeroline": False,
        }
    }
}


iplot(fig, filename='violin-hang-time')

#### Quick Conclusions: 
* Return rates are lowest on punts with short distance and high hang time
* Return rates are 59% lower when the punt distance is under 45 yards compared to over 45 yards
* Return rates are 11% lower when the hang time is above 4.5 seconds compared to below 4.5 seconds 
    
Explore:
* On punts that may have been touchbacks (inside the 10-yard line) but ended up getting returned:
    * How far does the returner get?
    * What % of returns get at least to the 20-yard line?
    * What % of returns get at least to the 25-yard line?
    
I segment data for punts inside the 10-yard line that are returned and then add a column for the yard line the return is taken to.  If punt returners are not getting passed the 20-yard line then they should not be returning the ball.  An additional 5 yards would make that threshold even more difficult for a returner to reach.

In [None]:
inside_ten_ret = pdd.loc[(pdd.punt_to < 10) & (pdd.punt_to > 0) & (pdd.outcome=='return')]

inside_ten_ret[['punt_to', 'return_dist']] = inside_ten_ret[['punt_to', 'return_dist']].astype(int)
inside_ten_ret['return_to'] = (inside_ten_ret['punt_to'] + inside_ten_ret['return_dist'])

print('The median yard line for a return on a punt inside the 10 yard line is the {}-yard line'\
      .format(int(inside_ten_ret.return_to.median())))

print('The % of returns starting from inside the 10 yard line that get at least to the 20-yard line is {}%'\
    .format(round(len(inside_ten_ret.loc[inside_ten_ret.return_to >= 20]) / len(inside_ten_ret) * 100, 1)))

print('The % of returns starting from inside the 10 yard line that get at least to the 25-yard line is {}%'\
    .format(round(len(inside_ten_ret.loc[inside_ten_ret.return_to >= 25]) / len(inside_ten_ret) * 100, 1)))

#### Quick Conclusions: 
* The returner gets to or beyond the current touchback location less than 1/3 of the time, so it is in their best interest to let the ball bounce into the end zone
* That success rate is cut in half when extending the touchback location to the 25-yard line

Explore:
* Injury rates on each of the bins
    
#### Punt Distance Bin Injury Rate

In [None]:
pdd = pdd.merge(video_review[['Season_Year', 'GameKey', 'PlayID', 'GSISID']], 
                how ='left', 
                on = ['Season_Year', 'GameKey', 'PlayID'])
pdd.rename(columns={'GSISID':'injury'}, inplace=True)
pdd['injury'] = np.where(pd.notnull(pdd['injury']), 1, 0).astype(int)

punt_dist_grouped = pdd.groupby(['punt_dist_bin'], as_index=False)['injury']\
    .agg({'play_count':'count','injury':sum})
    
punt_dist_grouped['injury_rate'] = (punt_dist_grouped['injury'] \
                                    / punt_dist_grouped['play_count'] * 100).round(1).astype(str) +'%'

punt_dist_grouped.loc[punt_dist_grouped.play_count > 150].reset_index(drop=True)[['punt_dist_bin', 'injury_rate']]

#### Punt To Bin Injury Rate

In [None]:
punt_to_grouped = pdd.groupby(['punt_to_bin'], as_index=False)['injury']\
    .agg({'play_count':'count','injury':sum})
punt_to_grouped['injury_rate'] = (punt_to_grouped['injury'] \
                                  / punt_to_grouped['play_count'] * 100).round(1).astype(str) +'%'

punt_to_grouped[['punt_to_bin', 'injury_rate']]

#### Hang Time Bin Injury Rate

In [None]:
hang_time_grouped = pdd.groupby(['hang_time_bin'], as_index=False)['injury']\
    .agg({'play_count':'count','injury':sum})
hang_time_grouped['injury_rate'] = (hang_time_grouped['injury'] \
                                  / hang_time_grouped['play_count'] * 100).round(1).astype(str) +'%'

hang_time_grouped[['hang_time_bin', 'injury_rate']]

#### Quick Conclusions: 
* Not surprisingly, injury rates are highest on long punts with low hang times
    * Higher likelihood of a return on these plays

Explore:
* Correlation between:
    * punt_to
    * punt_dist
    * return_dist
    * hang_time
* Fitting a logistic regression

#### Correlation Heat Map

I looke at the correlation between the 4 features listed above in a visual heat map

In [None]:
#Subset the columns to be used in the correlation table
corr_df = pdd_ret[['punt_to', 'punt_dist', 'return_dist', 'hang_time']]
corr_df.corr().round(2)

In [None]:
#Create correlation heatmap
plt.rcParams["figure.figsize"] = (10,10)
cmap = ListedColormap(['#000431', '#01065A', '#021CA4', '#0275F4','#3393FF', '#F5FAFF'])

plt.matshow(corr_df.corr(), cmap=cmap)
plt.xticks(range(len(corr_df.columns)), corr_df.columns)
plt.yticks(range(len(corr_df.columns)), corr_df.columns)
plt.colorbar()
plt.show()

#### Quick Conclusions: 
*  The correlation matrix shows that return distance has a slight negative correlation with hang time and a slight positive correlation with punt distance
*  Therefore if a punter is trying to limit return yards it would be in their best interest to kick the ball high and short

#### Logistic Regression

I fit a logistic regression to find the marginal effects of punt to, punt distance, and hang time on returns

In [None]:
import statsmodels.api as sm

log_df = pdd.loc[(pd.notnull(pdd.punt_to)) & (pd.notnull(pdd.hang_time))][['punt_dist','hang_time','punt_to','outcome']]
log_df['return_bool'] = np.where(log_df.outcome=='return', 1, 0)

logit = sm.Logit(log_df['return_bool'], log_df[['punt_dist','hang_time','punt_to']])
result = logit.fit()
result.summary()

In [None]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
np.exp(conf).round(2)

#### Quick Conclusions: 
* The logistic regression results demonstrate that short punts and high hang times can significatly decrease the chance of returning a punt
    * For a 1 unit increase in punt distance, the odds of a return increase by 17%
    * For a 1 unit increase in hang time, the odds of a return decrease by 20%

## Proposal #2
If the result of the punt play is a touchback:

The dead-ball spot for the following snap will be from the 25-yard line

#### Intuitive Reasoning:
* Fearing the extra 5 yard penalty, punters will be even more wary of avoiding a touchback.  They will punt the ball shorter and with more hang time

* From a returners perspective, if the ball is heading close to the goaline, they will be more likely to let it bounce 
and take the touchback than return it because of the extra 5 yard bonus

* Both of these motivations working simultaneously will lead to less returns and thus less injuries

#### Unintended Consquences
* I don抰 see a downside to this rule change as it specifically relates to injuries on punts
* However, it could have an impact on other aspects of the game.  For instance, it clearly helps offenses by giving them an extra 5-yards
* As Coaches may decide to go for it more on 4th down, knowing that possible touchback would bring the ball out to the 25-yard line
    * This would actually be a benefit as less punts means less injuries on punts
* The graph below shows that there was already a huge up-tick in 4th down attempt rates in 2018

In [None]:
#*Source Pro Football Reference
years = list(range(2000, 2019, 1))
attempt_rate = [.121, .123, .131, .128, .121, .120, .123, .142, .132, .145, 
                .126, .112, .117, .12, .12, .126, .128, .125, .149]

trace0 = go.Scatter(
    x=years,
    y=attempt_rate,
    line=dict(color='green')
)

layout = dict(title = '4th Down Attempt Rate',
              xaxis = dict(title = 'Year'),
              yaxis = dict(title = 'Attempt Rate', tickformat=',.1%'),
              hovermode = 'closest')

fig = dict(data=[trace0], layout=layout)
             
iplot(fig, filename='attempt_rates')

### Conclusions & Proposals
In summary, my two proposed rule changes (and evidence) are:
1. **Gunner Blockers** - When Team A presents a punt formation, Team B may have at most 1 blocker aligned opposite each of Team A抯 end men on the line of scrimmage (gunners) at the snap of the ball. 
    - Return alignments with only 2 gunner blockers allow for better coverage from the gunners
    - Better coverage leads to an increase in fair catch signals and less returns
     - Less returns will decrease injuries and thus make make punts safer

2. **Touchbacks** - If the result of the punt play is a touchback,  the dead-ball spot for the following snap will be from the 25-yard line.
    - Moving touchbacks on punts to the 25-yard line will incentivize:
     - Returners to let the football bounce into the end zone
     - Kickers to keep their punts high and short
    - Both of these motivations will lead to less returns and less injuries

### Closing Thoughts
I really enjoyed working on this project and participating in the NFL Punt Analytics Competition.  As a huge fan of the NFL, the project has definitely given me a new perspective while watching games, specifically punt plays.  

I explored a few other ideas for rule changes such as adding a bonus amount of yards for a fair catch, or allowing all players on the punt team to move passed 1 yard from line of scrimmage as soon as the ball is snapped.  But I ultimately decided that these rules would be a drastic shift in the way the game is played and that there are significant unkowns that could prove to be detrimental.   

I believe both of my proposed rule changes are moderate and have little downside, but will be able to make a significant impact in making the NFL a safer league.

Thanks any and all for reading this far.  I look forward to feedback.