# NFL Punt Analytics Competition

This notebook documents my submission to a Kaggle competition aimed at reducing the rate of concussions during punt plays.

- Author: Karl Pazdernik
- Date: 2018/01/09

In [None]:
%matplotlib inline

In [None]:
## Load necessary packages ##

import os
import seaborn as sns
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib as plt
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import classification_report, confusion_matrix
from scipy.spatial.distance import pdist,squareform
from scipy.stats import chisquare
from tqdm.autonotebook import tqdm
from matplotlib import pyplot as plt
from matplotlib import cm
from glob import glob
from tqdm.autonotebook import tqdm
from datetime import timedelta
from shapely.geometry import LineString, Point
from toolz.itertoolz import sliding_window
from itertools import combinations

sns.set_style('darkgrid')

In [None]:
# distance to mph
dis2mph = 10 * 3600 / 1760 

In [None]:
## Read in auxiliary data ##

review = pd.read_csv('../input/video_review.csv')
players = pd.read_csv('../input/player_punt_data.csv').drop_duplicates('GSISID').set_index('GSISID')['Position']
players_all = pd.read_csv('../input/player_punt_data.csv')
roles = pd.read_csv('../input/play_player_role_data.csv')

To create more interesting slices of the data, we grouped the punt and punt return formations into the following categories:
- Punter (P)
- Punt Returner (PR)
- PFB
- Gunners (G)
- Corners (V)
- Players along or near the line (Line)

In [None]:
## Group the roles ##

roles['RoleGroup'] = ''
roles.loc[roles.Role == 'P','RoleGroup'] = 'P'
roles.loc[roles.Role == 'PR','RoleGroup'] = 'PR'
roles.loc[roles.Role == 'PFB','RoleGroup'] = 'PFB'
roles.loc[roles.Role.str[0] == 'G','RoleGroup'] = 'G'
roles.loc[roles.Role.str[0] == 'V','RoleGroup'] = 'V'
roles.loc[roles.RoleGroup == '','RoleGroup'] = 'Line'

roles.head()

## Data Exploration

There are 37 concussion plays provided with a variety of auxiliary information. First we check to see if there are any obvious factors that correlate with concussion.

The major finding is that concussions are predominantly because of a multi-player collisions from opposite teams involving the helmet, i.e. not simply caused by hitting the ground.

Note, although none of the injuries are listed as been related to a turnover, that does not mean that they are not related to turnover worthy plays. For example, the concussion in game 274 play 3609 was caused by the punter running on a fake punt, losing control of the ball while untouched and then being hit high to ensure that the ball would remain loose. Upon reviewing the play-by-play information, Seattle recovered the ball, so it wasn't officially listed as a turnover but it would possess the same behavior as if Seattle had lost the ball.

In [None]:
review.Player_Activity_Derived.value_counts().plot.bar()
plt.show()

In [None]:
review.Friendly_Fire.value_counts().plot.bar()
plt.show()

In [None]:
review.Turnover_Related.value_counts().plot.bar()
plt.show()

In [None]:
review.Primary_Impact_Type.value_counts().plot.bar()
plt.show()

In [None]:
review.Primary_Partner_Activity_Derived.value_counts().plot.bar()
plt.show()

In [None]:
## Add player information to the review data ##

review = pd.read_csv('../input/video_review.csv')
review = review.merge(players.reset_index(), on='GSISID', how='inner')
review = review.merge(roles, on=['Season_Year','GameKey','PlayID','GSISID'], how='inner')
review.head()

Looking for trends in player roles within the formation, it isn't immediately obvious that there are any trends aside from punt returners are the most likely to be concussed.

In [None]:
review.Role.value_counts().plot.bar()
plt.figure()
review.RoleGroup.value_counts().plot.bar()
plt.show()

To determine if any position group is more likely to be concussed, we can build an data frame with both observed and expected concussions given player group. Using the "roles" data, can obtain the average number of players within each position group across all plays. If we assume that each player has an equally likely chance of being concussed on the play, we can test the hypothesis that the observed concussion counts match our expection.

A formal chi-squared goodness-of-fit test suggests that even in this small dataset, there is moderate statistical evidence (p-value close to 0.05) that the observed counts of concussion for each position group does not match expectation. In particular, we see that the punt returner is a particularly risky position, the gunner is somewhat risky, and the corners tend to be fairly safe.

In [None]:
rcounts = pd.DataFrame(roles.groupby(['Season_Year','GameKey','PlayID','RoleGroup'])['Role'].count()).reset_index()
rave = pd.DataFrame(rcounts.groupby(['RoleGroup'])['Role'].mean()).reset_index()
rave['Expected'] = rave.Role/22*review.shape[0]

EC = pd.DataFrame(review.RoleGroup.value_counts()).reset_index()
EC.columns = ['RoleGroup','Observed']

rave = rave.merge(EC, on='RoleGroup', how='inner')

Xstat, Xp = chisquare(rave.Observed, f_exp=rave.Expected)
print('Goodnees of fit p-value = %3f' %Xp)

rave

## Data Augmentation

Since the descriptions for the incidents are not particularly informative, I reviewed each video individually to assess whether I could determine the primary reason for the concussion. My hand labels are provided below. When I could not determine where or how the concussion occurred, the entry was left blank.

Some basic definitions:
- Collision = standard hit, nothing unusual observed
- Engaged = the would-be tackler is being blocked by one opponent when another opponent delivers and extra hit
- Friendly Fire = hit by a teammate, likely difficult to avoid
- Pile up = another tackler adding to the tackle once the returner is/appears to be down or going down
- Poor blocking/tackling = technique that would endanger the blocker/blocked/tackler/tackled, especially hits to the head
- Pushed from behind/side = the player is pushed into a concussion by the opponent, either from behind (illegal) or from the side

In [None]:
review['Possible_Cause'] = ['pushed from behind','engaged','poor tackling','poor tackling','pushed from side','poor blocking',\
                           'poor blocking','engaged friendly fire','pile up','','','collision',\
                           'poor tackling','illegal hit','turnover','poor blocking','poor tackling','engaged',\
                           'friendly fire','poor tackling','poor blocking','poor blocking','friendly cross fire','collision',\
                           'poor tackling','collision','poor tackling','poor blocking','friendly fire','collision',\
                            'poor blocking','poor blocking','poor tackling','poor blocking','collision','poor blocking',\
                           'poor tackling']
review['Blindsided'] = ['No','Yes','No','No','Yes','No',\
                       'No','Yes','No','','','Yes',\
                       'No','Yes','No','No','No','Yes',\
                       'No','No','No','Yes','No','Yes',\
                       'No','No','No','Yes','No','No',\
                        'Yes','No','No','Yes','No','No',\
                       'No']

## Note: Mislabeling exists on the player jersey number and friendly fire was missed when a player was pushed into another

review[['GameKey', 'PlayID', 'Possible_Cause', 'Blindsided']]
#review

In [None]:
## Summary table of hand curated labels

pd.crosstab(review.Possible_Cause, review.Blindsided, margins=True)

### Observation 1

Although the sample size is limited, especially when 2 video clips are inconclusive, the most common trend is poor tackling/blocking (19/35). Players on punt and punt return teams are often predominantly reserves and so usually have less experience and are trying to make "splash plays" to get noticed. Continuing to promote "heads up" tackling is likely the best strategy to minimize these plays and perhaps a re-emphasis on avoiding unnecessary roughness (Rule 12, Section 2, Article 6i).


### Rule Change 1

However, in terms of a rule change, one option is to better position referees to notice and call unnecessary roughness. On many of the plays observed, no yellow flag was thrown for unnecessary roughness. While this may be due to a referee's unwillingness to throw a flag on a questionable hit, I also noticed that there may not be adequate attention paid to blocking during a punt. According to the "official responsibilities", as outlined by NFL operations, only the Field Judge is responsible for enforcing rules on blocking during a punt. Of the seven referees on the field at any time, only three will be near the returner where the most vicious hits occur. However, the Down Judge and Line Judge are only responsible for the line of scrimmage. After players begin to move down the field, these judges could run alongside and look for infractions regarding blocking, which would include unnecessary roughness. This would better enforce the rules without the need of adding extra officials.


### Observation 2

Another highly likely cause for concussion is being blindsided (11/35). The remaining causes vary but tend to be the result of routine collisions or unavoidable contact. Players are taught to "keep their head on a swivel", but not everyone is adept at this. 

Since a concussion is caused from a blow to the head, in extreme cases causing the brain to be jolted within the skull, it seems that the best way to prevent concussions is the reduce the likelihood of these violent hits. What makes a hit violent is the amount of force exerted on the hit, so following this logic, we should consider the speed, direction, and mass of players involved in collisions.


## Add Player Weight by Position

Without building a complicated physics model that quantifies the transfer of force between to moving players, we can at least assess whether or not the weight of a player has an impact. While the player weight isn't immediately available in this data, we can use a rudimentary form of imputation as a sanity check.

The average weight for each position was collected from https://public.tableau.com/en-us/s/gallery/height-and-weight-nfl. It is from 2015, but I am assuming that these averages haven't changed significantly since then. Note: OLB is taken as the average between DE and LB, since a 3-4 OLD tends to be larger than your typical ILB or MLB.

In [None]:
pos = ['DE', 'DT', 'NT', 'LB', 'ILB', 'OLB', 'MLB', 'CB', 'S', 'FS', 'SS', \
       'QB', 'RB', 'FB', 'WR', 'TE', 'OL', \
       'K', 'P', 'LS']
wgt = [283.1, 312.8, 312.8, 246.0, 246.0, (283.1+246.0)/2, 246.0, 200.2, 200.2, 200.2, 200.2, \
       224.1, 220.2, 220.2, 222.4, 222.4, 314.0, \
       202.3, 213.2, 245.3]

suppl = pd.DataFrame(pos)
suppl.columns = ['position']
suppl['weight'] = wgt

suppl

## Angle and Speed at Collisions

The following code is designed to extract the angle and speed of colliding players at the time of collisions using the RFID data.

In [None]:
def viz_play(iplay, gamekey, playid):
    prole = roles.query('GameKey == @gamekey and PlayID == @playid')
    long_snapper = prole.query('Role == "PLS"').GSISID.iloc[0]
    punter = prole.query('Role == "P"').GSISID.iloc[0]
    iplay = iplay.query('GSISID in @prole.GSISID')
    assert len(iplay.GSISID.unique()) <= 22
    iplay.Time = pd.to_datetime(iplay.Time)
    iplay = iplay.sort_values('Time')

    scrimmage = iplay.query('GSISID == @long_snapper and Event == "ball_snap"').x.iloc[0]

    flip = 1
    punter = iplay.query('GSISID == @punter and Event == "ball_snap"').x.iloc[0]
    if punter > scrimmage:
        flip = -1

    rplay = review.query('GameKey == @gamekey and PlayID == @playid')

    snap = iplay.query('Event == "ball_snap"').Time.iloc[0]
    end = iplay.query('Event in ["tackle", "punt_downed", "out_of_bounds", "fair_catch", "touchdown"]').Time.iloc[-1] + timedelta(seconds=1.5)

    iplay = iplay[iplay.Time.between(snap, end)]

    for x, player in iplay.groupby('GSISID'):
        viridis = cm.viridis(np.linspace(0, 1, player.shape[0]))
        plasma = cm.plasma(np.linspace(0, 1, player.shape[0]))
        cividis = cm.cividis(np.linspace(0, 1, player.shape[0]))
        inferno = cm.inferno(np.linspace(0, 1, player.shape[0]))
        colors = cividis
        alpha = 0.1
        zorder = 0
        if int(rplay.iloc[0].GSISID) == x:
            colors = viridis
            alpha = .6
            zorder = 2
        if (rplay.Primary_Partner_GSISID.notnull().iloc[0]
            and rplay.Primary_Partner_GSISID.iloc[0] != 'Unclear' 
            and int(rplay.Primary_Partner_GSISID.iloc[0]) == x):
            colors = plasma
            alpha = .6
            zorder = 1
        plt.scatter(-flip * player.y + (flip > 0) * 53.3, flip * (player.x - scrimmage), c=colors, alpha=alpha, zorder=zorder)
    plt.xlim(-3, 55)
    plt.title(f'gamekey {gamekey} playid {playid}')
    plt.tight_layout()

In [None]:
def get_collisions(play, gamekey, playid, start, end, prole):
    total = (end - start).total_seconds() / 0.1 - 1
    data = []
    blocks = []
    timepoints = 2
    prev = set()
    for ts in tqdm(sliding_window(timepoints, play[play.Time.between(start, end)].Time.unique()), total=total, leave=False):
        new_prev = set()
        rev = review.query('GameKey == @gamekey and PlayID == @playid')
        try:
            injpair = set([int(rev.GSISID), int(rev.Primary_Partner_GSISID)])
        except ValueError:
            injpair = -1
        iframe = play.query('Time == @ts[1]').sort_values('GSISID')
        gsis_s = iframe.GSISID.values.astype(int)
        pairs = squareform(pdist(iframe[['x', 'y']])) < 5
        for i, j in zip(*pairs.nonzero()):
            if i >= j:
                continue
            gsis1 = gsis_s[i]
            gsis2 = gsis_s[j]
            assert gsis1 < gsis2

            locs1 = play.query('Time in @ts and GSISID == @gsis1')
            if locs1.shape[0] != timepoints:
                continue
            locs2 = play.query('Time in @ts and GSISID == @gsis2')
            if locs2.shape[0] != timepoints:
                continue
            x1 = locs1[['x', 'y']].values
            nx1 = locs1[['nx', 'ny']].values
            # sometimes players slow down so need 2x diff e.g game 448 play 2792
            #ls1 = LineString(np.append(x1[-1], np.diff(x1, axis=0) + x1[-1, :], 0))
            ls1 = LineString(np.c_[x1[-1], 2*np.diff(x1, axis=0).ravel() + x1[-1, :]].T)
            x2 = locs2[['x', 'y']].values
            nx2 = locs2[['nx', 'ny']].values
            #ls2 = LineString(np.append(x2[-1], np.diff(x2, axis=0) + x2[-1, :], 0))
            ls2 = LineString(np.c_[x2[-1], 2*np.diff(x2, axis=0).ravel() + x2[-1, :]].T)
            if ((Point(x1[0]).distance(Point(x2[0])) < 1.5)
                and ((np.abs(np.diff(x1, axis=0) - np.diff(x2, axis=0))).sum() < 0.25)):
                spd1 = locs1.iloc[0].dis * dis2mph
                spd2 = locs2.iloc[0].dis * dis2mph
                ang1 = np.rad2deg(np.arctan2(*locs1.iloc[:2][['x', 'y']].diff().iloc[1].values))
                ang2 = np.rad2deg(np.arctan2(*locs2.iloc[:2][['x', 'y']].diff().iloc[1].values))
                x = ang1 - ang2
                diffa = min(x % 360, abs((x % 360) - 360) % 360)
                pos1 = players.loc[gsis1]
                pos2 = players.loc[gsis2]
                    
                blocks.append({
                    'gamekey': gamekey,
                    'playid': playid,
                    'x': np.mean([x2[-1, 0], x1[-1, 0]]),
                    'y': np.mean([x2[-1, 1], x1[-1, 1]]),
                    'nx': np.mean([nx2[-1, 0], nx1[-1, 0]]),
                    'ny': np.mean([nx2[-1, 1], nx1[-1, 1]]),
                    'gsis1': gsis1,
                    'role1': prole.loc[gsis1, 'Role'],
                    'position1': pos1,
                    'gsis2': gsis2,
                    'role2': prole.loc[gsis2, 'Role'],
                    'position2': pos2,
                    'spd1': spd1,
                    'spd2': spd2,
                    'time': pd.to_datetime(ts[1]),
                    'angle1': ang1,
                    'angle2': ang2,
                    'angle_diff': diffa,
                })
            if ls1.distance(ls2) < 0.5 and Point(x1[0]).distance(Point(x2[0])) > 1.0:
                new_prev.add((gsis1, gsis2))
                if (gsis1, gsis2) not in prev:
                
                    spd1 = locs1.iloc[0].dis * dis2mph
                    spd2 = locs2.iloc[0].dis * dis2mph
                    ang1 = np.rad2deg(np.arctan2(*locs1.iloc[:2][['x', 'y']].diff().iloc[1].values))
                    ang2 = np.rad2deg(np.arctan2(*locs2.iloc[:2][['x', 'y']].diff().iloc[1].values))
                    x = ang1 - ang2
                    diffa = min(x % 360, abs((x % 360) - 360) % 360)
                    #ix = ls1.intersection(ls2)

                    pos1 = players.loc[gsis1]
                    pos2 = players.loc[gsis2]

    #                 if set([gsis1, gsis2]) == injpair:
    #                     print(list(ls1.coords()))

                    data.append({
                        'gamekey': gamekey,
                        'playid': playid,
                        'x': np.mean([x2[-1, 0], x1[-1, 0]]),
                        'y': np.mean([x2[-1, 1], x1[-1, 1]]),
                        'nx': np.mean([nx2[-1, 0], nx1[-1, 0]]),
                        'ny': np.mean([nx2[-1, 1], nx1[-1, 1]]),
                        'gsis1': gsis1,
                        'role1': prole.loc[gsis1, 'Role'],
                        'position1': pos1,
                        'gsis2': gsis2,
                        'role2': prole.loc[gsis2, 'Role'],
                        'position2': pos2,
                        'spd1': spd1,
                        'spd2': spd2,
                        'time': pd.to_datetime(ts[1]),
                        'angle1': ang1,
                        'angle2': ang2,
                        'angle_diff': diffa,
                        'injury': set([gsis1, gsis2]) == injpair
                    })
        prev = new_prev
    data = pd.DataFrame(data)
    blocks = pd.DataFrame(blocks)
    return data, blocks

In [None]:
## Vizualize motion ##

plays = pd.read_csv('../input/NGS-2016-reg-wk13-17.csv')
gamekey, playid = 281, 1526
viz_play(plays.query('GameKey == @gamekey and PlayID == @playid'), gamekey, playid)

In [None]:
all_blocks = []
all_collisions = []
for ngs in tqdm(glob('../input/NGS*.csv')):
    plays = pd.read_csv(ngs)
    for (gamekey, playid), play in tqdm(plays.merge(review[['GameKey', 'PlayID']]).groupby(['GameKey', 'PlayID']), leave=False):
        if play.empty:
            continue
        play.Time = pd.to_datetime(play.Time)
        
        
        prole = roles.query('GameKey == @gamekey and PlayID == @playid')
        long_snapper = prole.query('Role == "PLS"').GSISID.iloc[0]
        punter = prole.query('Role == "P"').GSISID.iloc[0]
        scrimmage = play.query('GSISID == @long_snapper and Event == "ball_snap"').x.iloc[0]

        flip = 1
        punter = play.query('GSISID == @punter and Event == "ball_snap"').x.iloc[0]
        if punter > scrimmage:
            flip = -1
            
        play['nx'] = -flip * play.y + (flip > 0) * 53.3
        play['ny'] = flip * (play.x - scrimmage)
            
        start = play.query('Event == "ball_snap"').Time.iloc[0]
        end = play.query('Event in ["tackle", "punt_downed", "out_of_bounds", "fair_catch", "touchdown"]').Time.iloc[-1] + timedelta(seconds=1.5)

        prole = roles.query('GameKey == @gamekey and PlayID == @playid')[['GSISID', 'Role']]
        play = play.merge(prole, on=('GSISID'))
        prole = prole.set_index('GSISID')
        #play = play.sort_values(by=('Time', 'GSISID'))
        play = play.sort_values(['Time', 'GSISID'])

        a, b = get_collisions(play, gamekey, playid, start, end, prole)
        all_collisions.append(a)
        all_blocks.append(b)

In [None]:
collisions = pd.concat(all_collisions, ignore_index=True)
#blocks = pd.concat(all_blocks, ignore_index=True)

In [None]:
## Data Engineering ##
collisions['sumspd'] = collisions.spd1 + collisions.spd2
collisions['maxspd'] = collisions[['spd1','spd2']].max(axis=1)
collisions['position'] = collisions.position1 + '-' + collisions.position2
collisions['role'] = collisions.role1 + '-' + collisions.role2

## Add auxiliary information ##
punt = collisions.merge(review, left_on=['gamekey','playid'], right_on=['GameKey','PlayID'], how='inner')

Note that not all concussion hits could be detected using the heuristic algorithm developed above. However, I was able to locate 31/37 such hits.

In [None]:
print(punt.shape)
print('%i concussion collisions detected' %sum(punt.injury))
punt.head()

In [None]:
punt_wgt = punt.merge(suppl, left_on='position1', right_on='position', how='outer')
punt_wgt = punt_wgt.merge(suppl, left_on='position2', right_on='position', how='outer')
punt_wgt['weight_diff'] = abs(punt_wgt.weight_y - punt_wgt.weight_x)

punt_wgt = punt_wgt.dropna(subset=['angle1', 'angle2'])
punt_wgt = punt_wgt.reset_index(drop=True)

print(punt_wgt.shape)
punt_wgt.tail()

Whether comparing angle to maximum speed or combined speed of the colliding player, there appears to be some evidence that the combination of high speeds and angle between perpendicular and head on leads to an increase in concussion risk.

In [None]:
sns.scatterplot(x='angle_diff', y='maxspd', hue='injury', style='injury', data=punt, s=100)
plt.ylabel('Maximum Speed')
plt.xlabel('Angle Difference')
plt.show()

In [None]:
sns.scatterplot(x='angle_diff', y='sumspd', hue='injury', style='injury', data=punt, s=100)
plt.ylabel('Combined Speed')
plt.xlabel('Angle Difference')
plt.show()

The plot below includes the size of the point scaled by the estimated difference in weight between the two players. There does not appear to be any obvious trend with respect to the weight differential between colliding players, so we will not include this factor in any additional modeling.

In [None]:
sns.scatterplot(data=punt_wgt, x='angle_diff', y='sumspd', size='weight_diff', style='injury', hue='injury')
plt.xlabel('Angle Difference')
plt.ylabel('Combined Speed')
plt.show()

## Logistic Regression

To test this theory, I fit a logistic regression model to the data with the sum of the players' speed, the difference in the angle of their trajectories during collision, and the interaction of these two factors. The maximum speed was also tested, but the sum of the speeds was more predictive of concussion.

Five-fold cross-validation was used to ensure reproducibility of results. Given the imbalance in the data, summary metrics such as F1-score are misleading. For the purposes of this exercise, we want to increase the recall as much as possible while maintaining a fairly high precision (above 0.95).

In [None]:
X = np.vstack((punt.angle_diff, punt.sumspd, punt.angle_diff*punt.sumspd)).T
y = punt.injury
alpha = 0.1

print(X.shape, y.shape)

In [None]:
clf_cv = LogisticRegressionCV(cv=5, random_state=0).fit(X, y)


#inj_clf_cv = clf_cv.predict(X)
inj_clf_cv = clf_cv.predict_proba(X)[:,1] > alpha


print(classification_report(y,inj_clf_cv))
print(confusion_matrix(y,inj_clf_cv))

To visualize this pattern, I plotted the regions of higher risks of concussion in blue and lower risk of concussion in red. As expected, the combination of high speeds and large angle differential leads to higher rates of concussion.

In [None]:
# Parameters
n_classes = 2
plot_colors = "ryb"
plot_step = 0.02

plt.figure(figsize=(6,5))

x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step))


Z = clf_cv.predict(np.c_[xx.ravel(), yy.ravel(), xx.ravel()*yy.ravel()])
Z = clf_cv.predict_proba(np.c_[xx.ravel(), yy.ravel(), xx.ravel()*yy.ravel()])[:,1] > alpha
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

plt.ylabel('Speed')
plt.xlabel('Angle')



# Plot the training points
for i, color in zip(range(n_classes), plot_colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1], c=color, cmap=plt.cm.RdYlBu, edgecolor='black', s=15)

plt.grid(True)
plt.ylabel("Combined Speed")
plt.xlabel("Angle Difference")
plt.title('Concussion Probability Regions')
plt.show()

More important is likely the distinction between illegal hits and unavoidable contact. So, we removed the hits that could have been prevented by better technique or that were simply illegal.

By removing these preventable concussions, we see that the remaining injuries are most likely due to the interaction between speed and angle of collision.

In [None]:
punt_sub = punt.query('Possible_Cause != "poor tackling" & Possible_Cause != "poor blocking" & Possible_Cause != "illegal hit" & Possible_Cause != "pushed from behind" & Possible_Cause != "pushed from side"')
print('%i concussion collisions detected' %sum(punt_sub.injury))
punt_sub.shape

In [None]:
X = np.vstack((punt_sub.angle_diff, punt_sub.sumspd, punt_sub.angle_diff*punt_sub.sumspd)).T
y = punt_sub.injury
alpha = 0.1

print(X.shape, y.shape)

clf_cv = LogisticRegressionCV(cv=5, random_state=0).fit(X, y)


#inj_clf_cv = clf_cv.predict(X)
inj_clf_cv = clf_cv.predict_proba(X)[:,1] > alpha


print(classification_report(y,inj_clf_cv))
print(confusion_matrix(y,inj_clf_cv))

In [None]:
# Parameters
n_classes = 2
plot_colors = "ryb"
plot_step = 0.02

plt.figure(figsize=(6,5))

x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step))


Z = clf_cv.predict(np.c_[xx.ravel(), yy.ravel(), xx.ravel()*yy.ravel()])
Z = clf_cv.predict_proba(np.c_[xx.ravel(), yy.ravel(), xx.ravel()*yy.ravel()])[:,1] > alpha
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

plt.ylabel('Speed')
plt.xlabel('Angle')



# Plot the training points
for i, color in zip(range(n_classes), plot_colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1], c=color, cmap=plt.cm.RdYlBu, edgecolor='black', s=15)

plt.grid(True)
plt.ylabel("Combined Speed")
plt.xlabel("Angle Difference")
plt.title('Concussion Probability Regions')
plt.show()

### Observation 3

To reduce the rate of concussion on punt returns, we need to reduce the potential for violent hits, here defined as large speeds from significantly different angles of motion. But how and where do these hits occurs?

Below is a plot of all collisions and their relative location to the line of scrimmage. Although all positions are susceptible to a violent hit, it is obvious that these hits typically occur down field in a region near to the returner. The best approach to improving player safety, therefore, is reduce contact with or near the returner.

Note that this plot also supports the previously suggested rule change of having the Down Judge and Line Judge monitor more activity down field.

In [None]:
sns.scatterplot(x='nx', y='ny', data=collisions.query('injury == False'), color='blue', s=70, alpha=0.5, label='normal')
sns.scatterplot(x='nx', y='ny', data=collisions.query('injury == True'), color='red', s=70, alpha=1.0, label='injury')
plt.title('Relative location of injuries')
plt.ylabel('distance down field')
plt.xlabel('width of field')
plt.show()

### Rule Change 2

Fair Catch: Rule 10. Section 2. Article 1.
Award the receiving team an additional 5 yards from the spot of reception if a fair catch is called for and successfully received. Optional: Do not award the additional 5 yards if the punt is received within a team's own 5 yard line.

#### Motivation

This will reduce the number of punt plays where a return happens. The returner is incentivized to take the fair catch because he is guaranteed 5 yards. The returner will only return the ball when he has plenty of cushion and thus is not in danger of getting hit immediately. Injuries are dramatically less likely when the return doesn抰 happen.



### Rule Change 3

Punt Out of Bounds: Rule 9. Section 4. Article 4.
Award the kicking team the shorter of 5 yards or half the distance to the goal line if the punt travels out of bounds.

#### Motivation

This will reduce the number of punt plays where a return happens. The punter is incentivized to kick the ball out of bounds because the kicking team will receive an extra five yards and guaranteed no return yards. Punts without returns have dramatically lower probability of resulting in injuries.




## Conclusion

Adding a potential 5 yard reward for both the kicking and receiving teams will not improve player safety, but add a layer of intrigue to the punt return. Instead of the common concerns fans have while watching the punt returns, namely injuries, penalties, and turnovers, the precision of punters will be on full display, testing how close a punter can put the ball to the sideline without entering the field of play. The returner will be challenged to make tip-toe sideline catches in order to gain the extra 5 yards and will need to think carefully about not catching the ball when it sails beyond their own 10 yard line.

In the event that a punt returner decides to catch and return the punt, however, the added support of sideline judges near the action should help catch unnecessary roughness and, hopefully, increase the rate of safe and clean play. Also, by excluding fair catches within a team's own 5 yard line, the excitement of a punt downed near the goal line and the potential for a safety will remain.
