## Objective:
- Preprocess 'PlayDescription' from the play information dataset to create subsets of data that will be used in other notebooks for analysis.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import re

# PLAY INFORMATION
- <b>Play Information</b>: Play level data that describes the type of play, possession team, score and a brief narrative of each play. Plays are uniquely identified using a its PlayID along with the corresponding GameKey. 
- All plays are punts (just check counts of 'Play_Type')
- I'll look specifically at <b>'PlayDescription'</b> to get a rough idea of how a play panned out and use that data to create one-hot encodings of types of plays (touchback, punt return, blocked kick, etc) and miscellaneous attributes of a play (fumble, muffed, etc). This will help to understand how many plays could be deemed 'interesting' (exciting, action after the catch, a blocked kick) and 'uninteresting' (out of bounds kicks, touchbacks, fair catches, etc.). This labeling is subjective and is used later to place value on the result of a punt both to a team and its fanbase.

In [None]:
play_df = pd.read_csv('../input/play_information.csv')
print(play_df.shape)
play_df.head(1)

In [None]:
# HOW MANY GAMES HAD NO PUNTS AND WHICH GAMES
stuff = []
# Collect all game id's in punt data
for element in play_df['GameKey']:
    stuff.append(element)
print('Number of games without a punt:', 666 - len(set(stuff)))

for element in [i for i in range(1, 667)]:
    if element not in set(stuff):
        print('Game', element, 'had no punts')

- Game 1: 'Hall of Fame Game' was cancelled due to weather
- Game 333: Probowl game
    - If you search the game and find the box-score, there were 3 punts
- Game 399: Was cancelled due to Hurricane Harvey :(
- Game 666: Probowl game
    - If you search the game and find the box-score, there were 8 punts

- Example Play Descriptions for one-hot-encodings:
    - <b>Interesting outcomes</b>:
        - <b>Returned Punt</b>: B.Nortman punts 40 yards to BUF 23, Center-C.Holba. B.Tate to BUF 34 for 11 yards (D.Payne).
        - <b>Muffed catch</b>: S.Waters punts 36 yards to BLT 15, Center-J.Jansen. K.Clay MUFFS catch, RECOVERED by CAR-F.Whittaker at BLT 12. F.Whittaker to BLT 12 for no gain (K.Clay).
        - <b>Blocked Punt</b>: B.Wing punt is BLOCKED by B.Carter, Center-Z.DeOssie, recovered by NYG-J.Currie at NYG 15. J.Currie to NYG 15 for no gain (J.Burris).
        - <b>Fumbles</b>: M.Darr punts 42 yards to TEN 14, Center-J.Denney. K.Reed to TEN 21 for 7 yards (Dan.Thomas). FUMBLES (Dan.Thomas), RECOVERED by MIA-J.Denney at TEN 23. J.Denney to TEN 23 for no gain (K.Byard).
        - <b>Touchdown</b>: J.Locke punts 61 yards to CIN 20, Center-K.McDermott. A.Erickson for 80 yards, TOUCHDOWN.
        - <b>Fake Punt</b>: P.McAfee pass deep right to E.Swoope to PIT 8 for 35 yards (J.Gilbert).
            - Passing: P.McAfee pass deep right to E.Swoope to PIT 8 for 35 yards (J.Gilbert).
            - Running: C.Jones left end to PHI 43 for 30 yards (D.Sproles). Fake punt run around left end.
                - Lots of variations in descriptions for these bad boys
    - Uninteresting outcomes:
        - <b>Fair Catch</b>: J.Locke punts 47 yards to GB 10, Center-K.McDermott, fair catch by M.Hyde.
        - <b>Downed Punt</b>: J.Locke punts 50 yards to GB 9, Center-K.McDermott, downed by MIN-J.Kearse.
            - This is a play where the punting team controls the ball before any receiving team player after the ball has been punted
        - <b>Touchbacks</b>: J.Hekker punts 50 yards to end zone, Center-J.Overbaugh, Touchback.
        - <b>Out of Bounds Punt</b>: J.Schum punts 35 yards to MIN 34, Center-B.Goode, out of bounds.
        - <b>Dead Ball</b>: B.Nortman punts 51 yards to BUF 34, Center-C.Holba. B.Tate, dead ball declared at BUF 34 for no gain.
        - <b>No Play</b>: (:04) (Punt formation) PENALTY on ATL-M.Bosher, Delay of Game, 5 yards, enforced at ATL 49 - No Play.
            - Some 'No Play' or '(Punt formation) Penalty' descriptions vary where a punt was executed and a penalty occurred that would negate the play, such that the punt is reattempted
            - Such penalties include: False Start, Illegal Substitution, Delay of Game, Illegal Formation, Neutral Zone Infraction, Player Out of Bounds on Punt, Defensive 12 On-field, Ineligible Downfield Kick, Illegal Shift, Unnecessary Roughness, Roughing the Kicker, Defensive Offside, Ineligible Downfield Kick, Offensive Holding

- Note: a play may have more than one of the above classifications.

In [None]:
# Create condensed version of play data
keeper_columns = ['GameKey', 'PlayID', 'PlayDescription', 'Poss_Team', 'YardLine']
condensed_play_df = play_df[keeper_columns].copy()

In [None]:
def find_that_play_word(keyword, df):
    """Help to find keywords"""
    df[keyword] = 0
    count = 0
    for i, description in enumerate(df['PlayDescription']):
        game_key = df.loc[i, 'GameKey']
        play_id = df.loc[i, 'PlayID']
        # Find keyword in lowercased string of play description
        if description.lower().find(keyword) != -1:
#             print('Keyword', keyword, 'found for (game, play):', '(' + str(game_key) + ',' + str(play_id) + ')')
#             print('Play description:', description)
#             print('---')
                
            # One-hot encode with keyword
            df.loc[i, keyword] = 1
            count += 1

    print('# of', keyword, 'occuring on a punt play:', count)

The coice of strings to parse for were determined based off reading through the Play Descriptions. There are probably cases where I'm still making poor assumptions, but I'll have to live with it.

### Uninteresting Outcomes

In [None]:
find_that_play_word('fair catch', condensed_play_df)
find_that_play_word('touchback', condensed_play_df)
find_that_play_word('downed', condensed_play_df)
find_that_play_word(', out of bounds', condensed_play_df)
find_that_play_word('dead ball', condensed_play_df)
find_that_play_word('no play', condensed_play_df)
find_that_play_word('(punt formation) penalty on', condensed_play_df) # Picks up additional 'no play' type punts

- Some of these counts may overlap, but won't matter for the processing

In [None]:
# Reduce play_df even further 
where_condition = (
    (condensed_play_df['fair catch'] == 1) |
    (condensed_play_df['touchback'] == 1) |
    (condensed_play_df['downed'] == 1) |
    (condensed_play_df[', out of bounds'] == 1) |
    (condensed_play_df['dead ball'] == 1) |
    (condensed_play_df['no play'] == 1) |
    (condensed_play_df['(punt formation) penalty on'] == 1))
interesting_plays_df = condensed_play_df[~where_condition].reset_index(drop=True)

print('There are now', len(interesting_plays_df), '"interesting plays" from', len(condensed_play_df), 'punt plays')
print('Proportion of interesting punts:', len(interesting_plays_df)/len(condensed_play_df))
interesting_plays_df.head(1)

- So we can see that around <b>55.4% of punts (3701/6681)</b> result in a play that is 'uninteresting'. Maybe the punt isn't worth the time.
- The <b>touchback rate was 57.6% for kickoffs in 2016</b>. Just for perspective of 'uninteresting outcomes'.
    - Reference: http://www.espn.com/nfl/story/_/id/18393780/kickoff-returns-reduced-18-percentage-points-2016-season


Now that we have a condensed set of punt plays where something potentially interesting occurred, lets parse for the more interesting than interesting plays on punts (touchdowns, fumbles, blocks, etc.). I'll also create datasets that will be needed for other notebooks.

In [None]:
'''I only have this here for reference of what I've filtered by'''
uninteresting_keywords = ['fair catch', 'touchback.', 'downed', ', out of bounds', 'dead ball', 'no play',
                         '(punt formation) penalty on']
interesting_keywords = ['muffs', 'blocked by','touchdown.', 'fumble', 'ruling', 'fake punt',
                        'up the middle', 'pass', 'right end', 'left end', 'right guard',
                        'direct snap', 'touchdown nullified']

### Interesting outcomes

In [None]:
# 'Interesting outcomes'
find_that_play_word('muffs', interesting_plays_df)
find_that_play_word('blocked by', interesting_plays_df)
find_that_play_word('touchdown.', interesting_plays_df)
find_that_play_word('fumble', interesting_plays_df)
find_that_play_word('ruling', interesting_plays_df)
find_that_play_word('fake punt', interesting_plays_df)
find_that_play_word('safety', interesting_plays_df)
find_that_play_word('up the middle', interesting_plays_df)
find_that_play_word('pass', interesting_plays_df)
find_that_play_word('right end', interesting_plays_df)
find_that_play_word('left end', interesting_plays_df)
find_that_play_word('right guard', interesting_plays_df)
find_that_play_word('direct snap', interesting_plays_df)
find_that_play_word('touchdown nullified', interesting_plays_df)

In [None]:
# Create a dataset where plays are currently assumed to be actual punt returns 
where_condition = (
    (interesting_plays_df['muffs'] == 1) |
    (interesting_plays_df['blocked by'] == 1) |
    (interesting_plays_df['touchdown.'] == 1) |
    (interesting_plays_df['fumble'] == 1) |
    (interesting_plays_df['ruling'] == 1) |
    (interesting_plays_df['fake punt'] == 1) |
    (interesting_plays_df['safety'] == 1) |
    (interesting_plays_df['up the middle'] == 1) |
    (interesting_plays_df['pass'] == 1) |
    (interesting_plays_df['right end'] == 1) |
    (interesting_plays_df['left end'] == 1) |
    (interesting_plays_df['right guard'] == 1) |
    (interesting_plays_df['direct snap'] == 1) |
    (interesting_plays_df['touchdown nullified'] == 1))
remainder_df = interesting_plays_df[~where_condition].reset_index(drop=True)

# Isolate touchdowns that were from punt returns
where_condition = ((interesting_plays_df['touchdown.'] == 1) &
                   (interesting_plays_df['blocked by'] == 0) &
                   (interesting_plays_df['direct snap'] == 0) &
                   (interesting_plays_df['right guard'] == 0) &
                   (interesting_plays_df['fumble'] == 0) &
                   (interesting_plays_df['pass'] == 0))
td_df = interesting_plays_df[where_condition].reset_index(drop=True)

# Combine touchdown punt returns and regular punt returns
remainder_df = pd.concat([remainder_df, td_df], axis=0)

# Drop unnecessary columns
keeper_columns = ['GameKey', 'PlayID', 'PlayDescription', 'Poss_Team', 'YardLine']
remainder_df = remainder_df[keeper_columns]
remainder_df.reset_index(inplace=True, drop=True)
print(remainder_df.shape)
remainder_df.head()

**play-punt_retrn.csv and play-fair_catch.csv** are used in notebook: https://www.kaggle.com/jdemeo/preprocessing-ngs

In [None]:
# Create dataset of punt return plays
remainder_df.to_csv('play-punt_return.csv', index=False)

# Create dataset of fair catch plays
where_condition = ((condensed_play_df['fair catch'] == 1))
fc_df = condensed_play_df[where_condition].reset_index(drop=True)\

# Drop unnecessary columns
keeper_columns = ['GameKey', 'PlayID', 'PlayDescription', 'Poss_Team', 'YardLine']
fc_df = fc_df[keeper_columns]
fc_df.to_csv('play-fair_catch.csv', index=False)

In [None]:
# Just for reference if you want to filter return plays that have a penalty
find_that_play_word('penalty on', remainder_df)

So now we have a condensed set of punts that result in some return minus the above filtered 'interesting' plays. We'll now look at this set of plays and extract some information from the 'PlayDescription'

### Play Description Parsing of Punt Return Plays
- The following work-up/analysis does not adjust for penalties on the play. I know this isn't clean and big returns on punts do have a pretty good chance of a penalty was helping with the success of the return, but I'm only parsing the PlayDescription to get a rough idea of the return amounts and potential value of a return.

In [None]:
'''
Need to parse through PlayDescription in order to get return distance of play and distance to touchdown
Patterns that return two distances for yardage on play are lateral plays
'''

# Regex for them patterns
punt_distance_pattern = re.compile(r'punts ((-?)\d+) yards? to(\s| \w+ )((-?)\d+)')
yards_gained_pattern = re.compile(r'for ((-?)\d+) yard')
no_yards_gained_pattern = re.compile(r'([A-Z]\w+) ((-?)\d+) (for no gain)')

remainder_df['punt distance'] = 0
remainder_df['side ball lands'] = ''
remainder_df['yardline received'] = 0
remainder_df['yardage on play'] = 0

for i, element in enumerate(remainder_df['PlayDescription']):

    punt_distance = punt_distance_pattern.findall(element) # ('Punt distance', '', 'Side Ball Lands', 'Yardline Received')
    yards_gained = yards_gained_pattern.findall(element)   # ('Yardage on Play', '', )
    no_gain = no_yards_gained_pattern.findall(element)
    
#     print(punt_distance)
#     print(yards_gained)
#     print(no_gain)
    
    # A play that results in yards gained or lossed
    if yards_gained != []:
        remainder_df.loc[i, 'punt distance'] = int(punt_distance[0][0])
        remainder_df.loc[i, 'side ball lands'] = punt_distance[0][2]
        remainder_df.loc[i, 'yardline received'] = int(punt_distance[0][3])
        
        # A normal return
        if len(yards_gained) == 1:
            remainder_df.loc[i, 'yardage on play'] = int(yards_gained[0][0])
            
        # For laterals
        else:
            remainder_df.loc[i, 'yardage on play'] = int(yards_gained[0][0]) + int(yards_gained[1][0])
            
    # A play that resulted in no gain in yards
    elif no_gain != []:
        remainder_df.loc[i, 'punt distance'] = int(punt_distance[0][0])
        remainder_df.loc[i, 'side ball lands'] = punt_distance[0][2]
        remainder_df.loc[i, 'yardline received'] = int(punt_distance[0][3])

#     print('---')

In [None]:
# Doing some hand processing of specific returns where the yardage gained on return was
# officially changed (I know not elegant, especially if dataframe indices change overtime)
culprits = [476, 891, 1062, 1064, 1096, 2193]
yard_changes = [14, 6, 0, 3, 0, 4]
for i, element in enumerate(culprits):
    remainder_df.loc[element, 'yardage on play'] = yard_changes[i]

We'll calculate distance to a touchdown for each play to create a reward metric for each play. A more proper metric for the value of a punt return should also take into account the current score, time remaining, playoff implications, and return by the home team or not just to name a few factors. I don't do this just to have a simplified model for reward. 

In [None]:
def calculate_distance_to_td (data_sample):
    '''Calculate distance needed for touchdown for each play'''
    # Punts that land on the 50 yard line
    if data_sample['yardline received'] == 50:
        distance_to_touchdown = 50
    
    # Punting on punting team's side of field
    elif data_sample['Poss_Team'] == data_sample['YardLine'][:len(data_sample['Poss_Team'])]:
        # Ball remains on punt team's side of field
        if data_sample['side ball lands'] == data_sample['YardLine'][:len(data_sample['Poss_Team'])]:
            distance_to_touchdown = data_sample['yardline received']
        # Ball is punted to return team's side of field
        else:
            distance_to_touchdown = (50 - data_sample['yardline received']) + 50
            
    # Punting on opponents side of field
    else:
        distance_to_touchdown = (50 - data_sample['yardline received']) + 50
    return distance_to_touchdown

In [None]:
# Calculate the value of a punt return based solely on the proportion of yardage gained on the return
# Relative to how many yards are needed to score a touchdown from where the punt initially lands
remainder_df['reward'] = 0
for i in range(len(remainder_df)):
    yards_on_return = remainder_df.loc[i, 'yardage on play']
    distance_to_touchdown = calculate_distance_to_td(remainder_df.iloc[i, :])
    remainder_df.loc[i, 'reward'] = yards_on_return / distance_to_touchdown
#     print('Value of return:', yards_on_return / distance_to_touchdown)

remainder_df.head()

In [None]:
# Cleanup
keepers = ['GameKey', 'PlayID', 'PlayDescription', 'yardage on play', 'reward']
remainder_df = remainder_df[keepers]

# Create dataset for external usage
remainder_df.to_csv('play-punt_return-yardage.csv', index=False)

**play-punt_return-yardage.csv** used in notebook: https://www.kaggle.com/jdemeo/analysis-punt-returns

### PLAY PLAYER ROLE DATA
- <b>Play Player Role Data</b>: Play and player level data that specifies a punt specific player role. This dataset will specify each player that played in each play. A player抯 role in a play is uniquely defined by the Gamekey PlayID and GSISID.

In [None]:
play_player_role_df = pd.read_csv('../input/play_player_role_data.csv')
print(play_player_role_df.shape)
play_player_role_df.tail(2)

In [None]:
# How many unique plays in play_player_role dataset?
print('# of unique plays according to play_player_role dataset:',
      len(play_player_role_df.groupby(['GameKey','PlayID']).size().reset_index().rename(columns={0:'count'})))

# How many roles are there?
print('# of roles in dataset:', len(play_player_role_df['Role'].value_counts()))

- NGS dataset has 6666 punt plays
- Play Information dataset has 6681 punt plays
- **I wanted to note that there is some missing data between the datasets**

# Links to other notebooks:
- Concussion play analysis with proposed rule changes: https://www.kaggle.com/jdemeo/analysis-concussions
- Analysis of uncalled penalties: https://www.kaggle.com/jdemeo/analysis-uncalled-penalties
- Analysis of punt returns: https://www.kaggle.com/jdemeo/analysis-punt-returns
- Analysis of fair catches: https://www.kaggle.com/jdemeo/analysis-fair-catches
- Preprocessing of NGS data for the above notebooks: https://www.kaggle.com/jdemeo/preprocessing-ngs