# NFL Data: Cleaning
### Cleaning the Big Data Bowl 2022 datasets and filtering to only include Punt Returns 

# Cleaning Players

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as mpl
import seaborn as sns

In [9]:
from google.colab import drive
"""Global Variables / Paths"""

input_folderpath = "data"
output_folderpath = "data"
drive_folderpath = "Colab Notebooks"
useDrive = True


drivepath = 'drive\\MyDrive\\'+drive_folderpath+"\\"
inputpath = drivepath+input_folderpath+"\\" if useDrive else input_folderpath+"\\"
outputpath = drivepath+output_folderpath+"\\" if useDrive else output_folderpath+"\\"
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [11]:
players = pd.read_csv(inputpath+"players.csv")
players.columns = players.columns.str.replace(' ', '')
players

Converting all heights to inches

In [None]:
check = players['height'].str.split('-', expand=True)
check.columns = ['feet', 'inches']
check.loc[(check['inches'].notnull()), 'feet'] = check[check['inches'].notnull()]['feet'].astype(np.int16) * 12 + check[check['inches'].notnull()]['inches'].astype(np.int16)
players['height'] = check['feet']
players['height'] = players['height'].astype(np.float32)
players

Making all dates the same format

In [None]:
for idx, row in players.iterrows():
  if "/" in row['birthDate']: 
        split = row["birthDate"].split("/")
        players.loc[idx,"birthDate"] = split[2].replace(" ","")+"-"+split[0]+"-"+split[1]
players

In [None]:
players.to_csv(outputpath+"cleaned_players.csv", index=False)
cleaned_players = pd.read_csv(outputpath+"cleaned_players.csv")
cleaned_players

# Plays

In [None]:
plays = pd.read_csv("plays.csv")
plays.head()

There are four special plays detailed. They should be given their own csvs.

In [None]:
plays['specialTeamsPlayType'].unique()

In [None]:
plays[plays['specialTeamsPlayType'] == "Kickoff"]["specialTeamsResult"].unique()

- Touchback - Kickoff resulted in ball becoming dead in defending team's endzone, so defending team gain possesion at 25 or 20 yard line. Either has to land there and stop, or a player catches and kneels to end play.
- Return - Kickoff resulted in ball being received by defending team and them running the ball up the field. (Is caught or becomes dead not in end zone?)
- Muffed - Receiving team don't gain possession of the ball properly, and can only start at where the ball was downed?
- Kickoff Team Recovery - kickoff team gain possesion of the ball after it crosses the receiving team's restraining line (35 yards) or a member of the receiving team possess the ball first.
- Out of Bounds - out of bounds
- Fair Catch - Receiver signals that they want a fair catch, meaning they can catch the ball without interference. Then the ball becomes dead at that spot and the receiving team cannot advance it.
- Downed - Ball brought to the ground??

In [None]:
plays[plays['specialTeamsPlayType'] == "Punt"]["specialTeamsResult"].unique()

- Non-Special Teams Result - Punt is passed instead.

In [None]:
plays[plays['specialTeamsPlayType'] == "Field Goal"]["specialTeamsResult"].unique()

- Kick Attempt Good - goal scored
- Kick Attempt No Good - goal missed
- Blocked Kick Attempt - kick blocked by an opponent
- Non-Special Teams Result - kick set up but passed instead?

In [None]:
plays[plays['specialTeamsPlayType'] == "Extra Point"]["specialTeamsResult"].unique()

- Non-Special Teams Result - Can choose to attempt another touchdown after first touchdown instead of conversion kick, so no one attempts the kick, kickerId is null. Mostly fails however.

## Kickoff

In [None]:
kickoff = plays[plays['specialTeamsPlayType'] == "Kickoff"]
kickoff.columns

The percentage of NA values in each column:

In [None]:
for column in kickoff.columns:
  print(column,(kickoff[column].isnull().sum()/len(kickoff[column])*100))

- Penalties have high percentages because they are rare, but still valid data
- Kickoffs have no kick blocker so kickBlockerId is irrelevant here
- passResult: Scrimmage outcome of the play if specialTeamsPlayResult is "Non-Special Teams Result", so irrelevant here
- looks like yardlineNumber should all be 35 because that's where a kickoff occurs, but some maybe different because of pentalies?

In [None]:
kickoff = kickoff.drop(columns=["kickBlockerId","passResult","specialTeamsPlayType"])

In [None]:
kickoff.to_csv(outputpath+"kickoff.csv",index=False)

specialTeamsPlayType is removed because the csv only has data about one special type, so would be a column with all the same values

## Punt

In [None]:
punt = plays[plays['specialTeamsPlayType'] == "Punt"]
punt

In [None]:
for column in punt.columns:
  print(column,(punt[column].isnull().sum()/len(punt[column])*100))

- Some kickerIds are null because the punt is not kicked (??), it is passed instead. Indicated by having the specialTeamsResult set to Non-Special Teams Result, and then the passResult shows the result of the pass.
- kickBlockerId is mostly null because it is rare to block a punt. When not null, specialTeamsResult has Blocked Punt


In [None]:
punt = punt.drop(columns=["specialTeamsPlayType"])

In [None]:
punt.to_csv(outputpath+"punt.csv",index=False)

## Field Goal

In [None]:
fieldGoal = plays[plays['specialTeamsPlayType'] == "Field Goal"]
fieldGoal

In [None]:
for column in fieldGoal.columns:
  print(column,(fieldGoal[column].isnull().sum()/len(fieldGoal[column])*100))

- kickReturnYardage is all null because the receiving cannot (??) advance the ball after a field goal ??
- playResult is mostly 0 because most attempts score goals, so kicking team essentially gains no yards because play is reset. Will be negative if goal is missed so receiving team get the ball at their 8 yard mark (??). For blocked kicks, it's anyone's ball after so kicking team may or may not gain yards afterwards.
- returnerId is mostly null because it's rare to return after a field goal??

In [None]:
fieldGoal = fieldGoal.drop(columns=["specialTeamsPlayType","kickReturnYardage"])

In [None]:
fieldGoal.to_csv(outputpath+"fieldGoal.csv",index=False)

# Extra Point

In [None]:
extraPoint = plays[plays['specialTeamsPlayType'] == "Extra Point"]
extraPoint

In [None]:
for column in extraPoint.columns:
  print(column,(extraPoint[column].isnull().sum()/len(extraPoint[column])*100))

- returnerId all null because no one returns
- kickLength all null because kicks happen at same place
- kickReturnYardage all null because you can't advance after an extra point attempt

In [None]:
extraPoint = extraPoint.drop(columns=["specialTeamsPlayType","kickReturnYardage","returnerId","kickLength"])
extraPoint.to_csv(outputpath+"extraPoint.csv",index=False)