# Capstone Two Data Wrangling

Data was scraped from pro-football-reference.com/teams, and includes data from 32 teams for the last 17 years.  Data is only used from 2002-2019, as the Houston Texans were included in the NFL in 2002.  As the 2020 season is currently not over, it is not being used in this analysis.

## Loading and exploring data

In [157]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

nfl = pandas.read_csv('nflstats.csv')

In [158]:
nfl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76662 entries, 0 to 76661
Data columns (total 76 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Coach                      76662 non-null  object 
 1   Def_Align                  76662 non-null  object 
 2   Def_Coor                   75570 non-null  object 
 3   Draft_Position             76662 non-null  object 
 4   Draft_School               76461 non-null  object 
 5   Draft_Team_Selection       76662 non-null  int64  
 6   Drafted_Player             76662 non-null  object 
 7   MoV                        76662 non-null  float64
 8   Off_Coor                   72695 non-null  object 
 9   Off_Scheme                 75528 non-null  object 
 10  Opp_First_down             76662 non-null  int64  
 11  Opp_Fumbles                76662 non-null  int64  
 12  Opp_PF                     76662 non-null  int64  
 13  Opp_Pass_Att               76662 non-null  int

In [159]:
nfl.shape

(76662, 76)

In [160]:
nfl.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Draft_Team_Selection,76662.0,3.698338,2.627572,0.0,1.0,3.0,6.0,13.0
MoV,76662.0,0.200613,6.405410,-16.3,-4.6,0.4,4.9,19.7
Opp_First_down,76662.0,308.160536,29.029262,228.0,289.0,309.0,327.0,419.0
Opp_Fumbles,76662.0,10.486147,3.513725,2.0,8.0,10.0,13.0,23.0
Opp_PF,76662.0,351.177441,58.661711,196.0,311.0,348.0,392.0,517.0
...,...,...,...,...,...,...,...,...
Week_Points_Scored,76662.0,22.052738,10.252568,0.0,14.0,21.0,28.0,62.0
Week_Rush_Yards,76662.0,113.980708,51.649079,-18.0,77.0,107.0,145.0,378.0
Week_Total__Off_Yards,76662.0,336.908260,85.445394,26.0,279.0,336.0,395.0,653.0
Wins,76662.0,8.107994,3.133288,0.0,6.0,8.0,10.0,16.0


Finding Missing Data and correcting for missing Data

In [161]:
missing = pd.concat([nfl.isnull().sum(), 100 * nfl.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='%', ascending=False)
missing[missing['count'] >0]

Unnamed: 0,count,%
Def_Coor,1092,1.424435
Draft_School,201,0.26219
Off_Coor,3967,5.174663
Off_Scheme,1134,1.47922
Week_Def_Pass_Yards,11,0.014349
Week_Def_Turnovers,17295,22.560069
Week_Off_Turnovers,17364,22.650074
Week_Pass_Yards,21,0.027393


Fill in Jets Missing def_pass_yards with info from https://www.footballdb.com/games/boxscore.html?gid=2010010311

In [162]:
nfl[(nfl['Week_Def_Pass_Yards'].isnull()) & (nfl['Team'] == 'New York Jets')].fillna(31.0, inplace=True)

In [163]:
nfl[(nfl['Week_Pass_Yards']==63) & (nfl['Team'] == 'New York Jets')]['Week_Def_Pass_Yards'].replace(np.nan, 31, inplace=True)

In [171]:
nfl[(nfl['Week_Pass_Yards']==63) & (nfl['Team'] == 'New York Jets')]['Week_Def_Pass_Yards']

20928    31.0
20929    31.0
20930    31.0
Name: Week_Def_Pass_Yards, dtype: float64

In [170]:
nfl.iloc[20930, 62] = 31.0

Fill in Chargers Missing def_pass_yards with info from https://www.footballdb.com/games/boxscore.html?gid=2003122812

In [172]:
nfl.update(nfl[(nfl['Team'] == 'Los Angeles Chargers') & (nfl['Year'] == 2003) & (nfl['Week'] == '17')]['Week_Def_Pass_Yards'].replace(np.nan, 0))

Fill in Raiders Missing pass_yards with info from https://www.footballdb.com/games/boxscore.html?gid=2003122812

In [173]:
nfl.update(nfl[(nfl['Team'] == 'Las Vegas Raiders') & (nfl['Year'] == 2003) & (nfl['Week'] == '17')]['Week_Pass_Yards'].replace(np.nan, 0))

Fill in Bengals Missing pass_yards with info from https://www.footballdb.com/games/boxscore.html?gid=2010010311

In [185]:
nfl.update(nfl[(nfl['Team'] == 'Cincinnati Bengals') & (nfl['Year'] == 2009) & (nfl['Week'] == '17')]['Week_Pass_Yards'].replace(np.nan, 0))

### Mistake when scraping, as I didn't account for null values from website representing zero turnovers.  Fixing that here

In [176]:
nfl['Week_Def_Turnovers'].replace(np.nan,0, inplace=True)

In [177]:
nfl['Week_Off_Turnovers'].replace(np.nan,0, inplace=True)

In [211]:
missing2 = pd.concat([nfl.isnull().sum(), 100 * nfl.isnull().mean()], axis=1)
missing2.columns=['count', '%']
missing2.sort_values(by='%', ascending=False)
missing2[missing['%'] >0]

Unnamed: 0,count,%
Def_Coor,1076,1.403564
Draft_School,0,0.0
Off_Coor,3967,5.174663
Off_Scheme,1134,1.47922
Week_Def_Pass_Yards,0,0.0
Week_Def_Turnovers,0,0.0
Week_Off_Turnovers,0,0.0
Week_Pass_Yards,0,0.0


Filling in missing schools

In [209]:
nfl.update(nfl[nfl['Drafted_Player'] == 'Michael Bowie '].replace(np.nan, 'NE State'))
nfl.update(nfl[nfl['Drafted_Player'] == 'Jordan Mailata'].replace(np.nan, 'None'))
nfl.update(nfl[nfl['Drafted_Player'] == 'Greg Zuerlein'].replace(np.nan, 'Missouri Western St'))
nfl.update(nfl[nfl['Drafted_Player'] == 'Moritz Boehringer'].replace(np.nan, 'None'))
nfl.update(nfl[nfl['Drafted_Player'] == 'Bill Bentley'].replace(np.nan, 'Louisiana'))
nfl.update(nfl[nfl['Drafted_Player'] == 'Matt McCants'].replace(np.nan, 'Ala-Birmingham'))
nfl.update(nfl[nfl['Drafted_Player'] == 'Akiem Hicks'].replace(np.nan, 'Regina'))
nfl.update(nfl[nfl['Drafted_Player'] == 'Christo Bilukidi'].replace(np.nan, 'Georgia St.'))
nfl.update(nfl[nfl['Drafted_Player'] == 'Ladarius Green'].replace(np.nan, 'Louisiana'))
nfl.update(nfl[nfl['Drafted_Player'] == 'Michael Jasper'].replace(np.nan, 'Bethel'))
nfl.update(nfl[nfl['Drafted_Player'] == 'Ryan Jensen'].replace(np.nan, 'Colorado State-Pueblo'))

In [246]:
nfl[nfl['Coach'] == 'Lovie Smith']['Year'].value_counts()

2008.0    192
2007.0    144
2009.0    144
2006.0    133
2004.0    128
2015.0    112
2005.0    102
2012.0     96
2014.0     96
2010.0     90
2011.0     80
Name: Year, dtype: int64

In [229]:
nfl[(nfl['Coach'] == 'Dick Jauron') & (nfl['Off_Scheme'] == ' West Coast')]['Year'].value_counts()

2008.0    160
2005.0     96
Name: Year, dtype: int64

In [236]:
nfl[(nfl['Coach'] == 'Dick Jauron')]['Year'].value_counts()

2003.0    192
2008.0    160
2006.0    144
2002.0    144
2007.0    112
2005.0     96
Name: Year, dtype: int64

In [238]:
nfl[nfl['Coach'] == 'Dick Jauron']['Off_Scheme'].replace(np.NaN, method='ffill')
nfl[nfl['Coach'] == 'Dave Mcginnis']['Off_Scheme'].replace(np.NaN, 'Unknown')

44879     West Coast
44880     West Coast
44881     West Coast
44882     West Coast
44883     West Coast
            ...     
71921     West Coast
71922     West Coast
71923     West Coast
71924     West Coast
71925     West Coast
Name: Off_Scheme, Length: 848, dtype: object

In [216]:
nfl[nfl['Off_Scheme'].isnull()]['Coach'].value_counts()

Dick Jauron       336
Dave McGinnis     240
Lovie Smith       128
Bill Belichick    126
Jim Haslett       112
Dave Wannstedt     96
Dom Capers         96
Name: Coach, dtype: int64

In [215]:
nfl[nfl['Off_Scheme'].isnull()]['Team'].value_counts()

Chicago Bears           464
Arizona Cardinals       240
New England Patriots    126
New Orleans Saints      112
Houston Texans           96
Miami Dolphins           96
Name: Team, dtype: int64

In [137]:
nfl['Team'].value_counts()

New England Patriots        2825
Seattle Seahawks            2796
Green Bay Packers           2772
San Francisco 49ers         2758
Baltimore Ravens            2675
Cincinnati Bengals          2655
Los Angeles Rams            2610
Indianapolis Colts          2567
Tennessee Titans            2565
Philadelphia Eagles         2556
Pittsburgh Steelers         2504
Minnesota Vikings           2481
Dallas Cowboys              2428
Las Vegas Raiders           2415
Denver Broncos              2399
Houston Texans              2393
Cleveland Browns            2392
Detroit Lions               2311
Kansas City Chiefs          2285
Tampa Bay Buccaneers        2270
Buffalo Bills               2270
Jacksonville Jaguars        2243
Carolina Panthers           2238
Washington Football Team    2228
Atlanta Falcons             2208
Miami Dolphins              2193
New York Giants             2191
Chicago Bears               2188
Arizona Cardinals           2161
Los Angeles Chargers        2129
New York J

In [251]:
nfl['Coach'].nunique()

125

In [264]:
nfl.Coach.value_counts().sort_index()

Aaron Kromer       80
Adam Gase         472
Andy Reid        2628
Anthony Lynn      462
Art Shell         112
                 ... 
Tony Sparano      425
Vance Joseph      288
Vic Fangio         96
Wade Phillips     784
Zac Taylor        160
Name: Coach, Length: 125, dtype: int64

In [248]:
nfl['Off_Scheme'].nunique()

10

In [270]:
nfl['Def_Align'].value_counts()

 4-3     49984
 3-4     26486
 3/4/      192
Name: Def_Align, dtype: int64

In [267]:
nfl['Def_Coor'].nunique()

121

In [279]:
nfl[nfl['Off_Scheme'] == ' ???']

Unnamed: 0,Coach,Def_Align,Def_Coor,Draft_Position,Draft_School,Draft_Team_Selection,Drafted_Player,MoV,Off_Coor,Off_Scheme,...,Week_First_Downs,Week_Off_Turnovers,Week_Opp,Week_Pass_Yards,Week_Points_Allowed,Week_Points_Scored,Week_Rush_Yards,Week_Total__Off_Yards,Wins,Year
60138,Gregg Williams,4-3,Jerry Gray,OL,Texas,0.0,Mike Williams,-1.1,Kevin Gilbride,???,...,21.0,0.0,Miami Dolphins,270.0,21.0,38.0,161.0,431.0,8.0,2002.0
60139,Gregg Williams,4-3,Jerry Gray,WR,LSU,1.0,Josh Reed,-1.1,Kevin Gilbride,???,...,21.0,0.0,Miami Dolphins,270.0,21.0,38.0,161.0,431.0,8.0,2002.0
60140,Gregg Williams,4-3,Jerry Gray,DE,BYU,2.0,Ryan Denney,-1.1,Kevin Gilbride,???,...,21.0,0.0,Miami Dolphins,270.0,21.0,38.0,161.0,431.0,8.0,2002.0
60141,Gregg Williams,4-3,Jerry Gray,DB,Stanford,3.0,Coy Wire,-1.1,Kevin Gilbride,???,...,21.0,0.0,Miami Dolphins,270.0,21.0,38.0,161.0,431.0,8.0,2002.0
60142,Gregg Williams,4-3,Jerry Gray,DT,Colorado,4.0,Justin Bannan,-1.1,Kevin Gilbride,???,...,21.0,0.0,Miami Dolphins,270.0,21.0,38.0,161.0,431.0,8.0,2002.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60604,Gregg Williams,4-3,Jerry Gray,DB,UNLV,5.0,Kevin Thomas,-1.1,Kevin Gilbride,???,...,26.0,3.0,New York Jets,242.0,37.0,31.0,142.0,384.0,8.0,2002.0
60605,Gregg Williams,4-3,Jerry Gray,OL,Auburn,6.0,Mike Pucillo,-1.1,Kevin Gilbride,???,...,26.0,3.0,New York Jets,242.0,37.0,31.0,142.0,384.0,8.0,2002.0
60606,Gregg Williams,4-3,Jerry Gray,WR,Fresno St.,7.0,Rodney Wright,-1.1,Kevin Gilbride,???,...,26.0,3.0,New York Jets,242.0,37.0,31.0,142.0,384.0,8.0,2002.0
60607,Gregg Williams,4-3,Jerry Gray,RB,Virginia Tech,8.0,Jarrett Ferguson,-1.1,Kevin Gilbride,???,...,26.0,3.0,New York Jets,242.0,37.0,31.0,142.0,384.0,8.0,2002.0


In [269]:
nfl.groupby(['Off_Scheme', 'Def_Align']).size()

Off_Scheme                      Def_Align
 ???                             4-3           160
 Air Coryell                     3-4          5983
                                 4-3         14167
 Air Coryell/West Coast hybrid   3-4            96
                                 4-3           176
 Balanced                        4-3          1168
 Erhardt-Perkins                 3-4          6988
                                 4-3          7461
 Smashmouth                      3-4           889
                                 4-3           232
 Spread                          3-4          1457
                                 4-3           987
 Spread/Erhardt-Perkins          4-3           248
 Vertical                        4-3           112
 West Coast                      3-4         10851
                                 3/4/          192
                                 4-3         24361
dtype: int64

In [274]:
nfl[nfl['Def_Align'] == ' 3/4/']

Unnamed: 0,Coach,Def_Align,Def_Coor,Draft_Position,Draft_School,Draft_Team_Selection,Drafted_Player,MoV,Off_Coor,Off_Scheme,...,Week_First_Downs,Week_Off_Turnovers,Week_Opp,Week_Pass_Yards,Week_Points_Allowed,Week_Points_Scored,Week_Rush_Yards,Week_Total__Off_Yards,Wins,Year
36688,Mike Shanahan,3/4/,Jim Haslett,DE,Purdue,0.0,Ryan Kerrigan,-4.9,Kyle Shanahan,West Coast,...,21.0,1.0,Philadelphia Eagles,247.0,34.0,10.0,130.0,377.0,5.0,2011.0
36689,Mike Shanahan,3/4/,Jim Haslett,DT,Clemson,1.0,Jarvis Jenkins,-4.9,Kyle Shanahan,West Coast,...,21.0,1.0,Philadelphia Eagles,247.0,34.0,10.0,130.0,377.0,5.0,2011.0
36690,Mike Shanahan,3/4/,Jim Haslett,WR,Miami (FL),2.0,Leonard Hankerson,-4.9,Kyle Shanahan,West Coast,...,21.0,1.0,Philadelphia Eagles,247.0,34.0,10.0,130.0,377.0,5.0,2011.0
36691,Mike Shanahan,3/4/,Jim Haslett,RB,Nebraska,3.0,Roy Helu,-4.9,Kyle Shanahan,West Coast,...,21.0,1.0,Philadelphia Eagles,247.0,34.0,10.0,130.0,377.0,5.0,2011.0
36692,Mike Shanahan,3/4/,Jim Haslett,DB,Nebraska,4.0,DeJon Gomes,-4.9,Kyle Shanahan,West Coast,...,21.0,1.0,Philadelphia Eagles,247.0,34.0,10.0,130.0,377.0,5.0,2011.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37438,Mike Shanahan,3/4/,Jim Haslett,WR,SMU,7.0,Aldrick Robinson,-4.9,Kyle Shanahan,West Coast,...,21.0,1.0,New York Giants,258.0,14.0,28.0,74.0,332.0,5.0,2011.0
37439,Mike Shanahan,3/4/,Jim Haslett,DB,Boise St.,8.0,Brandyn Thompson,-4.9,Kyle Shanahan,West Coast,...,21.0,1.0,New York Giants,258.0,14.0,28.0,74.0,332.0,5.0,2011.0
37440,Mike Shanahan,3/4/,Jim Haslett,OL,Florida,9.0,Maurice Hurt,-4.9,Kyle Shanahan,West Coast,...,21.0,1.0,New York Giants,258.0,14.0,28.0,74.0,332.0,5.0,2011.0
37441,Mike Shanahan,3/4/,Jim Haslett,DE,Florida St.,10.0,Markus White,-4.9,Kyle Shanahan,West Coast,...,21.0,1.0,New York Giants,258.0,14.0,28.0,74.0,332.0,5.0,2011.0


In [280]:
nfl.groupby(['Starting_Player', 'Starting_Position']).size()

Starting_Player  Starting_Position
Aaron Rouse      DB                   144
Aaron Williams   DB                   112
Abram Elam       DB                   320
Adrian Amos      DB                   144
Aeneas Williams  DB                   313
                                     ... 
Will Hill        DB                   112
William Moore    DB                   112
Xavier Woods     DB                   290
Yeremiah Bell    DB                   272
Zack Bronson     DB                   112
Length: 257, dtype: int64

In [281]:
nfl['Starting_Player'].nunique()

257