Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Way to Systematically Fix Issues with "Drive" Column #36

Closed
jacole3 opened this issue Jul 10, 2020 · 4 comments
Closed

New Way to Systematically Fix Issues with "Drive" Column #36

jacole3 opened this issue Jul 10, 2020 · 4 comments
Labels
NFL data issue Error in the underlying data, not a bug

Comments

@jacole3
Copy link

jacole3 commented Jul 10, 2020

EDIT: The following data is updated to work with NFLFastR version 2.1.1. Descriptions have been added with the "#" sign to help increase reproducibility.

If you've done any work that involves the "drive" column in NFLFastR, you've probably noticed that it's an absolute shitshow. If you haven't seen this, here are a few examples (and there are many more that I didn't include here):

View(pbp %>% filter(game_id == "2012_06_CIN_CLE", qtr == 2))

2012 CIN CLE Example

View(pbp %>% filter(game_id == "2011_13_DET_NO", qtr %in% 2:3))

2011 DET NO Example

View(pbp %>% filter(game_id == "2019_16_NYG_WAS", qtr %in% 1:2))

2019 NYG WAS Example

What makes this especially concerning is that we simply can't discount these plays as mistakes coming from the late 90s/early 2000s era, when NFL data tracking wasn't as accurate. For whatever reason, the NFLFastR repository has struggled to competently track drive numbers, and that showed no sign of slowing down in 2019.

Going through every single game by hand and manually correcting the drive numbers is an unreasonable task, but fortunately, I've put together a new system that systematically corrects the drive numbers by creating a new variable. See the code below:

seasons <- 1999:2019 # choose which seasons to include
pbp_Original <- purrr::map_df(seasons, function(x) {
readRDS(
url(
glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds")
)
)
})

roster <- readRDS(url("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/roster-data/roster.rds"))

pbp_Original <- pbp_Original %>%
mutate_at(vars(home_team, away_team, posteam, defteam), funs(case_when(
. %in% "JAX" ~ "JAC",
. %in% "STL" ~ "LA",
. %in% "SD" ~ "LAC",
. %in% "OAK" ~ "LV",
TRUE ~ .
)))

pbp_Original$play_id <- 1:nrow(pbp_Original)
pbp <- pbp_Original

# FIND INCORRECTLY LABELED TIMEOUT ERRORS HERE
# View(pbp_Original %>% filter(timeout == 1 & play_type != "no_play" & !str_detect(desc, "(Challenge)") & !str_detect(desc, "(challenge)")))

ShouldBeLabeledNoPlay <- pbp_Original %>% filter(timeout == 1 & play_type != "no_play"
& !str_detect(desc, "(Challenge)") & !str_detect(desc, "(challenge)") & str_detect(desc, "(Timeout)"))
ShouldBeLabeledNoPlayVec <- ShouldBeLabeledNoPlay$play_id
WronglyLabeledTimeout <- pbp_Original %>% filter(timeout == 1 & play_type != "no_play"
& !str_detect(desc, "(Challenge)") & !str_detect(desc, "(challenge)") & !str_detect(desc, "(Timeout)"))
WronglyLabeledTimeoutVec <- WronglyLabeledTimeout$play_id

pbp[ShouldBeLabeledNoPlayVec, "play_type"] <- "no_play"
pbp[WronglyLabeledTimeoutVec, "timeout"] <- 0

Useless_NoSnap <- pbp %>%
filter( (penalty == 0 &
(timeout > 0 & play_type == "no_play")) |
( is.na(play_type) &
(is.na(play_type_nfl) | (play_type_nfl != "PENALTY" & play_type_nfl != "FREE_KICK" & play_type_nfl != "KICK_OFF"))))
Useless_NoSnap_Vec <- Useless_NoSnap$play_id
# Not necessary, but I like to do this to get rid of the timeouts, two-minute warnings, etc.

pbp <- pbp %>% filter(!play_id %in% Useless_NoSnap_Vec)
pbp$play_id <- 1:nrow(pbp)

# Some data entry errors here:
WronglyLabeledAsPunt <- pbp %>% filter(punt_attempt == 1 &
!str_detect(desc, "(Punt)") & !str_detect(desc, "(punt)"))
WronglyLabeledAsPuntVec <- WronglyLabeledAsPunt$play_id
pbp[WronglyLabeledAsPuntVec, "punt_attempt"] <- 0
# Fix the play_types manually
# View(pbp %>% filter(game_id == "2000_06_WAS_PHI", qtr == 2)), 2 of final 3 plays of half
pbp[56335:56336, "play_type"] <- "run"
# View(pbp %>% filter(game_id == "2002_03_DAL_PHI", qtr == 4)), 9:30 to go
pbp[132523, "play_type"] <- "run"

WronglyLabeledAsFumLost <- pbp %>% filter(fumble_lost == 1 & !str_detect(desc, "(Fumble)") &
!str_detect(desc, "(fumble)") & !str_detect(desc, "(FUMBLE)") &
!str_detect(desc, "(muff)") & !str_detect(desc, "(Muff)") &
!str_detect(desc, "(MUFF)") & !str_detect(desc, "(recover)") &
!str_detect(desc, "(Recover)") & !str_detect(desc, "(RECOVER)"))
WronglyLabeledAsFumLostVec <- WronglyLabeledAsFumLost$play_id
pbp[WronglyLabeledAsFumLostVec, "fumble_lost"] <- 0

# A few egregious data entry errors here:
WronglyLabeledAsFGAtt <- pbp %>% filter(field_goal_attempt == 1 & !str_detect(desc, "(goal)") &
!str_detect(desc, "(Goal)") & !str_detect(desc, "(GOAL)"))
WronglyLabeledAsFGAttVec <- WronglyLabeledAsFGAtt$play_id
pbp[WronglyLabeledAsFGAttVec, "field_goal_attempt"] <- 0
pbp[WronglyLabeledAsFGAttVec, "field_goal_result"] <- NA
# Fix play_type, posteam manually
# View(pbp %>% filter(game_id == "2000_11_OAK_DEN", qtr == 3)), 7-9 minutes left
pbp[67642, "play_type"] <- "pass"
pbp[67643, "play_type"] <- "run"
pbp[67645, "play_type"] <- "kickoff"
pbp[67642:67644, "posteam"] <- "DEN"
pbp[67642:67644, "defteam"] <- "LV"
pbp[67642:67644, "posteam_type"] <- "home"

WronglyLabeledAsKickoff <- pbp %>% filter(kickoff_attempt == 1 & !str_detect(desc, "(kick)") &
!str_detect(desc, "(Kick)") & !str_detect(desc, "(KICK)"))
WronglyLabeledAsKickoffVec <- WronglyLabeledAsKickoff$play_id
pbp[WronglyLabeledAsKickoffVec, "kickoff_attempt"] <- 0

# More miscellaneous errors that aren't directly relevant to the "drive" column
# View(pbp %>% filter(game_id == "1999_08_CHI_WAS", qtr == 3)), 1st of 2 straight onsides
pbp[16832, "posteam"] <- "WAS"
pbp[16832, "defteam"] <- "CHI"
pbp[16832, "posteam_type"] <- "home"

# View(pbp %>% filter(game_id == "2012_04_CLE_BAL", qtr == 4)), 2nd to last play of game
pbp[569272, "fourth_down_failed"] <- 0

# View(pbp %>% filter(game_id == "2018_01_ATL_PHI", qtr == 1)), 10:55 left
pbp[827488, "fourth_down_failed"] <- 1

# A weird situation (offsides, but first down given on 3rd and 15?)
# View(pbp %>% filter(game_id == "1999_10_SF_NO", qtr == 1)), 9:37 left
pbp[22971, "first_down_penalty"] <- 1
pbp[22971, "first_down"] <- 1

# Another odd situation (defensive delay of game leads to first)
# View(pbp %>% filter(game_id == "2000_17_ARI_WAS", qtr == 1)), 5:05 left
pbp[80455, "first_down_penalty"] <- 1
pbp[80455, "first_down"] <- 1

# View(pbp %>% filter(game_id == "2006_10_WAS_PHI", qtr == 2)), 7:25 left
pbp[324039, "first_down_penalty"] <- 1
pbp[324039, "first_down"] <- 1

# View(pbp %>% filter(game_id == "2005_04_MIN_ATL", qtr == 3)), 12:02 left
pbp[266050, "first_down_penalty"] <- 1
pbp[266050, "first_down"] <- 1

# View(pbp %>% filter(game_id == "2018_01_ATL_PHI"))
# 8:23 Q2 DPI
pbp[827526, "first_down_penalty"] <- 1
pbp[827526, "first_down"] <- 1
# 2nd to last play of 1st half
pbp[827560, "first_down_penalty"] <- 1
pbp[827560, "first_down"] <- 1
# 1st play of Q4
pbp[827604, "first_down_penalty"] <- 1
pbp[827604, "first_down"] <- 1
# 2nd to last play of game
pbp[827648, "first_down_penalty"] <- 1
pbp[827648, "first_down"] <- 1

# View(pbp %>% filter(game_id == "2000_03_NYG_CHI", qtr == 4)), 3rd and 2 at 9:24
pbp[49030, "first_down"] <- 1
pbp[49030, "first_down_rush"] <- 1

# View(pbp %>% filter(game_id == "2000_16_OAK_SEA", qtr == 3)), 1:12 left
pbp[79728, "down"] <- 3
pbp[79728, "ydstogo"] <- 5

# View(pbp %>% filter(game_id == "2007_03_STL_TB", qtr == 3)), 8:41 3rd down
pbp[351346, "first_down"] <- 1
pbp[351346, "first_down_rush"] <- 1
pbp[351346, "yards_gained"] <- 6

# View(pbp %>% filter(game_id == "2009_12_TB_ATL", qtr == 2)), 2nd and 14, 2:00
pbp[457795, "first_down_penalty"] <- 1
pbp[457795, "first_down"] <- 1

# These data entry corrections also aren't specifically relevant to the "drive" column, but will help you out no matter what you're doing:
# View(pbp %>% filter(game_id == "1999_14_MIN_KC", qtr == 2)), ending with 0:29 TD
pbp[32289, "play_id"] <- 32286
pbp[32286, "play_id"] <- 32287
pbp[32287, "play_id"] <- 32288
pbp[32288, "play_id"] <- 32289
# View(pbp %>% filter(game_id == "2000_01_TEN_BUF", qtr == 1)), 1:35 offsides
pbp[45056, "play_id"] <- 45057
pbp[45057, "play_id"] <- 45056
# View(pbp %>% filter(game_id == "2000_02_CAR_SF", qtr == 2)), 3rd/4th down at 1:23
pbp[45432, "play_id"] <- 45433
pbp[45433, "play_id"] <- 45432
# View(pbp %>% filter(game_id == "2000_02_WAS_DET", qtr == 1)), 9:37 FG
pbp[47513, "play_id"] <- 47514
pbp[47514, "play_id"] <- 47513
# View(pbp %>% filter(game_id == "2000_03_ATL_CAR", qtr == 2)), ending in 1:42 run
pbp[47719, "play_id"] <- 47716
pbp[47716, "play_id"] <- 47717
pbp[47717, "play_id"] <- 47718
pbp[47718, "play_id"] <- 47719
# View(pbp %>% filter(game_id == "2000_04_WAS_NYG", qtr == 2)), 2 plays after 6:27 KO
pbp[51837, "play_id"] <- 51838
pbp[51838, "play_id"] <- 51837
# View(pbp %>% filter(game_id == "2000_06_PIT_NYJ", qtr == 4)), 2:20 INT
pbp[55773, "play_id"] <- 55774
pbp[55774, "play_id"] <- 55773
# View(pbp %>% filter(game_id == "2000_07_BAL_WAS", qtr == 2)), 6:27 1st/2nd down
pbp[56650, "play_id"] <- 56651
pbp[56651, "play_id"] <- 56650
# View(pbp %>% filter(game_id == "2000_07_JAX_TEN", qtr == 4)), 2:15 2nd/3rd down
pbp[57714, "play_id"] <- 57715
pbp[57715, "play_id"] <- 57714
# View(pbp %>% filter(game_id == "2000_09_TEN_WAS", qtr == 4)) 7:45 2nd/3rd down
pbp[63293, "play_id"] <- 63294
pbp[63294, "play_id"] <- 63293
# View(pbp %>% filter(game_id == "2000_11_NYJ_IND", qtr == 2)), 7:07 1st/2nd down
pbp[67430, "play_id"] <- 67431
pbp[67431, "play_id"] <- 67430
# View(pbp %>% filter(game_id == "2000_11_CHI_BUF", qtr == 4)), 4:30 3rd/4th down
pbp[66393, "play_id"] <- 66394
pbp[66394, "play_id"] <- 66393
# View(pbp %>% filter(game_id == "2000_13_PHI_WAS", qtr == 2)), starts at 13:50
pbp[72755, "play_id"] <- 72757
pbp[72756, "play_id"] <- 72755
pbp[72757, "play_id"] <- 72756
# View(pbp %>% filter(game_id == "2000_14_SEA_ATL", qtr == 3)), 12:25 1st/2nd down
pbp[75098, "play_id"] <- 75099
pbp[75099, "play_id"] <- 75098
# View(pbp %>% filter(game_id == "2000_17_CIN_PHI", qtr == 4)), 0:49 is messed up play
pbp[81215, "play_id"] <- 81202
pbp[81202, "play_id"] <- 81203
pbp[81203, "play_id"] <- 81204
pbp[81204, "play_id"] <- 81205
pbp[81205, "play_id"] <- 81206
pbp[81206, "play_id"] <- 81207
pbp[81207, "play_id"] <- 81208
pbp[81208, "play_id"] <- 81209
pbp[81209, "play_id"] <- 81210
pbp[81210, "play_id"] <- 81211
pbp[81211, "play_id"] <- 81212
pbp[81212, "play_id"] <- 81213
pbp[81213, "play_id"] <- 81214
pbp[81214, "play_id"] <- 81215
# View(pbp %>% filter(game_id == "2004_09_PHI_PIT", qtr == 2)), 6:14 kickoff/FG
pbp[234545, "play_id"] <- 234546
pbp[234546, "play_id"] <- 234545
# View(pbp %>% filter(game_id == "2004_09_CLE_BAL", qtr == 2)), 3:09 kickoff/FG
pbp[233119, "play_id"] <- 233120
pbp[233120, "play_id"] <- 233119
# View(pbp %>% filter(game_id == "2001_03_KC_WAS", qtr == 4)), 8:36 1st/2nd down
pbp[90538, "play_id"] <- 90539
pbp[90539, "play_id"] <- 90538
# View(pbp %>% filter(game_id == "2001_05_PIT_KC", qtr == 2)), 0:32 safety/kickoff
pbp[95431, "play_id"] <- 95432
pbp[95432, "play_id"] <- 95431
# View(pbp %>% filter(game_id == "2004_02_NE_ARI", qtr == 4)), 11:20 2nd/3rd down
pbp[218079, "play_id"] <- 218080
pbp[218080, "play_id"] <- 218079
# View(pbp %>% filter(game_id == "2004_07_DAL_GB", qtr == 4)), 3:27 false start
pbp[228742, "play_id"] <- 228743
pbp[228743, "play_id"] <- 228742
# View(pbp %>% filter(game_id == "2004_15_WAS_SF", qtr == 1)), final play of qtr should be 4th
pbp[250272, "play_id"] <- 250240
pbp[250240, "play_id"] <- 250241
pbp[250241, "play_id"] <- 250242
pbp[250242, "play_id"] <- 250243
pbp[250243, "play_id"] <- 250244
pbp[250244, "play_id"] <- 250245
pbp[250245, "play_id"] <- 250246
pbp[250246, "play_id"] <- 250247
pbp[250247, "play_id"] <- 250248
pbp[250248, "play_id"] <- 250249
pbp[250249, "play_id"] <- 250250
pbp[250250, "play_id"] <- 250251
pbp[250251, "play_id"] <- 250252
pbp[250252, "play_id"] <- 250253
pbp[250253, "play_id"] <- 250254
pbp[250254, "play_id"] <- 250255
pbp[250255, "play_id"] <- 250256
pbp[250256, "play_id"] <- 250257
pbp[250257, "play_id"] <- 250258
pbp[250258, "play_id"] <- 250259
pbp[250259, "play_id"] <- 250260
pbp[250260, "play_id"] <- 250261
pbp[250261, "play_id"] <- 250262
pbp[250262, "play_id"] <- 250263
pbp[250263, "play_id"] <- 250264
pbp[250264, "play_id"] <- 250265
pbp[250265, "play_id"] <- 250266
pbp[250266, "play_id"] <- 250267
pbp[250267, "play_id"] <- 250268
pbp[250268, "play_id"] <- 250269
pbp[250269, "play_id"] <- 250270
pbp[250270, "play_id"] <- 250271
pbp[250271, "play_id"] <- 250272
# View(pbp %>% filter(game_id == "2006_08_DAL_CAR", qtr == 4)), 8:04 offsides
pbp[317680, "play_id"] <- 317681
pbp[317681, "play_id"] <- 317680
# View(pbp %>% filter(game_id == "2007_02_HOU_CAR", qtr == 2)), 0:45 3rd/4th down
pbp[347288, "play_id"] <- 347289
pbp[347289, "play_id"] <- 347288
# View(pbp %>% filter(game_id == "2007_06_CIN_KC", qtr == 4)), 3:29 false start
pbp[356327, "play_id"] <- 356328
pbp[356328, "play_id"] <- 356327
# View(pbp %>% filter(game_id == "2009_11_MIA_CAR", qtr == 3)), 2:41 punt
pbp[453996, "play_id"] <- 453997
pbp[453997, "play_id"] <- 453996
# View(pbp %>% filter(game_id == "2010_09_NYG_SEA", qtr == 3)), 1:15 clock reset
pbp[493447, "play_id"] <- 493448
pbp[493448, "play_id"] <- 493447
# View(pbp %>% filter(game_id == "2010_19_SEA_CHI", qtr == 4)), 1:24 XP/kickoff
pbp[515900, "play_id"] <- 515901
pbp[515901, "play_id"] <- 515900
# View(pbp %>% filter(game_id == "2011_13_ATL_HOU", qtr == 4)), 2:35 delay of game
pbp[545633, "play_id"] <- 545634
pbp[545634, "play_id"] <- 545633
# View(pbp %>% filter(game_id == "2011_13_DET_NO", qtr %in% 2:3)), late 2nd and early 3rd kickoffs
pbp[546537, "play_id"] <- 546538
pbp[546538, "play_id"] <- 546537
pbp[546541, "play_id"] <- 546542
pbp[546542, "play_id"] <- 546541
# View(pbp %>% filter(game_id == "2011_14_NE_WAS", qtr == 2)), final 2 plays of half
pbp[549497, "play_id"] <- 549498
pbp[549498, "play_id"] <- 549497
# View(pbp %>% filter(game_id == "2011_15_NYJ_PHI", qtr == 2)), spike at 0:17
pbp[552650, "play_id"] <- 552651
pbp[552651, "play_id"] <- 552650
# View(pbp %>% filter(game_id == "2012_01_SD_OAK", qtr == 3)), 9:16 punt
pbp[562417, "play_id"] <- 562418
pbp[562418, "play_id"] <- 562417
# View(pbp %>% filter(game_id == "2012_01_CIN_BAL", qtr == 2)), 12:25 kickoff/PAT
pbp[561049, "play_id"] <- 561047
pbp[561047, "play_id"] <- 561048
pbp[561048, "play_id"] <- 561049
# View(pbp %>% filter(game_id == "2012_01_WAS_NO", qtr == 2)), 11:33 penalty after PAT
pbp[563041, "play_id"] <- 563039
pbp[563039, "play_id"] <- 563040
pbp[563040, "play_id"] <- 563041
# View(pbp %>% filter(game_id == "2012_01_BUF_NYJ", qtr == 4)), 5:19 2nd down
pbp[560849, "play_id"] <- 560847
pbp[560847, "play_id"] <- 560848
pbp[560848, "play_id"] <- 560849
# View(pbp %>% filter(game_id == "2012_03_BUF_CLE", qtr == 1)), 12:06 false start
pbp[565997, "play_id"] <- 565998
pbp[565998, "play_id"] <- 565997
# View(pbp %>% filter(game_id == "2013_03_NYG_CAR", qtr == 1)), 12:58 2nd and 4th down
pbp[612436, "play_id"] <- 612438
pbp[612438, "play_id"] <- 612436
# View(pbp %>% filter(game_id == "2013_13_JAX_CLE", qtr == 3)), 9:44 PAT/kickoff shitshow
pbp[636014, "play_id"] <- 636011
pbp[636011, "play_id"] <- 636013
pbp[636013, "play_id"] <- 636014
# View(pbp %>% filter(game_id == "2014_06_DEN_NYJ", qtr == 2)), 0:27 PAT shitshow
pbp[663431, "play_id"] <- 663428
pbp[663428, "play_id"] <- 663429
pbp[663429, "play_id"] <- 663430
pbp[663430, "play_id"] <- 663431
# View(pbp %>% filter(game_id == "2014_06_SD_OAK", qtr == 4)), 1:56 PAT shitshow
pbp[664868, "play_id"] <- 664866
pbp[664866, "play_id"] <- 664867
pbp[664867, "play_id"] <- 664868
# View(pbp %>% filter(game_id == "2015_09_PHI_DAL", qtr == 2)), 9:01 PAT shitshow
pbp[716024, "play_id"] <- 716022
pbp[716022, "play_id"] <- 716023
pbp[716023, "play_id"] <- 716024
# View(pbp %>% filter(game_id == "2015_14_DAL_GB", qtr == 2)), 2:40 PAT penalty
pbp[727126, "play_id"] <- 727124
pbp[727124, "play_id"] <- 727125
pbp[727125, "play_id"] <- 727126
# View(pbp %>% filter(game_id == "2016_10_MIN_WAS", qtr == 1)), 9:04 PAT penalty
pbp[763344, "play_id"] <- 763341
pbp[763341, "play_id"] <- 763342
pbp[763342, "play_id"] <- 763343
pbp[763343, "play_id"] <- 763344
# View(pbp %>% filter(game_id == "2016_19_PIT_KC", qtr == 3)), 2:09 3rd down penalty
pbp[782817, "play_id"] <- 782818
pbp[782818, "play_id"] <- 782817
# View(pbp %>% filter(game_id == "2017_02_NYJ_OAK", qtr == 2)), 9:08 PAT/kickoff
pbp[787812, "play_id"] <- 787813
pbp[787813, "play_id"] <- 787812
# View(pbp %>% filter(game_id == "2017_03_NYG_PHI", qtr == 1)), 9:31 3rd/4th down
pbp[790409, "play_id"] <- 790410
pbp[790410, "play_id"] <- 790409
# View(pbp %>% filter(game_id == "2017_06_PIT_KC", qtr == 1)), 7:49 kickoff after safety
pbp[798111, "play_id"] <- 798112
pbp[798112, "play_id"] <- 798111
# View(pbp %>% filter(game_id == "2017_15_HOU_JAX", qtr == 2)), PAT shitshow on final play
pbp[819144, "play_id"] <- 819142
pbp[819142, "play_id"] <- 819143
pbp[819143, "play_id"] <- 819144
# View(pbp %>% filter(game_id == "2018_02_OAK_DEN", qtr == 3)), 9:06 Lynch run
pbp[832446, "play_id"] <- 832448
pbp[832447, "play_id"] <- 832446
pbp[832448, "play_id"] <- 832447
# View(pbp %>% filter(game_id == "2018_10_JAX_IND", qtr == 2)), 5:16 3rd down penalty
pbp[850671, "play_id"] <- 850672
pbp[850672, "play_id"] <- 850671
# View(pbp %>% filter(game_id == "2018_14_CAR_CLE", qtr == 4)), 13:05 PAT shitshow
pbp[859494, "play_id"] <- 859492
pbp[859492, "play_id"] <- 859493
pbp[859493, "play_id"] <- 859494
# View(pbp %>% filter(game_id == "2019_01_TEN_CLE", qtr == 4)), 3:02 Chubb run
pbp[873655, "play_id"] <- 873657
pbp[873656, "play_id"] <- 873655
pbp[873657, "play_id"] <- 873656
# View(pbp %>% filter(game_id == "2019_08_CIN_LA", qtr == 4)), 0:30 spike
pbp[889154, "play_id"] <- 889155
pbp[889155, "play_id"] <- 889154
# View(pbp %>% filter(game_id == "2019_13_NE_HOU", qtr == 3)), last play of quarter
pbp[901489, "play_id"] <- 901487
pbp[901487, "play_id"] <- 901488
pbp[901488, "play_id"] <- 901489
# View(pbp %>% filter(game_id == "2019_16_NYG_WAS", qtr %in% 1:2))
# 1st 2 plays of game, and 12:21 PAT, and 7:37 PAT/kickoff
pbp[910264, "play_id"] <- 910265
pbp[910265, "play_id"] <- 910264
pbp[910272, "play_id"] <- 910270
pbp[910270, "play_id"] <- 910271
pbp[910271, "play_id"] <- 910272
pbp[910281, "play_id"] <- 910282
pbp[910282, "play_id"] <- 910281
pbp <- arrange(pbp, play_id)

pbp <- pbp %>% group_by(game_id, game_half) %>%
mutate(LastPlayOfHalf = as.numeric(play_id == max(play_id))) %>%
ungroup()

pbp <- pbp %>% group_by(game_id, game_half) %>%
mutate(FirstPlayOfHalf = as.numeric(play_id == min(play_id))) %>%
ungroup()

pbp <- pbp %>% group_by(game_id) %>%
mutate(FirstPlayOfGame = as.numeric(play_id == min(play_id))) %>%
ungroup()

# Now, we get into what's most relevant:
pbp <- pbp %>% mutate(LastPlayOfDrive = ifelse(
(is.na(td_team) &
(punt_attempt %in% 1 | play_type %in% "punt" | (extra_point_attempt %in% 1 & (play_type != "no_play" | (str_detect(pbp$desc, "(enforced between downs)")))) |
play_type %in% "extra_point" | two_point_attempt %in% 1 | (str_detect(pbp$desc, "(two-point)") & penalty %in% 0) |
(str_detect(pbp$desc, "(TWO-POINT)") & penalty %in% 0) | (str_detect(pbp$desc, "(Two-point)") & penalty %in% 0) |
(str_detect(pbp$desc, "(Two-Point)") & penalty %in% 0) | field_goal_result %in% "missed" | field_goal_result %in% "blocked" |
field_goal_result %in% "made" | field_goal_attempt %in% 1 | interception %in% 1 | safety %in% 1 |
fumble_lost %in% 1 | fourth_down_failed %in% 1 | LastPlayOfHalf %in% 1)), 1, 0)
)

# This is just a dummy variable that will be fixed later
pbp <- pbp %>%
mutate(NewDrive = 0)

pbp <- pbp %>% group_by(game_id) %>%
mutate(NewDrive = ifelse(FirstPlayOfGame == 1, 1,
ifelse(lag(LastPlayOfDrive == 1), 1 + lag(NewDrive), lag(NewDrive)))) %>%
ungroup()

pbp <- pbp %>% group_by(game_id) %>%
mutate(NewDrive = cumsum(NewDrive)) %>%
ungroup()

pbp <- pbp %>% select(1:19, 327, 20:326)
# This is just to move our new variable, "NewDrive", closer to the original "drive" column, so that we can directly compare the two

# Once we are confident that the "drive" column sucks, we can simply remove it:
pbp <- pbp %>% select(-"drive")

I've gone through more than 15 games, and I've not seen any errors with the "NewDrive" column. For frame of reference, here are the three examples we looked at earlier:

2012 CIN CLE Example 2

2011 DET NO Example 2

2019 NYG WAS Example 2

I hope this is helpful to all other members of the NFLFastR community who are tired of seeing some of the heinous data entry errors in the repository. If anyone does spot mistakes with the "NewDrive" column, feel free to either reply here, or contact me directly at @ColeJacobson32 on Twitter or jacole@sas.upenn.edu.

@guga31bb
Copy link
Member

Thanks! Will take a closer look at this eventually, but note that we can't do anything to manipulate play_id within nflfastR because that would break the ability to link to external datasets (e.g., ESPN play by play).

@jacole3
Copy link
Author

jacole3 commented Jul 11, 2020

@guga31bb Got it, that makes sense. Ultimately, the play_id changes aren't important to the fixing of the "drive" variable - what really matters, and is most easily reproducible, is the code with the "LastPlayOfDrive" column and subsequent mutations. I just thought I'd include the play_id changes here since I had already put in the work on those as well. (But since Version 2.1.1 happened to drop the same day, I'll have to change the numbers so they become accurate again.)

@guga31bb guga31bb added the NFL data issue Error in the underlying data, not a bug label Jul 15, 2020
@guga31bb
Copy link
Member

The next version of the package will have fixed_drive (replacement for drive that is actually in order) and fixed_drive_result, along with many more plays being in the right order. Thank you for this!

@jacole3
Copy link
Author

jacole3 commented Aug 17, 2020

@guga31bb Awesome, that's great to hear. Will be useful to be able to compare fixed_drive to my "NewDrive" to see what discrepancies exist, and why. Looking forward to seeing that next update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NFL data issue Error in the underlying data, not a bug
Projects
None yet
Development

No branches or pull requests

2 participants