# Research Project

Michael Choi

The purpose of this project is to determine the distribution of NFL scoring plays. Can we model the true distribution simply using the historical final score data or can we model the individual scoring plays themselves? Can we create a model to predict the final score based on the spread, total, and aggression ratings of the two teams?

In [1]:
import numpy as np
import pandas as pd
import altair as alt
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier

## Final Digits of Each Quarter Analysis

We will start by modeling the historical data of scores. For simplicity, we will look at the final digits of scores for each quarter.

In [2]:
qtr = pd.read_csv('/Users/MC/Downloads/quarters.csv')
qtr

Unnamed: 0,season,week,game_date,team,place,qtr1,qtr2,qtr3,qtr4,ot,spread_line,total_line,half,full
0,2013,1,2013-09-05,BAL,away,7,10,0,10,,7.5,48.5,17,27
1,2013,1,2013-09-05,DEN,home,0,14,21,14,,7.5,48.5,14,49
2,2013,1,2013-09-08,ARI,away,0,10,14,0,,3.5,42.5,10,24
3,2013,1,2013-09-08,LA,home,0,10,3,14,,3.5,42.5,10,27
4,2013,1,2013-09-08,ATL,away,10,0,7,0,,3.5,54.5,10,17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5745,2023,12,2023-11-23,SEA,home,3,0,10,0,,-7.0,43.5,3,13
5746,2023,12,2023-11-23,WAS,away,0,10,0,0,,13.0,48.0,10,10
5747,2023,12,2023-11-23,DAL,home,7,13,0,25,,13.0,48.0,20,45
5748,2023,12,2023-11-24,MIA,away,3,14,3,14,,-9.5,40.0,17,34


We will make a dataframe containing the last digit of scores for each quarter. Put these dataframes into a list called q. We will also make a list of strings to pull from.

In [3]:
strings = ['qtr1','qtr2','qtr3','qtr4','ot']
q = [0]*(len(strings)+1)

for i in range(1,len(strings)+1):
    q[i] = pd.DataFrame((qtr[strings[i-1]]%10).value_counts().sort_index()).reset_index()

For each quarter, we will make a bar chart with the numbers 0 to 9 on the x-axis. On the y axis, we will record the frequency that a final scores has a last digit with the number on the x-axis.

In [4]:
quarters = [q[1],q[2],q[3],q[4],q[5]]
quarter = [0]*(len(quarters)+1)
colors = ['darkred','darkblue','darkorange','purple','darkgreen']

for i in range(1,len(quarters)+1):
    quarter[i] = alt.Chart(quarters[i-1]).mark_bar().encode(
        x = alt.X(f'{strings[i-1]}:N',axis=alt.Axis(labelAngle=0), title = 'Squares'),
        y = alt.Y('count', title = 'Frequency'),
        tooltip= strings[i-1],
        color = alt.value(colors[i-1])
    ).properties(
        title = f'Quarter {i}',
        height=300,
        width=600,
    )
quarter[5] = quarter[5].properties(title = 'Overtime')

In [5]:
alt.vconcat(quarter[1], quarter[2], quarter[3], quarter[4], quarter[5])

## Predict Final Outcome using Aggression Scores

We are also interested in predicting the final score of a game based on how aggressive a team is. We expect more aggressive teams to attempt to score in situations where less aggressive teams would choose to end their possession in fear of allowing the other team to score easily. In particular, we will find the number of fourth-down attempts and two-point conversion attempts. In addition, more aggressive teams will likely pass the ball for high yardage. These three parameters will assign an aggression score to each of the 32 NFL teams. Using, Feature Engineering, we can create a parameter to help predict the odds of winning.

In [6]:
all_plays_2023 = pd.read_csv('/Users/MC/Downloads/play_by_play_2023.csv')
all_plays_2023['fourth_down_attempts'] = all_plays_2023['fourth_down_converted'] + all_plays_2023['fourth_down_failed']
all_plays_2023['two_point_conv_result2']= all_plays_2023['two_point_conv_result'].map(lambda x: True if x == 'success' else False)
all_plays_2023

  all_plays_2023 = pd.read_csv('/Users/MC/Downloads/play_by_play_2023.csv')


Unnamed: 0,play_id,game_id,old_game_id,home_team,away_team,season_type,week,posteam,posteam_type,defteam,...,qb_epa,xyac_epa,xyac_mean_yardage,xyac_median_yardage,xyac_success,xyac_fd,xpass,pass_oe,fourth_down_attempts,two_point_conv_result2
0,1,2023_01_ARI_WAS,2023091007,WAS,ARI,REG,1,,,,...,0.000000,,,,,,,,,False
1,39,2023_01_ARI_WAS,2023091007,WAS,ARI,REG,1,WAS,home,ARI,...,0.000000,,,,,,,,0.0,False
2,55,2023_01_ARI_WAS,2023091007,WAS,ARI,REG,1,WAS,home,ARI,...,-0.336103,,,,,,0.515058,-51.505846,0.0,False
3,77,2023_01_ARI_WAS,2023091007,WAS,ARI,REG,1,WAS,home,ARI,...,0.703308,0.340652,3.328642,1.0,0.996628,0.583928,0.661106,33.889407,0.0,False
4,102,2023_01_ARI_WAS,2023091007,WAS,ARI,REG,1,WAS,home,ARI,...,0.469799,,,,,,0.196065,-19.606467,0.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48420,4253,2023_19_PIT_BUF,2024011501,BUF,PIT,POST,19,PIT,away,BUF,...,0.097917,0.642515,5.621778,4.0,0.988080,0.249705,0.962465,3.753471,0.0,False
48421,4278,2023_19_PIT_BUF,2024011501,BUF,PIT,POST,19,PIT,away,BUF,...,-0.858869,,,,,,0.968867,3.113294,0.0,False
48422,4322,2023_19_PIT_BUF,2024011501,BUF,PIT,POST,19,PIT,away,BUF,...,-0.316456,,,,,,0.940734,5.926609,0.0,False
48423,4349,2023_19_PIT_BUF,2024011501,BUF,PIT,POST,19,PIT,away,BUF,...,-1.543516,,,,,,0.962551,3.744876,0.0,False


We will use the first 15 weeks of an NFL season as a training set. We are most interested in how often a team goes for a fourth down conversion with more than 5 minutes to go between the NOT between the 30 and 40-yard line. If a team is within 30 yards, it would be more aggressive to try to go for a conversion rather than the safe option of kicking a field goal. Similarly, if a team is beyond the 40-yard line, we want to see how often they go for a conversion on fourth down rather than the safe option of punting the ball.  We note that the 30 and 40-yard line "dead man's zone" is arbitrarily created. A future extension of this project is to determine what bounds make the most sense for this area.

In [7]:
plays_2023 = all_plays_2023[['season_type','game_id','week','posteam','side_of_field','game_seconds_remaining','home_team','away_team','yards_gained','yardline_100','down','play_type','pass_length','air_yards','fourth_down_attempts','fourth_down_converted','two_point_attempt','two_point_conv_result2','spread_line','total_line']].copy()
train = plays_2023[(plays_2023["week"] <= 15) & (plays_2023["game_seconds_remaining"] >= 300) & ((plays_2023["yardline_100"] < 30) | (plays_2023["yardline_100"] > 40))].copy()
train['two_point_conv_result2']= train['two_point_conv_result2'].map(lambda x: True if x == 'success' else False)
train = train.fillna(0)

We will create an index call go to determine how often a team goes for a fourth down conversion NOT between the 30 and 40 yard line with more than 5 mins to go. In all a close game with less than 5 minutes left, a team is likely to attempt a conversion regardless of how truly aggresive they are. Between 30 and 40 yards, it is simillar to a no mans land where it makes sense to go for an attempt since the opposing team will still have a lot of yardage to cover in a turnover and your team is likely too far for a field goal attempt. The are between 30 and 40 yards is not defenitive and a possible follow up of this research can model what the true no mans land yardage are should be.

In [8]:
four_down = train[train['down'] == 4].copy()
four_down['play'] = four_down['play_type'].map(lambda x: 1 if x == 'pass' or x == 'run' else 0)
plays = four_down.groupby('posteam', as_index = False).sum(numeric_only = True)
fourths = four_down.groupby('posteam',as_index = False).count()
fourths['go'] = plays['play']
fourths['prop'] = plays['play']/fourths['play']
go = fourths.set_index('posteam')['prop']

In [9]:
teams = pd.DataFrame(train[train['air_yards'] > 20].groupby('posteam').count()['pass_length'])
teams['two_point_attempt'] = train[plays_2023['two_point_attempt'] == 1].groupby('posteam').count()['two_point_attempt']
teams['fourth_down_attempt'] = train[plays_2023['fourth_down_attempts'] == 1].groupby('posteam').count()['fourth_down_attempts']

  teams['two_point_attempt'] = train[plays_2023['two_point_attempt'] == 1].groupby('posteam').count()['two_point_attempt']
  teams['fourth_down_attempt'] = train[plays_2023['fourth_down_attempts'] == 1].groupby('posteam').count()['fourth_down_attempts']


Below is a data frame of each team and their aggression score. The most aggressive team is Arizona as they had 10 two-point attempts, 18 fourth down attempts, and a 20 percent go rate for aggressive fourth down attempts. Jacksonville would be seen as 78% as aggressive as Arizona in this table.

In [10]:
teams = teams.rename_axis('team').reset_index()
teams = teams.fillna(0)
teams = teams.rename(columns={"pass_length": "deep_passes"})
teams['go'] = teams['team'].map(go)
teams['score'] = teams['deep_passes'] + 5*teams['two_point_attempt'] + 2*teams['fourth_down_attempt'] + 200*teams['go']
teams['score'] = teams['score']/max(teams['score'])
teams.sort_values(['score'],ascending=False)

Unnamed: 0,team,deep_passes,two_point_attempt,fourth_down_attempt,go,score
0,ARI,29,10.0,18,0.204545,1.0
14,JAX,43,3.0,14,0.181818,0.78484
25,PHI,43,2.0,14,0.202899,0.779812
7,CLE,46,5.0,13,0.117117,0.772395
13,IND,31,3.0,18,0.189474,0.769004
11,GB,38,4.0,14,0.150538,0.744713
10,DET,23,3.0,16,0.222222,0.734046
8,DAL,40,4.0,11,0.15942,0.730452
31,WAS,43,4.0,11,0.135802,0.719397
12,HOU,33,5.0,13,0.132653,0.708943


We will make a dictionary of each team and their aggression score.

In [11]:
agg = teams.set_index('team')['score']

We will make a games data frame with each game in the NFL season and the spread, total, and aggression scores of the home and away team.

In [12]:
games = plays_2023[plays_2023['season_type'] == 'REG'].groupby('game_id').first()
games_tries = plays_2023[plays_2023['season_type'] == 'REG'].groupby('game_id').sum(numeric_only = True)
games = games[['spread_line','total_line','away_team','home_team']]
games['two_point_attempts'] =games_tries['two_point_attempt']
games['two_point_try'] = games['two_point_attempts'].map(lambda x: 1 if x > .00001 else 0)
games = games.rename_axis('game').reset_index()
games['away_team_score'] = games['away_team'].map(agg)
games['home_team_score'] = games['home_team'].map(agg)
games = games[['game','spread_line','total_line','away_team_score','home_team_score','two_point_attempts','two_point_try']]
games

Unnamed: 0,game,spread_line,total_line,away_team_score,home_team_score,two_point_attempts,two_point_try
0,2023_01_ARI_WAS,7.0,38.0,1.000000,0.719397,0.0,0
1,2023_01_BUF_NYJ,-2.5,44.5,0.620677,0.490472,0.0,0
2,2023_01_CAR_ATL,3.5,40.5,0.695155,0.569172,0.0,0
3,2023_01_CIN_CLE,-1.0,46.5,0.398137,0.772395,1.0,1
4,2023_01_DAL_NYG,-3.5,44.5,0.730452,0.630321,0.0,0
...,...,...,...,...,...,...,...
267,2023_18_NYJ_NE,2.5,28.5,0.490472,0.392452,1.0,1
268,2023_18_PHI_NYG,-4.5,43.0,0.779812,0.630321,0.0,0
269,2023_18_PIT_BAL,-3.0,34.0,0.487600,0.413703,0.0,0
270,2023_18_SEA_ARI,-2.5,48.0,0.447393,1.000000,1.0,1


We will then fit the data with a Decision Tree Regression model with spread, total, and aggression scores being used to predict the probability of a two-point attempt.

In [13]:
X = games[['spread_line','total_line','away_team_score','home_team_score']]
y = games['two_point_try']
clf = DecisionTreeRegressor(max_depth=5)
clf.fit(X,y)

This is the spread, total, and aggression score for the Super Bowl 2024 matchup.

In [14]:
data = pd.DataFrame({"spread_line":[2],"total_line":[47.5],'away_team_score':agg['SF'],'home_team_score':agg['KC']})
data

Unnamed: 0,spread_line,total_line,away_team_score,home_team_score
0,2,47.5,0.313234,0.421437


In [15]:
p_super_try = clf.predict(data)[0]
p_super_try

0.30344827586206896

In [16]:
(1-p_super_try)/p_super_try*100

229.5454545454546

In [17]:
p_sucess = len(all_plays_2023[all_plays_2023['two_point_conv_result2'] ==  True])/len(all_plays_2023[all_plays_2023['two_point_attempt'] == 1])
p_sucess

0.5538461538461539

In [18]:
p_super_sucess = p_super_try * p_sucess
p_super_sucess

0.16806366047745358

In [19]:
((1-p_super_sucess)/p_super_sucess)*100

495.0126262626263

We will now model the scoring distributions of games. In the cell below, all scoring play columns are added to a dataset of 2023 plays. We record separate columns for home and away field goals worth 3 points, and touchdowns worth 6 points with a separate column for the extra point and two-point conversion.

In [20]:
plays_2023 = all_plays_2023[['season_type','game_id','week','qtr','total_away_score','total_home_score','quarter_seconds_remaining','sp','posteam','defteam','side_of_field','home_team','away_team','td_team','touchdown','extra_point_result','two_point_conv_result','field_goal_result','total_home_score','total_away_score','desc','play_type','safety']].copy()
all_games = plays_2023[(plays_2023['sp'] == 1) | (plays_2023['extra_point_result'] == 'failed') | (plays_2023['two_point_conv_result'] == 'failure')].copy()
all_games['home_team_score'] = (all_games['td_team'] == all_games['home_team']).map(lambda x:  6 if x == True else 0)
all_games['away_team_score'] = (all_games['td_team'] == all_games['away_team']).map(lambda x:  6 if x == True else 0)
all_games['xp'] = all_games['extra_point_result'].map(lambda x:  1 if x == 'good' else 0).shift(-1).fillna(0)
all_games['2pt'] = all_games['two_point_conv_result'].map(lambda x:  2 if x == 'success' else 0).shift(-1).fillna(0)

homesafety = ((all_games['defteam'] == all_games['home_team']) & (all_games['safety'] == 1)).map(lambda x: 2 if x == True else 0)
awaysafety = ((all_games['defteam'] == all_games['away_team']) & (all_games['safety'] == 1)).map(lambda x: 2 if x == True else 0)
homekick = ((all_games['posteam'] == all_games['home_team']) & (all_games['field_goal_result'] == 'made')).map(lambda x:  3 if x == True else 0)
awaykick = ((all_games['posteam'] == all_games['away_team']) & (all_games['field_goal_result'] == 'made')).map(lambda x:  3 if x == True else 0)

two_h = ((homesafety == 2)).map(lambda x:  2 if x == True else 0)
three_h = ((homekick == 3)).map(lambda x:  3 if x == True else 0)
six_h = ((all_games['home_team_score'] == 6) & (all_games['xp'] == 0) & (all_games['2pt'] == 0)).map(lambda x:  6 if x == True else 0)
seven_h = ((all_games['home_team_score'] == 6) & (all_games['xp'] == 1)).map(lambda x:  7 if x == True else 0)
eight_h = ((all_games['home_team_score'] == 6) & (all_games['2pt'] == 2)).map(lambda x:  8 if x == True else 0)
all_games['home_score'] = two_h + three_h + six_h + seven_h + eight_h

two_a = ((awaysafety == 2)).map(lambda x:  2 if x == True else 0)
three_a = ((awaykick == 3)).map(lambda x:  3 if x == True else 0)
six_a = ((all_games['away_team_score'] == 6) & (all_games['xp'] == 0) & (all_games['2pt'] == 0)).map(lambda x:  6 if x == True else 0)
seven_a = ((all_games['away_team_score'] == 6) & (all_games['xp'] == 1)).map(lambda x:  7 if x == True else 0)
eight_a = ((all_games['away_team_score'] == 6) & (all_games['2pt'] == 2)).map(lambda x:  8 if x == True else 0)
all_games['away_score'] = two_a +three_a + six_a + seven_a + eight_a
all_games

Unnamed: 0,season_type,game_id,week,qtr,total_away_score,total_home_score,quarter_seconds_remaining,sp,posteam,defteam,...,total_away_score.1,desc,play_type,safety,home_team_score,away_team_score,xp,2pt,home_score,away_score
25,REG,2023_01_ARI_WAS,1,1,0,6,262.0,1,WAS,ARI,...,0,(4:22) (Shotgun) 14-S.Howell pass short left t...,pass,0.0,6,0,1.0,0.0,7,0
26,REG,2023_01_ARI_WAS,1,1,0,7,255.0,1,WAS,ARI,...,0,"6-J.Slye extra point is GOOD, Center-54-C.Chee...",extra_point,0.0,0,0,0.0,0.0,0,0
36,REG,2023_01_ARI_WAS,1,1,3,7,56.0,1,ARI,WAS,...,3,"(:56) 5-M.Prater 28 yard field goal is GOOD, C...",field_goal,0.0,0,0,0.0,0.0,0,3
46,REG,2023_01_ARI_WAS,1,2,6,7,811.0,1,ARI,WAS,...,6,"(13:31) 5-M.Prater 54 yard field goal is GOOD,...",field_goal,0.0,0,0,0.0,0.0,0,3
79,REG,2023_01_ARI_WAS,1,2,12,7,62.0,1,WAS,ARI,...,12,(1:02) (Shotgun) 14-S.Howell sacked at WAS 12 ...,pass,0.0,0,6,1.0,0.0,0,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48370,POST,2023_19_PIT_BUF,19,3,10,24,96.0,1,BUF,PIT,...,10,"(1:36) 2-T.Bass 45 yard field goal is GOOD, Ce...",field_goal,0.0,0,0,0.0,0.0,3,0
48384,POST,2023_19_PIT_BUF,19,4,16,24,637.0,1,PIT,BUF,...,16,(10:37) (Shotgun) 2-M.Rudolph pass short left ...,pass,0.0,0,6,1.0,0.0,0,7
48385,POST,2023_19_PIT_BUF,19,4,17,24,632.0,1,PIT,BUF,...,17,"9-C.Boswell extra point is GOOD, Center-46-C.K...",extra_point,0.0,0,0,0.0,0.0,0,0
48394,POST,2023_19_PIT_BUF,19,4,17,30,397.0,1,BUF,PIT,...,17,(6:37) (Shotgun) 17-J.Allen pass short middle ...,pass,0.0,6,0,1.0,0.0,7,0


We will create a dictionary called scores. This dictionary tells us how many scoring plays occured in quarter 1 for the home and away team.

In [21]:
len_h = (all_games['home_score'] > 0).map(lambda x: 1 if x == True else 0).sum()
len_a = (all_games['away_score'] > 0).map(lambda x: 1 if x == True else 0).sum()
scores = {'home_score': [len_h], 'away_score': [len_a]}

Next, we will create a dictionary of the probability that a single scoring play results in 2,3,6,7, or 8 points.

In [22]:
plays = {place:{i:(all_games[[place]] == i).sum()[place]/scores[place][0] for score in [[len_h,len_a]] for i in [2,3,6,7,8]} for place in ['home_score','away_score']}
plays

{'home_score': {2: 0.010752688172043012,
  3: 0.39619520264681557,
  6: 0.04218362282878412,
  7: 0.5153019023986766,
  8: 0.03556658395368073},
 'away_score': {2: 0.004651162790697674,
  3: 0.41674418604651164,
  6: 0.0586046511627907,
  7: 0.4930232558139535,
  8: 0.026976744186046512}}

We are also interested in how many scoring plays will occur for the home and away team. We will create a dictionary with the probability of how many scoring plays a team has.

In [23]:
num_scores = {place:plays_2023[plays_2023['posteam'] == plays_2023[place]].groupby('game_id',as_index = False).sum()['sp'].value_counts() for place in ['home_team','away_team']}
num_scores['home_score'] = num_scores['home_team'] / sum(num_scores['home_team'] )
num_scores['away_score'] = num_scores['away_team'] / sum(num_scores['away_team'] )
pd.DataFrame(num_scores)

Unnamed: 0_level_0,home_team,away_team,home_score,away_score
sp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,3,3.0,0.010791,0.010791
1,4,10.0,0.014388,0.035971
2,9,13.0,0.032374,0.046763
3,21,34.0,0.07554,0.122302
4,15,27.0,0.053957,0.097122
5,41,43.0,0.147482,0.154676
6,43,43.0,0.154676,0.154676
7,36,31.0,0.129496,0.111511
8,35,21.0,0.125899,0.07554
9,31,24.0,0.111511,0.086331


We will create a function num_sp() which uses the num_scores dictionaries above to generate how many scoring plays a team has per game for a given number of games.

In [24]:
rng = np.random.default_rng()
def num_sp(n,team):
    sp = rng.choice(a=list(num_scores[f"{team}_score"].index), size=n, p=list(num_scores[f"{team}_score"].values))
    return sp

The function games() generate a distribution of single-scoring plays.

In [25]:
rng = np.random.default_rng()
def games(n,team):
    i = 1
    if (team == 'home'): i = 0
    scores_in_game = rng.choice(a=list(plays[f"{team}_score"].keys()), size=n, p=pd.DataFrame(plays.values()).iloc[i])
    return scores_in_game

The function scores() combines the previous functions and dictionaries to print simulated scores for a certain number of games.

In [26]:
def scores(n):
    home_score_plays = num_sp(n,'home')
    away_score_plays = num_sp(n,'away')
    scores_list = []
    for i in range(len(home_score_plays)):
        h = int(sum(games(home_score_plays[i],'home')))
        a = int(sum(games(away_score_plays[i],'away')))
        if h < a:
            scores_list.append((h,a))
        else:
            scores_list.append((a,h))
    return scores_list

For instance, for a 1000 games, these are simulated scores

In [27]:
scores(1000)

[(13, 21),
 (7, 30),
 (26, 26),
 (54, 57),
 (30, 36),
 (34, 50),
 (10, 37),
 (24, 38),
 (6, 48),
 (38, 60),
 (24, 51),
 (36, 70),
 (26, 28),
 (7, 20),
 (13, 14),
 (30, 31),
 (20, 47),
 (17, 27),
 (13, 28),
 (33, 71),
 (17, 45),
 (9, 27),
 (27, 54),
 (17, 33),
 (41, 45),
 (23, 30),
 (13, 33),
 (22, 65),
 (35, 50),
 (57, 61),
 (35, 50),
 (21, 40),
 (26, 57),
 (25, 38),
 (40, 55),
 (7, 17),
 (7, 36),
 (13, 30),
 (29, 30),
 (17, 37),
 (15, 43),
 (6, 46),
 (22, 31),
 (21, 39),
 (31, 34),
 (3, 20),
 (24, 34),
 (13, 16),
 (32, 41),
 (35, 45),
 (55, 60),
 (21, 36),
 (34, 40),
 (23, 33),
 (24, 37),
 (24, 38),
 (44, 46),
 (22, 35),
 (39, 45),
 (9, 17),
 (36, 47),
 (10, 60),
 (16, 26),
 (23, 37),
 (16, 23),
 (34, 39),
 (14, 51),
 (15, 46),
 (26, 54),
 (28, 35),
 (31, 40),
 (37, 53),
 (36, 97),
 (24, 33),
 (30, 31),
 (41, 45),
 (13, 17),
 (23, 99),
 (30, 46),
 (27, 60),
 (9, 26),
 (0, 53),
 (14, 61),
 (34, 55),
 (20, 27),
 (33, 47),
 (16, 27),
 (13, 41),
 (29, 30),
 (45, 51),
 (26, 44),
 (17, 59),

## Scorigami Analysis

A Scorigami is a final score that has never occurred in the history of the NFL. We will make a sp data frame to determine the number of scoring plays in each game of the NFL season. 

In [28]:
sp = all_games.copy()
sp['home_score'] = sp['home_score'].map(lambda x: 1 if x>0 else 0)
sp['away_score'] = sp['away_score'].map(lambda x: 1 if x>0 else 0)
sp = sp.groupby('game_id',as_index = False).sum()
sp

Unnamed: 0,game_id,season_type,week,qtr,total_away_score,total_home_score,quarter_seconds_remaining,sp,posteam,defteam,...,total_away_score.1,desc,play_type,safety,home_team_score,away_team_score,xp,2pt,home_score,away_score
0,2023_01_ARI_WAS,REGREGREGREGREGREGREGREGREGREGREG,11,26,111,114,3684.0,11,WASWASARIARIWASARIWASARIWASWASWAS,ARIARIWASWASARIWASARIWASARIARIARI,...,111,(4:22) (Shotgun) 14-S.Howell pass short left t...,passextra_pointfield_goalfield_goalpassextra_p...,0.0,12,6,3.0,0.0,4,4
1,2023_01_BUF_NYJ,REGREGREGREGREGREGREGREGREGREGREG,11,33,122,97,3236.0,11,BUFNYJBUFBUFBUFNYJNYJNYJNYJBUFBUF,NYJBUFNYJNYJNYJBUFBUFBUFBUFNYJNYJ,...,122,"(3:09) 2-T.Bass 40 yard field goal is GOOD, Ce...",field_goalfield_goalpassextra_pointfield_goalf...,0.0,12,6,2.0,0.0,5,4
2,2023_01_CAR_ATL,REGREGREGREGREGREGREGREGREGREG,10,30,73,124,5652.0,10,ATLATLCARCARCARATLATLATLATLATL,CARCARATLATLATLCARCARCARCARCAR,...,73,(15:00) (Shotgun) 9-D.Ridder pass short right ...,passextra_pointpassextra_pointfield_goalfield_...,0.0,18,6,4.0,0.0,4,2
3,2023_01_CIN_CLE,REGREGREGREGREGREGREGREG,8,24,15,107,3549.0,8,CLECLECLECINCLECLECLECLE,CINCINCINCLECINCINCINCIN,...,15,(14:19) 7-D.Hopkins 42 yard field goal is GOOD...,field_goalrunextra_pointfield_goalfield_goalfi...,0.0,12,0,1.0,2.0,5,1
4,2023_01_DAL_NYG,REGREGREGREGREGREGREGREGREGREGREGREG,12,25,266,0,5697.0,11,NYGDALDALNYGDALDALDALDALDALDALDALDAL,DALNYGNYGDALNYGNYGNYGNYGNYGNYGNYGNYG,...,266,(8:14) 9-G.Gano 45 yard field goal is BLOCKED ...,field_goalextra_pointfield_goalpassextra_point...,0.0,0,30,4.0,0.0,0,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
273,2023_19_GB_DAL,POSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPO...,437,63,717,247,7514.0,21,GBGBGBGBGBGBDALGBDALDALDALGBGBDALDALGBGBGBGBDA...,DALDALDALDALDALDALGBDALGBGBGBDALDALGBGBDALDALD...,...,717,(7:12) (Shotgun) 33-A.Jones right guard for 3 ...,runextra_pointrunextra_pointpassextra_pointpas...,0.0,24,42,7.0,4.0,5,7
274,2023_19_LA_DET,POSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPO...,266,27,141,230,5749.0,14,DETDETLADETDETLALADETDETLALADETLALA,LALADETLALADETDETLALADETDETLADETDET,...,141,(9:33) 5-D.Montgomery up the middle for 1 yard...,runextra_pointfield_goalrunextra_pointpassextr...,0.0,18,12,5.0,0.0,4,5
275,2023_19_MIA_KC,POSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOST,190,23,48,142,6213.0,10,KCKCKCMIAMIAKCKCKCKCKC,MIAMIAMIAKCKCMIAMIAMIAMIAMIA,...,48,(11:10) (Shotgun) 15-P.Mahomes pass short righ...,passextra_pointfield_goalpassextra_pointfield_...,0.0,12,6,3.0,0.0,6,1
276,2023_19_PHI_TB,POSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPOSTPO...,247,30,69,226,4577.0,12,TBTBTBTBPHITBPHIPHIPHITBTBTBTB,PHIPHIPHIPHITBPHITBTBTBPHIPHIPHIPHI,...,69,(10:05) 4-C.McLaughlin 28 yard field goal is G...,field_goalpassextra_pointfield_goalfield_goalf...,1.0,18,6,3.0,0.0,7,2


Below is a dictionary of the likelihood of how many scoring plays a team scores in a given NFL game.

In [29]:
num_scores = {place:sp[place].value_counts() for place in ['home_score','away_score']}
num_scores['home_score'] = num_scores['home_score'] / sum(num_scores['home_score'] )
num_scores['away_score'] = num_scores['away_score'] / sum(num_scores['away_score'] )
num_scores

{'home_score': home_score
 4     0.212230
 5     0.212230
 3     0.172662
 6     0.154676
 2     0.097122
 7     0.075540
 1     0.028777
 8     0.021583
 0     0.017986
 10    0.003597
 9     0.003597
 Name: count, dtype: float64,
 'away_score': away_score
 4    0.241007
 3    0.223022
 5    0.147482
 2    0.129496
 6    0.122302
 1    0.064748
 7    0.050360
 8    0.010791
 0    0.010791
 Name: count, dtype: float64}

We will make a data frame of all scorigamis.

In [30]:
pip install html5lib



Note: you may need to restart the kernel to use updated packages.


In [31]:
multiser = pd.read_html('/Users/MC/Downloads/scoregami.rtf',skiprows = 2)[0].fillna(0).stack()
multiser = pd.DataFrame(multiser)
scorigami = multiser[multiser[0] == 0]
scorigami

Unnamed: 0,Unnamed: 1,0
0,1,0.0
0,4,0.0
0,46,0.0
0,50,0.0
0,61,0.0
...,...,...
73,69,0.0
73,70,0.0
73,71,0.0
73,72,0.0


For each game, we will check if it is a scorigami. If it is, we will add it to the scori list. We will also count how many games it generates 1000 scorigamis.

In [32]:
scori = []
temp_array = []
counter = 0
while len(scori) < 1000:
    counter += 1
    temp = scores(1)[0]
    temp_array.append(temp)
    if temp in scorigami.index:
        scori.append(temp)
scori = pd.DataFrame({"score":scori})
counter

38605

Finally, we will generate a histogram of scorigamis to determine which ones are the most likely to occur next. 

In [33]:
hist = alt.Chart(scori).mark_bar().encode(
    x = alt.X('score:N',
                sort=alt.EncodingSortField(field="Letters", op="count", order='descending'),
                axis=alt.Axis(labelAngle=0)), 
    y = 'count()',
    tooltip=['score', 'count()']).interactive()
hist

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=3ba18e92-e83d-46c0-9312-ed9b128d0b76' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>

In [34]:
pip install -U jupyter-book





































Note: you may need to restart the kernel to use updated packages.


In [35]:
jupyter-book create book/

SyntaxError: invalid syntax (414859535.py, line 1)