# How Bad is the Racing in the Hybrid Era?
### What does the data actually tell us about the quality of racing since the hybrid engine was introduced.

Since the Hybrid was introduced last July at Mid-Ohio, the public perception has been all about weight. Drivers, fans, the media, your uncle Bill, everyone agrees the Dallara DW12 was never designed to carry that much weight, and the racing has suffered as a result. 

Its not all the hybrid's fault, of course, the aeroscreen added a bunch of weight long before the hybrid got here. But the combination of the two is just too much. 

I mean look at the last lap of the Indy 500 this year compared the the last lap last year. Last year, without the hybrid, we got a dramatic back straight overtake that decided the race and somehow led to everyone liking Pato O'Ward even more while cementing Josef Newgarden as an even bigger villain. 

This year? Sure there was a close battle for P1 on the last lap, but where was the dramatic overtake? Also, unlike last year, only one of the two guys battling for first was in all those Fox commercials. Clearly the hybrid sucks.

Except this is a data blog, and INDYCAR has thousands of laps of racing every year. So a 2-lap sample size maybe feels like not enough data to draw a definitive conclusion? Ok fine.

### Margin of Victory.

Margin of victory - basically the time gap between P1 and P2 at the end of the race - seems to be the go-to metric whenver someone wants to cite how close the racing is. In fact this post was inspired by an interview INDYCAR's Managing Director of Engine Development Darren Sansom gave last offseason where he cited margin of victory as evidence that the Hybrid hasn't ruined the racing. 

So what does the data actually say about that? Let's see what the average margin of victory was for the entire aeroscreen era 2020-2024 vs the hybrid era 2024-2025:

In [1]:
import pandas as pd
import os
from analytics import RaceData

# load data
rd_oemaero = RaceData()
rd_oemaero.add_races_by_date('2015-01-01','2017-12-31',section_results=False) 
rd_singleaero = RaceData()
rd_singleaero.add_races_by_date('2018-01-01','2019-12-31',section_results=False) 
rd_aeroscreen = RaceData()
rd_aeroscreen.add_races_by_date('2020-01-01','2024-07-01',section_results=False) 
rd_hybrid = RaceData()
rd_hybrid.add_races_by_date('2024-07-01','2025-12-31',section_results=False)

# combine data into one df
oemaero = rd_oemaero.results_df.copy()
oemaero.rename(columns = {'Running/Reason Out':'Running / Reason Out'},inplace=True)
oemaero['Period'] = 'OEM Aero'
singleaero = rd_singleaero.results_df.copy()
singleaero['Period'] = 'Uniform Aero'
aeroscreen = rd_aeroscreen.results_df.copy()
aeroscreen = aeroscreen.loc[aeroscreen.RaceID != '6345'].copy()
aeroscreen['Period'] = 'Aeroscreen'
hybrid = rd_hybrid.results_df.copy()
hybrid['Period'] = 'Hybrid'
df = pd.concat([oemaero,singleaero,aeroscreen,hybrid],ignore_index=True)

In [2]:
# clean DFs
df = df.loc[df.Pos != ''].copy()
df['Pos'] = df.Pos.astype(int)
df['SP'] = df.SP.astype(int)

In [3]:
# look at gap to car in front
df['ElapsedSec'] = df['Elapsed Time'].apply(
    lambda x: int(x[0:2])*3600+int(x[3:5])*60+float(x[6:])
)
df['ElapsedSec-1'] = df.sort_values('Pos').groupby(['RaceID','Lap'])['ElapsedSec'].shift(-1)
df['Gap'] = df['ElapsedSec-1'] - df.ElapsedSec

In [4]:
df.loc[(df.Pos == 1) & df.Period.isin(['Aeroscreen','Hybrid'])].groupby('Period').agg(
    NRaces = ('RaceID','nunique'),
    Average = ('Gap','mean'),
    Median = ('Gap','median'),
    Largest = ('Gap','max')
)#,'median','count','max'])

Unnamed: 0_level_0,NRaces,Average,Median,Largest
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aeroscreen,72,3.518535,1.4091,30.3812
Hybrid,21,3.010319,1.726,16.0035


The Hybrid era (July 2024-present) actually saw margin of victory reduced by half a second compared to the Aeroscreen era (2020-July 2024), but the median is a little higher. This tells us that the majority of races were probably slightly tighter during the aeroscreen era, but the hybrid era had fewer blowout races, or at least the blowout victories weren't as massive in the hybrid era. 

If we're worried about weight, though, shouldn't we be looking at befoe the aeroscreen was added? After all, the aeroscreen added close to 20 pounds before the hybrid ever showed up. So what if we include the era BEFORE the hybrid era?

Actually I have no idea what to call that era, so we're going to throw in the era before the era before the aeroscreen which is when each engine manufacturer brought its own aero kit. That way we can call the 2 eras "OEM Aero" (2015-2017) and "Uniform Aero" (2018-2019). 

OK with that not at all made up problem solved, here's the analysis:

In [5]:
df.loc[(df.Pos == 1)].groupby('Period').agg(
    NRaces = ('RaceID','nunique'),
    Average = ('Gap','mean'),
    Median = ('Gap','median'),
    Largest = ('Gap','max')
)#,'median','count','max'])

Unnamed: 0_level_0,NRaces,Average,Median,Largest
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aeroscreen,72,3.518535,1.4091,30.3812
Hybrid,21,3.010319,1.726,16.0035
OEM Aero,49,3.508176,1.5275,30.2703
Uniform Aero,34,4.264303,2.68365,28.4391


OK, so the uniform aero era actually had the largest margin of victory of any era despite being lighter than both the aeroscreen and hybrid era cars. We have done capital S Science and determined that heavier cars lead to better racing. 

"Not so fast" you might say, oh intelligent and handsome reader of this blog. "Margion of Victory only looks at the winning and runner-up cars. How is looking at 2 cars per race definitive when there are 27+ cars in the field each weekend?" You're right, you wise and presumably rich reader. What if we looked at all the cars? What if we looked at more than just margin of victory?

OK that's actually more complex than it seems. You can't determine gap between 2 cars that aren't on the same lap. So first we're just going to look at cars on the lead lap.

In [13]:
df['Finished'] = (df['Running / Reason Out']=='Running')
df['LeadLap'] = (df['Laps Down'] == '0')
df.loc[
    df.Finished,
    'adjSP'
] = (
    df
    .loc[df.Finished]
    .groupby(['RaceID'])
    ['SP']
    .rank()
)
df['adjPosMvmt'] = abs(df['adjSP'] - df['Pos'])

df['LLGap'] = df.loc[df.LeadLap,'Gap']
df.groupby('Period').agg(
    
    Percent_of_Cars_on_Lead_Lap = ('LeadLap','mean'),
    Average_Gap_Lead_Lap_Only = ('LLGap','mean')
)

Unnamed: 0_level_0,Percent_of_Cars_on_Lead_Lap,Average_Gap_Lead_Lap_Only
Period,Unnamed: 1_level_1,Unnamed: 2_level_1
Aeroscreen,0.624868,2.800859
Hybrid,0.583184,2.632473
OEM Aero,0.635472,3.294332
Uniform Aero,0.592777,3.594967


Ok, so fewer cars are finishing on the lead lap, but those cars are racing tighter to each other than during any other era. 

There are 2 reasons that we might be seeing a smaller percentage of cars finishing on the lead lap. Either fewer cars are finishing overall, or there is a wider gap between the front and back of the pack. 

Looking at the percentage of cars that finish the race as well as the average laps own of cars that finish on the lead lap will shed some light on which of those 2 possibilities we are seeing:

In [23]:
df['Laps Down'] = df['Laps Down'].astype(int)
df['LeadLapFinishersOnly'] = df.apply(lambda x: x['Laps Down'] == 0 if x['Finished'] else None, axis=1)
df['FinisherLapsDownNonLL'] = df.apply(lambda x: x['Laps Down'] if x['Finished'] and (x['Laps Down'] > 0) else None, axis=1)

df.groupby('Period').agg(
    Percent_of_Cars_Finished = ('Finished','mean'),
    Percent_of_Finishers_on_lead_lap = ('LeadLapFinishersOnly','mean'),
    #Average_Laps_Down = ('Laps Down','mean'),
    #Average_Laps_Down_Non_Lead_Lap = ('LapsDownNonLL','mean'),
    Average_Laps_Down_Finishers_Only = ('FinisherLapsDownNonLL','mean')
)

Unnamed: 0_level_0,Percent_of_Cars_Finished,Percent_of_Finishers_on_lead_lap,Average_Laps_Down_Finishers_Only
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aeroscreen,0.854057,0.731647,2.586207
Hybrid,0.812165,0.718062,2.453125
OEM Aero,0.827986,0.766416,3.179724
Uniform Aero,0.856787,0.69186,3.240566


Well that's a mixed bag. On the one hand, the cars that finish on the lead lap are in a tighter race in the Hybrid Era than in any other Era. On the other hand, fewer finishers are finishing on the lead lap. 

Oh yeah, and a LOT fewer cars are finishing at all. Like 29% fewer than in the Aeroscreen era.

Those of you pre-disposed to hate the hybrid I'm sure are licking your chops at that low finishing percentage... I promise we'll look into it, but not in this post. I mean come on, we're trying to define how "good" the racing is, and races aren't judged on how many cars finish, they're judged on what the cars that finish actually do. 

So with all of that in mind what else do we want to know? For starters, how close is the racing between cars not on the lead lap? Also what about overtakes, we haven't even looked at a car's ability to move through the pack yet, and that feels important. 

Yeah you're right, those things do feel important, let's have a look. 

In [30]:
df['PosMvmt'] = abs(df['SP'] - df['Pos'])
agg = df.groupby("Period").agg(
    Avg_Position_Movement = ("PosMvmt", "mean"),
    Average_Gap_In_Seconds = ('Gap','mean')
    
)
agg

Unnamed: 0_level_0,Avg_Position_Movement,Average_Gap_In_Seconds
Period,Unnamed: 1_level_1,Unnamed: 2_level_1
Aeroscreen,6.079031,5.079309
Hybrid,6.753131,6.735813
OEM Aero,5.796791,4.311517
Uniform Aero,5.863014,13.17478


Position movement here is defined as the absolute difference beteween a car's starting and finishing position. For example, say the car on pole finished in 3rd and the car that started 5th won, then the Position Movement for the pole sitter would be 2 and the position movement for the winner would be 4. 

Gap is the average gap to the car in front of all cars that finish on the same lap as the car in front. 

At a surface level, overtaking feels good in the Hybrid era. Cars are moving through the pack much more than in any other era. Gap isn't so great however. 

BUT I can't ignore all the hybrid haters out there. Remember that finishing % number you read like 30 seconds ago? Remember how low it was? Turns out cars have a really easy time moving through the pack when nearly 20% of the field ends up retiring early, so That position movement number is almost definitely inflated by DNFs. 

Also do we really care about the gap between cars 3 laps down? For that matter, do we care whether a car is 20 or 25 seconds behind the car in front of it? I mean I don't, and I'm the one writing this blog. So let's fix it, but how?

This is where I pause to address those of you who follow analytics in other sports: Right now you're probably thinking to yourself, "Hey, I'm 5 minutes into reading this blog and I haven't been introdued to a single made up metric yet. Where are the 7-letter acronyms? Where is the confusing math that turns off casual fans but makes me, the analytics enthusiast, feel smarter?" Well, if that's what you're looking for then you're in luck. Nerd.

For those of you who are just Indycar fans desperately looking for any offseason content, I promise I won't be *that* obnoxious with the math and you'll walk away with some decent insight. 

Ok, with that out of the way, let's get back to the DNF problem. I follow Indycar for the racing and the strategy, not for the attrition. I'm guessing the same is true of the majoirty of my readers. So let's come up with a way that captures how much cars are able to move through the field based on speed and strategy, not due to cars in front of them DNFing.

This one's actually pretty easy. If we refactor the starting grid to exclude the cars that DNF, we can calculate position movement from the on-track racing and not from attrition. To explain, we can revisit my earlier example where the pole sitter finishes 3rd and SP5 wins the race. Now let's say that the car that SP2 has a mechanical failure. In this scenario, we would adjust the starting position of cars 3-27 by 1. So the pole sitter still has a position movement of 2 (since they started in front of the DNFing car) while the winner now has a position movement of 3 (since they ultimately passed the cars in SP4, SP3 and SP1 due to speed/strategy). We'll call this Adjusted Position Movement (I know, not a 7-letter acornym).

Addressing the gap issue also requires a little creativity. If the first 10 cars in a race each finish half a second in front of the next car, then P11 finishes 106 seconds ahead of P12, that's an average gap of 10 seconds per car - which sounds like a boring race but clearly is not given that P1 and P11 are only separated by 5 seconds. By the same token, a race where every car finishes 10 seconds ahead of the car in front kind of is a boring race. What we need to capture is how many cars finish a competitive distance behind the car they are chasing. 

Let's define any gap of 3 seconds or less between 2 cars as a competitive distance. 3 seconds is a little arbitrary, and certainly we'll want to do some of that capital-S Science later on to refine this number, but for now 3 seconds feels like a small enough gap that a car could feasibly overtake within a few laps but not so small that very few cars are going to finish within that gap. We'll call this Positions at Stake and we can get a sense of how competitive an era is by calculating what percentage of positions are at stake (Position at Stake %) across an entire era.

But wait! One more thing before we crunch those numbers. We haven't really addressed the problem of "how excited are we really going to get over a competitive battle for P23?" I mean I'd rather P23 and P24 finish 1 second apart vs 30 seconds apart, but I'd even more rather P1 and P2 finish 1 second apart. To address this, we need some kind of way of assigning value to each position, with better finishing positions being assigned a higher value. Sadly no such thing exists, so I'm just going to make up an arbitrary system where P1 is awarded 50 points, P2 is awarded 40 points, the subsequent positions are awarded decreasing points until any car finishing 25th or worse receives exactly 5 points. (OK fine I stole the idea from the *Pts* column of the results data, whatever that means). For each Position at Stake as described above, I'll also calculate Points at Stake by using Indyar's base points (not bonus points). With this, I'll look at Points at Stake as a percentage of total points available (Points at Stake %) to see how competitive the racing is - but weighted by the value of the positiosn being battled over. 

OK, explanation over, here are the numbers:

In [33]:
pts = [0,50,40,35,32,30,28,26,24,22,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,5,5,5,5,5,5,5,5]

df['PosAtStake'] = df.Gap.apply(lambda x: 1 if x <3 else 0)*df.Finished
df['BasePoints'] = df.Pos.apply(lambda x: pts[x])
df['PtsAtStake'] = df.PosAtStake*df.BasePoints

agg2 = df.groupby("Period").agg(
    Avg_Position_Movement = ("PosMvmt", "mean"),
    AvgAdjPosMvmt = ("adjPosMvmt", "mean"),
    AtStakePct = ("PosAtStake","mean"),
    TotalPoints = ("BasePoints", "sum"),
    PointsAtStake = ("PtsAtStake","sum")
)
agg2['PointsAtStakePct'] = agg2.PointsAtStake/agg2.TotalPoints
agg2[['Avg_Position_Movement','AvgAdjPosMvmt','AtStakePct','PointsAtStakePct']]

Unnamed: 0_level_0,Avg_Position_Movement,AvgAdjPosMvmt,AtStakePct,PointsAtStakePct
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aeroscreen,6.079031,4.809377,0.485774,0.612249
Hybrid,6.753131,4.933921,0.488372,0.617889
OEM Aero,5.796791,4.401507,0.463458,0.579646
Uniform Aero,5.863014,4.372093,0.433375,0.549978
