# 2025 Data-Driven Driver Rankings

The goal of this notebook is to create an objective driver ranking, based on data.

## Methodology

We will use features generated from F1 data and create a normalization algorithm to process each feature category, creating a hypothetical score for each driver in each feature category.

I will explain everything below as I proceed, which should make things much clearer.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os

pd.set_option('display.max_columns', None)

In [None]:
# change path for local imports

if not (Path.cwd() / 'src').exists():
    project_root = Path.cwd().parents[1]
    os.chdir(project_root)
    print(f"Diretório alterado para: {Path.cwd()}")
else:
    print(f"Já estamos na raiz do projeto: {Path.cwd()}")

In [None]:
from src.analysis.data_viz.plotter import *
from src.analysis.data_viz.constants import TEAM_COLORS
from src.analysis.data_viz.constants import JOLPICA_CONSTRUCTOR_RENAME
import notebooks.f1_2025_data_rankings.utils as nb_utils # helpers for this specific notebook
from src.analysis.utils.feature_normalizer import FeatureNormalizer

# Features and plan for them:

The analysis of the features in this notebook will consider the features to have some amount of validation (see feature_table_validation.ipynb for reference), and in this notebook I will dive deeper in the values produced in each feature for each driver. Explaining and revisiting each one of the features. And than, after, I will try to create a score/power ranking for each one of the drivers.

Also, all of the datasets have received previous data treatment to calculate the features appropriately. Check the extractor for each feature for more details.

The analysis will focus on the 2025 championship.

In [None]:
# Loading feature datasets:

pace_path = 'data/features/pace_features.csv'
perf_path = 'data/features/performance_features.csv'
exp_path = 'data/features/experience_features.csv'

df_pace = pd.read_csv(pace_path)
df_perf = pd.read_csv(perf_path)
df_exp = pd.read_csv(exp_path)

list_df_features = [df_pace, df_perf, df_exp]

# Data treatment useful for later

def clean_df(df):
    df = nb_utils.filter_year(df, 2025)
    return nb_utils.replace_constructors_names(df)

df_pace, df_perf, df_exp = [clean_df(p) for p in list_df_features]


# Pace features:

In [None]:
df_pace

In [None]:
list_pace_features = nb_utils.get_features_column_list(df_pace)
list_pace_features

In [None]:
selected_feature = 'avg_pace_vs_field'

df_plot = df_pace.sort_values(by=selected_feature, ascending=False)

graf_barras_padrao(
    df_dados=df_plot,
    x_col='driver_surname',
    y_col=selected_feature,
    hue_col='constructor_name',
    cores_map=TEAM_COLORS,
    titulo=f'2025 Championship - Feature Analysis: {selected_feature}',
    ylabel=selected_feature,
    fmt_rotulo='%.2f'
)

The problem with what I'm showing above is that it is not really the best way to present this kind of value... The differences are really small and it doesn't really show how much the drivers are diverging from each other... Let's treat the features and then we will get back to analyzing them...

# Normalization/Score Creation

I've created a separate module to normalize the features so we can extract more information from visualizing them and then we can later add weights to each one of them and created a combined score for each feature "category".

## How the normalization works:

1- Apply Z-Score normalization to normalize using mean and std. dev., being a better normalization method to not distort the whole series if there are outliers

2- Apply min-max scaling so the numbers are better interpreted -> Here we will have a max value of 10 and a min value of 5 (elite baseline logic, since the analysis is based on comparison and not an actual test score, we are only comparing elite drivers).

So, now that I made this quick break to briefly explain the normalization technique. I will apply it to the feature analysis and we can both see the original value and than the normalized value and I will explain the feature as we go.

But before going ahead, I'm going to create a function below to make my life a little easier when analyzing each single feature.

In [None]:
def create_feature_norm_analysis(df_feature, feature, lower_is_better):

    '''
    Plots chart for feature analysis and returns original dataframe with normalized feature col added
    '''

    df = df_feature.copy()

    df_plot = df.sort_values(by=feature, ascending=False)

    graf_barras_padrao(
        df_dados=df_plot,
        x_col='driver_surname',
        y_col=feature,
        hue_col='constructor_name',
        cores_map=TEAM_COLORS,
        titulo=f'2025 Championship - Feature Analysis: {feature}',
        ylabel=feature,
        xlabel='Driver',
        fmt_rotulo='%.2f',
        show_legend=False
    )

    normalizer = FeatureNormalizer()

    df[f'{feature}_norm'] = normalizer.robust_normalize(df[feature], target_range=(5, 10), lower_is_better=lower_is_better).copy()
    df_plot = df.sort_values(by=feature, ascending=False)

    graf_barras_padrao(
        df_dados=df_plot,
        x_col='driver_surname',
        y_col=f'{feature}_norm',
        hue_col='constructor_name',
        cores_map=TEAM_COLORS,
        titulo=f'2025 Championship - Feature Analysis: {feature} Normalized',
        ylabel=f'{feature} Normalized',
        xlabel='Driver',
        fmt_rotulo='%.2f',
        show_legend=False
    )

    return df


# Going back to Pace Features

In [None]:
feature_index = 0

selected_feature = list_pace_features[feature_index]

df_pace = create_feature_norm_analysis(df_feature=df_pace, feature=selected_feature, lower_is_better=True)

feature_index += 1

This feature is generated by the following formula:
- For each race, the driver's median pace (pace is represented by the lap time) is calculated;
- The median pace of the field is calculated by getting the median value of the driver's median pace of the field;
- The driver's median pace is then divided by the median pace of the field to get the pace vs field. This calculation is still done in a race level;
- Then, I average the value for the whole year and get the final value for the driver.

So, at the end of the day, the feature in itself is a ratio of the driver's median pace to the median pace of the field. It could be interpreted as a percentage value of the driver's median pace relative to the median pace of the field. With lower values being better, since the pace is represented by the lap time, and lower lap times are better, obviously.

A driver with a pace vs field of 1.0 is the median of the field, while a driver with a pace vs field of 2.0 is twice the median pace of the field, and so on. And for every driver with a pace vs field lower than 1.0, it means that their pace is lower than the median pace of the field.

In [None]:
selected_feature = list_pace_features[feature_index]

df_pace = create_feature_norm_analysis(df_feature=df_pace, feature=selected_feature, lower_is_better=True)

feature_index += 1

This feature is calculated using pretty much the same logic as the previous one, but it compares the driver's median pace to his teammate's median pace instead of the whole field median pace.

We can see that, even though I used the z-score normalization to make outliers not affect the analysis by so much, it will still detect drivers that have a feature value that is far from the mean and give those drivers a high/low score. Just as we can see with Verstappen and Tsunoda on the chart above.

In [None]:
selected_feature = list_pace_features[feature_index]

df_pace = create_feature_norm_analysis(df_feature=df_pace, feature=selected_feature, lower_is_better=True)

feature_index += 1

The feature above is calculated using the following logic:

1. Calculate the standard deviation for every driver in every race stint, than average the value for all the stints in a race to get the standard deviation for each driver in each race;
2. Calculate the mean value for the above to get the standard deviation for each race;
3. Divide the driver's std. dev. by the race std. dev. to get the feature value for each race.
4. Average the value for each driver for the whole season to get the final feature value.

So, like the previous feature, this feature also represents a ration between values, which can also be interpreted as a percentage between the two.

In [None]:
selected_feature = list_pace_features[feature_index]

df_pace = create_feature_norm_analysis(df_feature=df_pace, feature=selected_feature, lower_is_better=True)

feature_index += 1

Again, the formula for the calculation is the same as explained in the previous cell. But now we are comparing the driver's median pace to his teammate's median pace instead of the whole field median pace.

In [None]:
selected_feature = list_pace_features[feature_index]

df_pace = create_feature_norm_analysis(df_feature=df_pace, feature=selected_feature, lower_is_better=True)

feature_index += 1

Here, instead of comparing the standard deviation of the driver's pace to the standard deviation of the field or teammate, we are showing the raw value.

What I found interesting is that Norris shows up as the most consistent driver when looking at the raw value, but when looking at the value for the feature comparing him to his teammate, he has a higher standard deviation.

This can happen because of the way the data is distributed, since we first compare the driver's standard deviation for every race, and than we calculate the mean value for the year. So, a scenario that this phenomena can happen (and probably has), is that Piastri is probably more consistent than Norris in amount of races, but overall, Lando is able to maintain his consistency across races better than Piastri, who could have suffered with a few rounds where he was not as consistent as he should be.


In [None]:
selected_feature = list_pace_features[feature_index]

df_pace = create_feature_norm_analysis(df_feature=df_pace, feature=selected_feature, lower_is_better=True)

feature_index += 1

The feature above is calculated through the following process:

1. Get the driver's best lap time for each qualifying session;
2. Get the best overall time for each qualifying session (usually pole);
3. Divide the driver's best lap time by the best overall time for each qualifying session, getting a ratio of how fast the driver was in comparison to the pole position time;
4. Subtract 1 from the ratio to get a value that represents the "percentage off-pole" of the driver in that session;
5. Average the percentage off-pole for each driver across all qualifying sessions to get the value representative of the whole season.


In [None]:
selected_feature = list_pace_features[feature_index]

df_pace = create_feature_norm_analysis(df_feature=df_pace, feature=selected_feature, lower_is_better=True)

feature_index += 1

The calculations for this feature is pretty much the same as above but using the teammate's best time as a comparison basis instead of the pole time.

# Performance Features

In [None]:
df_perf

In [None]:
list_perf_features = nb_utils.get_features_column_list(df_perf)
list_perf_features

In [None]:
feature_index = 0

selected_feature = list_perf_features[feature_index]

df_perf = create_feature_norm_analysis(df_feature=df_perf, feature=selected_feature, lower_is_better=False)

feature_index += 1

Total points in the season, self-explanatory

In [None]:
selected_feature = list_perf_features[feature_index]

df_perf = create_feature_norm_analysis(df_feature=df_perf, feature=selected_feature, lower_is_better=True)

feature_index += 1

Mean finishing position.

In [None]:
selected_feature = list_perf_features[feature_index]

df_perf = create_feature_norm_analysis(df_feature=df_perf, feature=selected_feature, lower_is_better=True)

feature_index += 1

Average starting position.

In [None]:
selected_feature = list_perf_features[feature_index]

df_perf = create_feature_norm_analysis(df_feature=df_perf, feature=selected_feature, lower_is_better=True)

feature_index += 1

Average difference between the finishing position and the starting position.

In [None]:
selected_feature = list_perf_features[feature_index]

df_perf = create_feature_norm_analysis(df_feature=df_perf, feature=selected_feature, lower_is_better=False)

feature_index += 1

Average points obtained in each round.

In [None]:
selected_feature = list_perf_features[feature_index]

df_perf = create_feature_norm_analysis(df_feature=df_perf, feature=selected_feature, lower_is_better=False)

feature_index += 1

Team's point share for each driver of the team.

In [None]:
selected_feature = list_perf_features[feature_index]

df_perf = create_feature_norm_analysis(df_feature=df_perf, feature=selected_feature, lower_is_better=False)

feature_index += 1

Number of times a driver outperformed their teammates in qualifying.