# Jolpica Data EDA

This notebook explores the new Jolpica F1 dataset structure loaded via the updated `F1DataRepository`.

**Goals:**
1. Verify connection to the database.
2. Load the full race history using the new schema.
3. Check column names and data types.
4. Assess data quality (missing values).
5. Export sample data for inspection.

In [1]:
import sys
import os
from pathlib import Path
import pandas as pd

# # Add project root to sys.path to allow importing modules
project_root = Path(os.getcwd()).parent / 'src' / 'gonzo_pit_strategy'
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

from db.repositories.data_repository import F1DataRepository
from gonzo_pit_strategy.log.logger import get_logger

# Setup basic logging
logger = get_logger("eda_notebook")

## 1. Load Race History
Fetching data using the updated `get_all_race_history()` method which joins Rounds, Sessions, and Entries.

In [2]:
try:
    df = F1DataRepository.get_all_race_history()
    print(f"Successfully loaded data with shape: {df.shape}")
except Exception as e:
    print(f"Error loading data: {e}")

2026-01-19 22:11:25,631 - gonzo_pit_strategy.security.vault - DEBUG - Successfully authenticated with Vault
2026-01-19 22:11:25,634 - gonzo_pit_strategy.security.vault - INFO - Vault client initialized successfully
2026-01-19 22:11:25,672 - gonzo_pit_strategy.db.connection_pool - INFO - Creating database engine for postgresql+psycopg2 at 10.0.20.18:5432/f1db
2026-01-19 22:11:25,825 - gonzo_pit_strategy.db.connection_pool - DEBUG - Database engine and session factory created successfully
Successfully loaded data with shape: (25873, 29)


## 2. Basic Structure Inspection

In [3]:
# Check column names and types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25873 entries, 0 to 25872
Data columns (total 29 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   round_id                      25873 non-null  int64  
 1   circuit_id                    25873 non-null  int64  
 2   circuit_name                  25873 non-null  object 
 3   season_year                   25873 non-null  int64  
 4   round_date                    25873 non-null  object 
 5   round_number                  25873 non-null  int64  
 6   round_url                     25873 non-null  object 
 7   team_id                       25873 non-null  int64  
 8   team_name                     25873 non-null  object 
 9   driver_id                     25873 non-null  int64  
 10  driver_abbreviation           10644 non-null  object 
 11  q1_time                       0 non-null      object 
 12  q2_time                       0 non-null      object 
 13  q

In [4]:
# View first few rows
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,round_id,circuit_id,circuit_name,season_year,round_date,round_number,round_url,team_id,team_name,driver_id,driver_abbreviation,q1_time,q2_time,q3_time,qualifying_position,grid,finish_position,status,status_detail,laps_completed,points,time,fastest_lap_rank,driver_championship_points,driver_championship_position,driver_wins,team_championship_points,team_championship_position,team_wins
0,1,1,Silverstone Circuit,1950,1950-05-13,1,https://en.wikipedia.org/wiki/1950_British_Gra...,1,Alfa Romeo,1,,,,,,1,1,0,Finished,70,9.0,02:13:23.6,,9.0,1.0,1,0.0,,1.0
1,1,1,Silverstone Circuit,1950,1950-05-13,1,https://en.wikipedia.org/wiki/1950_British_Gra...,1,Alfa Romeo,2,,,,,,2,2,0,Finished,70,6.0,02:13:26.2,,6.0,2.0,0,0.0,,1.0
2,1,1,Silverstone Circuit,1950,1950-05-13,1,https://en.wikipedia.org/wiki/1950_British_Gra...,1,Alfa Romeo,3,,,,,,4,3,0,Finished,70,4.0,02:14:15.6,,4.0,3.0,0,0.0,,1.0
3,1,1,Silverstone Circuit,1950,1950-05-13,1,https://en.wikipedia.org/wiki/1950_British_Gra...,5,Talbot-Lago,4,,,,,,6,4,1,+2 Laps,68,3.0,,,3.0,4.0,0,0.0,,0.0
4,1,1,Silverstone Circuit,1950,1950-05-13,1,https://en.wikipedia.org/wiki/1950_British_Gra...,5,Talbot-Lago,5,,,,,,9,5,1,+2 Laps,68,2.0,,,2.0,5.0,0,0.0,,0.0


## 3. Data Quality Checks

In [5]:
# Check for missing values
missing = df.isnull().sum()
missing = missing[missing > 0]
print("Columns with missing values:")
print(missing.sort_values(ascending=False))

Columns with missing values:
q1_time                         25873
q2_time                         25873
q3_time                         25873
time                            17438
fastest_lap_rank                16887
driver_abbreviation             15229
qualifying_position             15023
driver_championship_position     6990
team_championship_position       5114
team_championship_points          328
team_wins                         328
dtype: int64


In [6]:
# Check Qualifying Data Consistency
# Q1/Q2/Q3 times should be present for modern era races
recent_races = df[df['season_year'] >= 2023]
print("\nMissing Qualifying info for 2023+ seasons:")
print(recent_races[['q1_time', 'q2_time', 'q3_time']].isnull().sum())


Missing Qualifying info for 2023+ seasons:
q1_time    1398
q2_time    1398
q3_time    1398
dtype: int64


## 4. Statistics

In [7]:
df.describe()

Unnamed: 0,round_id,circuit_id,season_year,round_number,team_id,driver_id,qualifying_position,grid,finish_position,status,laps_completed,points,fastest_lap_rank,driver_championship_points,driver_championship_position,driver_wins,team_championship_points,team_championship_position,team_wins
count,25873.0,25873.0,25873.0,25873.0,25873.0,25873.0,10850.0,25873.0,25873.0,25873.0,25873.0,25873.0,8986.0,25873.0,18883.0,25873.0,25545.0,20759.0,25545.0
mean,559.639663,31.627527,1992.484946,8.622077,117.847293,624.251691,11.165069,11.711321,11.936729,4.820083,48.908592,2.149402,10.317828,20.676718,9.800614,0.376802,42.010785,5.271786,0.84635
std,325.101898,22.593345,20.5564,5.171198,65.785258,207.519981,6.238976,6.880082,6.878707,5.902237,28.045228,4.524937,6.106787,46.34602,5.985322,1.197644,89.686718,2.989903,2.008827
min,1.0,1.0,1950.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
25%,287.0,7.0,1977.0,4.0,59.0,515.0,6.0,6.0,6.0,0.0,30.0,0.0,5.0,0.0,5.0,0.0,1.0,3.0,0.0
50%,543.0,34.0,1993.0,8.0,131.0,685.0,11.0,11.0,12.0,1.0,53.0,0.0,10.0,4.0,9.0,0.0,10.0,5.0,0.0
75%,839.0,52.0,2010.0,13.0,177.0,783.0,16.0,17.0,17.0,11.0,67.0,2.0,15.0,19.0,14.0,0.0,38.0,8.0,1.0
max,1149.0,77.0,2025.0,24.0,211.0,864.0,28.0,34.0,39.0,30.0,200.0,50.0,24.0,575.0,29.0,19.0,860.0,16.0,21.0


## 5. Export Data
Optionally save the loaded dataframe to CSV for manual inspection.

In [8]:
export_csv = True  # Set to True to export
output_path = "../data/processed/jolpica_eda_export.csv"

if export_csv:
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    df.to_csv(output_path, index=False)
    print(f"Data exported to {output_path}")
else:
    print("Export skipped")

Data exported to ../data/processed/jolpica_eda_export.csv
