# Formula One Insights with Pyton & SQL
Since its inception in the 1950s, Formula One has represented the pinnacle of global motorsport, pushing the boundaries of racing and automotive engineering. This analysis leverages Python and SQL to uncover insights into the achievements of drivers and constructors across F1's decades-long history.

*For a more detailed exploration, please refer to the accompanying PDF document. The comments within this Jupyter notebook are provided exclusively to explain the functionality of the code.*
*The values and data in this Jupyter Notebook were last updated on the 17th of January 2025.*

# Organizing the Notebook into Multiple Parts
Due to the extensive code in this notebook, it has been divided into five parts. The first notebook focuses on retrieving data from the formula1.com website. The second notebook handles data retrieval from the F1DB database. The third notebook is dedicated to creating statistics and visualizations. The fourth notebook explores the question of who is the Greatest Driver of All Time. The fifth notebook consolidates multiple CSV files into separate Excel worksheets.

# Data Retrieval from F1DB

# Importing Python Libraries
This Jupyter notebook is designed to run on most modern Python installations. However, to ensure reproducibility, note that it was developed and tested with Python 3.12.3. The following libraries and their respective versions were used in this analysis:

- pandas 2.2.2
- requests 2.32.2

In [1]:
# Import libraries
import os
import shutil
import unicodedata
import zipfile

import pandas as pd
import requests

print("Libraries imported")

Libraries imported


# Data Collection - Retrieving Formula One Data from F1DB

Scraping race and qualifying results for every round of every season would be excessively time-consuming, as the structure of the F1 website requires loading hundreds of individual pages. Instead, we utilize the excellent resources of F1DB, a comprehensive and free open-source database containing all-time Formula One data and statistics.

The data, updated after the final Grand Prix of 2024, is freely available on GitHub. We retrieve it from the following URL: https://github.com/f1db/f1db/releases/download/v2024.24.2/f1db-csv.zip

*After downloading, we retain only the relevant datasets needed for our analysis and discard the rest.*

In [2]:
# Define the F1DB directory
f1db_dir = "f1db"
os.makedirs(f1db_dir, exist_ok=True)

# Download the ZIP file from GitHub
url = "https://github.com/f1db/f1db/releases/download/v2024.24.2/f1db-csv.zip"
zip_file_path = "f1db-csv.zip"
response = requests.get(url)
with open(zip_file_path, 'wb') as file:
    file.write(response.content)

# Extract the ZIP file
extract_dir = f1db_dir
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

# Define the files to keep
files_to_keep = {
    "f1db-circuits.csv",
    "f1db-constructors.csv",
    "f1db-drivers.csv",
    "f1db-engine-manufacturers.csv",
    "f1db-races-race-results.csv",
    "f1db-races-starting-grid-positions.csv",
    "f1db-races.csv",
    "f1db-tyre-manufacturers.csv"
}

# Remove unwanted files
for root, _, files in os.walk(extract_dir):
    for file in files:
        if file not in files_to_keep:
            os.remove(os.path.join(root, file))

# Delete the ZIP file
os.remove(zip_file_path)

print("F1 data from F1DB retrieved and extracted to 'f1db' folder")

F1 data from F1DB retrieved and extracted to 'f1db' folder


# Data Cleaning and Preprocessing - Normalizing Text

Similar to the datasets from formula1.com, some drivers' and constructors' names contain non-Latin characters, which may cause issues in later stages of analysis. To address this, we normalize all text by converting it into Unicode format to ensure consistency and compatibility.

In [3]:
# List of CSV files to edit
csv_files = [
    "f1db/f1db-circuits.csv",
    "f1db/f1db-constructors.csv",
    "f1db/f1db-drivers.csv",
    "f1db/f1db-engine-manufacturers.csv",
    "f1db/f1db-races.csv",
    "f1db/f1db-races-race-results.csv",
    "f1db/f1db-races-starting-grid-positions.csv",
    "f1db/f1db-tyre-manufacturers.csv"
]

# Function to normalize text
def normalize_text(text):
    if isinstance(text, str):
        return unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('ASCII')
    return text

# Process each file
for file_path in csv_files:
    if os.path.exists(file_path):
        data = pd.read_csv(file_path, low_memory=False)

        # Normalize all string columns
        for column in data.columns:
            if data[column].dtype == 'object':
                data[column] = data[column].apply(normalize_text)

        # Save the updated file back
        data.to_csv(file_path, index=False)
        print(f"Processed file: {file_path}")

Processed file: f1db/f1db-circuits.csv
Processed file: f1db/f1db-constructors.csv
Processed file: f1db/f1db-drivers.csv
Processed file: f1db/f1db-engine-manufacturers.csv
Processed file: f1db/f1db-races.csv
Processed file: f1db/f1db-races-race-results.csv
Processed file: f1db/f1db-races-starting-grid-positions.csv
Processed file: f1db/f1db-tyre-manufacturers.csv


# Data Cleaning and Preprocessing - Editing Race Results Dataset

We retain only a subset of the columns from the original dataset, renaming them using PascalCase for consistency. Empty values in the "Points" and "Laps" columns are filled with zeros. Additionally, we manually correct certain constructor names and time values to ensure data accuracy.

In [4]:
# Load the CSV file
input_file = "f1db/f1db-races-race-results.csv"
race_results = pd.read_csv(input_file, low_memory=False)

# Keep specified columns and rename them
columns_to_keep = [
    "year", "round", "positionDisplayOrder", "driverNumber", "driverId", 
    "constructorId", "engineManufacturerId", "tyreManufacturerId", "laps", "gap", 
    "positionText", "points", "time"
]
race_results = race_results[columns_to_keep]

columns_to_rename = {
    "year": "Season",
    "round": "Round",
    "positionDisplayOrder": "Position",
    "driverNumber": "Number",
    "driverId": "Driver",
    "constructorId": "Constructor",
    "engineManufacturerId": "Engine",
    "tyreManufacturerId": "Tyre",
    "laps": "Laps",
    "gap": "Time",
    "positionText": "PositionText",
    "points": "Points",
    "time": "OriginalTime"
}
race_results = race_results.rename(columns=columns_to_rename)

# Handle "Retired" column based on "positionText"
race_results["Retired"] = race_results["PositionText"].apply(
    lambda x: x if not x.isdigit() else None
)

# Fill empty Points with 0
race_results["Points"] = race_results["Points"].fillna(0).astype(int)

# Fill empty Laps with 0
race_results["Laps"] = race_results["Laps"].fillna(0).astype(int)

# Standardize names in "Driver", "Constructor", and "EngineManufacturer"
def format_name(name):
    return " ".join(word.capitalize() for word in name.split("-"))

race_results["Driver"] = race_results["Driver"].apply(format_name)
race_results["Constructor"] = race_results["Constructor"].apply(format_name)
race_results["Engine"] = race_results["Engine"].apply(format_name)
race_results["Tyre"] = race_results["Tyre"].apply(format_name)

# Drop intermediate columns
race_results = race_results.drop(columns=["PositionText"], errors="ignore")

# Manually Correct some of the Car Entries

# Define a list of words to capitalize
capitalize_words = [
    "Ats", "Bmw", "Bpm", "Brm", "Bwt", "Emw", "Era", "Jap", "Osca", "Rbpt", "Tag",
    "Afm", "Ags", "Bar", "Brp", "Enb", "Hrt", "Hwm", "Jbw", "Lds", "Lec", "Ram", "Rb"
]

# Capitalize these words in the "Car" column
def capitalize_team_names(car):
    words = car.split()
    return " ".join(word.upper() if word in capitalize_words else word for word in words)

race_results["Constructor"] = race_results["Constructor"].apply(capitalize_team_names)
race_results["Engine"] = race_results["Engine"].apply(capitalize_team_names)

# Manually edit constructor names
constructor_replacements = {
    "Alphatauri": "Toro Rosso",
    "Alpine": "Renault",
    "ATS Wheels": "ATS",
    "BMW Sauber": "Sauber",
    "Footwork": "Arrows",
    "Frank Williams Racing Cars": "Williams",
    "Iso Marlboro": "Williams",
    "Kick Sauber": "Sauber",
    "Lotus F1": "Caterham",
    "Lotus Racing": "Lotus",
    "Racing Point": "Force India",
    "RB": "Toro Rosso",
    "Simca Gordini": "Simca-Gordini",
    "Talbot Lago": "Talbot-Lago",
    "Wolf Williams": "Williams"
}

# Replace values in the "Constructor" column
race_results["Constructor"] = race_results["Constructor"].replace(constructor_replacements)

# Manually edit engine names
engine_replacements = {
    "BWT Mercedes": "Mercedes",
    "Honda RBPT": "RBPT",
    "Mugen Honda": "Honda",
    "Petronas": "Ferrari",
    "Sauber": "Mercedes",
    "TAG Heuer": "Renault",
}

# Replace values in the "Constructor" column
race_results["Engine"] = race_results["Engine"].replace(engine_replacements)

# Create the "Lapped" column
def extract_lapped(gap_value):
    if pd.notna(gap_value) and "lap" in gap_value:
        return int(gap_value.split()[0][1:])
    return 0

race_results["Lapped"] = race_results["Time"].apply(extract_lapped)

# Directly copy values from the OriginalTime column to the Time column
race_results["Time"] = race_results["OriginalTime"]

# Manually Correct some of the Time entries

# Dictionary for specific manual corrections
manual_corrections = {
    "4:00:1.150": "4:00:01.150",
    "4:00:7.150": "4:00:07.150",
    "2:00:2.699": "2:00:02.699",
    "2:00:2.600": "2:00:02.600",
    "2:00:3.882": "2:00:03.882",
    "2:00:4.803": "2:00:04.803",
    "2:00:4.520": "2:00:04.520",
    "2:00:5.995": "2:00:05.995",
    "2:00:6.370": "2:00:06.370",
    "2:00:2.206": "2:00:02.206",
    "2:00:4.287": "2:00:04.287",
    "2:00:6.291": "2:00:06.291",
    "2:00:0.189": "2:00:00.189",
    "3:00:6.883": "3:00:06.883",
    "3:00:7.195": "3:00:07.195"
}

# Function to correct time values
def correct_time_format(time_value):
    if pd.isna(time_value):
        return time_value
    time_str = str(time_value)
    if len(time_str) == 8:
        return f"0:0{time_str}"
    elif len(time_str) == 9:
        return f"0:{time_str}"
    elif time_str in manual_corrections:
        return manual_corrections[time_str]
    return time_str

# Apply the corrections
race_results['Time'] = race_results['Time'].apply(correct_time_format)

# Correct "Carlos Sainz" name
race_results['Driver'] = race_results['Driver'].replace("Carlos Sainz Jr", "Carlos Sainz")

# Correct "McLaren" name
race_results.replace(to_replace="Mclaren", value="McLaren", regex=True, inplace=True)

# Reorder columns
column_order = [
    "Season", "Round", "Position", "Number", "Driver", "Constructor", "Engine", "Tyre",
    "Laps", "Time", "Lapped", "Retired", "Points"
]
race_results = race_results[column_order]

# Sort the dataset by Season, Round, and Position
race_results = race_results.sort_values(by=["Season", "Round", "Position"]).reset_index(drop=True)

# Save the new dataset
output_file = "csv/race_results.csv"
race_results.to_csv(output_file, index=False, encoding="utf-8")

print(f"Processed data saved to {output_file}")

Processed data saved to csv/race_results.csv


# Data Cleaning and Preprocessing - Editing Qualifying Results Dataset

In [5]:
# Load the CSV file
input_file = "f1db/f1db-races-starting-grid-positions.csv"
qualifying_results = pd.read_csv(input_file, low_memory=False)

# Keep specified columns and rename them
columns_to_keep = [
    "year", "round", "positionDisplayOrder", "driverNumber", "driverId", 
    "constructorId", "engineManufacturerId", "tyreManufacturerId", "time"
]
qualifying_results = qualifying_results[columns_to_keep]

columns_to_rename = {
    "year": "Season",
    "round": "Round",
    "positionDisplayOrder": "Position",
    "driverNumber": "Number",
    "driverId": "Driver",
    "constructorId": "Constructor",
    "engineManufacturerId": "Engine",
    "tyreManufacturerId": "Tyre",
    "time": "Time"
}
qualifying_results = qualifying_results.rename(columns=columns_to_rename)

# Standardize names in "Driver", "Constructor", and "EngineManufacturer"
def format_name(name):
    return " ".join(word.capitalize() for word in name.split("-"))

qualifying_results["Driver"] = qualifying_results["Driver"].apply(format_name)
qualifying_results["Constructor"] = qualifying_results["Constructor"].apply(format_name)
qualifying_results["Engine"] = qualifying_results["Engine"].apply(format_name)
qualifying_results["Tyre"] = qualifying_results["Tyre"].apply(format_name)

# Drop intermediate columns
qualifying_results = qualifying_results.drop(columns=["positionText"], errors="ignore")

# Manually Correct some of the Car Entries

# Define a list of words to capitalize
capitalize_words = [
    "Ats", "Bmw", "Bpm", "Brm", "Bwt", "Emw", "Era", "Jap", "Osca", "Rbpt", "Tag",
    "Afm", "Ags", "Bar", "Brp", "Enb", "Hrt", "Hwm", "Jbw", "Lds", "Lec", "Ram", "Rb"
]

# Capitalize these words in the "Car" column
def capitalize_team_names(car):
    words = car.split()
    return " ".join(word.upper() if word in capitalize_words else word for word in words)

qualifying_results["Constructor"] = qualifying_results["Constructor"].apply(capitalize_team_names)
qualifying_results["Engine"] = qualifying_results["Engine"].apply(capitalize_team_names)

# Manually edit constructor names
constructor_replacements = {
    "Alphatauri": "Toro Rosso",
    "Alpine": "Renault",
    "ATS Wheels": "ATS",
    "BMW Sauber": "Sauber",
    "Footwork": "Arrows",
    "Frank Williams Racing Cars": "Williams",
    "Iso Marlboro": "Williams",
    "Kick Sauber": "Sauber",
    "Lotus F1": "Caterham",
    "Lotus Racing": "Lotus",
    "Racing Point": "Force India",
    "RB": "Toro Rosso",
    "Simca Gordini": "Simca-Gordini",
    "Talbot Lago": "Talbot-Lago",
    "Wolf Williams": "Williams"
}

# Replace values in the "Constructor" column
qualifying_results["Constructor"] = qualifying_results["Constructor"].replace(constructor_replacements)

# Manually edit engine names
engine_replacements = {
    "BWT Mercedes": "Mercedes",
    "Honda RBPT": "RBPT",
    "Mugen Honda": "Honda",
    "Petronas": "Ferrari",
    "Sauber": "Mercedes",
    "TAG Heuer": "Renault",
}

# Replace values in the "Constructor" column
qualifying_results["Engine"] = qualifying_results["Engine"].replace(engine_replacements)

# Manually Correct some of the Time entries

# Function to manually correct time values
def correct_time_format(time_value):
    if pd.isna(time_value):
        return time_value
    time_str = str(time_value)
    if len(time_str) == 4:
        return f"0:{time_str}00"
    elif len(time_str) == 5:
        return f"0:{time_str}0"
    elif len(time_str) == 6:
        return f"0:{time_str}"
    return time_str

# Apply the corrections
qualifying_results['Time'] = qualifying_results['Time'].apply(correct_time_format)

# Correct "Carlos Sainz" name
qualifying_results['Driver'] = qualifying_results['Driver'].replace("Carlos Sainz Jr", "Carlos Sainz")

# Correct "McLaren" name
qualifying_results.replace(to_replace="Mclaren", value="McLaren", regex=True, inplace=True)

# Reorder columns
column_order = [
    "Season", "Round", "Position", "Number", "Driver", "Constructor", "Engine", "Tyre", "Time"
]
qualifying_results = qualifying_results[column_order]

# Sort the dataset by Season, Round, and Position
qualifying_results = qualifying_results.sort_values(by=["Season", "Round", "Position"]).reset_index(drop=True)

# Save the new dataset
output_file = "csv/qualifying_results.csv"
qualifying_results.to_csv(output_file, index=False, encoding="utf-8")

print(f"Processed data saved to {output_file}")

Processed data saved to csv/qualifying_results.csv


# Data Cleaning and Preprocessing - Editing Drivers Dataset

In [6]:
# Load the CSV file
input_file = "f1db/f1db-drivers.csv"
drivers = pd.read_csv(input_file)

# Keep specified columns
columns_to_keep = [
    "name", "abbreviation", "nationalityCountryId"
]
drivers = drivers[columns_to_keep]

# Rename the columns
rename_columns = {
    "name": "Name",
    "abbreviation": "Abbreviation",
    "nationalityCountryId": "Country"
}
drivers = drivers.rename(columns=rename_columns)

# Format the Country column to replace dashes with spaces and capitalize the first letters
drivers["Country"] = drivers["Country"].str.replace("-", " ").str.title()

# Correct "Carlos Sainz" name
drivers['Name'] = drivers['Name'].replace("Carlos Sainz Jr.", "Carlos Sainz")

# Correct "McLaren" name
drivers.replace(to_replace="Mclaren", value="McLaren", regex=True, inplace=True)

# Save the modified DataFrame to a new CSV file
output_file = "csv/drivers.csv"
drivers.to_csv(output_file, index=False, encoding="utf-8")

print(f"Processed data saved to {output_file}")

Processed data saved to csv/drivers.csv


# Data Cleaning and Preprocessing - Editing Constructors Dataset

In [7]:
# Load the CSV file
input_file = "f1db/f1db-constructors.csv"
constructors = pd.read_csv(input_file)

# Keep specified columns
columns_to_keep = [
    "name", "fullName", "countryId"
]
constructors = constructors[columns_to_keep]

# Rename the columns
rename_columns = {
    "name": "Name",
    "fullName": "FullName",
    "countryId": "Country"
}
constructors = constructors.rename(columns=rename_columns)

# Format the Country column to replace dashes with spaces and capitalize the first letters
constructors["Country"] = constructors["Country"].str.replace("-", " ").str.title()

# Correct "McLaren" name
constructors.replace(to_replace="Mclaren", value="McLaren", regex=True, inplace=True)

# Save the modified DataFrame to a new CSV file
output_file = "csv/constructors.csv"
constructors.to_csv(output_file, index=False, encoding="utf-8")

print(f"Processed data saved to {output_file}")

Processed data saved to csv/constructors.csv


# Data Cleaning and Preprocessing - Editing Engine Manufacturers Dataset

In [8]:
# Load the CSV file
input_file = "f1db/f1db-engine-manufacturers.csv"
engine_manufacturers = pd.read_csv(input_file)

# Keep specified columns
columns_to_keep = [
    "name", "countryId"
]
engine_manufacturers = engine_manufacturers[columns_to_keep]

# Rename the columns
rename_columns = {
    "name": "Name",
    "countryId": "Country"
}
engine_manufacturers = engine_manufacturers.rename(columns=rename_columns)

# Format the Country column to replace dashes with spaces and capitalize the first letters
engine_manufacturers["Country"] = engine_manufacturers["Country"].str.replace("-", " ").str.title()

# Save the modified DataFrame to a new CSV file
output_file = "csv/engine_manufacturers.csv"
engine_manufacturers.to_csv(output_file, index=False, encoding="utf-8")

print(f"Processed data saved to {output_file}")

Processed data saved to csv/engine_manufacturers.csv


# Data Cleaning and Preprocessing - Editing Tyre Manufacturers Dataset

In [9]:
# Load the CSV file
input_file = "f1db/f1db-tyre-manufacturers.csv"
tyre_manufacturers = pd.read_csv(input_file)

# Keep the specified columns
columns_to_keep = [
    "name", "countryId"
]
tyre_manufacturers = tyre_manufacturers[columns_to_keep]

# Rename the columns
rename_columns = {
    "name": "Name",
    "countryId": "Country",
}
tyre_manufacturers = tyre_manufacturers.rename(columns=rename_columns)

# Format the Country column to replace dashes with spaces and capitalize the first letters
tyre_manufacturers["Country"] = tyre_manufacturers["Country"].str.replace("-", " ").str.title()

# Save the modified DataFrame to a new CSV file
output_file = "csv/tyre_manufacturers.csv"
tyre_manufacturers.to_csv(output_file, index=False, encoding="utf-8")

print(f"Processed data saved to {output_file}")

Processed data saved to csv/tyre_manufacturers.csv


# Data Cleaning and Preprocessing - Editing Circuits Dataset

In [10]:
# Load the CSV file
input_file = "f1db/f1db-circuits.csv"
circuits = pd.read_csv(input_file)

# Remove the specified columns
columns_to_remove = ["id", "previousNames", "latitude", "longitude", "totalRacesHeld"]
circuits = circuits.drop(columns=columns_to_remove, errors="ignore")

# Rename the columns
rename_columns = {
    "name": "Name",
    "fullName": "FullName",
    "type": "Type",
    "placeName": "Location",
    "countryId": "Country"
}
circuits = circuits.rename(columns=rename_columns)

# Adjust the Type column to capitalize only the first letter
circuits["Type"] = circuits["Type"].str.capitalize()

# Format the Country column to replace dashes with spaces and capitalize the first letters
circuits["Country"] = circuits["Country"].str.replace("-", " ").str.title()

# Save the modified DataFrame to a new CSV file
output_file = "csv/circuits.csv"
circuits.to_csv(output_file, index=False, encoding="utf-8")

print(f"Processed data saved to {output_file}")

Processed data saved to csv/circuits.csv


# Data Cleaning and Preprocessing - Editing Races Dataset

We apply manual corrections to a few Grand Prix names to ensure consistency and accuracy across the dataset.

In [11]:
# Load the CSV file
input_file = "f1db/f1db-races.csv"
races = pd.read_csv(input_file)

# Remove the specified columns
columns_to_remove = ["id", "time", "qualifyingFormat", "sprintQualifyingFormat", 
                     "scheduledLaps", "scheduledDistance", "preQualifyingDate", "preQualifyingTime", 
                     "freePractice1Date", "freePractice1Time", "freePractice2Date", "freePractice2Time", 
                     "freePractice3Date", "freePractice3Time", "freePractice4Date", "freePractice4Time", 
                     "qualifying1Date", "qualifying1Time", "qualifying2Date", "qualifying2Time", 
                     "qualifyingDate", "qualifyingTime", "sprintQualifyingDate", "sprintQualifyingTime", 
                     "sprintRaceDate", "sprintRaceTime", "warmingUpDate", "warmingUpTime"
]
races = races.drop(columns=columns_to_remove, errors="ignore")

# Rename the columns
rename_columns = {
    "year": "Season",
    "round": "Round",
    "date": "Date",
    "grandPrixId": "GrandPrix",
    "officialName": "OfficialName",
    "circuitId": "Circuit",
    "circuitType": "Type",
    "courseLength": "CircuitLength",
    "laps": "Laps",
    "distance": "Distance",
}
races = races.rename(columns=rename_columns)

# Adjust the Type column to capitalize only the first letter
races["Type"] = races["Type"].str.capitalize()

# Format columns to replace dashes with spaces and capitalize the first letters
races["GrandPrix"] = races["GrandPrix"].str.replace("-", " ").str.title()
races["Circuit"] = races["Circuit"].str.replace("-", " ").str.title()

# Correct grand prix names
races['GrandPrix'] = races['GrandPrix'].replace("Emilia Romagna", "Emilia-Romagna")
races['GrandPrix'] = races['GrandPrix'].replace("Caesars Palace", "Caesar's Palace")

# Correct circuit names
races['Circuit'] = races['Circuit'].replace("Ain Diab", "Ain-Diab")
races['Circuit'] = races['Circuit'].replace("Dijon Prenois", "Dijon-Prenois")
races['Circuit'] = races['Circuit'].replace("Magny Cours", "Magny-Cours")
races['Circuit'] = races['Circuit'].replace("Mont Tremblant", "Mont-Tremblant")
races['Circuit'] = races['Circuit'].replace("Nivelles Baulers", "Nivelles-Baulers")
races['Circuit'] = races['Circuit'].replace("Rouen", "Rouen-Les-Essarts")
races['Circuit'] = races['Circuit'].replace("Spa Francorchamps", "Spa-Francorchamps")

# Reorder columns
column_order = [
    "Season", "Round", "Date", "GrandPrix", "Circuit", "OfficialName", "Type", "Laps", "Distance", "CircuitLength"
]
races = races[column_order]

# Sort the dataset
qualifying_results = qualifying_results.sort_values(by=["Season", "Round"]).reset_index(drop=True)

# Save the modified DataFrame to a new CSV file
output_file = "csv/races.csv"
races.to_csv(output_file, index=False, encoding="utf-8")

print(f"Processed data saved to {output_file}")

Processed data saved to csv/races.csv


# Data Cleaning and Preprocessing - Manually Changing "United States" to "United States of America"

We standardize the country name to "United States of America" to ensure consistency between the datasets retrieved from the F1 website and subsequent datasets.

In [12]:
# List of CSV files to update
csv_files = [
    "csv/circuits.csv",
    "csv/constructors.csv",
    "csv/drivers.csv",
    "csv/engine_manufacturers.csv",
    "csv/qualifying_results.csv",
    "csv/race_results.csv",
    "csv/races.csv",
    "csv/tyre_manufacturers.csv",
]

# Function to replace "United States Of America" with "United States of America"
def update_us_name(file_path):
    # Read the CSV file
    df = pd.read_csv(file_path)
    
    # Replace "United States Of America" with "United States of America"
    df.replace(to_replace="United States Of America", value="United States of America", inplace=True)
    
    # Save the updated dataframe back to the same CSV file
    df.to_csv(file_path, index=False)
    print(f"Updated file: {file_path}")

# Apply the function to each file
for file in csv_files:
    update_us_name(file)

Updated file: csv/circuits.csv
Updated file: csv/constructors.csv
Updated file: csv/drivers.csv
Updated file: csv/engine_manufacturers.csv
Updated file: csv/qualifying_results.csv
Updated file: csv/race_results.csv
Updated file: csv/races.csv
Updated file: csv/tyre_manufacturers.csv


# Data Cleaning and Preprocessing - Deleting the F1DB Folder

In [13]:
# Delete the F1DB Folder
if os.path.exists("f1db"):
    shutil.rmtree("f1db")
    print(f"Folder '{"f1db"}' has been removed")
else:
    print(f"Folder '{"f1db"}' does not exist")

Folder 'f1db' has been removed
