# 📊 Building the Medallion Architecture
*Create a complete F1 analytics pipeline using the data lake design pattern*

---

## 🎯 What We'll Build

In this notebook, we'll build a complete data pipeline following the medallion architecture pattern:

1. **Bronze Layer**: Raw data ingestion from Volume
2. **Silver Layer**: Clean and transform the data
3. **Gold Layer**: Create analytics-ready tables

We'll use the F1 dataset that we prepared in the setup notebook to analyze Formula 1 racing history.

**Let's get started!** 🚀

## 📋 Prerequisites

Before running this notebook, make sure you've completed:
1. The `00_Setup_Data.ipynb` notebook to download F1 data into the Volume
2. Attached this notebook to a cluster with Spark 3.3+

Let's first verify that our data is available in the Volume:

## 1️⃣ Bronze Layer: Raw Data Ingestion

The Bronze layer represents the raw data ingested from source systems with minimal transformation.
We'll create tables from the CSV files in our Volume using the COPY INTO command.

Let's start by creating tables for races, drivers, and results:

In [0]:
%sql
-- Bronze Layer: Create tables for F1 2025 season data with correct column names

-- Create bronze_race_results table matching CSV structure
CREATE TABLE IF NOT EXISTS main.default.bronze_race_results (
  Track STRING,
  Position STRING,
  No INT,
  Driver STRING,
  Team STRING,
  Starting_Grid INT,
  Laps INT,
  Time_Retired STRING,
  Points INT,
  Set_Fastest_Lap STRING,
  Fastest_Lap_Time STRING,
  _rescued_data STRING
) USING DELTA
TBLPROPERTIES (
  'delta.columnMapping.mode' = 'name'
);

In [0]:
%sql
DESCRIBE VOLUME main.default.formula1

In [0]:
%sql
-- Insert data into the existing bronze_race_results table
INSERT INTO main.default.bronze_race_results
SELECT 
  Position as position,
  Driver as driver,
  Team as team,
  CAST(Points AS STRING) as points,
  `Fastest Lap Time` as fastest_lap,
  Track as race,
  CAST(current_date() AS STRING) as date
FROM read_files(
  '/Volumes/main/default/formula1/Formula1_2025Season_RaceResults.csv',
  format => 'csv',
  header => true
);

-- Check the loaded data
SELECT COUNT(*) as total_records FROM main.default.bronze_race_results;
SELECT race, position, driver, team, points FROM main.default.bronze_race_results LIMIT 10;

In [0]:
%sql
-- Check qualifying results CSV structure first
SELECT * FROM read_files(
  '/Volumes/main/default/formula1/Formula1_2025Season_QualifyingResults.csv',
  format => 'csv',
  header => true
) LIMIT 3;

-- Create and load qualifying results table
CREATE TABLE IF NOT EXISTS main.default.bronze_qualifying_results (
  race STRING,
  position STRING,
  driver STRING,
  team STRING,
  q1_time STRING,
  q2_time STRING,
  q3_time STRING,
  date STRING
) USING DELTA;

-- Insert qualifying data
INSERT INTO main.default.bronze_qualifying_results
SELECT 
  Track as race,
  Position as position,
  Driver as driver,
  Team as team,
  Q1 as q1_time,
  Q2 as q2_time,
  Q3 as q3_time,
  CAST(current_date() AS STRING) as date
FROM read_files(
  '/Volumes/main/default/formula1/Formula1_2025Season_QualifyingResults.csv',
  format => 'csv',
  header => true
);

-- Check the data
SELECT COUNT(*) as total_records FROM main.default.bronze_qualifying_results;
SELECT * FROM main.default.bronze_qualifying_results LIMIT 5;

In [0]:
%sql
-- Check sprint results CSV structure
SELECT * FROM read_files(
  '/Volumes/main/default/formula1/Formula1_2025Season_SprintResults.csv',
  format => 'csv',
  header => true
) LIMIT 3;

-- Check sprint qualifying CSV structure
SELECT * FROM read_files(
  '/Volumes/main/default/formula1/Formula1_2025Season_SprintQualifyingResults.csv',
  format => 'csv',
  header => true
) LIMIT 3;

In [0]:
%sql
-- Insert sprint qualifying data into existing table
INSERT INTO main.default.bronze_sprint_qualifying
SELECT 
  CAST(Position AS STRING) as position,
  Driver as driver,
  Team as team,
  Q1 as sq1_time,
  Q2 as sq2_time,
  Q3 as sq3_time,
  Track as race,
  CAST(current_date() AS STRING) as date
FROM read_files(
  '/Volumes/main/default/formula1/Formula1_2025Season_SprintQualifyingResults.csv',
  format => 'csv',
  header => true
);

SELECT COUNT(*) as records FROM main.default.bronze_sprint_qualifying;
SELECT * FROM main.default.bronze_sprint_qualifying LIMIT 5;

In [0]:
%sql
-- Check the actual sprint results structure
SELECT * FROM read_files(
  '/Volumes/main/default/formula1/Formula1_2025Season_SprintResults.csv',
  format => 'csv',
  header => true
) LIMIT 3;

In [0]:
%sql
-- Insert sprint results data into existing table
INSERT INTO main.default.bronze_sprint_results
SELECT 
  Position as position,
  Driver as driver,
  Team as team,
  CAST(Points AS STRING) as points,
  `Time/Retired` as fastest_lap,  -- Using time/retired as fastest_lap placeholder
  Track as race,
  CAST(current_date() AS STRING) as date
FROM read_files(
  '/Volumes/main/default/formula1/Formula1_2025Season_SprintResults.csv',
  format => 'csv',
  header => true
);

SELECT COUNT(*) as records FROM main.default.bronze_sprint_results;
SELECT * FROM main.default.bronze_sprint_results LIMIT 5;

In [0]:
%sql
-- Silver Layer: Clean and transform race results
CREATE TABLE IF NOT EXISTS main.default.silver_race_results
USING DELTA
AS
SELECT 
  race,
  CAST(position AS INT) as finish_position,
  driver,
  team,
  CAST(points AS INT) as points_earned,
  fastest_lap as fastest_lap_time,
  date as race_date,
  -- Add derived columns
  CASE 
    WHEN CAST(position AS INT) = 1 THEN 'Winner'
    WHEN CAST(position AS INT) <= 3 THEN 'Podium'
    WHEN CAST(position AS INT) <= 10 THEN 'Points'
    ELSE 'No Points'
  END as result_category,
  current_timestamp() as processed_at
FROM main.default.bronze_race_results
WHERE position IS NOT NULL 
  AND position != 'DNF' 
  AND position != 'DNS'
  AND TRY_CAST(position AS INT) IS NOT NULL;

SELECT COUNT(*) as total_records FROM main.default.silver_race_results;
SELECT * FROM main.default.silver_race_results ORDER BY finish_position LIMIT 10;

In [0]:
%sql
-- Silver Layer: Clean and transform qualifying results
CREATE TABLE IF NOT EXISTS main.default.silver_qualifying_results
USING DELTA
AS
SELECT 
  race,
  CAST(position AS INT) as qualifying_position,
  driver,
  team,
  q1_time,
  q2_time,
  q3_time,
  date as qualifying_date,
  -- Add derived columns
  CASE 
    WHEN CAST(position AS INT) = 1 THEN 'Pole Position'
    WHEN CAST(position AS INT) <= 3 THEN 'Front Row'
    WHEN CAST(position AS INT) <= 10 THEN 'Q3'
    WHEN CAST(position AS INT) <= 15 THEN 'Q2'
    ELSE 'Q1 Only'
  END as qualifying_category,
  current_timestamp() as processed_at
FROM main.default.bronze_qualifying_results
WHERE position IS NOT NULL 
  AND TRY_CAST(position AS INT) IS NOT NULL;

SELECT COUNT(*) as total_records FROM main.default.silver_qualifying_results;
SELECT * FROM main.default.silver_qualifying_results ORDER BY qualifying_position LIMIT 10;

In [0]:
%sql
-- Check what's in the bronze qualifying table
SELECT COUNT(*) as total_records FROM main.default.bronze_qualifying_results;
SELECT * FROM main.default.bronze_qualifying_results LIMIT 5;

-- If empty, let's load it properly
INSERT INTO main.default.bronze_qualifying_results
SELECT 
  CAST(Position AS STRING) as position,
  Driver as driver,
  Team as team,
  Q1 as q1_time,
  Q2 as q2_time,
  Q3 as q3_time,
  Track as race,
  CAST(current_date() AS STRING) as date
FROM read_files(
  '/Volumes/main/default/formula1/Formula1_2025Season_QualifyingResults.csv',
  format => 'csv',
  header => true
)
WHERE NOT EXISTS (
  SELECT 1 FROM main.default.bronze_qualifying_results 
  WHERE driver = Driver AND race = Track
);

In [0]:
%sql
-- Gold Layer: Driver Championship Standings
CREATE TABLE IF NOT EXISTS main.default.gold_driver_championship
USING DELTA
AS
SELECT 
  driver,
  team,
  COUNT(*) as races_entered,
  SUM(points_earned) as total_points,
  SUM(CASE WHEN finish_position = 1 THEN 1 ELSE 0 END) as wins,
  SUM(CASE WHEN finish_position <= 3 THEN 1 ELSE 0 END) as podiums,
  SUM(CASE WHEN finish_position <= 10 THEN 1 ELSE 0 END) as points_finishes,
  ROUND(AVG(CAST(finish_position AS DOUBLE)), 2) as avg_finish_position,
  ROUND(SUM(points_earned) * 1.0 / COUNT(*), 2) as points_per_race,
  current_timestamp() as processed_at
FROM main.default.silver_race_results
GROUP BY driver, team
ORDER BY total_points DESC;

SELECT * FROM main.default.gold_driver_championship LIMIT 10;

In [0]:
%sql
-- Gold Layer: Constructor/Team Championship Standings
CREATE TABLE IF NOT EXISTS main.default.gold_team_championship
USING DELTA
AS
SELECT 
  team,
  COUNT(DISTINCT driver) as drivers_count,
  COUNT(*) as total_entries,
  SUM(points_earned) as total_points,
  SUM(CASE WHEN finish_position = 1 THEN 1 ELSE 0 END) as wins,
  SUM(CASE WHEN finish_position <= 3 THEN 1 ELSE 0 END) as podiums,
  ROUND(AVG(CAST(finish_position AS DOUBLE)), 2) as avg_finish_position,
  ROUND(SUM(points_earned) * 1.0 / COUNT(*), 2) as points_per_entry,
  current_timestamp() as processed_at
FROM main.default.silver_race_results
GROUP BY team
ORDER BY total_points DESC;

SELECT * FROM main.default.gold_team_championship LIMIT 10;

In [0]:
%sql
-- Summary of our Medallion Architecture implementation
-- Bronze Layer Tables (Raw Data)
SELECT 'Bronze Layer' as layer, 'bronze_race_results' as table_name, COUNT(*) as record_count FROM main.default.bronze_race_results
UNION ALL
SELECT 'Bronze Layer' as layer, 'bronze_qualifying_results' as table_name, COUNT(*) as record_count FROM main.default.bronze_qualifying_results
UNION ALL
SELECT 'Bronze Layer' as layer, 'bronze_sprint_results' as table_name, COUNT(*) as record_count FROM main.default.bronze_sprint_results
UNION ALL
SELECT 'Bronze Layer' as layer, 'bronze_sprint_qualifying' as table_name, COUNT(*) as record_count FROM main.default.bronze_sprint_qualifying
UNION ALL
-- Silver Layer Tables (Cleaned Data)
SELECT 'Silver Layer' as layer, 'silver_race_results' as table_name, COUNT(*) as record_count FROM main.default.silver_race_results
UNION ALL
SELECT 'Silver Layer' as layer, 'silver_qualifying_results' as table_name, COUNT(*) as record_count FROM main.default.silver_qualifying_results
UNION ALL
-- Gold Layer Tables (Analytics Ready)
SELECT 'Gold Layer' as layer, 'gold_driver_championship' as table_name, COUNT(*) as record_count FROM main.default.gold_driver_championship
UNION ALL
SELECT 'Gold Layer' as layer, 'gold_team_championship' as table_name, COUNT(*) as record_count FROM main.default.gold_team_championship
ORDER BY layer, table_name;

In [0]:
%sql
# Simple Medallion Architecture Summary

# Check all our tables
print("=== BRONZE LAYER (Raw Data) ===")
print(f"Race Results: {spark.sql('SELECT COUNT(*) FROM main.default.bronze_race_results').collect()[0][0]} records")
print(f"Qualifying Results: {spark.sql('SELECT COUNT(*) FROM main.default.bronze_qualifying_results').collect()[0][0]} records")
print(f"Sprint Results: {spark.sql('SELECT COUNT(*) FROM main.default.bronze_sprint_results').collect()[0][0]} records")
print(f"Sprint Qualifying: {spark.sql('SELECT COUNT(*) FROM main.default.bronze_sprint_qualifying').collect()[0][0]} records")

print("\n=== SILVER LAYER (Clean Data) ===")
print(f"Clean Race Results: {spark.sql('SELECT COUNT(*) FROM main.default.silver_race_results').collect()[0][0]} records")
print(f"Clean Qualifying Results: {spark.sql('SELECT COUNT(*) FROM main.default.silver_qualifying_results').collect()[0][0]} records")

print("\n=== GOLD LAYER (Analytics Ready) ===")
print(f"Driver Championship: {spark.sql('SELECT COUNT(*) FROM main.default.gold_driver_championship').collect()[0][0]} drivers")
print(f"Team Championship: {spark.sql('SELECT COUNT(*) FROM main.default.gold_team_championship').collect()[0][0]} teams")

print("\n✅ Simple Medallion Architecture Complete!")

In [0]:
# Quick analytics from our Gold layer
print("=== F1 2025 SEASON HIGHLIGHTS ===")

# Top 5 drivers
print("\n🏆 TOP 5 DRIVERS:")
top_drivers = spark.sql("""
    SELECT driver, total_points, wins, podiums 
    FROM main.default.gold_driver_championship 
    ORDER BY total_points DESC 
    LIMIT 5
""").collect()

for i, row in enumerate(top_drivers, 1):
    print(f"{i}. {row.driver}: {row.total_points} pts ({row.wins} wins, {row.podiums} podiums)")

# Top 3 teams
print("\n🏁 TOP 3 TEAMS:")
top_teams = spark.sql("""
    SELECT team, total_points, wins, podiums 
    FROM main.default.gold_team_championship 
    ORDER BY total_points DESC 
    LIMIT 3
""").collect()

for i, row in enumerate(top_teams, 1):
    print(f"{i}. {row.team}: {row.total_points} pts ({row.wins} wins, {row.podiums} podiums)")

print("\n📊 Data pipeline ready for dashboards and further analysis!")

In [0]:
# Quick analytics from our Gold layer
print("=== F1 2025 SEASON HIGHLIGHTS ===")

# Top 5 drivers
print("\n🏆 TOP 5 DRIVERS:")
top_drivers = spark.sql("""
    SELECT driver, total_points, wins, podiums 
    FROM main.default.gold_driver_championship 
    ORDER BY total_points DESC 
    LIMIT 5
""").collect()

for i, row in enumerate(top_drivers, 1):
    print(f"{i}. {row.driver}: {row.total_points} pts ({row.wins} wins, {row.podiums} podiums)")

# Top 3 teams
print("\n🏁 TOP 3 TEAMS:")
top_teams = spark.sql("""
    SELECT team, total_points, wins, podiums 
    FROM main.default.gold_team_championship 
    ORDER BY total_points DESC 
    LIMIT 3
""").collect()

for i, row in enumerate(top_teams, 1):
    print(f"{i}. {row.team}: {row.total_points} pts ({row.wins} wins, {row.podiums} podiums)")

print("\n📊 Data pipeline ready for dashboards and further analysis!")