# 📊 Building the Medallion Architecture
*Create a complete F1 analytics pipeline using the data lake design pattern*

---

## 🎯 What We'll Build

In this notebook, we'll build a complete data pipeline following the medallion architecture pattern:

1. **Bronze Layer**: Raw data ingestion from Volume
2. **Silver Layer**: Clean and transform the data
3. **Gold Layer**: Create analytics-ready tables

We'll use the F1 dataset that we prepared in the setup notebook to analyze Formula 1 racing history.

**Let's get started!** 🚀 

## 📋 Prerequisites

Before running this notebook, make sure you've completed:
1. The `00_Setup_Data.ipynb` notebook to download F1 data into the Volume
2. Attached this notebook to a serverless cluster 

Let's first verify that our data is available in the Volume:

## 1️⃣ Bronze Layer: Raw Data Ingestion

The Bronze layer represents the raw data ingested from source systems with minimal transformation.
We'll create tables from the CSV files in our Volume using the COPY INTO command.

Let's start by creating tables for races, drivers, and results:

In [0]:
%sql
-- Bronze Layer: Create tables for F1 2025 season data with correct column names

-- Create bronze_race_results table matching CSV structure
CREATE OR REPLACE TABLE main.default.f1_bronze_race_results (
  track STRING,
  position STRING,
  no INT,
  driver STRING,
  team STRING,
  starting_grid INT,
  laps INT,
  time_retired STRING,
  points INT,
  set_fastest_lap STRING,
  fastest_lap_time STRING
) USING DELTA
TBLPROPERTIES (
  'delta.columnMapping.mode' = 'name'
);

In [0]:
%sql
DESCRIBE VOLUME main.default.formula1

In [0]:
%sql
INSERT INTO main.default.f1_bronze_race_results
SELECT 
  Track as track,
  Position as position,
  CAST(No AS INT) as no,
  Driver as driver,
  Team as team,
  CAST(`Starting Grid` AS INT) as starting_grid,
  CAST(Laps AS INT) as laps,
  `Time/Retired` as time_retired,
  CAST(Points AS INT) as points,
  `Set Fastest Lap` as set_fastest_lap,
  `Fastest Lap Time` as fastest_lap_time
FROM read_files(
  '/Volumes/main/default/formula1/Formula1_2025Season_RaceResults.csv',
  format => 'csv',
  header => true
);

-- Check the loaded data
SELECT COUNT(*) as total_records FROM main.default.f1_bronze_race_results;
SELECT track, position, no, driver, team, points FROM main.default.f1_bronze_race_results LIMIT 10;

In [0]:
%sql
CREATE TABLE IF NOT EXISTS main.default.f1_silver_race_results
USING DELTA
AS
SELECT 
  track as race,
  CAST(position AS INT) as finish_position,
  driver,
  team,
  CAST(points AS INT) as points_earned,
  fastest_lap_time,
  current_date() as race_date,
  -- Add derived columns
  CASE 
    WHEN CAST(position AS INT) = 1 THEN 'Winner'
    WHEN CAST(position AS INT) <= 3 THEN 'Podium'
    WHEN CAST(position AS INT) <= 10 THEN 'Points'
    ELSE 'No Points'
  END as result_category,
  current_timestamp() as processed_at
FROM main.default.f1_bronze_race_results
WHERE position IS NOT NULL 
  AND position != 'DNF' 
  AND position != 'DNS'
  AND TRY_CAST(position AS INT) IS NOT NULL;

SELECT COUNT(*) as total_records FROM main.default.f1_silver_race_results;
SELECT * FROM main.default.f1_silver_race_results ORDER BY finish_position LIMIT 10;

In [0]:
%sql
-- Gold Layer: Driver Championship Standings
CREATE TABLE IF NOT EXISTS main.default.f1_gold_driver_championship
USING DELTA
AS
SELECT 
  driver,
  team,
  COUNT(*) as races_entered,
  SUM(points_earned) as total_points,
  SUM(CASE WHEN finish_position = 1 THEN 1 ELSE 0 END) as wins,
  SUM(CASE WHEN finish_position <= 3 THEN 1 ELSE 0 END) as podiums,
  SUM(CASE WHEN finish_position <= 10 THEN 1 ELSE 0 END) as points_finishes,
  ROUND(AVG(CAST(finish_position AS DOUBLE)), 2) as avg_finish_position,
  ROUND(SUM(points_earned) * 1.0 / COUNT(*), 2) as points_per_race,
  current_timestamp() as processed_at
FROM main.default.f1_silver_race_results
GROUP BY driver, team
ORDER BY total_points DESC;

SELECT * FROM main.default.f1_gold_driver_championship LIMIT 10;

In [0]:
%sql
-- Gold Layer: Constructor/Team Championship Standings
CREATE TABLE IF NOT EXISTS main.default.f1_gold_team_championship
USING DELTA
AS
SELECT 
  team,
  COUNT(DISTINCT driver) as drivers_count,
  COUNT(*) as total_entries,
  SUM(points_earned) as total_points,
  SUM(CASE WHEN finish_position = 1 THEN 1 ELSE 0 END) as wins,
  SUM(CASE WHEN finish_position <= 3 THEN 1 ELSE 0 END) as podiums,
  ROUND(AVG(CAST(finish_position AS DOUBLE)), 2) as avg_finish_position,
  ROUND(SUM(points_earned) * 1.0 / COUNT(*), 2) as points_per_entry,
  current_timestamp() as processed_at
FROM main.default.f1_silver_race_results
GROUP BY team
ORDER BY total_points DESC;

SELECT * FROM main.default.f1_gold_team_championship LIMIT 10;

In [0]:
%sql
-- Summary of our Medallion Architecture implementation
-- Bronze Layer Tables (Raw Data)
SELECT 'Bronze Layer' as layer, 'f1_bronze_race_results' as table_name, COUNT(*) as record_count FROM main.default.f1_bronze_race_results
UNION ALL
-- Silver Layer Tables (Cleaned Data)
SELECT 'Silver Layer' as layer, 'f1_silver_race_results' as table_name, COUNT(*) as record_count FROM main.default.f1_silver_race_results
UNION ALL
-- Gold Layer Tables (Analytics Ready)
SELECT 'Gold Layer' as layer, 'f1_gold_driver_championship' as table_name, COUNT(*) as record_count FROM main.default.f1_gold_driver_championship
UNION ALL
SELECT 'Gold Layer' as layer, 'f1_gold_team_championship' as table_name, COUNT(*) as record_count FROM main.default.f1_gold_team_championship
ORDER BY layer, table_name;