# 🏎️ Formula 1 Qualifying Results DLT Pipeline

This notebook demonstrates a simple Delta Live Tables (DLT) pipeline for Formula 1 qualifying results. The pipeline has three layers:

* **Bronze**: Ingests raw data from CSV files
* **Silver**: Cleans and standardizes the data
* **Gold**: Aggregates and summarizes for reporting

Each layer builds on the previous one, making the data more useful and reliable for analytics. The code and explanations are designed for new joiners to quickly understand how DLT works.

## 🥇 Gold Layer: Analytics-Ready Aggregations

Gold tables provide business-ready analytics with complex aggregations.

In [0]:
pip install dlt

In [0]:
import dlt

@dlt.table(
    name="bronze_qualifying_results",
    comment="Raw Formula 1 qualifying results from CSV file."
)
def bronze_qualifying_results():
    return (
        spark.read.format("csv")
        .option("header", True)
        .load("/Volumes/main/default/formula1/Formula1_2025Season_QualifyingResults.csv")
    )

In [0]:
@dlt.table(
    name="dlt_gold_driver_stats",
    comment="DLT Gold: Comprehensive driver career statistics and performance metrics"
)
@dlt.expect("drivers_with_races", "total_races > 0")
def gold_driver_stats():
    """
    Calculate comprehensive driver career statistics.
    Aggregates from clean silver tables to create analytics-ready data.
    """
    drivers = dlt.read("dlt_silver_drivers_clean")
    results = dlt.read("dlt_silver_results_clean")
    
    return (
        drivers.alias("d")
        .join(results.alias("r"), col("d.driver_id") == col("r.driver_id"), "inner")
        .groupBy(
            col("d.driver_id"),
            col("d.full_name"),
            col("d.nationality"),
            col("d.current_age"),
            col("d.birth_date")
        )
        .agg(
            count("r.result_id").alias("total_races"),
            sum("r.points_scored").alias("career_points"),
            sum(when(col("r.race_winner"), 1).otherwise(0)).alias("wins"),
            sum(when(col("r.podium_finish"), 1).otherwise(0)).alias("podiums"),
            sum(when(col("r.scored_points"), 1).otherwise(0)).alias("points_finishes"),
            avg("r.finish_position").alias("avg_finish_position"),
            min("r.finish_position").alias("best_finish"),
            max("r.finish_position").alias("worst_finish"),
            sum("r.laps_completed").alias("total_laps"),
            # Performance ratios
            round(sum("r.points_scored") / count("r.result_id"), 2).alias("points_per_race"),
            round(sum(when(col("r.race_winner"), 1).otherwise(0)) * 100.0 / count("r.result_id"), 2).alias("win_percentage"),
            round(sum(when(col("r.podium_finish"), 1).otherwise(0)) * 100.0 / count("r.result_id"), 2).alias("podium_percentage"),
            # Data lineage
            current_timestamp().alias("calculated_at"),
            lit("dlt_gold_aggregation").alias("calculation_method")
        )
        .filter(col("total_races") >= 1)  # Only drivers with actual race participation
    )

### 🥉 Bronze Table: Raw Ingestion

The bronze table is the first step in our pipeline. It simply loads the raw qualifying results from the CSV file into a Delta table. No cleaning or transformation is done here—this layer is all about capturing the original data as-is, so we always have a source of truth to refer back to.

In [0]:
@dlt.table(
    name="dlt_gold_top_performers",
    comment="DLT Gold: Top performing drivers across different performance categories"
)
@dlt.expect("performance_categories", "performance_tier IS NOT NULL")
def gold_top_performers():
    """
    Create performance tiers and identify top performers.
    Builds on driver stats to create business-friendly categorizations.
    """
    return (
        dlt.read("dlt_gold_driver_stats")
        .select(
            col("driver_id"),
            col("full_name"),
            col("nationality"),
            col("total_races"),
            col("career_points"),
            col("wins"),
            col("podiums"),
            col("points_per_race"),
            col("win_percentage"),
            col("podium_percentage"),
            # Create performance tiers
            when(col("wins") >= 20, "F1 Legend")
            .when(col("wins") >= 5, "Race Winner")
            .when(col("podiums") >= 10, "Podium Regular")
            .when(col("career_points") >= 100, "Points Scorer")
            .when(col("total_races") >= 20, "Veteran")
            .otherwise("Rookie").alias("performance_tier"),
            # Excellence indicators
            when(col("win_percentage") >= 25, "Elite Winner")
            .when(col("podium_percentage") >= 50, "Consistent Podium")
            .when(col("points_per_race") >= 5, "Strong Performer")
            .otherwise("Developing").alias("consistency_rating"),
            # Experience categorization
            when(col("total_races") >= 200, "Ultra Veteran")
            .when(col("total_races") >= 100, "Veteran")
            .when(col("total_races") >= 50, "Experienced")
            .otherwise("Newcomer").alias("experience_level"),
            col("calculated_at"),
            current_timestamp().alias("categorized_at")
        )
        .filter(col("total_races") >= 5)  # Focus on drivers with meaningful careers
    )

## 📊 Data Quality Expectations Explained

DLT provides powerful data quality features through **expectations**:

### 🎯 Expectation Types:

#### 1. `@dlt.expect()`
```python
@dlt.expect("reasonable_birth_date", "date_of_birth >= '1900-01-01'")
```
- **Behavior:** Records violation but continues processing
- **Use case:** Data quality monitoring and alerts
- **Result:** Violating records included in output with quality metrics tracked

#### 2. `@dlt.expect_or_drop()`
```python
@dlt.expect_or_drop("valid_driver_id", "driver_id IS NOT NULL")
```
- **Behavior:** Drops records that fail the expectation
- **Use case:** Critical data quality requirements
- **Result:** Only valid records in output table

#### 3. `@dlt.expect_or_fail()`
```python
@dlt.expect_or_fail("critical_data", "COUNT(*) > 0")
```
- **Behavior:** Stops pipeline execution if expectation fails
- **Use case:** Critical business rules that cannot be violated
- **Result:** Pipeline failure with clear error message

### 📈 Quality Monitoring:
- **Automatic dashboards** show data quality trends
- **Alerting** when quality degrades
- **Historical tracking** of data quality over time

## 🚀 DLT Pipeline Creation Guide

### 📋 How to Create a DLT Pipeline:

#### 1. Navigate to Delta Live Tables 🔄
- Click **"Workflows"** in the left sidebar
- Click **"Delta Live Tables"** tab
- Click **"Create Pipeline"** button

#### 2. Configure Pipeline Settings ⚙️
```
Pipeline Name: "F1 Data Pipeline with DLT"
Description: "Managed ETL pipeline for Formula 1 analytics with data quality"
```

#### 3. Source Configuration 📝
- **Source Type:** `Notebook`
- **Notebook Path:** Select this notebook (`05_Delta_Live_Pipeline.ipynb`)
- **Source:** Your workspace location

#### 4. Target Configuration 🎯
```
Target Catalog: main
Target Schema: default
Storage Location: Managed (Unity Catalog)
```

#### 5. Compute Configuration ⚡
```
Cluster Mode: Serverless (recommended)
Min Workers: 1
Max Workers: 5 (auto-scaling)
```

#### 6. Advanced Settings 🎛️
- **Pipeline Mode:** `Triggered` (manual) or `Continuous` (streaming)
- **Channel:** `Current` (latest features)
- **Edition:** `Advanced` (for expectations and monitoring)

## 🔄 DLT vs Jobs: When to Use Each

### 🏗️ **Use Delta Live Tables When:**

✅ **Complex ETL with dependencies**
- Multiple transformation layers (Bronze → Silver → Gold)
- Automatic dependency resolution needed
- Schema evolution and data quality critical

✅ **Data quality is paramount**
- Need built-in expectations and monitoring
- Automatic quarantine of bad data
- Quality metrics and alerting required

✅ **Streaming and incremental processing**
- Near real-time data processing
- Change data capture (CDC) patterns
- Efficient incremental updates

✅ **Team collaboration on pipelines**
- Declarative code is easier to understand
- Built-in lineage and documentation
- Standardized patterns across teams

### ⚙️ **Use Jobs When:**

✅ **Simple, scheduled tasks**
- Single notebook execution
- Basic data refresh operations
- Notification and reporting workflows

✅ **Custom orchestration logic**
- Complex conditional workflows
- Integration with external systems
- Custom retry and error handling

✅ **Ad-hoc or exploratory processing**
- One-time data migration
- Experimental data processing
- Quick fixes and patches

### 📊 **Feature Comparison:**

| **Feature** | **Delta Live Tables** | **Jobs** |
|-------------|----------------------|----------|
| **Dependency Management** | ✅ Automatic | ⚙️ Manual |
| **Data Quality** | ✅ Built-in expectations | ⚙️ Custom code |
| **Streaming** | ✅ Native support | ⚙️ Structured streaming |
| **Monitoring** | ✅ Automatic dashboards | ⚙️ Custom monitoring |
| **Cost** | 💰 DLT premium | 💰 Standard compute |
| **Flexibility** | 🎯 Declarative patterns | 🔧 Full control |

## 📈 Advanced DLT Features

### 🔄 **Change Data Capture (CDC)**
```python
@dlt.table
def customers_cdc():
    return dlt.read_stream("customers_raw").apply_changes(
        keys=["customer_id"],
        sequence_by="update_timestamp",
        apply_as_deletes=expr("operation = 'DELETE'"),
        except_column_list=["operation", "update_timestamp"]
    )
```

### 📊 **Pipeline Dependencies**
```python
# Automatic dependency resolution
@dlt.table
def downstream_table():
    return dlt.read("upstream_table_1").join(dlt.read("upstream_table_2"))
```

### 🎯 **Custom Expectations**
```python
@dlt.expect_or_fail("freshness_check", "max(update_time) > current_timestamp() - interval 1 day")
def time_sensitive_data():
    return dlt.read("source_data")
```

## ✅ Delta Live Tables Complete!

**🎉 Outstanding! You've mastered Delta Live Tables fundamentals!**

### What You've Accomplished:
- ✅ **Built DLT pipeline** with Bronze, Silver, and Gold layers
- ✅ **Implemented data quality expectations** for automatic monitoring
- ✅ **Used declarative transformations** with automatic dependencies
- ✅ **Learned DLT vs Jobs** comparison and use cases
- ✅ **Explored advanced features** (CDC, streaming, quality monitoring)

### 🏗️ Your DLT Pipeline Architecture:
```
📁 Volume CSV Files
    ↓ (Auto Loader)
🥉 DLT Bronze Tables (Raw ingestion)
    ↓ (Quality expectations)
🥈 DLT Silver Tables (Clean & validated)  
    ↓ (Business aggregations)
🥇 DLT Gold Tables (Analytics ready)
```

### 📊 Tables Created:
- **Bronze:** `dlt_bronze_drivers`, `dlt_bronze_results`
- **Silver:** `dlt_silver_drivers_clean`, `dlt_silver_results_clean`  
- **Gold:** `dlt_gold_driver_stats`, `dlt_gold_top_performers`

## 📋 Delta Live Tables Pipeline Summary

This DLT pipeline will create the following tables:

| Layer | Table | Description |
|-------|------|-------------|
| 🥉 Bronze | `dlt_bronze_drivers` | Raw driver data ingestion |
| 🥉 Bronze | `dlt_bronze_results` | Raw results data ingestion |
| 🥈 Silver | `dlt_silver_drivers_clean` | Validated driver data with quality expectations |
| 🥈 Silver | `dlt_silver_results_clean` | Validated results data with quality expectations |
| 🥇 Gold | `dlt_gold_driver_stats` | Driver career statistics and aggregations |
| 🥇 Gold | `dlt_gold_top_performers` | Performance categorization and tiers |

### Key Pipeline Features
- 📊 **Data Quality**: Built-in expectations and monitoring
- 🔄 **Dependencies**: Automatic resolution and execution order
- ⚡ **Compute**: Serverless managed infrastructure

## 🚀 Next Steps

Ready to explore AI-powered features and advanced analytics?

### Immediate Actions:
1. **🔄 Create Your DLT Pipeline:**
   - Go to Workflows → Delta Live Tables → Create Pipeline
   - Use this notebook as the source
   - Configure with Serverless compute

2. **📊 Monitor Pipeline Execution:**
   - Watch automatic dependency resolution
   - Check data quality expectation results
   - Explore generated lineage graphs

3. **➡️ Next Notebook:** [06_AI_Agent_Bricks.ipynb](06_AI_Agent_Bricks.ipynb)
   - Explore AI Agents and intelligent applications
   - Build F1 Q&A chatbots with your data

### 🎯 Best Practices Checklist:
- ✅ **Start simple** with basic Bronze → Silver → Gold
- ✅ **Add expectations gradually** as you understand your data
- ✅ **Use descriptive table names** for clarity
- ✅ **Document transformations** with comments
- ✅ **Monitor data quality** trends over time
- ✅ **Test expectations** before production deployment

### 💡 Pro Tips:
- **🔧 Start with `@dlt.expect()`** to understand data patterns
- **📊 Use DLT dashboards** for quality monitoring
- **⚡ Leverage Serverless** for cost-effective execution
- **🔄 Design for incremental processing** from day one

**🔄 Your data pipelines are now production-ready! 🚀**