# 🗄️ Unity Catalog Demo: Data Governance & Lineage
*Explore Unity Catalog features with Formula 1 data lineage in 5 minutes*

---

## 🎯 Learning Objectives

By the end of this demo, you'll understand:
- ✅ **Unity Catalog's 3-level namespace** (catalog.schema.table)
- ✅ **Data lineage tracking** and visualization
- ✅ **Governance features** for enterprise data management
- ✅ **Best practices** for organizing data assets

---

## 📊 What We'll Explore

**Data Lineage Demo Pipeline:**
```
🔍 Unity Catalog Features:
├── 📋 Data Discovery (search and explore tables)
├── 📈 Lineage Visualization (track data flow)
├── 🔒 Governance Controls (permissions and security)
├── 📝 Metadata Management (descriptions and tags)
└── 🔍 Impact Analysis (understand dependencies)
```

### 🎯 Using Our F1 Data
We'll explore the data pipeline we created in previous notebooks:
- **Bronze Tables** → **Silver Tables** → **Gold Tables**
- **Data lineage** from raw CSV files to analytics
- **Impact analysis** for schema changes
- **Governance** for production data assets

First, let's verify that we have our F1 data tables from the setup and medallion notebooks:

In [0]:
-- Let's explore our F1 data catalog structure

-- Show all tables in our schema
SHOW TABLES IN main.default;

## 🔍 Data Discovery and Metadata

Unity Catalog provides rich metadata and discovery capabilities. Let's explore our F1 tables:

In [0]:
-- Explore detailed table information

-- Get detailed info for our gold table
DESCRIBE EXTENDED main.default.gold_driver_standings;

-- Get table statistics
DESCRIBE DETAIL main.default.gold_driver_standings;

## 🔄 Creating Lineage Demo Tables

Let's create a set of tables that demonstrate Unity Catalog's lineage tracking capabilities.
We'll create 5 tables with clear dependencies between them:

1. `lineage_drivers_source` - Base driver information
2. `lineage_results_source` - Base race results 
3. `lineage_driver_performance` - Joins the two sources
4. `lineage_career_stats` - Aggregates performance data
5. `lineage_championship_tiers` - Classifies drivers by performance

These tables will provide a clean demonstration of data lineage in Unity Catalog.

In [0]:
-- 1. Create the first source table - driver information
CREATE OR REPLACE TABLE main.default.lineage_drivers_source AS
SELECT
  driver_id,
  full_name,
  nationality,
  date_of_birth,
  current_age
FROM main.default.silver_drivers;

-- 2. Create the second source table - race results
CREATE OR REPLACE TABLE main.default.lineage_results_source AS
SELECT
  race_id,
  driver_id,
  finish_position,
  points,
  race_time
FROM main.default.silver_results
WHERE finish_position IS NOT NULL;

-- Show table counts to verify
SELECT 'lineage_drivers_source' as table_name, COUNT(*) as row_count FROM main.default.lineage_drivers_source
UNION ALL
SELECT 'lineage_results_source' as table_name, COUNT(*) as row_count FROM main.default.lineage_results_source;

In [0]:
-- 3. Create the joined performance table
CREATE OR REPLACE TABLE main.default.lineage_driver_performance AS
SELECT
  d.driver_id,
  d.full_name,
  d.nationality,
  r.race_id,
  r.finish_position,
  r.points,
  CASE 
    WHEN r.finish_position = 1 THEN 'Win'
    WHEN r.finish_position = 2 THEN 'Runner-up'
    WHEN r.finish_position = 3 THEN 'Podium'
    WHEN r.finish_position <= 10 THEN 'Points'
    ELSE 'Non-points'
  END AS result_type,
  CASE
    WHEN r.finish_position = 1 THEN 25
    WHEN r.finish_position = 2 THEN 18
    WHEN r.finish_position = 3 THEN 15
    ELSE r.points
  END AS normalized_points
FROM main.default.lineage_drivers_source d
JOIN main.default.lineage_results_source r ON d.driver_id = r.driver_id;

-- 4. Create career statistics aggregation
CREATE OR REPLACE TABLE main.default.lineage_career_stats AS
SELECT
  driver_id,
  full_name,
  nationality,
  COUNT(*) AS total_races,
  SUM(CASE WHEN result_type = 'Win' THEN 1 ELSE 0 END) AS total_wins,
  SUM(CASE WHEN result_type IN ('Win', 'Runner-up', 'Podium') THEN 1 ELSE 0 END) AS total_podiums,
  SUM(normalized_points) AS career_points,
  ROUND(SUM(normalized_points) / COUNT(*), 2) AS points_per_race
FROM main.default.lineage_driver_performance
GROUP BY driver_id, full_name, nationality;

-- 5. Create championship tier classification
CREATE OR REPLACE TABLE main.default.lineage_championship_tiers AS
SELECT
  driver_id,
  full_name,
  nationality,
  total_races,
  total_wins,
  total_podiums,
  career_points,
  points_per_race,
  CASE
    WHEN total_wins >= 20 THEN 'Legend'
    WHEN total_wins >= 10 THEN 'Champion'
    WHEN total_wins >= 5 THEN 'Race Winner'
    WHEN total_podiums >= 10 THEN 'Podium Regular'
    WHEN total_podiums >= 1 THEN 'Podium Achiever'
    WHEN points_per_race >= 1 THEN 'Points Scorer'
    ELSE 'Competitor'
  END AS driver_tier
FROM main.default.lineage_career_stats
WHERE total_races >= 10
ORDER BY career_points DESC;

-- Show final classification counts
SELECT driver_tier, COUNT(*) as driver_count 
FROM main.default.lineage_championship_tiers
GROUP BY driver_tier
ORDER BY COUNT(*) DESC;

## ? Data Lineage Visualization

Now that we've created our lineage demo tables, let's view the full lineage:

**[Screenshot: Unity Catalog Lineage Graph]**
*📁 Image location: `images/03_lineage_graph.png`*
*Screenshot guidance: Show the Unity Catalog lineage view with F1 tables connected, displaying the flow from Bronze → Silver → Gold*

### 🔄 Our Lineage Demo Pipeline:

```
Source Tables                        Joined Data                        Analytics                      Classification
┌─────────────────────┐          ┌─────────────────────┐          ┌─────────────────────┐          ┌─────────────────────┐
│                     │          │                     │          │                     │          │                     │
│ lineage_drivers_    │───┐      │ lineage_driver_     │──────────┤ lineage_career_     │──────────┤ lineage_championship│
│ source              │   │      │ performance         │          │ stats               │          │ _tiers              │
│                     │   │      │                     │          │                     │          │                     │
└─────────────────────┘   │      └─────────────────────┘          └─────────────────────┘          └─────────────────────┘
                          │                  ▲
                          │                  │
┌─────────────────────┐   │                  │
│                     │   │                  │
│ lineage_results_    │───┘                  │
│ source              │                      │
│                     │                      │
└─────────────────────┘                      │
```

### 🔍 How to Explore This Lineage:

#### 1. **Navigate to Unity Catalog UI**
- Click **"Catalog"** in the left sidebar
- Navigate to **main** → **default** → **lineage_championship_tiers**
- Click the **"Lineage"** tab

#### 2. **Explore Lineage Graph**
- **Upstream dependencies** - see source tables
- **Transformation logic** - view the JOIN operations
- **Column-level lineage** - trace individual metrics
- **Downstream usage** - find dashboards and queries using this data

#### 3. **Impact Analysis**
- **Schema changes** - understand what would break
- **Dependency mapping** - see all affected assets
- **Change propagation** - track impact of modifications

In [0]:
-- Query to view our driver classification tiers
SELECT 
  full_name, 
  nationality, 
  total_races, 
  total_wins, 
  total_podiums, 
  ROUND(points_per_race, 1) as avg_points, 
  driver_tier
FROM main.default.lineage_championship_tiers
WHERE driver_tier IN ('Legend', 'Champion', 'Race Winner')
ORDER BY career_points DESC
LIMIT 20;

In [0]:
-- 3. Create the joined performance table
CREATE OR REPLACE TABLE main.default.lineage_driver_performance AS
SELECT
  d.driver_id,
  d.full_name,
  d.nationality,
  r.race_id,
  r.finish_position,
  r.points,
  CASE 
    WHEN r.finish_position = 1 THEN 'Win'
    WHEN r.finish_position = 2 THEN 'Runner-up'
    WHEN r.finish_position = 3 THEN 'Podium'
    WHEN r.finish_position <= 10 THEN 'Points'
    ELSE 'Non-points'
  END AS result_type,
  CASE
    WHEN r.finish_position = 1 THEN 25
    WHEN r.finish_position = 2 THEN 18
    WHEN r.finish_position = 3 THEN 15
    ELSE r.points
  END AS normalized_points
FROM main.default.lineage_drivers_source d
JOIN main.default.lineage_results_source r ON d.driver_id = r.driver_id;

-- 4. Create career statistics aggregation
CREATE OR REPLACE TABLE main.default.lineage_career_stats AS
SELECT
  driver_id,
  full_name,
  nationality,
  COUNT(*) AS total_races,
  SUM(CASE WHEN result_type = 'Win' THEN 1 ELSE 0 END) AS total_wins,
  SUM(CASE WHEN result_type IN ('Win', 'Runner-up', 'Podium') THEN 1 ELSE 0 END) AS total_podiums,
  SUM(normalized_points) AS career_points,
  ROUND(SUM(normalized_points) / COUNT(*), 2) AS points_per_race
FROM main.default.lineage_driver_performance
GROUP BY driver_id, full_name, nationality;

-- 5. Create championship tier classification
CREATE OR REPLACE TABLE main.default.lineage_championship_tiers AS
SELECT
  driver_id,
  full_name,
  nationality,
  total_races,
  total_wins,
  total_podiums,
  career_points,
  points_per_race,
  CASE
    WHEN total_wins >= 20 THEN 'Legend'
    WHEN total_wins >= 10 THEN 'Champion'
    WHEN total_wins >= 5 THEN 'Race Winner'
    WHEN total_podiums >= 10 THEN 'Podium Regular'
    WHEN total_podiums >= 1 THEN 'Podium Achiever'
    WHEN points_per_race >= 1 THEN 'Points Scorer'
    ELSE 'Competitor'
  END AS driver_tier
FROM main.default.lineage_career_stats
WHERE total_races >= 10
ORDER BY career_points DESC;

-- Show final classification counts
SELECT driver_tier, COUNT(*) as driver_count 
FROM main.default.lineage_championship_tiers
GROUP BY driver_tier
ORDER BY COUNT(*) DESC;

In [0]:
-- Add metadata to our tables for better governance and discoverability
-- This SQL approach is more straightforward than using Python for SQL operations

-- Gold layer tables
ALTER TABLE main.default.gold_driver_standings 
SET TBLPROPERTIES ('comment' = 'Comprehensive F1 driver career statistics and performance metrics aggregated from race results');

ALTER TABLE main.default.gold_season_stats 
SET TBLPROPERTIES ('comment' = 'Annual Formula 1 season-level analytics including driver counts, races, and completion rates');

-- Silver layer tables
ALTER TABLE main.default.silver_drivers 
SET TBLPROPERTIES ('comment' = 'Cleaned and validated F1 driver information with standardized names and data types');

ALTER TABLE main.default.silver_races 
SET TBLPROPERTIES ('comment' = 'Processed F1 race information with validated dates and circuit details');

ALTER TABLE main.default.silver_results 
SET TBLPROPERTIES ('comment' = 'Clean race results with calculated fields for winners, podiums, and performance metrics');

## 🔒 Governance and Security Features

Unity Catalog provides enterprise-grade governance capabilities:

### 🛡️ Security Controls:
- **Fine-grained access control** (table, column, row level)
- **Dynamic data masking** for sensitive information
- **Audit logging** for compliance and monitoring
- **Attribute-based access control** (ABAC)

### 📋 Data Classification:
- **Sensitivity labels** (PII, confidential, public)
- **Compliance tags** (GDPR, CCPA, SOX)
- **Business classification** (finance, marketing, operations)
- **Quality indicators** (bronze, silver, gold, certified)

### 🔍 Monitoring and Auditing:
- **Access patterns** and usage analytics
- **Change history** and version control
- **Performance metrics** and optimization recommendations
- **Data freshness** and quality monitoring

## ✅ Unity Catalog Demo Complete!

**🎉 Great job! You've explored Unity Catalog's powerful governance features!**

### What You've Learned:
- ✅ **3-level namespace** organization (catalog.schema.table)
- ✅ **Data lineage** visualization and impact analysis
- ✅ **Metadata management** and table documentation
- ✅ **Governance capabilities** for enterprise data management

### ? Tables Created for Lineage Demo:
1. `lineage_drivers_source`: Driver base information
2. `lineage_results_source`: Race results base information
3. `lineage_driver_performance`: Joined race performance data
4. `lineage_career_stats`: Aggregated driver statistics
5. `lineage_championship_tiers`: Driver classification by performance tier

### ?🚀 Next Steps in Unity Catalog:
1. **Explore the UI** - Navigate to Catalog → main → default
2. **View Lineage** - Click on lineage_championship_tiers and explore its lineage
3. **Search Data** - Use the search bar to find F1-related assets
4. **Set up Permissions** - Configure access controls for your team
5. **Add Tags** - Classify your data with business-relevant tags

### 💡 Key Governance Benefits:
- **Data Discovery** - Find relevant datasets quickly
- **Impact Analysis** - Understand change consequences
- **Compliance** - Meet regulatory requirements
- **Quality** - Track data lineage and transformations
- **Security** - Control access at granular levels

**Continue to the next notebook:** `04_Job_Creation.ipynb`

**🏁 Ready to automate your F1 pipeline? Let's create some jobs! 🚀**