# 🗄️ Unity Catalog Demo: Data Governance & Lineage
*Explore Unity Catalog features with Formula 1 data lineage in 5 minutes*

---

## 🎯 Learning Objectives

By the end of this demo, you'll understand:
- ✅ **Unity Catalog's 3-level namespace** (catalog.schema.table)
- ✅ **Data lineage tracking** and visualization
- ✅ **Governance features** for enterprise data management
- ✅ **Best practices** for organizing data assets

---

## 📊 What We'll Build

**Data Lineage Demo Pipeline:**
```
Source Tables               Intermediate Tables           Final Analytics
┌──────────────────┐       ┌─────────────────────┐       ┌─────────────────────┐
│ lineage_drivers_ │   →   │ lineage_driver_     │   →   │ lineage_championship│
│ source           │       │ performance         │       │ _tiers              │
└──────────────────┘       └─────────────────────┘       └─────────────────────┘
┌──────────────────┐                    ↑               
│ lineage_results_ │   ─────────────────┘
│ source           │                    ↓
└──────────────────┘       ┌─────────────────────┐
                           │ lineage_career_     │
                           │ stats               │
                           └─────────────────────┘
```

**🎯 Goal:** Create clear data lineage that you can visualize in the Catalog UI

## 🏗️ Unity Catalog Overview

**Unity Catalog** is Databricks' unified governance solution for data and AI assets.

### 🔧 Key Features:
- **🗂️ 3-Level Namespace:** `catalog.schema.table` hierarchy
- **📈 Data Lineage:** Automatic tracking of data dependencies  
- **🔒 Access Control:** Fine-grained permissions (row/column level)
- **📋 Metadata Management:** Rich descriptions, tags, and discovery
- **🌍 Cross-Cloud:** Works across AWS, Azure, and GCP
- **🔄 Version Control:** Schema evolution and time travel

### 🏢 Enterprise Benefits:
- **Compliance:** GDPR, CCPA, SOX compliance support
- **Audit:** Complete audit trail of data access
- **Collaboration:** Shared catalogs across workspaces
- **Discovery:** Data marketplace for self-service analytics

In [None]:
# Let's start by exploring our current Unity Catalog setup
print("🗄️ Unity Catalog Environment")
print("=" * 40)

# Show current catalog context
current_catalog = spark.sql("SELECT current_catalog()").collect()[0][0]
current_schema = spark.sql("SELECT current_schema()").collect()[0][0]
current_user = spark.sql("SELECT current_user()").collect()[0][0]

print(f"📚 Current Catalog: {current_catalog}")
print(f"📁 Current Schema: {current_schema}")  
print(f"👤 Current User: {current_user}")
print(f"🌐 Full Context: {current_catalog}.{current_schema}")

In [None]:
%sql
-- Let's see what tables we have from our previous notebook
SHOW TABLES IN main.default

## 📈 Step 1: Create Source Tables for Lineage Demo

We'll create simplified source tables to demonstrate clear lineage relationships.

In [None]:
%sql
-- Lineage Source 1: Driver Information
CREATE OR REPLACE TABLE main.default.lineage_drivers_source
USING DELTA
COMMENT 'Source table: Driver master data for lineage demonstration'
AS
SELECT 
  driverId,
  full_name,
  nationality,
  current_age,
  'drivers_master_system' as source_system,
  current_timestamp() as ingested_at
FROM main.default.silver_drivers
WHERE driverId IS NOT NULL

In [None]:
%sql
-- Lineage Source 2: Race Results Information  
CREATE OR REPLACE TABLE main.default.lineage_results_source
USING DELTA
COMMENT 'Source table: Race results data for lineage demonstration'
AS
SELECT 
  r.resultId,
  r.raceId,
  r.driverId,
  r.finish_position,
  r.points,
  r.race_winner,
  race.year as season,
  race.race_name,
  'race_results_system' as source_system,
  current_timestamp() as ingested_at
FROM main.default.silver_results r
JOIN main.default.silver_races race ON r.raceId = race.raceId
WHERE r.driverId IS NOT NULL

## 🔄 Step 2: Create Intermediate Processing Tables

These tables will show how data flows through transformation layers.

In [None]:
%sql
-- Intermediate Table 1: Driver Performance Metrics
CREATE OR REPLACE TABLE main.default.lineage_driver_performance
USING DELTA
COMMENT 'Intermediate table: Driver performance calculated from sources (shows lineage from 2 source tables)'
AS
SELECT 
  d.driverId,
  d.full_name,
  d.nationality,
  d.current_age,
  -- Performance metrics from results
  COUNT(r.resultId) as total_races,
  SUM(r.points) as total_points,
  COUNT(CASE WHEN r.race_winner THEN 1 END) as wins,
  COUNT(CASE WHEN r.finish_position <= 3 THEN 1 END) as podiums,
  ROUND(AVG(r.finish_position), 2) as avg_finish_position,
  -- Data lineage metadata
  ARRAY(d.source_system, r.source_system[0]) as upstream_sources,
  current_timestamp() as processed_at
FROM main.default.lineage_drivers_source d
JOIN main.default.lineage_results_source r ON d.driverId = r.driverId
GROUP BY d.driverId, d.full_name, d.nationality, d.current_age, d.source_system

In [None]:
%sql
-- Intermediate Table 2: Career Statistics Aggregation
CREATE OR REPLACE TABLE main.default.lineage_career_stats
USING DELTA
COMMENT 'Intermediate table: Career aggregations derived from driver performance'
AS
SELECT 
  driverId,
  full_name,
  nationality,
  total_races,
  total_points,
  wins,
  podiums,
  -- Career performance calculations
  ROUND(total_points / total_races, 2) as points_per_race,
  ROUND(wins * 100.0 / total_races, 2) as win_percentage,
  ROUND(podiums * 100.0 / total_races, 2) as podium_percentage,
  -- Career categories
  CASE 
    WHEN wins >= 20 THEN 'Legend'
    WHEN wins >= 5 THEN 'Star'
    WHEN podiums >= 10 THEN 'Contender'
    WHEN total_points >= 100 THEN 'Regular'
    ELSE 'Rookie'
  END as career_tier,
  upstream_sources,
  current_timestamp() as calculated_at
FROM main.default.lineage_driver_performance
WHERE total_races >= 5  -- Focus on drivers with meaningful careers

## 🏆 Step 3: Create Final Analytics Table

This final table shows the complete lineage from sources through to business-ready analytics.

In [None]:
%sql
-- Final Analytics Table: Championship Tier Analysis
CREATE OR REPLACE TABLE main.default.lineage_championship_tiers
USING DELTA
COMMENT 'Analytics table: Final championship tier analysis showing complete data lineage'
AS
SELECT 
  career_tier,
  COUNT(*) as driver_count,
  ROUND(AVG(total_points), 0) as avg_career_points,
  ROUND(AVG(wins), 1) as avg_wins,
  ROUND(AVG(podiums), 1) as avg_podiums,
  ROUND(AVG(points_per_race), 2) as avg_points_per_race,
  ROUND(AVG(win_percentage), 2) as avg_win_percentage,
  -- Data quality metrics
  MIN(total_races) as min_races_in_tier,
  MAX(total_races) as max_races_in_tier,
  -- Lineage tracking
  'Derived from lineage_career_stats → lineage_driver_performance → lineage_*_source' as data_lineage,
  current_timestamp() as analysis_date
FROM main.default.lineage_career_stats
GROUP BY career_tier
ORDER BY 
  CASE career_tier
    WHEN 'Legend' THEN 1
    WHEN 'Star' THEN 2  
    WHEN 'Contender' THEN 3
    WHEN 'Regular' THEN 4
    ELSE 5
  END

In [None]:
# Let's verify all our lineage tables were created successfully
print("📈 Data Lineage Pipeline Summary")
print("=" * 45)

lineage_tables = [
    'lineage_drivers_source',
    'lineage_results_source', 
    'lineage_driver_performance',
    'lineage_career_stats',
    'lineage_championship_tiers'
]

for table in lineage_tables:
    try:
        count = spark.sql(f"SELECT COUNT(*) as count FROM main.default.{table}").collect()[0].count
        if 'source' in table:
            emoji = '📥'
        elif 'championship' in table:
            emoji = '🏆'
        else:
            emoji = '⚙️'
        print(f"{emoji} {table}: {count:,} records")
    except Exception as e:
        print(f"❌ {table}: Error - {str(e)}")

In [None]:
%sql
-- Let's see our final championship tier analysis
SELECT * FROM main.default.lineage_championship_tiers

## 🔍 Step 4: Viewing Data Lineage in Catalog UI

Now let's explore how to view the lineage we just created!

### 📋 How to View Lineage:

1. **📂 Navigate to Catalog Explorer**
   - Click **"Catalog"** in the left sidebar
   - Expand **"main"** catalog
   - Expand **"default"** schema

2. **🎯 Select a Table**
   - Click on **`lineage_championship_tiers`** (our final table)
   - This will open the table details page

3. **📈 View Lineage Tab**
   - Click the **"Lineage"** tab at the top
   - You'll see a visual graph showing data flow
   - Tables → Intermediate transformations → Final analytics

4. **🔍 Explore Dependencies**
   - Click on any table node to see its details
   - Hover over connections to see transformation info
   - Use zoom controls to navigate large lineage graphs

### 🎨 What You'll See:
```
lineage_drivers_source ──┐
                         ├─→ lineage_driver_performance ──┐
lineage_results_source ──┘                                ├─→ lineage_championship_tiers
                                                          │
                             lineage_career_stats ────────┘
```

## 🏢 Unity Catalog Enterprise Features

Unity Catalog provides comprehensive governance capabilities for enterprise data management.

In [None]:
# Let's explore some Unity Catalog metadata and governance features
print("🏢 Unity Catalog Governance Features")
print("=" * 45)

In [None]:
%sql
-- Explore table metadata and lineage information
DESCRIBE EXTENDED main.default.lineage_championship_tiers

In [None]:
%sql
-- Show table history (Delta Lake time travel)
DESCRIBE HISTORY main.default.lineage_championship_tiers

In [None]:
%sql
-- Explore catalog-level information
DESCRIBE CATALOG main

In [None]:
%sql
-- Show all schemas in our catalog
SHOW SCHEMAS IN main

## 🔒 Access Control & Security Features

Unity Catalog provides fine-grained access control at multiple levels:

### 🛡️ Security Levels:

1. **📚 Catalog Level**
   - Control who can access entire data domains
   - Example: Finance catalog vs Marketing catalog

2. **📁 Schema Level** 
   - Organize tables by project or team
   - Example: `finance.payroll` vs `finance.budgets`

3. **📊 Table Level**
   - Individual table permissions
   - Example: Read-only vs Read-Write access

4. **📋 Column Level**
   - Hide sensitive columns from specific users
   - Example: Mask PII data for non-admin users

5. **📝 Row Level**
   - Filter data based on user attributes
   - Example: Users only see their region's data

### 🔑 Common Permission Patterns:
```sql
-- Grant read access to analysts
GRANT SELECT ON main.default.lineage_championship_tiers TO analysts;

-- Grant write access to data engineers  
GRANT MODIFY ON SCHEMA main.default TO data_engineers;

-- Grant full catalog admin to data platform team
GRANT ALL PRIVILEGES ON CATALOG main TO data_platform_admins;
```

## 📋 Best Practices for Data Organization

### 🏗️ Recommended Catalog Structure:

```
📚 Enterprise Catalog Layout
├── 🏢 prod_catalog (Production data)
│   ├── 📁 finance_schema
│   │   ├── 📊 revenue_table
│   │   └── 📊 costs_table
│   ├── 📁 marketing_schema
│   │   ├── 📊 campaigns_table
│   │   └── 📊 leads_table
│   └── 📁 shared_schema
│       ├── 📊 dim_dates
│       └── 📊 dim_geography
├── 🧪 dev_catalog (Development/Testing)
│   └── 📁 [same schema structure]
└── 📊 analytics_catalog (Curated analytics)
    ├── 📁 executive_dashboards
    └── 📁 self_service_analytics
```

### 🎯 Naming Conventions:
- **Catalogs:** `{environment}_{domain}` (e.g., `prod_finance`, `dev_marketing`)
- **Schemas:** `{team_or_project}` (e.g., `payroll`, `customer_analytics`)
- **Tables:** `{layer}_{entity}_{purpose}` (e.g., `gold_customer_360`, `silver_transactions_clean`)

In [None]:
# Let's demonstrate some catalog exploration capabilities
print("🔍 Catalog Exploration Demo")
print("=" * 35)

In [None]:
%sql
-- Find all tables with 'lineage' in the name
SHOW TABLES IN main.default LIKE 'lineage*'

In [None]:
%sql
-- Search for tables by comment/description
SELECT 
  table_catalog,
  table_schema,
  table_name,
  table_type,
  comment
FROM information_schema.tables 
WHERE table_schema = 'default' 
  AND comment LIKE '%lineage%'
ORDER BY table_name

## ✅ Unity Catalog Demo Complete!

**🎉 Excellent work! You've explored Unity Catalog's powerful governance features!**

### What You've Accomplished:
- ✅ **Built data lineage** with 5 interconnected tables
- ✅ **Explored the 3-level namespace** (catalog.schema.table)
- ✅ **Learned lineage visualization** in the Catalog UI
- ✅ **Discovered governance features** (permissions, metadata, audit)
- ✅ **Applied best practices** for data organization

### 🔍 Next Steps to Explore Lineage:
1. **Navigate to Catalog Explorer** (left sidebar)
2. **Find your tables:** main → default → `lineage_championship_tiers`
3. **Click the Lineage tab** to see the visual data flow
4. **Explore dependencies** by clicking on connected tables

### 📊 Your Lineage Pipeline:
```
Sources → Intermediate → Analytics
✅ lineage_drivers_source      ✅ lineage_driver_performance    ✅ lineage_championship_tiers
✅ lineage_results_source      ✅ lineage_career_stats
```

In [None]:
# Final summary of what we created
print("🗄️ Unity Catalog Demo Summary")
print("=" * 40)

# Count total tables in our schema
total_tables = spark.sql("SELECT COUNT(*) as count FROM information_schema.tables WHERE table_schema = 'default'").collect()[0].count
lineage_table_count = len([t for t in lineage_tables])

print(f"📊 Total tables in main.default: {total_tables}")
print(f"📈 Lineage demo tables created: {lineage_table_count}")
print(f"🔗 Data lineage relationships: Established")
print(f"🏢 Governance features: Demonstrated")

print(f"\n🎯 Unity Catalog demo completed at {spark.sql('SELECT current_timestamp()').collect()[0][0]}")

## 🚀 Next Steps

Ready to explore more Databricks features? Here's what's next:

### Immediate Next Steps:
1. **➡️ [04_Job_Creation.ipynb](04_Job_Creation.ipynb)** - Automate your data pipelines
2. **➡️ [05_Delta_Live_Pipeline.ipynb](05_Delta_Live_Pipeline.ipynb)** - Build managed ETL workflows
3. **➡️ [07_SQL_Editor.sql](07_SQL_Editor.sql)** - Create analytics queries and dashboards

### 🔍 Explore Your Lineage:
- **Open Catalog Explorer** and navigate to your lineage tables
- **Click the Lineage tab** to see visual data flow
- **Try the search functionality** to find tables by name or description

### 💡 Pro Tips:
- **📌 Bookmark important tables** for quick access
- **📝 Add rich descriptions** to help team members understand data
- **🏷️ Use tags** to categorize and organize data assets
- **🔒 Set up permissions** based on your team's access needs

**🗄️ Unity Catalog is your data governance superpower! 🚀**