# Monitor and Govern Databricks Workspaces

Use system tables to monitor usage, costs, and implement governance with Unity Catalog.

## What You'll Learn

âœ… Query system tables for observability  
âœ… Analyze billing and cost allocation  
âœ… Monitor workspace usage and performance  
âœ… Implement Unity Catalog security  
âœ… Create governance dashboards  

**Note**: Since students won't have access to actual system tables, we'll use synthetic data that matches the schema.

---

**References:**
- [System Tables](https://docs.databricks.com/aws/en/admin/system-tables/)
- [Billing Tables](https://docs.databricks.com/aws/en/admin/system-tables/billing)
- [Unity Catalog Governance](https://docs.databricks.com/aws/en/data-governance/unity-catalog/)
- [Observability Dashboards](https://github.com/CodyAustinDavis/dbsql_sme/tree/main/Observability%20Dashboards%20and%20DBA%20Resources)

## 1. System Tables Overview

### What are System Tables?

**System Tables** provide observability into:
- Billing and usage
- Query execution
- Warehouse performance
- Audit logs
- Lineage information

### Available Schemas

```
system.billing.*        - Cost and usage data
system.compute.*        - Cluster and warehouse metrics
system.query.*          - Query execution logs
system.audit.*          - Audit logs
system.lineage.*        - Data lineage
```

### Access Requirements

**In Production:**
- Account admin privileges
- Unity Catalog enabled
- System tables schema access

**In This Training:**
- We'll use synthetic data with matching schemas
- Demonstrates real-world queries and patterns

---

## 2. Billing and Cost Analysis

### Synthetic Billing Data Setup

In [0]:
# Create synthetic system tables for training
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import random

# First, ensure the training schema exists
spark.sql("CREATE SCHEMA IF NOT EXISTS training")

# ========================================
# 1. BILLING DATA
# ========================================

# Generate sample billing records
dates = [(datetime.now() - timedelta(days=x)).strftime('%Y-%m-%d') for x in range(30)]
workspaces = ['prod-workspace', 'dev-workspace', 'staging-workspace']
sku_names = ['JOBS_COMPUTE', 'ALL_PURPOSE_COMPUTE', 'SQL_COMPUTE', 'DELTA_LIVE_TABLES']
users = ['user1@company.com', 'user2@company.com', 'user3@company.com', 'system']

billing_data = []
for date in dates:
    for _ in range(50):
        billing_data.append({
            'usage_date': date,
            'workspace_id': random.choice(workspaces),
            'sku_name': random.choice(sku_names),
            'usage_quantity': round(random.uniform(0.1, 10.0), 2),
            'usage_unit': 'DBU',
            'list_price': round(random.uniform(0.1, 2.0), 2),
            'usage_metadata': {
                'job_id': f'job_{random.randint(1, 100)}',
                'cluster_id': f'cluster_{random.randint(1, 20)}',
                'user': random.choice(users)
            }
        })

billing_df = spark.createDataFrame(billing_data)
billing_df.write.mode('overwrite').saveAsTable('training.system_billing')

print('âœ… Synthetic billing data created')

# ========================================
# 2. QUERY HISTORY DATA
# ========================================

# Sample query texts related to our IoT dataset
query_texts = [
    'SELECT * FROM sensor_bronze WHERE factory_id = "A06"',
    'SELECT device_id, AVG(temperature) FROM sensor_bronze GROUP BY device_id',
    'SELECT * FROM inspection_gold ORDER BY timestamp DESC',
    'SELECT COUNT(*) FROM anomaly_detected',
    'INSERT INTO sensor_bronze SELECT * FROM sensor_data',
    'OPTIMIZE sensor_bronze ZORDER BY (device_id)',
    'SELECT f.factory_name, COUNT(*) FROM sensor_bronze s JOIN dim_factories f',
    'CREATE TABLE sensor_aggregates AS SELECT device_id, AVG(temp)',
    'SELECT * FROM dim_devices WHERE status = "Active"',
    'UPDATE dim_devices SET status = "Maintenance" WHERE device_id = 123',
]

warehouses = ['iot-analytics-warehouse', 'prod-sql-warehouse', 'dev-warehouse']
query_types = ['SELECT', 'INSERT', 'UPDATE', 'CREATE', 'OPTIMIZE']

query_history_data = []
for i in range(500):
    query_start_time = datetime.now() - timedelta(days=random.randint(0, 7), 
                                                    hours=random.randint(0, 23),
                                                    minutes=random.randint(0, 59))
    execution_time_ms = random.randint(100, 30000)
    
    query_history_data.append({
        'query_id': f'query_{i}',
        'query_text': random.choice(query_texts),
        'query_start_time': query_start_time,
        'execution_time_ms': execution_time_ms,
        'rows_produced': random.randint(0, 100000),
        'bytes_scanned': random.randint(1000, 10000000),
        'compute_cost': round(random.uniform(0.01, 5.0), 3),
        'warehouse_id': random.choice(warehouses),
        'user_email': random.choice(users),
        'query_type': random.choice(query_types),
        'status': random.choice(['FINISHED', 'FINISHED', 'FINISHED', 'FAILED'])
    })

query_history_df = spark.createDataFrame(query_history_data)
query_history_df.write.mode('overwrite').saveAsTable('training.query_history')

print('âœ… Synthetic query history created')

# ========================================
# 3. USER PERMISSIONS DATA
# ========================================

# Create user permissions table for row-level security example
user_permissions_data = [
    {'user_email': 'user1@company.com', 'region': 'North Region', 'role': 'analyst'},
    {'user_email': 'user2@company.com', 'region': 'East Region', 'role': 'analyst'},
    {'user_email': 'user3@company.com', 'region': 'South Region', 'role': 'analyst'},
    {'user_email': 'admin@company.com', 'region': 'All', 'role': 'admin'},
    {'user_email': 'system', 'region': 'All', 'role': 'system'},
]

user_permissions_df = spark.createDataFrame(user_permissions_data)
user_permissions_df.write.mode('overwrite').saveAsTable('training.user_permissions')

print('âœ… Synthetic user permissions created')

# ========================================
# 4. AUDIT LOG DATA
# ========================================

# Sample audit logs
table_names = ['sensor_bronze', 'inspection_bronze', 'anomaly_detected', 
               'inspection_gold', 'dim_devices', 'dim_factories']
action_names = ['SELECT', 'INSERT', 'UPDATE', 'DELETE', 'CREATE_TABLE', 'DROP_TABLE']

audit_data = []
for i in range(300):
    event_time = datetime.now() - timedelta(days=random.randint(0, 7),
                                            hours=random.randint(0, 23),
                                            minutes=random.randint(0, 59))
    
    audit_data.append({
        'event_id': f'event_{i}',
        'event_time': event_time,
        'event_date': event_time.date(),
        'user_email': random.choice(users),
        'action_name': random.choice(action_names),
        'table_full_name': f'default.training.{random.choice(table_names)}',
        'workspace_id': random.choice(workspaces),
        'request_id': f'req_{i}',
        'source_ip': f'10.0.{random.randint(1, 255)}.{random.randint(1, 255)}',
    })

audit_df = spark.createDataFrame(audit_data)
audit_df.write.mode('overwrite').saveAsTable('training.audit_logs')

print('âœ… Synthetic audit logs created')

# ========================================
# SUMMARY
# ========================================

print('\n' + '='*60)
print('ðŸ“Š SYNTHETIC SYSTEM TABLES CREATED')
print('='*60)
print(f'âœ… training.system_billing     - {billing_df.count()} records')
print(f'âœ… training.query_history      - {query_history_df.count()} records')
print(f'âœ… training.user_permissions   - {user_permissions_df.count()} records')
print(f'âœ… training.audit_logs         - {audit_df.count()} records')
print('='*60)

# Display sample data
print('\nðŸ“‹ Sample Billing Data:')
billing_df.limit(5).display()

## 3. Cost Analysis Queries

Now let's analyze our synthetic billing data to understand cost patterns. In a real production environment, you would use `system.billing.*` tables.

### Total Cost by Day

2. **Cost by Workspace** (Bar Chart)
```sql
SELECT workspace_id, SUM(usage_quantity * list_price) as cost
FROM training.system_billing
WHERE usage_date >= CURRENT_DATE - 30
GROUP BY workspace_id;
```

3. **Top Cost Drivers** (Table)
```sql
SELECT 
  usage_metadata.job_id,
  sku_name,
  SUM(usage_quantity * list_price) as total_cost
FROM training.system_billing
WHERE usage_date >= CURRENT_DATE - 7
GROUP BY usage_metadata.job_id, sku_name
ORDER BY total_cost DESC
LIMIT 20;
```

4. **User Cost Allocation** (Pie Chart)
```sql
SELECT 
  usage_metadata.user,
  SUM(usage_quantity * list_price) as cost
FROM training.system_billing
WHERE usage_date >= CURRENT_DATE - 30
GROUP BY usage_metadata.user;
```

---

## Summary

âœ… **System tables** - Observability into usage and costs  
âœ… **Billing analysis** - Track and allocate costs  
âœ… **Usage monitoring** - Query patterns and performance  
âœ… **Unity Catalog security** - Row filters and column masking  
âœ… **Governance dashboards** - Visual cost tracking  

### Key Takeaways:

1. **Monitor costs regularly** - Daily/weekly review
2. **Implement cost allocation** - Tag and track by team/project
3. **Use row-level security** - Protect sensitive data
4. **Audit access** - Track who accesses what
5. **Create dashboards** - Visualize key metrics

### Cost Optimization Tips:

- Use job clusters instead of all-purpose
- Enable auto-termination
- Right-size clusters
- Use spot instances where possible
- Schedule non-urgent jobs for off-peak hours
- Archive old data to cheaper storage

---

**Additional Resources:**
- [System Tables Guide](https://docs.databricks.com/aws/en/admin/system-tables/)
- [Unity Catalog Security](https://docs.databricks.com/aws/en/data-governance/unity-catalog/access-control)
- [Observability Examples](https://github.com/CodyAustinDavis/dbsql_sme/tree/main/Observability%20Dashboards%20and%20DBA%20Resources)
- [Cost Management](https://docs.databricks.com/aws/en/admin/account-settings/usage-detail-tags)

In [None]:
-- Total Cost by Day
SELECT 
  usage_date,
  SUM(usage_quantity * list_price) as total_cost,
  COUNT(*) as num_operations
FROM training.system_billing
GROUP BY usage_date
ORDER BY usage_date DESC
LIMIT 10;


### Cost by Workspace

Identify which workspaces are consuming the most resources:


In [None]:
-- Cost by Workspace
SELECT 
  workspace_id,
  SUM(usage_quantity * list_price) as total_cost,
  COUNT(*) as num_operations,
  AVG(usage_quantity * list_price) as avg_cost_per_operation
FROM training.system_billing
WHERE usage_date >= CURRENT_DATE - 30
GROUP BY workspace_id
ORDER BY total_cost DESC;


### Cost by SKU Type

Understand which compute types are driving costs:


In [None]:
-- Cost by SKU Type
SELECT 
  sku_name,
  SUM(usage_quantity) as total_dbus,
  SUM(usage_quantity * list_price) as total_cost,
  AVG(usage_quantity * list_price) as avg_cost_per_operation,
  COUNT(*) as operations
FROM training.system_billing
WHERE usage_date >= CURRENT_DATE - 30
GROUP BY sku_name
ORDER BY total_cost DESC;


### Cost by User

Track cost allocation to individual users and teams:


In [None]:
-- Cost by User
SELECT 
  usage_metadata.user,
  COUNT(*) as operations,
  SUM(usage_quantity * list_price) as total_cost,
  AVG(usage_quantity * list_price) as avg_cost_per_operation
FROM training.system_billing
WHERE usage_date >= CURRENT_DATE - 30
GROUP BY usage_metadata.user
ORDER BY total_cost DESC
LIMIT 10;


---

## 4. Usage Monitoring

Monitor query execution patterns and warehouse performance using query history.

### Most Expensive Queries

Identify queries that are consuming the most resources:


In [None]:
-- Most expensive queries in the last 7 days
SELECT 
  query_id,
  LEFT(query_text, 80) as query_text_preview,
  execution_time_ms,
  rows_produced,
  bytes_scanned,
  compute_cost,
  user_email,
  warehouse_id,
  query_start_time
FROM training.query_history
WHERE query_start_time >= CURRENT_DATE - 7
ORDER BY compute_cost DESC
LIMIT 20;


### Query Performance by Type

Analyze performance patterns by query type:


In [None]:
-- Query performance by type
SELECT 
  query_type,
  COUNT(*) as query_count,
  AVG(execution_time_ms) as avg_duration_ms,
  MAX(execution_time_ms) as max_duration_ms,
  AVG(compute_cost) as avg_cost,
  SUM(compute_cost) as total_cost
FROM training.query_history
WHERE query_start_time >= CURRENT_DATE - 7
GROUP BY query_type
ORDER BY query_count DESC;


### User Activity Analysis

Track which users are running the most queries:


In [None]:
-- User activity patterns
SELECT 
  user_email,
  COUNT(*) as total_queries,
  SUM(CASE WHEN status = 'FAILED' THEN 1 ELSE 0 END) as failed_queries,
  AVG(execution_time_ms) as avg_execution_ms,
  SUM(compute_cost) as total_cost
FROM training.query_history
WHERE query_start_time >= CURRENT_DATE - 7
GROUP BY user_email
ORDER BY total_queries DESC;


---

## 5. Unity Catalog Security

Implement data governance using Unity Catalog's security features.

### View User Permissions

First, let's see what permissions we've defined:


In [None]:
-- View user permissions
SELECT * FROM training.user_permissions;


### Grant Privileges (Examples)

**Note:** These are example commands. In production, you would grant privileges to actual user groups.

**Grant SELECT on schema:**
```sql
GRANT SELECT ON SCHEMA training TO `data-analysts`;
```

**Grant table access:**
```sql
GRANT SELECT ON TABLE training.system_billing TO `data-analysts`;
```

**Grant usage on catalog:**
```sql
GRANT USAGE ON CATALOG <your_catalog> TO `data-analysts`;
```

### Row-Level Security (Conceptual Example)

Unity Catalog supports row filters to restrict data based on user permissions. Here's how it works:

**Step 1: Create a filter function**
```sql
CREATE FUNCTION training.filter_by_region(region STRING)
RETURN region IN (
  SELECT region FROM training.user_permissions 
  WHERE user_email = current_user()
);
```

**Step 2: Apply the filter to a table**
```sql
ALTER TABLE <your_table>
SET ROW FILTER training.filter_by_region(region) ON (region);
```

This ensures users only see data for their authorized regions.

### Column Masking (Conceptual Example)

Mask sensitive columns based on user roles:

**Step 1: Create masking function**
```sql
CREATE FUNCTION training.mask_device_id(device_id STRING)
RETURN CASE 
  WHEN is_member('admin') THEN device_id
  ELSE CONCAT('***', RIGHT(device_id, 4))
END;
```

**Step 2: Apply mask to column**
```sql
ALTER TABLE <your_table>
ALTER COLUMN device_id
SET MASK training.mask_device_id;
```

Non-admin users will only see masked device IDs (e.g., "***1234").


---

## 6. Audit Logging

Track data access and changes using audit logs.

### Recent Data Access Events


In [None]:
-- Track data access events
SELECT 
  event_time,
  user_email,
  action_name,
  table_full_name,
  workspace_id,
  source_ip
FROM training.audit_logs
WHERE action_name IN ('SELECT', 'UPDATE', 'DELETE', 'INSERT')
  AND event_date >= CURRENT_DATE - 7
ORDER BY event_time DESC
LIMIT 50;


### Table Access by User

Who is accessing which tables?


In [None]:
-- Table access patterns by user
SELECT 
  user_email,
  table_full_name,
  COUNT(*) as access_count,
  COUNT(DISTINCT DATE(event_time)) as days_accessed,
  MAX(event_time) as last_accessed
FROM training.audit_logs
WHERE event_date >= CURRENT_DATE - 7
  AND action_name = 'SELECT'
GROUP BY user_email, table_full_name
ORDER BY access_count DESC
LIMIT 20;


### Schema Changes

Track DDL operations (CREATE, DROP, ALTER):


In [None]:
-- Track schema changes (CREATE, DROP, ALTER)
SELECT 
  event_time,
  user_email,
  action_name,
  table_full_name,
  workspace_id
FROM training.audit_logs
WHERE action_name IN ('CREATE_TABLE', 'DROP_TABLE', 'ALTER_TABLE')
  AND event_date >= CURRENT_DATE - 7
ORDER BY event_time DESC;


---

## 7. Create Observability Dashboards

Use the queries above to build governance dashboards. Here are key visualizations to create:

### Dashboard Structure

**Page 1: Cost Overview**
- Daily Cost Trend (Line Chart) - Use cell 4 query
- Cost by Workspace (Bar Chart) - Use cell 6 query
- Cost by SKU Type (Pie Chart) - Use cell 8 query
- Top Cost Drivers by User (Table) - Use cell 10 query

**Page 2: Usage Monitoring**
- Most Expensive Queries (Table) - Use cell 12 query
- Query Performance by Type (Bar Chart) - Use cell 14 query
- User Activity (Table) - Use cell 16 query

**Page 3: Security & Audit**
- Recent Access Events (Table) - Use cell 21 query
- Table Access by User (Heatmap) - Use cell 23 query
- Schema Changes (Timeline) - Use cell 25 query

### How to Create the Dashboard

1. **Open Dashboards** from the left sidebar
2. **Create New Dashboard**
3. **Add Visualizations** for each query above
4. **Set Refresh Schedule** (e.g., every 6 hours)
5. **Share with Team** - Grant view access to stakeholders

### Example: Daily Cost Trend Query

This query works well for a line chart visualization:


In [None]:
-- Daily cost trend (optimized for line chart)
SELECT 
  usage_date,
  SUM(usage_quantity * list_price) as total_cost,
  SUM(CASE WHEN sku_name = 'JOBS_COMPUTE' THEN usage_quantity * list_price ELSE 0 END) as jobs_cost,
  SUM(CASE WHEN sku_name = 'SQL_COMPUTE' THEN usage_quantity * list_price ELSE 0 END) as sql_cost,
  SUM(CASE WHEN sku_name = 'ALL_PURPOSE_COMPUTE' THEN usage_quantity * list_price ELSE 0 END) as all_purpose_cost
FROM training.system_billing
GROUP BY usage_date
ORDER BY usage_date;


---

## Summary

Congratulations! You've learned how to monitor and govern your Databricks workspace using system tables and Unity Catalog.

### What You've Accomplished

âœ… **Created synthetic system tables** - Billing, query history, audit logs  
âœ… **Analyzed costs** - By workspace, user, SKU type, and time  
âœ… **Monitored usage** - Query patterns, performance, and user activity  
âœ… **Implemented security** - Row filters, column masking, access control  
âœ… **Built audit trails** - Track data access and schema changes  
âœ… **Designed dashboards** - Visual cost tracking and governance  

### Key Takeaways

1. **Monitor costs regularly** - Set up daily/weekly reviews
2. **Implement cost allocation** - Tag and track by team/project
3. **Use row-level security** - Protect sensitive data automatically
4. **Audit access** - Track who accesses what and when
5. **Create dashboards** - Visualize key metrics for stakeholders

### Applying This to the IoT Project

For your IoT manufacturing project, you should:

**Cost Monitoring:**
- Track costs of your daily sensor data pipelines
- Monitor dashboard query costs from Genie spaces
- Allocate costs to different teams (ops, analytics, ML)

**Usage Governance:**
- Ensure analysts only access authorized factory data
- Mask device IDs for non-admin users
- Monitor which ML models are consuming resources

**Security & Compliance:**
- Restrict regional data access based on user location
- Audit access to sensitive inspection data
- Track schema changes to production tables

### Cost Optimization Tips

- **Use job clusters** instead of all-purpose for scheduled workloads
- **Enable auto-termination** on interactive clusters (15-30 minutes)
- **Right-size clusters** - Don't over-provision compute
- **Use spot instances** where possible for non-critical jobs
- **Schedule non-urgent jobs** for off-peak hours
- **Archive old data** to cheaper storage tiers
- **Use Photon** for SQL queries (2-3x performance improvement)
- **Enable Predictive Optimization** for automatic table maintenance

### Next Steps

1. **Create a governance dashboard** using the queries above
2. **Set up alerts** for cost thresholds and unusual access patterns
3. **Implement row-level security** on your IoT tables
4. **Review audit logs** weekly to track usage patterns
5. **Share cost reports** with leadership to demonstrate value

---

## Try This Out

**Challenge:** Create a comprehensive governance dashboard for the IoT project with:
1. Cost by factory (join billing data with job metadata)
2. Most queried IoT tables (from query_history)
3. User access patterns (from audit_logs)
4. Failed query analysis (for troubleshooting)

**Bonus:** Set up row-level security so analysts only see data from their assigned factories.

---

**Additional Resources:**
- [System Tables Guide](https://docs.databricks.com/aws/en/admin/system-tables/)
- [Unity Catalog Security](https://docs.databricks.com/aws/en/data-governance/unity-catalog/access-control)
- [Observability Examples](https://github.com/CodyAustinDavis/dbsql_sme/tree/main/Observability%20Dashboards%20and%20DBA%20Resources)
- [Cost Management](https://docs.databricks.com/aws/en/admin/account-settings/usage-detail-tags-aws)
- [Row Filters and Column Masks](https://docs.databricks.com/aws/en/data-governance/unity-catalog/row-and-column-filters)

---

**ðŸŽ‰ You've completed Day 3!** You now have the skills to build end-to-end data and ML pipelines, monitor costs, and govern your Databricks workspace effectively.
