# Databricks Advanced Security and Governance
## Module: RBAC, Data Lineage, and Enterprise Security

---

## Table of Contents
1. [Implementing Role-Based Access Control (RBAC)](#rbac)
2. [Enabling Data Lineage and Auditing with Unity Catalog](#lineage)
3. [Advanced Encryption and Network Security](#security)

---

## 1. Implementing Role-Based Access Control (RBAC) {#rbac}

### 1.1 Introduction to RBAC in Databricks

Role-Based Access Control (RBAC) is a security model that controls access to resources based on user roles and permissions. In Databricks Unity Catalog, RBAC provides fine-grained access control at multiple levels.

#### Key Concepts:
- **Principals**: Users, service principals, and groups
- **Securable Objects**: Catalogs, schemas, tables, volumes, functions
- **Privileges**: Specific permissions (SELECT, CREATE, MODIFY, etc.)
- **Roles**: Collections of privileges that can be assigned to principals

### 1.2 Unity Catalog RBAC Hierarchy

```
Account
├── Metastore
    ├── Catalog
        ├── Schema
            ├── Table/View
            ├── Function
            ├── Volume
```

#### Privilege Inheritance:
- Privileges granted at higher levels cascade down
- More specific permissions override general ones
- Explicit DENY always takes precedence

### 1.3 Setting Up RBAC - Step by Step

#### Step 1: Create Groups and Users

```sql
-- Create groups for different roles
CREATE GROUP IF NOT EXISTS data_engineers;
CREATE GROUP IF NOT EXISTS data_analysts;
CREATE GROUP IF NOT EXISTS data_scientists;
CREATE GROUP IF NOT EXISTS business_users;

-- Add users to groups (Admin Console or SQL)
-- Note: User management typically done through Admin Console
```

#### Step 2: Create Service Principals for Applications

```sql
-- Create service principal for ETL processes
CREATE SERVICE PRINCIPAL IF NOT EXISTS 'etl-service-principal';

-- Create service principal for reporting tools
CREATE SERVICE PRINCIPAL IF NOT EXISTS 'reporting-service-principal';
```

#### Step 3: Create Catalogs with Appropriate Ownership

```sql
-- Create catalogs for different environments
CREATE CATALOG IF NOT EXISTS development
COMMENT 'Development environment catalog';

CREATE CATALOG IF NOT EXISTS staging
COMMENT 'Staging environment catalog';

CREATE CATALOG IF NOT EXISTS production
COMMENT 'Production environment catalog';

-- Set catalog owners
ALTER CATALOG development OWNER TO data_engineers;
ALTER CATALOG production OWNER TO `admin@company.com`;
```

#### Step 4: Implement Schema-Level Access Control

```sql
-- Create schemas with specific purposes
CREATE SCHEMA IF NOT EXISTS production.sales
COMMENT 'Sales data and analytics';

CREATE SCHEMA IF NOT EXISTS production.finance
COMMENT 'Financial data - restricted access';

CREATE SCHEMA IF NOT EXISTS production.marketing
COMMENT 'Marketing data and campaigns';

-- Grant schema-level permissions
GRANT USE CATALOG ON CATALOG production TO data_analysts;
GRANT USE SCHEMA ON SCHEMA production.sales TO data_analysts;
GRANT SELECT ON SCHEMA production.sales TO business_users;

-- Restrict access to sensitive financial data
GRANT USE SCHEMA ON SCHEMA production.finance TO data_engineers;
GRANT SELECT ON SCHEMA production.finance TO `finance-team`;
```

### 1.4 Table and Column-Level Security

#### Row-Level Security with Views

```sql
-- Create base table with sensitive data
CREATE TABLE production.sales.customer_transactions (
    transaction_id STRING,
    customer_id STRING,
    amount DECIMAL(10,2),
    transaction_date DATE,
    region STRING,
    customer_ssn STRING,  -- Sensitive data
    customer_email STRING
);

-- Create view with row-level filtering
CREATE VIEW production.sales.regional_transactions AS
SELECT 
    transaction_id,
    customer_id,
    amount,
    transaction_date,
    region,
    -- Mask sensitive data based on user groups
    CASE 
        WHEN is_member('finance-team') THEN customer_ssn
        ELSE 'XXX-XX-XXXX'
    END as customer_ssn,
    customer_email
FROM production.sales.customer_transactions
WHERE 
    -- Row-level filtering based on user's region
    CASE 
        WHEN is_member('regional-managers') THEN region = current_user_region()
        WHEN is_member('data_engineers') THEN 1=1  -- Full access
        ELSE region IN ('North', 'South')  -- Limited regions for others
    END;

-- Grant access to view instead of base table
GRANT SELECT ON VIEW production.sales.regional_transactions TO data_analysts;
REVOKE SELECT ON TABLE production.sales.customer_transactions FROM data_analysts;
```

#### Column-Level Masking with Dynamic Views

```sql
-- Create dynamic masking view
CREATE VIEW production.sales.masked_customer_data AS
SELECT 
    customer_id,
    -- Dynamic masking based on user roles
    CASE 
        WHEN is_member('pii-readers') THEN customer_email
        ELSE regexp_replace(customer_email, '(.{2}).*(@.*)', '$1***$2')
    END as customer_email,
    
    CASE 
        WHEN is_member('finance-team') THEN amount
        ELSE NULL
    END as amount,
    
    transaction_date,
    region
FROM production.sales.customer_transactions;
```

#### Using GRANT on Specific Columns

```sql
-- Allow analyst to query only specific columns
GRANT SELECT (id, name, department)
ON TABLE sales_db.employee
TO `analyst_group`;
```

### 1.5 Advanced RBAC Patterns

#### Pattern 1: Environment-Based Access Control

```sql
-- Development environment - Open access for data engineers
GRANT ALL PRIVILEGES ON CATALOG development TO data_engineers;
GRANT USE CATALOG, USE SCHEMA, SELECT ON CATALOG development TO data_scientists;

-- Staging environment - Read-only for testing
GRANT USE CATALOG, USE SCHEMA, SELECT ON CATALOG staging TO data_analysts;
GRANT CREATE SCHEMA, CREATE TABLE ON CATALOG staging TO data_engineers;

-- Production environment - Strict controls
GRANT USE CATALOG ON CATALOG production TO data_analysts;
-- Table-level grants only after approval process
```

#### Pattern 2: Time-Based Access Control

```sql
-- Create function to check business hours
CREATE FUNCTION production.security.is_business_hours()
RETURNS BOOLEAN
LANGUAGE SQL
DETERMINISTIC
RETURN hour(current_timestamp()) BETWEEN 8 AND 18 
       AND dayofweek(current_timestamp()) BETWEEN 2 AND 6;

-- Create view with time-based restrictions
CREATE VIEW production.finance.business_hours_data AS
SELECT *
FROM production.finance.sensitive_financial_data
WHERE production.security.is_business_hours() = true
   OR is_member('24x7-access-group');
```

#### Pattern 3: Data Classification-Based Access

```sql
-- Tag tables with data classification
ALTER TABLE production.sales.customer_data 
SET TAGS ('classification' = 'PII', 'sensitivity' = 'HIGH');

ALTER TABLE production.marketing.campaign_data 
SET TAGS ('classification' = 'INTERNAL', 'sensitivity' = 'MEDIUM');

-- Create policy-based views
CREATE VIEW production.sales.classified_customer_data AS
SELECT 
    customer_id,
    -- Apply masking based on data classification and user clearance
    CASE 
        WHEN get_table_tag('production.sales.customer_data', 'sensitivity') = 'HIGH'
             AND NOT is_member('high-clearance-users')
        THEN 'REDACTED'
        ELSE customer_name
    END as customer_name,
    purchase_amount,
    purchase_date
FROM production.sales.customer_data;
```

### 1.6 Monitoring and Auditing RBAC

#### Check Current Permissions

```sql
-- Show grants for a specific user
SHOW GRANTS ON CATALOG production TO `user@company.com`;

-- Show grants for a group
SHOW GRANTS ON SCHEMA production.sales TO data_analysts;

-- Show all grants on a table
SHOW GRANTS ON TABLE production.sales.customer_transactions;
```

#### RBAC Monitoring Queries

```sql
-- Monitor who has access to sensitive data
SELECT 
    principal,
    principal_type,
    privilege,
    securable_type,
    securable_name
FROM system.information_schema.grants
WHERE securable_name LIKE '%customer%'
  AND privilege IN ('SELECT', 'MODIFY', 'ALL PRIVILEGES');

-- Track recent permission changes
SELECT 
    user_identity,
    action_name,
    request_params,
    event_time
FROM system.access.audit
WHERE action_name LIKE '%GRANT%' OR action_name LIKE '%REVOKE%'
ORDER BY event_time DESC;
```

### 1.7 RBAC Best Practices

#### 1. Principle of Least Privilege
```sql
-- Grant minimum necessary permissions
GRANT USE CATALOG ON CATALOG production TO analysts;
GRANT USE SCHEMA ON SCHEMA production.sales TO analysts;
GRANT SELECT ON TABLE production.sales.summary_data TO analysts;
-- Don't grant broader permissions unless absolutely necessary
```

#### 2. Use Groups Instead of Individual User Grants
```sql
-- Good: Group-based permissions
GRANT SELECT ON SCHEMA production.sales TO data_analysts;
ALTER GROUP data_analysts ADD USER `new.analyst@company.com`;

-- Avoid: Individual user permissions
-- GRANT SELECT ON SCHEMA production.sales TO `user1@company.com`;
-- GRANT SELECT ON SCHEMA production.sales TO `user2@company.com`;
```

#### 3. Regular Access Reviews
```sql
-- Create view for access review
CREATE VIEW governance.access_review AS
SELECT 
    principal,
    principal_type,
    privilege,
    securable_type,
    securable_name,
    grantor,
    grant_time
FROM system.information_schema.grants
WHERE securable_name LIKE 'production.%';
```

---

## 2. Enabling Data Lineage and Auditing with Unity Catalog {#lineage}

### 2.1 Introduction to Data Lineage

Data lineage tracks the flow of data from source to destination, including all transformations and dependencies. Unity Catalog provides automatic lineage capture and visualization.

#### Benefits of Data Lineage:
- **Impact Analysis**: Understand downstream effects of changes
- **Root Cause Analysis**: Trace data quality issues to source
- **Compliance**: Meet regulatory requirements for data governance
- **Documentation**: Automatic documentation of data flows

### 2.2 Unity Catalog Lineage Architecture

```
Data Sources → Unity Catalog → Lineage Graph → Governance Tools
     ↓              ↓              ↓              ↓
   Tables      Automatic       Visual        Compliance
   Views       Capture        Display        Reporting
   Functions   Metadata       Analysis       Auditing
```

### 2.3 Automatic Lineage Capture

Unity Catalog automatically captures lineage for:

#### SQL Operations
```sql
-- Create source table
CREATE TABLE production.raw.customer_data (
    customer_id STRING,
    first_name STRING,
    last_name STRING,
    email STRING,
    registration_date DATE
);

-- Insert sample data
INSERT INTO production.raw.customer_data VALUES
('C001', 'John', 'Doe', 'john.doe@email.com', '2023-01-15'),
('C002', 'Jane', 'Smith', 'jane.smith@email.com', '2023-02-20');

-- Create derived table with lineage
CREATE TABLE production.clean.customer_profiles AS
SELECT 
    customer_id,
    CONCAT(first_name, ' ', last_name) as full_name,
    LOWER(email) as normalized_email,
    registration_date,
    YEAR(registration_date) as registration_year
FROM production.raw.customer_data
WHERE email IS NOT NULL;

-- Lineage is automatically captured:
-- production.raw.customer_data → production.clean.customer_profiles
```

#### PySpark Operations
```python
# Read source data
source_df = spark.table("production.raw.customer_data")

# Apply transformations
cleaned_df = (source_df
    .filter(col("email").isNotNull())
    .withColumn("full_name", concat(col("first_name"), lit(" "), col("last_name")))
    .withColumn("normalized_email", lower(col("email")))
    .withColumn("registration_year", year(col("registration_date")))
    .select("customer_id", "full_name", "normalized_email", "registration_date", "registration_year")
)

# Write to destination (lineage captured automatically)
cleaned_df.write.mode("overwrite").saveAsTable("production.clean.customer_profiles")
```

### 2.4 Advanced Lineage Scenarios

#### Complex ETL Pipeline Lineage
```sql
-- Step 1: Raw data ingestion
CREATE TABLE production.bronze.sales_events (
    event_id STRING,
    customer_id STRING,
    product_id STRING,
    event_timestamp TIMESTAMP,
    event_type STRING,
    amount DECIMAL(10,2)
);

-- Step 2: Data cleaning and validation
CREATE TABLE production.silver.validated_sales AS
SELECT 
    event_id,
    customer_id,
    product_id,
    event_timestamp,
    event_type,
    amount,
    -- Add data quality flags
    CASE 
        WHEN amount < 0 THEN 'INVALID_AMOUNT'
        WHEN customer_id IS NULL THEN 'MISSING_CUSTOMER'
        ELSE 'VALID'
    END as data_quality_flag
FROM production.bronze.sales_events
WHERE event_timestamp >= '2023-01-01';

-- Step 3: Business logic aggregation
CREATE TABLE production.gold.customer_sales_summary AS
SELECT 
    customer_id,
    COUNT(*) as total_transactions,
    SUM(amount) as total_spent,
    AVG(amount) as avg_transaction_amount,
    MAX(event_timestamp) as last_purchase_date,
    MIN(event_timestamp) as first_purchase_date
FROM production.silver.validated_sales
WHERE data_quality_flag = 'VALID'
GROUP BY customer_id;

-- Lineage Chain:
-- bronze.sales_events → silver.validated_sales → gold.customer_sales_summary
```

#### Cross-Schema Lineage with Joins
```sql
-- Customer dimension table
CREATE TABLE production.dimensions.customers AS
SELECT 
    customer_id,
    customer_name,
    customer_tier,
    registration_date
FROM production.raw.customer_master;

-- Product dimension table  
CREATE TABLE production.dimensions.products AS
SELECT 
    product_id,
    product_name,
    category,
    unit_price
FROM production.raw.product_catalog;

-- Fact table with multiple lineage sources
CREATE TABLE production.facts.enriched_sales AS
SELECT 
    s.event_id,
    s.customer_id,
    c.customer_name,
    c.customer_tier,
    s.product_id,
    p.product_name,
    p.category,
    s.amount,
    s.event_timestamp
FROM production.silver.validated_sales s
JOIN production.dimensions.customers c ON s.customer_id = c.customer_id
JOIN production.dimensions.products p ON s.product_id = p.product_id;

-- Multi-source lineage:
-- validated_sales + customers + products → enriched_sales
```

### 2.5 Lineage Visualization and Analysis

#### Accessing Lineage Information
```sql
-- Query lineage information from system tables
SELECT 
    source_table_full_name,
    target_table_full_name,
    source_column_name,
    target_column_name
FROM system.access.table_lineage
WHERE target_table_full_name = 'production.gold.customer_sales_summary';

-- View column-level lineage
SELECT 
    upstream_table_name,
    upstream_column_name,
    downstream_table_name,
    downstream_column_name,
    lineage_type
FROM system.access.column_lineage
WHERE downstream_table_name LIKE 'production.gold.%';
```

#### Creating Lineage Reports
```sql
-- Create lineage impact analysis view
CREATE VIEW governance.lineage_impact_analysis AS
WITH RECURSIVE lineage_tree AS (
    -- Base case: direct dependencies
    SELECT 
        source_table_full_name as root_table,
        target_table_full_name as dependent_table,
        1 as level
    FROM system.access.table_lineage
    
    UNION ALL
    
    -- Recursive case: indirect dependencies
    SELECT 
        lt.root_table,
        tl.target_table_full_name,
        lt.level + 1
    FROM lineage_tree lt
    JOIN system.access.table_lineage tl 
        ON lt.dependent_table = tl.source_table_full_name
    WHERE lt.level < 5  -- Prevent infinite recursion
)
SELECT 
    root_table,
    dependent_table,
    level,
    'DOWNSTREAM' as relationship_type
FROM lineage_tree;
```

### 2.6 Data Auditing with Unity Catalog

#### Audit Log Overview
Unity Catalog captures detailed audit logs for all data access and modifications:

```sql
-- View recent data access audit logs
SELECT 
    user_identity,
    service_name,
    action_name,
    request_params,
    response,
    event_time,
    source_ip_address
FROM system.access.audit
WHERE event_time >= current_timestamp() - INTERVAL 1 DAY
  AND action_name IN ('SELECT', 'INSERT', 'UPDATE', 'DELETE')
ORDER BY event_time DESC;
```

#### Detailed Audit Queries

```sql
-- Monitor sensitive table access
SELECT 
    user_identity,
    action_name,
    JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') as table_name,
    event_time,
    source_ip_address
FROM system.access.audit
WHERE JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') LIKE '%customer%'
  AND action_name = 'SELECT'
  AND event_time >= current_date() - INTERVAL 7 DAY;

-- Track data modification activities
SELECT 
    user_identity,
    action_name,
    JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') as affected_table,
    JSON_EXTRACT_SCALAR(response, '$.result.operation_id') as operation_id,
    event_time
FROM system.access.audit
WHERE action_name IN ('INSERT', 'UPDATE', 'DELETE', 'MERGE')
  AND event_time >= current_date() - INTERVAL 1 DAY
ORDER BY event_time DESC;

-- Monitor permission changes
SELECT 
    user_identity,
    action_name,
    request_params,
    event_time
FROM system.access.audit
WHERE action_name IN ('GRANT', 'REVOKE', 'CREATE_PRINCIPAL', 'DROP_PRINCIPAL')
ORDER BY event_time DESC;
```

### 2.7 Advanced Auditing Scenarios

#### Compliance Reporting
```sql
-- Create GDPR compliance audit view
CREATE VIEW governance.gdpr_audit_trail AS
SELECT 
    user_identity,
    action_name,
    JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') as data_asset,
    event_time,
    source_ip_address,
    session_id,
    -- Identify PII access
    CASE 
        WHEN JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') LIKE '%customer%' 
          OR JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') LIKE '%pii%'
        THEN 'PII_ACCESS'
        ELSE 'REGULAR_ACCESS'
    END as access_type
FROM system.access.audit
WHERE action_name IN ('SELECT', 'INSERT', 'UPDATE', 'DELETE')
  AND (
    JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') LIKE '%customer%'
    OR JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') LIKE '%personal%'
    OR JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') LIKE '%pii%'
  );
```

#### Anomaly Detection in Data Access
```sql
-- Detect unusual access patterns
CREATE VIEW governance.access_anomalies AS
WITH user_access_stats AS (
    SELECT 
        user_identity,
        COUNT(*) as daily_access_count,
        COUNT(DISTINCT JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg')) as unique_tables_accessed,
        MIN(event_time) as first_access,
        MAX(event_time) as last_access,
        COUNT(DISTINCT source_ip_address) as unique_ip_addresses
    FROM system.access.audit
    WHERE event_time >= current_date()
      AND action_name = 'SELECT'
    GROUP BY user_identity
),
access_thresholds AS (
    SELECT 
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY daily_access_count) as access_threshold,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY unique_tables_accessed) as table_threshold
    FROM user_access_stats
)
SELECT 
    uas.user_identity,
    uas.daily_access_count,
    uas.unique_tables_accessed,
    uas.unique_ip_addresses,
    uas.first_access,
    uas.last_access,
    CASE 
        WHEN uas.daily_access_count > at.access_threshold THEN 'HIGH_VOLUME_ACCESS'
        WHEN uas.unique_tables_accessed > at.table_threshold THEN 'BROAD_DATA_ACCESS'
        WHEN uas.unique_ip_addresses > 3 THEN 'MULTIPLE_LOCATIONS'
        ELSE 'NORMAL'
    END as anomaly_type
FROM user_access_stats uas
CROSS JOIN access_thresholds at
WHERE uas.daily_access_count > at.access_threshold
   OR uas.unique_tables_accessed > at.table_threshold
   OR uas.unique_ip_addresses > 3;
```

### 2.8 Lineage and Audit Best Practices

#### 1. Comprehensive Metadata Tagging
```sql
-- Tag tables with business context
ALTER TABLE production.sales.customer_data 
SET TAGS (
    'business_owner' = 'sales_team',
    'data_steward' = 'data_governance_team',
    'compliance_requirement' = 'GDPR',
    'refresh_frequency' = 'daily',
    'data_quality_tier' = 'gold'
);

-- Tag sensitive columns
ALTER TABLE production.sales.customer_data 
ALTER COLUMN customer_email 
SET TAGS ('pii' = 'true', 'encryption_required' = 'true');
```

#### 2. Automated Lineage Validation
```sql
-- Create function to validate expected lineage
CREATE FUNCTION governance.validate_lineage(
    source_table STRING,
    target_table STRING
) RETURNS BOOLEAN
LANGUAGE SQL
RETURN EXISTS (
    SELECT 1 
    FROM system.access.table_lineage 
    WHERE source_table_full_name = source_table 
      AND target_table_full_name = target_table
);

-- Use in data quality checks
SELECT 
    'lineage_validation' as check_type,
    CASE 
        WHEN governance.validate_lineage(
            'production.raw.customer_data', 
            'production.gold.customer_summary'
        ) THEN 'PASS'
        ELSE 'FAIL'
    END as result;
```

#### 3. Regular Audit Reviews
```sql
-- Create scheduled audit summary
CREATE VIEW governance.daily_audit_summary AS
SELECT 
    DATE(event_time) as audit_date,
    action_name,
    COUNT(*) as action_count,
    COUNT(DISTINCT user_identity) as unique_users,
    COUNT(DISTINCT JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg')) as unique_resources
FROM system.access.audit
WHERE event_time >= current_date() - INTERVAL 30 DAY
GROUP BY DATE(event_time), action_name
ORDER BY audit_date DESC, action_count DESC;
```

---

## 3. Advanced Encryption and Network Security {#security}

### 3.1 Introduction to Databricks Security Architecture

Databricks implements a comprehensive security model with multiple layers of protection:

```
┌─────────────────────────────────────────────────────────────┐
│                    Application Layer                        │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐│
│  │   RBAC &    │ │   Unity     │ │    Audit & Lineage     ││
│  │ Fine-grained│ │  Catalog    │ │      Tracking          ││
│  │ Permissions │ │ Governance  │ │                        ││
│  └─────────────┘ └─────────────┘ └─────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                     Data Layer                              │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐│
│  │ Encryption  │ │   Secret    │ │   Data Classification  ││
│  │ at Rest &   │ │ Management  │ │   & Data Discovery     ││
│  │ in Transit  │ │             │ │                        ││
│  └─────────────┘ └─────────────┘ └─────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                   Network Layer                             │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐│
│  │    VPC      │ │  Private    │ │    Network Security    ││
│  │ Isolation & │ │ Endpoints & │ │   Groups & Firewall    ││
│  │  Peering    │ │  IP Access  │ │        Rules           ││
│  └─────────────┘ └─────────────┘ └─────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                Infrastructure Layer                         │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐│
│  │   Cloud     │ │  Identity   │ │   Compliance &         ││
│  │ Provider    │ │ Provider    │ │   Certifications       ││
│  │  Security   │ │ Integration │ │   (SOC2, HIPAA, etc.)  ││
│  └─────────────┘ └─────────────┘ └─────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
```

### 3.2 Encryption at Rest

#### AWS Implementation
```python
# Configure cluster with encryption at rest
cluster_config = {
    "cluster_name": "secure-analytics-cluster",
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "i3.xlarge",
    "driver_node_type_id": "i3.xlarge",
    "num_workers": 2,
    
    # Encryption at rest configuration
    "aws_attributes": {
        "ebs_volume_type": "GENERAL_PURPOSE_SSD",
        "ebs_volume_count": 1,
        "ebs_volume_size": 100,
        # Enable EBS encryption
        "ebs_volume_encrypted": True,
        # Specify KMS key for encryption
        "ebs_volume_kms_key_id": "arn:aws:kms:us-west-2:123456789012:key/12345678-1234-1234-1234-123456789012"
    },
    
    # Instance profile for S3 access with encryption
    "aws_attributes.instance_profile_arn": "arn:aws:iam::123456789012:instance-profile/databricks-s3-access"
}
```

#### S3 Bucket Encryption Configuration
```json
{
    "Rules": [
        {
            "ApplyServerSideEncryptionByDefault": {
                "SSEAlgorithm": "aws:kms",
                "KMSMasterKeyID": "arn:aws:kms:us-west-2:123456789012:key/12345678-1234-1234-1234-123456789012"
            },
            "BucketKeyEnabled": true
        }
    ]
}
```

#### Delta Lake Encryption at Rest
```sql
-- Create external location with encryption
CREATE EXTERNAL LOCATION secure_data_location
URL 's3://my-encrypted-bucket/data/'
CREDENTIAL my_storage_credential
COMMENT 'Encrypted storage location for sensitive data';

-- Create table with encryption requirements
CREATE TABLE production.sensitive.encrypted_customer_data (
    customer_id STRING NOT NULL,
    ssn STRING NOT NULL,
    credit_card_number STRING,
    bank_account_number STRING,
    created_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
USING DELTA
LOCATION 's3://my-encrypted-bucket/data/customer_data/'
TBLPROPERTIES (
    'encryption.enabled' = 'true',
    'classification' = 'HIGHLY_SENSITIVE',
    'compliance' = 'PCI_DSS'
);
```

### 3.3 Encryption in Transit

#### Cluster SSL/TLS Configuration
```python
# Enable SSL/TLS for all cluster communications
spark_conf = {
    # Enable SSL for Spark UI
    "spark.ui.https.enabled": "true",
    "spark.ui.https.port": "4040",
    
    # Enable SSL for internal Spark communications
    "spark.ssl.enabled": "true",
    "spark.ssl.port": "7077",
    "spark.ssl.keyStore": "/databricks/ssl/keystore.jks",
    "spark.ssl.keyStorePassword": "{{secrets/ssl/keystore-password}}",
    "spark.ssl.trustStore": "/databricks/ssl/truststore.jks",
    "spark.ssl.trustStorePassword": "{{secrets/ssl/truststore-password}}",
    
    # Require SSL for all connections
    "spark.ssl.needClientAuth": "true",
    
    # Enable encryption for shuffle operations
    "spark.io.encryption.enabled": "true",
    
    # Enable encryption for RPC communications
    "spark.network.crypto.enabled": "true"
}
```

#### Database Connection Encryption
```python
# JDBC connection with SSL
jdbc_url = """
jdbc:postgresql://prod-db.example.com:5432/analytics
?ssl=true
&sslmode=require
&sslcert=/databricks/ssl/client-cert.pem
&sslkey=/databricks/ssl/client-key.pem
&sslrootcert=/databricks/ssl/ca-cert.pem
"""

# Read from encrypted database connection
encrypted_df = spark.read \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", "sensitive_customer_data") \
    .option("user", dbutils.secrets.get("database", "username")) \
    .option("password", dbutils.secrets.get("database", "password")) \
    .load()
```

### 3.4 Secret Management

#### Databricks Secret Scopes
```python
# Create secret scope (CLI command)
# databricks secrets create-scope --scope production-secrets --backend-type DATABRICKS

# Add secrets to scope (CLI command)
# databricks secrets put --scope production-secrets --key database-password
# databricks secrets put --scope production-secrets --key api-key
# databricks secrets put --scope production-secrets --key encryption-key

# Use secrets in code
database_password = dbutils.secrets.get("production-secrets", "database-password")
api_key = dbutils.secrets.get("production-secrets", "api-key")
encryption_key = dbutils.secrets.get("production-secrets", "encryption-key")

# Example: Secure database connection
connection_properties = {
    "user": dbutils.secrets.get("production-secrets", "db-username"),
    "password": dbutils.secrets.get("production-secrets", "db-password"),
    "driver": "org.postgresql.Driver"
}

df = spark.read.jdbc(
    url="jdbc:postgresql://secure-db.example.com:5432/production",
    table="customer_data",
    properties=connection_properties
)
```

#### Azure Key Vault Integration
```python
# Configure Azure Key Vault-backed secret scope
# Use Azure CLI or Databricks CLI:
# databricks secrets create-scope --scope azure-key-vault 
#   --backend-type AZURE_KEYVAULT 
#   --resource-id /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.KeyVault/vaults/<vault-name>

# Access secrets from Azure Key Vault
storage_account_key = dbutils.secrets.get("azure-key-vault", "storage-account-key")
client_secret = dbutils.secrets.get("azure-key-vault", "service-principal-secret")

# Configure Azure storage access with encryption
spark.conf.set(
    "fs.azure.account.key.mystorageaccount.dfs.core.windows.net",
    storage_account_key
)
```

### 3.5 Network Security Implementation

#### VPC Configuration for AWS
```yaml
# CloudFormation template for secure VPC setup
Resources:
  DatabricksVPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: Name
          Value: databricks-secure-vpc

  # Private subnets for Databricks clusters
  PrivateSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref DatabricksVPC
      CidrBlock: 10.0.1.0/24
      AvailabilityZone: !Select [0, !GetAZs '']
      MapPublicIpOnLaunch: false

  PrivateSubnet2:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref DatabricksVPC
      CidrBlock: 10.0.2.0/24
      AvailabilityZone: !Select [1, !GetAZs '']
      MapPublicIpOnLaunch: false

  # Security Group for Databricks clusters
  DatabricksSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for Databricks clusters
      VpcId: !Ref DatabricksVPC
      SecurityGroupIngress:
        # Allow internal communication between cluster nodes
        - IpProtocol: -1
          SourceSecurityGroupId: !Ref DatabricksSecurityGroup
        # Allow HTTPS from corporate network only
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 192.168.0.0/16  # Corporate network CIDR
        # Allow SSH from bastion host only
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          SourceSecurityGroupId: !Ref BastionSecurityGroup
      SecurityGroupEgress:
        # Allow HTTPS outbound for package downloads
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0
        # Allow HTTP outbound for package downloads
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0
```

#### IP Access Lists Configuration
```python
# Configure IP access lists via Databricks API
import requests

# Define allowed IP ranges
allowed_ips = [
    "192.168.1.0/24",    # Corporate office
    "10.0.0.0/16",       # VPN users
    "203.0.113.0/24"     # Partner organization
]

# Create IP access list
access_list_config = {
    "label": "corporate-access-list",
    "list_type": "ALLOW",
    "ip_addresses": allowed_ips,
    "enabled": True
}

# API call to create access list (requires admin token)
headers = {
    "Authorization": f"Bearer {admin_token}",
    "Content-Type": "application/json"
}

response = requests.post(
    f"{databricks_instance}/api/2.0/ip-access-lists",
    json=access_list_config,
    headers=headers
)
```

### 3.6 Private Link and VPC Endpoints

#### AWS Private Link Setup
```yaml
# VPC Endpoint for Databricks workspace
DatabricksVPCEndpoint:
  Type: AWS::EC2::VPCEndpoint
  Properties:
    VpcId: !Ref DatabricksVPC
    ServiceName: com.amazonaws.vpce.us-west-2.vpce-svc-databricks
    VpcEndpointType: Interface
    SubnetIds:
      - !Ref PrivateSubnet1
      - !Ref PrivateSubnet2
    SecurityGroupIds:
      - !Ref DatabricksVPCEndpointSecurityGroup
    PolicyDocument:
      Version: '2012-10-17'
      Statement:
        - Effect: Allow
          Principal: '*'
          Action:
            - databricks:*
          Resource: '*'
          Condition:
            StringEquals:
              'aws:PrincipalArn': 
                - 'arn:aws:iam::123456789012:role/DatabricksInstanceProfile'

# S3 VPC Endpoint for data access
S3VPCEndpoint:
  Type: AWS::EC2::VPCEndpoint
  Properties:
    VpcId: !Ref DatabricksVPC
    ServiceName: com.amazonaws.us-west-2.s3
    VpcEndpointType: Gateway
    RouteTableIds:
      - !Ref PrivateRouteTable
```

#### Azure Private Link Configuration
```json
{
    "type": "Microsoft.Network/privateEndpoints",
    "apiVersion": "2021-03-01",
    "name": "databricks-private-endpoint",
    "location": "[resourceGroup().location]",
    "properties": {
        "subnet": {
            "id": "[resourceId('Microsoft.Network/virtualNetworks/subnets', variables('vnetName'), variables('subnetName'))]"
        },
        "privateLinkServiceConnections": [
            {
                "name": "databricks-connection",
                "properties": {
                    "privateLinkServiceId": "[resourceId('Microsoft.Databricks/workspaces', variables('workspaceName'))]",
                    "groupIds": ["databricks_ui_api"],
                    "requestMessage": "Please approve this connection"
                }
            }
        ]
    }
}
```

### 3.7 Data Classification and Discovery

#### Automated Data Classification
```sql
-- Enable automatic data discovery and classification
ALTER CATALOG production 
SET PROPERTIES ('databricks.data_discovery.enabled' = 'true');

-- Create classification rules
CREATE FUNCTION governance.classify_column(column_name STRING, sample_data STRING)
RETURNS STRING
LANGUAGE SQL
DETERMINISTIC
RETURN 
  CASE 
    WHEN column_name LIKE '%ssn%' OR column_name LIKE '%social%' 
      THEN 'PII_SSN'
    WHEN column_name LIKE '%email%' 
      THEN 'PII_EMAIL'
    WHEN column_name LIKE '%phone%' OR column_name LIKE '%mobile%'
      THEN 'PII_PHONE'
    WHEN column_name LIKE '%credit%' OR column_name LIKE '%card%'
      THEN 'PCI_CREDIT_CARD'
    WHEN regexp_like(sample_data, '^[0-9]{3}-[0-9]{2}-[0-9]{4}$')
      THEN 'PII_SSN'
    WHEN regexp_like(sample_data, '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
      THEN 'PII_EMAIL'
    ELSE 'UNCLASSIFIED'
  END;

-- Apply classification to existing tables
CREATE VIEW governance.classified_columns AS
SELECT 
    table_catalog,
    table_schema,
    table_name,
    column_name,
    data_type,
    governance.classify_column(column_name, '') as classification
FROM information_schema.columns
WHERE table_catalog = 'production';
```

#### Implement Data Masking Based on Classification
```sql
-- Create masking functions for different data types
CREATE FUNCTION governance.mask_ssn(ssn STRING)
RETURNS STRING
LANGUAGE SQL
DETERMINISTIC
RETURN 
  CASE 
    WHEN is_member('pii-readers') THEN ssn
    ELSE regexp_replace(ssn, '^[0-9]{3}-[0-9]{2}', 'XXX-XX')
  END;

CREATE FUNCTION governance.mask_email(email STRING)
RETURNS STRING
LANGUAGE SQL
DETERMINISTIC
RETURN 
  CASE 
    WHEN is_member('pii-readers') THEN email
    ELSE regexp_replace(email, '^(.{2}).*(@.*)$', '$1***$2')
  END;

CREATE FUNCTION governance.mask_credit_card(cc STRING)
RETURNS STRING
LANGUAGE SQL
DETERMINISTIC
RETURN 
  CASE 
    WHEN is_member('pci-readers') THEN cc
    ELSE regexp_replace(cc, '^[0-9]{4}-[0-9]{4}-[0-9]{4}', 'XXXX-XXXX-XXXX')
  END;

-- Create automatically masked view
CREATE VIEW production.secure.masked_customer_data AS
SELECT 
    customer_id,
    governance.mask_ssn(social_security_number) as social_security_number,
    governance.mask_email(email_address) as email_address,
    governance.mask_credit_card(credit_card_number) as credit_card_number,
    customer_name,
    registration_date
FROM production.raw.customer_data;
```

### 3.8 Compliance and Monitoring

#### GDPR Compliance Implementation
```sql
-- Create GDPR data subject rights implementation
CREATE TABLE governance.data_subject_requests (
    request_id STRING,
    data_subject_id STRING,
    request_type STRING, -- 'ACCESS', 'DELETION', 'PORTABILITY', 'RECTIFICATION'
    request_date TIMESTAMP,
    status STRING, -- 'PENDING', 'IN_PROGRESS', 'COMPLETED', 'REJECTED'
    completion_date TIMESTAMP,
    requestor_email STRING
);

-- Data deletion procedure for GDPR right to be forgotten
CREATE PROCEDURE governance.process_data_deletion(subject_id STRING)
LANGUAGE SQL
AS
BEGIN
  -- Log the deletion request
  INSERT INTO governance.data_subject_requests VALUES (
    uuid(),
    subject_id,
    'DELETION',
    current_timestamp(),
    'IN_PROGRESS',
    NULL,
    current_user()
  );
  
  -- Delete from all tables containing the subject's data
  DELETE FROM production.sales.customer_transactions WHERE customer_id = subject_id;
  DELETE FROM production.marketing.customer_preferences WHERE customer_id = subject_id;
  DELETE FROM production.support.customer_interactions WHERE customer_id = subject_id;
  
  -- Update request status
  UPDATE governance.data_subject_requests 
  SET status = 'COMPLETED', completion_date = current_timestamp()
  WHERE data_subject_id = subject_id AND request_type = 'DELETION' AND status = 'IN_PROGRESS';
END;
```

#### SOC 2 Compliance Monitoring
```sql
-- Create SOC 2 control monitoring
CREATE VIEW governance.soc2_access_controls AS
SELECT 
    'Access Control' as control_area,
    COUNT(DISTINCT user_identity) as unique_users_count,
    COUNT(*) as total_access_events,
    COUNT(CASE WHEN action_name = 'SELECT' THEN 1 END) as read_operations,
    COUNT(CASE WHEN action_name IN ('INSERT', 'UPDATE', 'DELETE') THEN 1 END) as write_operations,
    DATE(event_time) as audit_date
FROM system.access.audit
WHERE event_time >= current_date() - INTERVAL 30 DAYS
GROUP BY DATE(event_time);

-- Monitor privileged access
CREATE VIEW governance.privileged_access_monitoring AS
SELECT 
    user_identity,
    action_name,
    JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') as resource_accessed,
    event_time,
    'PRIVILEGED_ACCESS' as alert_type
FROM system.access.audit
WHERE user_identity IN (
    SELECT principal 
    FROM system.information_schema.grants 
    WHERE privilege IN ('ALL PRIVILEGES', 'MODIFY', 'CREATE')
)
AND event_time >= current_timestamp() - INTERVAL 1 DAY;
```

### 3.9 Security Best Practices Implementation

#### 1. Defense in Depth
```python
# Multi-layer security configuration
security_config = {
    # Layer 1: Network Security
    "network": {
        "vpc_id": "vpc-12345678",
        "private_subnets": ["subnet-12345678", "subnet-87654321"],
        "security_groups": ["sg-databricks-restricted"],
        "ip_access_lists": ["corporate-offices", "vpn-users"]
    },
    
    # Layer 2: Identity and Access Management
    "iam": {
        "identity_provider": "SAML_SSO",
        "mfa_required": True,
        "session_timeout": 480,  # 8 hours
        "rbac_enabled": True
    },
    
    # Layer 3: Data Protection
    "data_protection": {
        "encryption_at_rest": "AES_256_KMS",
        "encryption_in_transit": "TLS_1_3",
        "data_classification": "AUTOMATIC",
        "data_masking": "ROLE_BASED"
    },
    
    # Layer 4: Monitoring and Auditing
    "monitoring": {
        "audit_logging": "COMPREHENSIVE",
        "real_time_alerts": True,
        "anomaly_detection": True,
        "compliance_reporting": "AUTOMATED"
    }
}
```

#### 2. Zero Trust Architecture
```sql
-- Implement zero trust data access model
CREATE VIEW governance.zero_trust_access AS
SELECT 
    user_identity,
    resource_accessed,
    access_granted,
    verification_factors,
    risk_score,
    decision
FROM (
    SELECT 
        user_identity,
        JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg') as resource_accessed,
        -- Evaluate multiple trust factors
        CASE 
            WHEN source_ip_address IN (SELECT ip FROM governance.trusted_ips) THEN 1
            ELSE 0
        END +
        CASE 
            WHEN user_identity IN (SELECT user FROM governance.verified_users) THEN 1
            ELSE 0
        END +
        CASE 
            WHEN session_id IN (SELECT session FROM governance.mfa_verified_sessions) THEN 1
            ELSE 0
        END as verification_factors,
        
        -- Calculate risk score
        CASE 
            WHEN action_name IN ('DELETE', 'DROP') THEN 3
            WHEN action_name IN ('UPDATE', 'INSERT') THEN 2
            WHEN action_name = 'SELECT' THEN 1
            ELSE 0
        END as risk_score,
        
        -- Make access decision
        CASE 
            WHEN verification_factors >= 2 AND risk_score <= 2 THEN 'ALLOW'
            WHEN verification_factors >= 3 THEN 'ALLOW'
            ELSE 'DENY'
        END as decision,
        
        action_name IN ('SELECT', 'INSERT', 'UPDATE', 'DELETE') as access_granted
    FROM system.access.audit
    WHERE event_time >= current_timestamp() - INTERVAL 1 HOUR
);
```

#### 3. Incident Response Automation
```python
# Automated security incident response
def security_incident_response(alert_type, severity, details):
    """
    Automated response to security incidents
    """
    
    if severity == "CRITICAL":
        # Immediate actions for critical incidents
        actions = [
            "disable_user_account",
            "revoke_access_tokens", 
            "notify_security_team",
            "initiate_forensic_logging"
        ]
    elif severity == "HIGH":
        # Actions for high severity incidents
        actions = [
            "require_mfa_reauthentication",
            "restrict_network_access",
            "notify_data_steward",
            "enhanced_monitoring"
        ]
    else:
        # Actions for medium/low severity
        actions = [
            "log_incident",
            "notify_user",
            "schedule_security_review"
        ]
    
    # Execute automated responses
    for action in actions:
        execute_security_action(action, details)
    
    # Create incident ticket
    create_incident_ticket(alert_type, severity, details, actions)

# Example: Detect and respond to anomalous access
anomalous_access_query = """
SELECT 
    user_identity,
    COUNT(*) as access_count,
    COUNT(DISTINCT JSON_EXTRACT_SCALAR(request_params, '$.full_name_arg')) as unique_resources,
    'ANOMALOUS_ACCESS' as alert_type,
    'HIGH' as severity
FROM system.access.audit
WHERE event_time >= current_timestamp() - INTERVAL 1 HOUR
  AND action_name = 'SELECT'
GROUP BY user_identity
HAVING access_count > 100 OR unique_resources > 50
"""

# Execute monitoring query and trigger response
anomalous_users = spark.sql(anomalous_access_query).collect()
for user in anomalous_users:
    security_incident_response(
        user.alert_type, 
        user.severity, 
        {"user": user.user_identity, "access_count": user.access_count}
    )
```

---

## Summary and Key Takeaways

### RBAC Implementation
- **Principle of Least Privilege**: Grant minimum necessary permissions
- **Group-Based Management**: Use groups instead of individual user grants
- **Hierarchical Permissions**: Leverage Unity Catalog's hierarchical model
- **Regular Access Reviews**: Implement automated access review processes

### Data Lineage and Auditing
- **Automatic Capture**: Unity Catalog automatically captures lineage for SQL and Spark operations
- **Comprehensive Auditing**: All data access and modifications are logged
- **Impact Analysis**: Use lineage for understanding downstream effects of changes
- **Compliance Reporting**: Leverage audit logs for regulatory compliance

### Advanced Security
- **Defense in Depth**: Implement multiple layers of security controls
- **Encryption Everywhere**: Encrypt data at rest and in transit
- **Network Isolation**: Use VPCs, private endpoints, and IP restrictions
- **Zero Trust Model**: Verify every access request regardless of source
- **Automated Response**: Implement automated incident response procedures

### Best Practices
1. **Start with Security**: Design security into your architecture from the beginning
2. **Automate Everything**: Use automation for monitoring, compliance, and incident response
3. **Regular Reviews**: Conduct regular security and access reviews
4. **Documentation**: Maintain comprehensive documentation of security policies and procedures
5. **Training**: Ensure all team members understand security requirements and procedures

This comprehensive training module provides the foundation for implementing enterprise-grade security and governance in Databricks environments.