# Databricks Operational Workspace Setup Guide
## Data Engineering Best Practices

This notebook provides a comprehensive framework for establishing operational patterns in your Databricks workspace:

* **Notebooks Structure** - Organized folder hierarchy and naming conventions
* **Cluster Policies** - Standardized compute configurations and cost controls
* **Secrets Management** - Secure credential access patterns
* **Project-Based Catalogs** - Unity Catalog organization by project with medallion architecture

## 1. Notebooks Structure & Organization

### Recommended Folder Hierarchy
```
/Workspace/
├── Shared/
│   ├── libraries/          # Reusable functions and utilities
│   ├── configs/            # Configuration notebooks
│   └── templates/          # Starter templates
├── Projects/
│   ├── CHAT/               # CHAT project (chat_catalog)
│   │   ├── bronze/         # Raw data ingestion notebooks
│   │   ├── silver/         # Data cleaning/transformation
│   │   ├── gold/           # Business aggregates
│   │   ├── orchestration/  # Workflow definitions
│   │   └── tests/          # Unit and integration tests
│   ├── SELMAN/             # SELMAN project (selman_catalog)
│   │   ├── bronze/
│   │   ├── silver/
│   │   ├── gold/
│   │   ├── orchestration/
│   │   └── tests/
└── Users/                  # Individual development workspaces
```

### Naming Conventions
* **Notebooks:** `01_ingest_source_data.py`, `02_transform_customers.sql`
* **Tables:** `{layer}_{domain}_{entity}` (e.g., `silver_clinical_patients`)
* **Jobs:** `{project}_{layer}_{pipeline}` (e.g., `chat_bronze_daily_ingest`)
* **Catalogs:** `{project_name}_catalog` (e.g., `chat_catalog`, `selman_catalog`)

### Project-Catalog Alignment
Each project folder corresponds to a Unity Catalog:
* **CHAT project** → `chat_catalog` (bronze/silver/gold schemas)
* **SELMAN project** → `selman_catalog` (bronze/silver/gold schemas)
* Notebooks in project folders write to their respective catalogs

#### Standard notebook header template for data engineering

In [0]:
# Standard notebook header template for data engineering

# ============================================
# NOTEBOOK: Data Ingestion Template
# PURPOSE: Ingest raw data from source systems
# PROJECT: CHAT (or specify your project)
# AUTHOR: Data Engineering Team
# CREATED: 2026-01-27
# ============================================

# Import standard libraries
from pyspark.sql import functions as F
from datetime import datetime, timedelta
import json

# Notebook parameters (for job orchestration)
dbutils.widgets.text("project", "chat", "Project Name (chat/selman/etc)")
dbutils.widgets.text("layer", "bronze", "Target Layer (bronze/silver/gold)")
dbutils.widgets.text("run_date", "", "Run Date (YYYY-MM-DD)")

project = dbutils.widgets.get("project")
layer = dbutils.widgets.get("layer")
run_date = dbutils.widgets.get("run_date") or datetime.now().strftime("%Y-%m-%d")

# Get project configuration
config = WorkspaceConfig.get_project_config(project)

print(f"Project: {project}")
print(f"Catalog: {config['catalog']}")
print(f"Target Layer: {layer}")
print(f"Run Date: {run_date}")
print(f"Storage Path: {config['storage_path']}")

# Set current catalog
spark.sql(f"USE CATALOG {config['catalog']}")
spark.sql(f"USE SCHEMA {layer}")

print(f"\nInitialized for {config['catalog']}.{layer}")

In [0]:
# Centralized configuration pattern using Python dictionaries

class WorkspaceConfig:
    """Centralized configuration for project-specific settings"""
    
    # Project-based catalog configuration
    PROJECTS = {
        "chat": {
            "catalog": "chat_catalog",
            "storage_path": "/mnt/projects/chat/data",
            "checkpoint_path": "/mnt/projects/chat/checkpoints",
            "owner_team": "chat_engineering_team",
            "description": "CHAT project data assets"
        },
        "selman": {
            "catalog": "selman_catalog",
            "storage_path": "/mnt/projects/selman/data",
            "checkpoint_path": "/mnt/projects/selman/checkpoints",
            "owner_team": "selman_engineering_team",
            "description": "SELMAN project data assets"
        }
        # Add more projects as needed
    }
    
    # Standard schema layers (same across all projects)
    LAYERS = {
        "bronze": {
            "description": "Raw ingested datasets and tables",
            "data_classification": "raw"
        },
        "silver": {
            "description": "Cleaned and validated silver tables",
            "data_classification": "cleaned"
        },
        "gold": {
            "description": "Business-ready gold tables",
            "data_classification": "curated"
        }
    }
    
    @staticmethod
    def get_project_config(project: str) -> dict:
        """Get configuration for specified project"""
        if project not in WorkspaceConfig.PROJECTS:
            raise ValueError(f"Invalid project: {project}. Must be one of {list(WorkspaceConfig.PROJECTS.keys())}")
        return WorkspaceConfig.PROJECTS[project]
    
    @staticmethod
    def get_layer_config(layer: str) -> dict:
        """Get configuration for specified layer"""
        if layer not in WorkspaceConfig.LAYERS:
            raise ValueError(f"Invalid layer: {layer}. Must be one of {list(WorkspaceConfig.LAYERS.keys())}")
        return WorkspaceConfig.LAYERS[layer]
    
    @staticmethod
    def get_table_path(project: str, layer: str, table_name: str) -> str:
        """Generate fully qualified table path"""
        project_config = WorkspaceConfig.get_project_config(project)
        return f"{project_config['catalog']}.{layer}.{table_name}"

# Usage example
project = "chat"  # Change to your project name
config = WorkspaceConfig.get_project_config(project)
print(f"Project: {project}")
print(f"Catalog: {config['catalog']}")
print(f"Storage Path: {config['storage_path']}")
print(f"\nTable Path Example: {WorkspaceConfig.get_table_path(project, 'silver', 'customers')}")

## 2. Cluster Policies & Compute Management

### Policy Strategy
Cluster policies enforce standardized configurations and cost controls across environments.

### Recommended Policies

**Development Policy:**
* Single-node or small clusters (1-3 workers)
* Auto-termination: 30 minutes
* Spot instances enabled
* DBR: Latest LTS version

**Production Policy:**
* Fixed-size or autoscaling clusters
* Auto-termination: 120 minutes
* On-demand instances
* DBR: Stable LTS version
* Enhanced monitoring enabled

**Interactive Policy:**
* For ad-hoc analysis
* Auto-termination: 60 minutes
* Photon enabled for SQL workloads

#### Development Cluster Policy (JSON format)

In [0]:
# Development Cluster Policy (JSON format)
# Apply via Databricks Admin Console > Compute > Policies

dev_cluster_policy = {
    "name": "Data Engineering - Development",
    "definition": {
        "spark_version": {
            "type": "fixed",
            "value": "15.4.x-scala2.12"
        },
        "node_type_id": {
            "type": "allowlist",
            "values": ["n2-highmem-4", "n2-standard-4"],
            "defaultValue": "n2-standard-4"
        },
        "num_workers": {
            "type": "range",
            "minValue": 0,
            "maxValue": 3,
            "defaultValue": 1
        },
        "autotermination_minutes": {
            "type": "fixed",
            "value": 30
        },
        "gcp_attributes.use_preemptible_executors": {
            "type": "fixed",
            "value": True
        },
        "data_security_mode": {
            "type": "fixed",
            "value": "USER_ISOLATION"
        },
        "spark_conf.spark.databricks.cluster.profile": {
            "type": "fixed",
            "value": "singleNode"
        }
    }
}

print(json.dumps(dev_cluster_policy, indent=2))

#### Production Cluster Policy (JSON format)

In [0]:
# Production Cluster Policy (JSON format)

prod_cluster_policy = {
    "name": "Data Engineering - Production",
    "definition": {
        "spark_version": {
            "type": "fixed",
            "value": "15.4.x-scala2.12"  # Use stable LTS
        },
        "node_type_id": {
            "type": "allowlist",
            "values": ["n2-highmem-4", "n2-highmem-8"],
            "defaultValue": "n2-highmem-4"
        },
        "autoscale": {
            "type": "fixed",
            "value": {
                "min_workers": 2,
                "max_workers": 10
            }
        },
        "autotermination_minutes": {
            "type": "fixed",
            "value": 120
        },
        "gcp_attributes.use_preemptible_executors": {
            "type": "fixed",
            "value": False  # On-demand for production
        },
        "data_security_mode": {
            "type": "fixed",
            "value": "USER_ISOLATION"
        },
        "spark_conf.spark.databricks.delta.optimizeWrite.enabled": {
            "type": "fixed",
            "value": "true"
        },
        "spark_conf.spark.databricks.delta.autoCompact.enabled": {
            "type": "fixed",
            "value": "true"
        }
    }
}

print(json.dumps(prod_cluster_policy, indent=2))

## 3. Secrets Management & Secure Access

### Databricks Secrets Architecture
Use **Databricks Secrets** to securely store credentials, API keys, and connection strings.

### Setup Steps

**1. Create Secret Scopes** (via Databricks CLI or UI):
```bash
# Using Databricks CLI
databricks secrets create-scope --scope dev-secrets
databricks secrets create-scope --scope prod-secrets
```

**2. Add Secrets to Scopes:**
```bash
# Example: Store database password
databricks secrets put --scope dev-secrets --key db-password

# Example: Store API key
databricks secrets put --scope prod-secrets --key api-key
```

**3. Grant Access via ACLs:**
```bash
# Grant read access to a group
databricks secrets put-acl --scope prod-secrets --principal data-eng-team --permission READ
```

### Best Practices
* Separate scopes per environment (dev/staging/prod)
* Use GCP Secret Manager backend for production
* Never hardcode credentials in notebooks
* Rotate secrets regularly

#### Secure secrets access pattern in notebooks

In [0]:
# Secure secrets access pattern in notebooks

class SecretsManager:
    """Wrapper for secure credential access"""
    
    def __init__(self, project: str):
        self.project = project
        self.scope = f"{project}-secrets"
    
    def get_secret(self, key: str) -> str:
        """Retrieve secret from Databricks secret scope"""
        try:
            return dbutils.secrets.get(scope=self.scope, key=key)
        except Exception as e:
            raise ValueError(f"Failed to retrieve secret '{key}' from scope '{self.scope}': {str(e)}")
    
    def get_jdbc_connection(self, db_type: str) -> dict:
        """Get database connection parameters"""
        return {
            "url": self.get_secret(f"{db_type}-url"),
            "user": self.get_secret(f"{db_type}-user"),
            "password": self.get_secret(f"{db_type}-password")
        }
    
    def get_api_credentials(self, service: str) -> dict:
        """Get API credentials for external services"""
        return {
            "api_key": self.get_secret(f"{service}-api-key"),
            "api_secret": self.get_secret(f"{service}-api-secret")
        }

# Usage example for CHAT project
project = "chat"  # Change to your project: 'chat', 'selman', etc.
secrets = SecretsManager(project)

# Access individual secrets
# api_key = secrets.get_secret("external-api-key")

# Access database credentials
# db_config = secrets.get_jdbc_connection("postgres")

print(f"Project: {project}")
print(f"Secrets scope: {secrets.scope}")
print("✓ Secrets manager initialized")
print("\nNote: Create project-specific secret scopes:")
print("  databricks secrets create-scope --scope chat-secrets")
print("  databricks secrets create-scope --scope selman-secrets")

#### Example: Connecting to external database using secrets

In [0]:
# Example: Connecting to external database using secrets

def read_from_external_db(table_name: str, project: str):
    """Read data from external database using secure credentials"""
    
    secrets = SecretsManager(project)
    
    # Retrieve connection details from project-specific secrets
    jdbc_url = secrets.get_secret("postgres-url")
    jdbc_user = secrets.get_secret("postgres-user")
    jdbc_password = secrets.get_secret("postgres-password")
    
    # Read data using JDBC
    df = (spark.read
        .format("jdbc")
        .option("url", jdbc_url)
        .option("dbtable", table_name)
        .option("user", jdbc_user)
        .option("password", jdbc_password)
        .option("driver", "org.postgresql.Driver")
        .load()
    )
    
    return df

def write_to_project_catalog(df, project: str, layer: str, table_name: str):
    """Write DataFrame to project catalog"""
    config = WorkspaceConfig.get_project_config(project)
    table_path = f"{config['catalog']}.{layer}.{table_name}"
    
    print(f"Writing to: {table_path}")
    
    (df.write
        .format("delta")
        .mode("overwrite")
        .option("mergeSchema", "true")
        .saveAsTable(table_path)
    )
    
    print(f"✓ Data written to {table_path}")
    return table_path

# Example usage (commented to avoid execution without secrets)
# project = "chat"
# df = read_from_external_db("public.customers", project)
# write_to_project_catalog(df, project, "bronze", "raw_customers")

print("✓ Secure connection functions defined")
print("\nUsage:")
print("  1. Set up project-specific secrets")
print("  2. Read from external source")
print("  3. Write to project catalog (bronze/silver/gold)")

## 4. Project-Based Catalog Strategy

### Project Isolation Architecture

**Catalog Organization:**

Each project receives its own dedicated catalog in the Foundation Workspace with standardized medallion architecture:

| Component | CHAT Project | SELMAN Project | Additional Projects |
|-----------|--------------|----------------|---------------------|
| **Catalog** | `chat_catalog` | `selman_catalog` | `{project}_catalog` |
| **Storage** | `/mnt/projects/chat/` | `/mnt/projects/selman/` | `/mnt/projects/{project}/` |
| **Secrets** | `chat-secrets` | `selman-secrets` | `{project}-secrets` |
| **Owner Team** | CHAT Engineering | SELMAN Engineering | Project Team |
| **Access** | Project-specific | Project-specific | Project-specific |

### Standard Schema Structure (Per Project)

Every project catalog contains three standard schemas:
* **bronze** - Raw ingested datasets and tables from source systems
* **silver** - Cleaned and validated silver tables
* **gold** - Business-ready gold tables for analytics

### Unity Catalog Setup
```sql
-- Create project-specific catalogs
CREATE CATALOG IF NOT EXISTS chat_catalog
  COMMENT 'CHAT project data assets - bronze, silver, and gold layers';

CREATE CATALOG IF NOT EXISTS selman_catalog
  COMMENT 'SELMAN project data assets - bronze, silver, and gold layers';

-- Create schemas within CHAT catalog
USE CATALOG chat_catalog;
CREATE SCHEMA IF NOT EXISTS bronze COMMENT 'Raw ingested datasets';
CREATE SCHEMA IF NOT EXISTS silver COMMENT 'Cleaned and validated tables';
CREATE SCHEMA IF NOT EXISTS gold COMMENT 'Business-ready analytics tables';

-- Create schemas within SELMAN catalog
USE CATALOG selman_catalog;
CREATE SCHEMA IF NOT EXISTS bronze COMMENT 'Raw ingested datasets';
CREATE SCHEMA IF NOT EXISTS silver COMMENT 'Cleaned and validated tables';
CREATE SCHEMA IF NOT EXISTS gold COMMENT 'Business-ready analytics tables';
```

### Benefits of Project-Based Catalogs
* **Clear Ownership:** Each project team owns and manages their catalog
* **Independent Access Control:** Project-specific permissions and security
* **Simplified Cost Tracking:** Easy to track resource usage per project
* **Flexible Lifecycle:** Projects can be archived or decommissioned independently
* **Reduced Complexity:** No need for environment-based catalog multiplication

#### Project-aware data access pattern

In [0]:
# Project-aware data access pattern

class DataAccessLayer:
    """Abstraction layer for project-aware data access"""
    
    def __init__(self, project: str):
        self.config = WorkspaceConfig.get_project_config(project)
        self.catalog = self.config['catalog']
        self.project = project
    
    def get_table_path(self, layer: str, table_name: str) -> str:
        """Generate fully qualified table name"""
        return f"{self.catalog}.{layer}.{table_name}"
    
    def read_table(self, layer: str, table_name: str):
        """Read table from specified layer"""
        table_path = self.get_table_path(layer, table_name)
        print(f"Reading from: {table_path}")
        return spark.table(table_path)
    
    def write_table(self, df, layer: str, table_name: str, mode: str = "overwrite"):
        """Write DataFrame to specified layer"""
        table_path = self.get_table_path(layer, table_name)
        print(f"Writing to: {table_path} (mode: {mode})")
        
        (df.write
            .format("delta")
            .mode(mode)
            .option("mergeSchema", "true")
            .saveAsTable(table_path)
        )
        
        return table_path
    
    def list_tables(self, layer: str):
        """List all tables in a specific layer"""
        return spark.sql(f"SHOW TABLES IN {self.catalog}.{layer}")

# Usage example for CHAT project
project = "chat"  # Change to your project: 'chat', 'selman', etc.
data_access = DataAccessLayer(project)

# Example table paths
print(f"Project: {project}")
print("Bronze layer:", data_access.get_table_path("bronze", "raw_events"))
print("Silver layer:", data_access.get_table_path("silver", "cleaned_events"))
print("Gold layer:", data_access.get_table_path("gold", "daily_metrics"))

# For working with multiple projects
print("\nMulti-project example:")
for proj in ["chat", "selman"]:
    dal = DataAccessLayer(proj)
    print(f"{proj}: {dal.get_table_path('gold', 'summary_metrics')}")

#### Project-based deployment and lifecycle management

In [0]:
# Project-based deployment and lifecycle management

import subprocess
import sys

class ProjectDeploymentManager:
    """Manage notebook deployments and project lifecycle"""
    
    def __init__(self, project: str, target_layer: str = None):
        self.project = project
        self.target_layer = target_layer
        self.config = WorkspaceConfig.get_project_config(project)
    
    def validate_deployment(self) -> bool:
        """Pre-deployment validation checks"""
        checks = {
            "project_exists": self.project in WorkspaceConfig.PROJECTS,
            "catalog_accessible": self._check_catalog_access(),
            "layer_valid": self.target_layer in ["bronze", "silver", "gold", None]
        }
        
        print(f"Deployment Validation for Project: {self.project}")
        for check, result in checks.items():
            status = "✓" if result else "✗"
            print(f"  {status} {check}: {result}")
        
        return all(checks.values())
    
    def _check_catalog_access(self) -> bool:
        """Verify catalog accessibility"""
        try:
            catalog = self.config['catalog']
            spark.sql(f"USE CATALOG {catalog}")
            return True
        except Exception as e:
            print(f"    Error: {str(e)}")
            return False
    
    def deploy_to_layer(self, notebook_path: str, layer: str):
        """Deploy notebook for specific data layer"""
        if not self.validate_deployment():
            raise ValueError("Deployment validation failed")
        
        print(f"\nDeploying {notebook_path}")
        print(f"  Project: {self.project}")
        print(f"  Catalog: {self.config['catalog']}")
        print(f"  Target Layer: {layer}")
        print("\n[Deployment would execute via Databricks CLI/API]")
    
    def promote_data(self, table_name: str, source_layer: str, target_layer: str):
        """Promote data between layers within project"""
        valid_promotions = [
            ("bronze", "silver"),
            ("silver", "gold")
        ]
        
        if (source_layer, target_layer) not in valid_promotions:
            raise ValueError(f"Invalid promotion path: {source_layer} -> {target_layer}")
        
        source_table = f"{self.config['catalog']}.{source_layer}.{table_name}"
        target_table = f"{self.config['catalog']}.{target_layer}.{table_name}"
        
        print(f"\nData Promotion:")
        print(f"  From: {source_table}")
        print(f"  To: {target_table}")
        print("  [Quality gates and validation would run here]")

# Example usage for CHAT project
project = "chat"  # Change to your project
deployment = ProjectDeploymentManager(project=project)
deployment.validate_deployment()

# Example: Promote data from bronze to silver
# deployment.promote_data("customer_data", "bronze", "silver")

#### Unit testing pattern for data pipelines

In [0]:
# Unit testing pattern for data pipelines

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
from datetime import datetime

class DataQualityTests:
    """Data quality testing framework"""
    
    def __init__(self, df):
        self.df = df
        self.results = []
    
    def test_not_empty(self, test_name: str = "Not Empty"):
        """Test that DataFrame is not empty"""
        count = self.df.count()
        passed = count > 0
        self.results.append({"test": test_name, "passed": passed, "details": f"Row count: {count}"})
        return self
    
    def test_no_nulls(self, columns: list, test_name: str = "No Nulls"):
        """Test that specified columns have no null values"""
        for col in columns:
            null_count = self.df.filter(F.col(col).isNull()).count()
            passed = null_count == 0
            self.results.append({
                "test": f"{test_name} - {col}",
                "passed": passed,
                "details": f"Null count: {null_count}"
            })
        return self
    
    def test_unique(self, column: str, test_name: str = "Unique Values"):
        """Test that column has unique values"""
        total_count = self.df.count()
        distinct_count = self.df.select(column).distinct().count()
        passed = total_count == distinct_count
        self.results.append({
            "test": f"{test_name} - {column}",
            "passed": passed,
            "details": f"Total: {total_count}, Distinct: {distinct_count}"
        })
        return self
    
    def report(self):
        """Print test results"""
        print("\n" + "="*60)
        print("DATA QUALITY TEST RESULTS")
        print("="*60)
        
        passed_count = sum(1 for r in self.results if r["passed"])
        total_count = len(self.results)
        
        for result in self.results:
            status = "✓ PASS" if result["passed"] else "✗ FAIL"
            print(f"{status} | {result['test']}: {result['details']}")
        
        print("="*60)
        print(f"Results: {passed_count}/{total_count} tests passed")
        print("="*60 + "\n")
        
        return passed_count == total_count

# Example usage with sample data
sample_schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("created_at", TimestampType(), False)
])

sample_data = [
    (1, "Record A", datetime.now()),
    (2, "Record B", datetime.now()),
    (3, "Record C", datetime.now())
]

sample_df = spark.createDataFrame(sample_data, sample_schema)

# Run tests
tests = DataQualityTests(sample_df)
tests.test_not_empty().test_no_nulls(["id", "name"]).test_unique("id").report()

#### Pipeline monitoring and logging pattern

In [0]:
# Pipeline monitoring and logging pattern

import logging
from datetime import datetime

class PipelineMonitor:
    """Monitor pipeline execution and log metrics"""
    
    def __init__(self, pipeline_name: str, environment: str):
        self.pipeline_name = pipeline_name
        self.environment = environment
        self.start_time = datetime.now()
        self.metrics = {}
    
    def log_metric(self, metric_name: str, value):
        """Log a pipeline metric"""
        self.metrics[metric_name] = value
        print(f"[METRIC] {metric_name}: {value}")
    
    def log_stage(self, stage_name: str, status: str = "started"):
        """Log pipeline stage"""
        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        print(f"[{timestamp}] [{self.environment.upper()}] {stage_name}: {status}")
    
    def finalize(self, status: str = "success"):
        """Finalize monitoring and log summary"""
        end_time = datetime.now()
        duration = (end_time - self.start_time).total_seconds()
        
        print("\n" + "="*60)
        print(f"PIPELINE EXECUTION SUMMARY: {self.pipeline_name}")
        print("="*60)
        print(f"Environment: {self.environment}")
        print(f"Status: {status.upper()}")
        print(f"Duration: {duration:.2f} seconds")
        print(f"Start Time: {self.start_time.strftime('%Y-%m-%d %H:%M:%S')}")
        print(f"End Time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
        
        if self.metrics:
            print("\nMetrics:")
            for metric, value in self.metrics.items():
                print(f"  - {metric}: {value}")
        
        print("="*60 + "\n")
        
        # In production, send metrics to monitoring system
        # self._send_to_monitoring_system()
    
    def _send_to_monitoring_system(self):
        """Send metrics to external monitoring (e.g., Datadog, CloudWatch)"""
        # Implementation would integrate with your monitoring platform
        pass

# Example usage
monitor = PipelineMonitor("customer_etl", env)

monitor.log_stage("Data Ingestion", "started")
monitor.log_metric("records_ingested", 10000)
monitor.log_stage("Data Ingestion", "completed")

monitor.log_stage("Data Transformation", "started")
monitor.log_metric("records_transformed", 9850)
monitor.log_stage("Data Transformation", "completed")

monitor.log_stage("Data Quality Checks", "started")
monitor.log_metric("quality_score", 98.5)
monitor.log_stage("Data Quality Checks", "completed")

monitor.finalize(status="success")