# ODIBI Databricks Setup Guide

**Phase 2C: Interactive setup for Databricks + Azure Key Vault**

This notebook helps you configure ODIBI in a Databricks environment with Azure Data Lake Storage and Key Vault authentication.

## What This Notebook Does

1. ‚úÖ Validates your Databricks environment
2. ‚úÖ Tests Azure Key Vault connectivity
3. ‚úÖ Configures multiple ADLS connections in parallel (3x faster)
4. ‚úÖ Verifies your setup with a test pipeline
5. ‚úÖ Provides troubleshooting tips

## Prerequisites

- Databricks cluster with runtime 12.2+ (Delta Lake support)
- ODIBI installed: `%pip install odibi[azure,spark]`
- Azure Key Vault with storage account keys stored
- Managed identity or service principal configured

---

## Step 1: Install ODIBI

Install ODIBI with Azure and Spark extras:

In [None]:
%pip install odibi[azure,spark] --quiet
dbutils.library.restartPython()

## Step 2: Validate Databricks Environment

Check that we're running in Databricks with Spark available:

In [None]:
from odibi.utils import validate_databricks_environment

env_info = validate_databricks_environment(verbose=True)

if not env_info["is_databricks"]:
    print("\n‚ö†Ô∏è  WARNING: Not running in Databricks environment")
    print("This notebook is designed for Databricks. Some features may not work.")

if not env_info["spark_available"]:
    print("\n‚ö†Ô∏è  WARNING: Spark session not available")
    print("Please start your cluster and retry.")

print("\n‚úì Environment validation complete")

## Step 3: Configure Your Connections

**Edit the configuration below** with your storage accounts and Key Vault details:

### Configuration Template

```python
connections_config = {
    "bronze": {
        "account": "mystorageaccount1",
        "container": "bronze",
        "auth_mode": "key_vault",
        "key_vault_name": "mykeyvault",
        "secret_name": "storage1-key",
    },
    "silver": {
        "account": "mystorageaccount2",
        "container": "silver",
        "auth_mode": "key_vault",
        "key_vault_name": "mykeyvault",
        "secret_name": "storage2-key",
    },
}
```

In [None]:
# üìù EDIT THIS CONFIGURATION
connections_config = {
    "bronze": {
        "account": "YOUR_STORAGE_ACCOUNT_1",
        "container": "bronze",
        "auth_mode": "key_vault",
        "key_vault_name": "YOUR_KEY_VAULT_NAME",
        "secret_name": "YOUR_SECRET_NAME_1",
    },
    # Add more connections as needed
    # "silver": {...},
    # "gold": {...},
}

print(f"‚úì Configured {len(connections_config)} connection(s)")
for name in connections_config.keys():
    print(f"  - {name}")

## Step 4: Create and Test Connections

Create ADLS connections from your configuration:

In [None]:
from odibi.connections import AzureADLS

# Create connections (validation deferred)
connections = {}
for name, config in connections_config.items():
    connections[name] = AzureADLS(**config, validate=True)
    print(f"‚úì Created connection: {name}")

print(f"\n‚úì All {len(connections)} connections created")

## Step 5: Fetch Secrets in Parallel ‚ö°

**Performance boost!** Fetch all Key Vault secrets in parallel (3x faster than sequential):

In [None]:
from odibi.utils import configure_connections_parallel
import time

print("üîÑ Fetching Key Vault secrets in parallel...\n")
start_time = time.time()

connections, errors = configure_connections_parallel(
    connections,
    prefetch_secrets=True,
    max_workers=5,
    timeout=30.0,
    verbose=True,
)

elapsed = time.time() - start_time

if errors:
    print(f"\n‚ö†Ô∏è  Encountered {len(errors)} error(s):")
    for error in errors:
        print(f"  ‚úó {error}")
else:
    print(f"\n‚úì All secrets fetched successfully in {elapsed:.2f}s")

## Step 6: Configure Spark Engine

Create a Spark engine with your configured connections:

In [None]:
from odibi.engine import SparkEngine
from pyspark.sql import SparkSession

# Get active Spark session
spark = SparkSession.getActiveSession()

if not spark:
    print("‚ö†Ô∏è  No active Spark session. Starting a new one...")
    spark = SparkSession.builder.appName("ODIBI-Setup").getOrCreate()

# Create Spark engine with connections
engine = SparkEngine(connections=connections, spark_session=spark)

print(f"\n‚úì SparkEngine configured with {len(connections)} connection(s)")
print(f"  Spark version: {spark.version}")
print(f"  App name: {spark.sparkContext.appName}")

## Step 7: Test Your Setup

Run a simple test to verify everything works:

In [None]:
# Create sample data
import pandas as pd

test_data = pd.DataFrame(
    {
        "id": [1, 2, 3],
        "name": ["Alice", "Bob", "Charlie"],
        "value": [100, 200, 300],
    }
)

print("üìä Sample data:")
print(test_data)

# Get first connection for testing
test_conn_name = list(connections.keys())[0]
test_conn = connections[test_conn_name]

test_path = "odibi_test/sample.parquet"
test_uri = test_conn.uri(test_path)

print(f"\nüìù Test path: {test_uri}")
print("\nüîÑ Writing test data...")

try:
    # Write using engine
    engine.write(
        data=test_data,
        path=test_uri,
        format="parquet",
        mode="overwrite",
    )
    print("‚úì Write successful")

    # Read back
    print("\nüîÑ Reading test data back...")
    result_df = engine.read(test_uri, format="parquet")

    print("‚úì Read successful")
    print("\nüìä Result:")
    result_df.show()

    print("\nüéâ SUCCESS! Your ODIBI setup is working correctly!")

except Exception as e:
    print(f"\n‚ùå Test failed: {e}")
    print("\nSee troubleshooting section below.")

## Step 8: Performance Comparison

Compare sequential vs parallel Key Vault fetching:

In [None]:
# Only run if you have 2+ Key Vault connections
kv_connections = {
    name: conn
    for name, conn in connections.items()
    if hasattr(conn, "auth_mode") and conn.auth_mode == "key_vault"
}

if len(kv_connections) >= 2:
    print(f"üìä Comparing performance with {len(kv_connections)} Key Vault connections\n")

    # Clear cached keys
    for conn in kv_connections.values():
        conn._cached_key = None

    # Sequential fetch
    print("üêå Sequential fetch...")
    start = time.time()
    for conn in kv_connections.values():
        _ = conn.get_storage_key()
    sequential_time = time.time() - start
    print(f"   Time: {sequential_time:.2f}s\n")

    # Clear cached keys again
    for conn in kv_connections.values():
        conn._cached_key = None

    # Parallel fetch
    print("‚ö° Parallel fetch...")
    start = time.time()
    _, _ = configure_connections_parallel(kv_connections, verbose=False)
    parallel_time = time.time() - start
    print(f"   Time: {parallel_time:.2f}s\n")

    speedup = sequential_time / parallel_time
    print(f"üöÄ Speedup: {speedup:.1f}x faster with parallel fetching!")
else:
    print(
        f"‚ÑπÔ∏è  Need 2+ Key Vault connections for performance comparison (have {len(kv_connections)})"
    )

## Troubleshooting

### Common Issues

#### 1. Key Vault Access Denied
```
Error: (Forbidden) The user, group or application does not have secrets get permission
```

**Solution:**
- Ensure your Databricks cluster has a managed identity assigned
- Grant the managed identity "Get" permission on secrets in Key Vault
- Or use Azure CLI: `az keyvault set-policy --name <vault> --object-id <id> --secret-permissions get`

#### 2. Timeout Errors
```
TimeoutError: Key Vault fetch timed out after 30s
```

**Solution:**
- Check network connectivity from Databricks to Azure
- Increase timeout: `configure_connections_parallel(..., timeout=60.0)`
- Verify Key Vault firewall settings

#### 3. Storage Account Access Denied
```
Error: Operation returned an invalid status code 'Forbidden'
```

**Solution:**
- Verify the storage account key in Key Vault is correct
- Check storage account firewall allows Databricks access
- Ensure container exists

#### 4. Module Import Errors
```
ImportError: azure-identity not found
```

**Solution:**
- Re-run: `%pip install odibi[azure,spark]`
- Restart Python: `dbutils.library.restartPython()`

---

### Getting Help

- **Documentation:** See `docs/` folder in the ODIBI repository
- **Examples:** Check `examples/template_full_adls.yaml`
- **Issues:** https://github.com/henryodibi11/Odibi/issues

---

## Next Steps

1. ‚úÖ Create a full YAML pipeline configuration
2. ‚úÖ See `examples/template_full_adls.yaml` for templates
3. ‚úÖ Check out walkthroughs in `walkthroughs/` folder
4. ‚úÖ Review `docs/SUPPORTED_FORMATS.md` for file format options

**Happy data engineering! üöÄ**