# Phase 2A Walkthrough: Azure ADLS + Key Vault Authentication

**Purpose:** Test Phase 2A implementation with real Azure ADLS connections

**What we'll test:**
1. ‚úÖ Direct key authentication (local dev)
2. ‚úÖ Key Vault authentication (production pattern)
3. ‚úÖ Multi-account connections
4. ‚úÖ Reading CSV from ADLS
5. ‚úÖ Writing Parquet to ADLS
6. ‚úÖ Both PandasEngine and SparkEngine

**Prerequisites:**
- Azure storage account with some test data
- Storage account key OR Azure Key Vault access
- For Spark: PySpark installed (`pip install pyspark`)

## Setup: Install Dependencies

In [None]:
import sys
sys.path.insert(0, r'C:\Users\hodibi\OneDrive - Ingredion\Desktop\Repos\Odibi')

# Verify it worked
import odibi
print(f"‚úÖ ODIBI loaded from: {odibi.__file__}")
# Or manually:
%pip install azure-identity azure-keyvault-secrets adlfs

## Configuration

**‚ö†Ô∏è IMPORTANT:** Replace these values with your actual Azure resources

In [None]:
# ===== CONFIGURE YOUR AZURE RESOURCES HERE =====

# Storage Account 1 (Bronze/Source)
STORAGE_ACCOUNT_1 = "mystorageaccount1"  # Replace with your storage account name
CONTAINER_1 = "bronze"  # Replace with your container name

# Storage Account 2 (Silver/Target) - Optional, can use same account
STORAGE_ACCOUNT_2 = "mystorageaccount2"  # Replace or set same as STORAGE_ACCOUNT_1
CONTAINER_2 = "silver"  # Replace with your container name

# Authentication Mode: Choose one
AUTH_MODE = "direct_key"  # or "key_vault"

# For direct_key mode: Set your storage account keys
import os
ACCOUNT_KEY_1 = os.getenv("STORAGE_KEY_1", "your-storage-key-1-here")
ACCOUNT_KEY_2 = os.getenv("STORAGE_KEY_2", "your-storage-key-2-here")

# For key_vault mode: Set your Key Vault details
KEY_VAULT_NAME = "your-keyvault-name"
SECRET_NAME_1 = "bronze-storage-key"
SECRET_NAME_2 = "silver-storage-key"

# Test data paths
TEST_CSV_PATH = "test/sample_data.csv"  # Path to existing CSV in your storage
OUTPUT_PARQUET_PATH = "test/output/sample_data.parquet"  # Where to write output

print(f"‚úÖ Configuration loaded")
print(f"   Storage Account 1: {STORAGE_ACCOUNT_1}/{CONTAINER_1}")
print(f"   Storage Account 2: {STORAGE_ACCOUNT_2}/{CONTAINER_2}")
print(f"   Auth Mode: {AUTH_MODE}")

‚úÖ Configuration loaded
   Storage Account 1: ingrglobaldigitalopsteam/example-container
   Storage Account 2: ingrglobaldigitalopsteam/example-container
   Auth Mode: key_vault


## Test 1: Create Sample Test Data Locally

First, let's create some sample data to work with

In [None]:
import pandas as pd
from datetime import datetime, timedelta
import numpy as np

# Create sample data
np.random.seed(42)
dates = [datetime(2024, 1, 1) + timedelta(days=i) for i in range(100)]
sample_data = pd.DataFrame({
    'date': dates,
    'product': np.random.choice(['Widget A', 'Widget B', 'Widget C'], 100),
    'quantity': np.random.randint(1, 100, 100),
    'price': np.round(np.random.uniform(10, 500, 100), 2),
    'customer': [f'Customer_{i%10}' for i in range(100)]
})

sample_data['amount'] = sample_data['quantity'] * sample_data['price']

print(f"‚úÖ Created sample data with {len(sample_data)} rows")
sample_data.head()

## Test 2: Direct Key Authentication - Write Test Data to ADLS

Upload our sample data to ADLS using direct key authentication

In [16]:
from odibi.connections.azure_adls import AzureADLS

# Create connection with direct key
bronze_conn = AzureADLS(
    account=STORAGE_ACCOUNT_1,
    container=CONTAINER_1,
    auth_mode="direct_key",
    account_key=ACCOUNT_KEY_1
)

print(f"‚úÖ Created Bronze connection: {bronze_conn.account}/{bronze_conn.container}")

# Get the full ADLS URI
test_csv_uri = bronze_conn.uri(TEST_CSV_PATH)
print(f"   Writing to: {test_csv_uri}")

# Write sample data to ADLS
storage_options = bronze_conn.pandas_storage_options()
sample_data.to_csv(test_csv_uri, index=False, storage_options=storage_options)

print(f"‚úÖ Successfully wrote {len(sample_data)} rows to ADLS")

‚úÖ Created Bronze connection: ingrglobaldigitalopsteam/example-container
   Writing to: abfss://example-container@ingrglobaldigitalopsteam.dfs.core.windows.net/test/sample_data.csv
‚úÖ Successfully wrote 100 rows to ADLS


## Test 3: PandasEngine - Read from ADLS

Test reading the CSV we just uploaded using PandasEngine

In [17]:
from odibi.engine.pandas_engine import PandasEngine

# Create Pandas engine
pandas_engine = PandasEngine()

# Read from ADLS
df_read = pandas_engine.read(
    connection=bronze_conn,
    format="csv",
    path=TEST_CSV_PATH
)

print(f"‚úÖ Successfully read {len(df_read)} rows from ADLS")
print(f"   Columns: {list(df_read.columns)}")
df_read.head()

‚úÖ Successfully read 100 rows from ADLS
   Columns: ['date', 'product', 'quantity', 'price', 'customer', 'amount']


Unnamed: 0,date,product,quantity,price,customer,amount
0,2024-01-01,Widget C,8,188.18,Customer_0,1505.44
1,2024-01-02,Widget A,88,486.17,Customer_1,42782.96
2,2024-01-03,Widget C,63,481.6,Customer_2,30340.8
3,2024-01-04,Widget C,11,133.37,Customer_3,1467.07
4,2024-01-05,Widget A,81,253.65,Customer_4,20545.65


## Test 4: PandasEngine - Write Parquet to Different Account

Test multi-account support by writing to a different storage account

In [18]:
# Create second connection (silver account)
silver_conn = AzureADLS(
    account=STORAGE_ACCOUNT_2,
    container=CONTAINER_2,
    auth_mode="direct_key",
    account_key=ACCOUNT_KEY_2
)

print(f"‚úÖ Created Silver connection: {silver_conn.account}/{silver_conn.container}")

# Write to silver as Parquet
pandas_engine.write(
    df=df_read,
    connection=silver_conn,
    format="parquet",
    path=OUTPUT_PARQUET_PATH,
    mode="overwrite"
)

output_uri = silver_conn.uri(OUTPUT_PARQUET_PATH)
print(f"‚úÖ Successfully wrote Parquet to: {output_uri}")

‚úÖ Created Silver connection: ingrglobaldigitalopsteam/example-container
‚úÖ Successfully wrote Parquet to: abfss://example-container@ingrglobaldigitalopsteam.dfs.core.windows.net/test/output/sample_data.parquet


## Test 5: Verify Multi-Account Write

Read back the Parquet file to verify the multi-account write worked

In [19]:
# Read back from silver account
df_parquet = pandas_engine.read(
    connection=silver_conn,
    format="parquet",
    path=OUTPUT_PARQUET_PATH
)

print(f"‚úÖ Successfully read {len(df_parquet)} rows from Silver account")
print(f"   Original rows: {len(sample_data)}")
print(f"   Parquet rows: {len(df_parquet)}")
assert len(df_parquet) == len(sample_data), "Row count mismatch!"
print("‚úÖ Row counts match - multi-account write successful!")

‚úÖ Successfully read 100 rows from Silver account
   Original rows: 100
   Parquet rows: 100
‚úÖ Row counts match - multi-account write successful!


## Test 6: Key Vault Authentication (Optional)

Test Key Vault authentication mode (requires Azure CLI login or managed identity)

In [None]:
# Uncomment to test Key Vault mode
# NOTE: Requires Azure CLI auth or running in Databricks with managed identity

# try:
#     bronze_kv = AzureADLS(
#         account=STORAGE_ACCOUNT_1,
#         container=CONTAINER_1,
#         auth_mode="key_vault",
#         key_vault_name=KEY_VAULT_NAME,
#         secret_name=SECRET_NAME_1
#     )
#     
#     # Try to fetch key (this will use DefaultAzureCredential)
#     key = bronze_kv.get_storage_key()
#     print(f"‚úÖ Successfully retrieved key from Key Vault: {KEY_VAULT_NAME}")
#     print(f"   Key length: {len(key)} characters")
#     
#     # Test read with Key Vault auth
#     df_kv = pandas_engine.read(
#         connection=bronze_kv,
#         format="csv",
#         path=TEST_CSV_PATH
#     )
#     print(f"‚úÖ Successfully read {len(df_kv)} rows using Key Vault auth")
#     
# except Exception as e:
#     print(f"‚ùå Key Vault test failed: {e}")
#     print("   Make sure you're authenticated (az login) or running in Databricks")

print("‚ÑπÔ∏è  Key Vault test commented out - uncomment to test")

## Test 7: SparkEngine with Multi-Account (Optional)

Test Spark engine with multiple storage accounts configured upfront

In [None]:
# Uncomment to test Spark engine
# NOTE: Requires PySpark installation and Java

# try:
#     from odibi.engine.spark_engine import SparkEngine
#     
#     # Create connections dict
#     connections = {
#         'bronze': bronze_conn,
#         'silver': silver_conn
#     }
#     
#     # Create Spark engine (will configure all connections)
#     spark_engine = SparkEngine(connections=connections)
#     print("‚úÖ SparkEngine created with multi-account configuration")
#     
#     # Read CSV with Spark
#     spark_df = spark_engine.read(
#         connection=bronze_conn,
#         format="csv",
#         path=TEST_CSV_PATH,
#         options={"header": "true", "inferSchema": "true"}
#     )
#     
#     print(f"‚úÖ Read {spark_df.count()} rows with Spark")
#     spark_df.show(5)
#     
#     # Write to silver with Spark
#     spark_engine.write(
#         df=spark_df,
#         connection=silver_conn,
#         format="parquet",
#         path="test/output/spark_output.parquet",
#         mode="overwrite"
#     )
#     print("‚úÖ Successfully wrote with Spark to different account")
#     
# except ImportError:
#     print("‚ö†Ô∏è  PySpark not installed - skipping Spark tests")
# except Exception as e:
#     print(f"‚ùå Spark test failed: {e}")

print("‚ÑπÔ∏è  Spark test commented out - uncomment to test")

## Test 8: Validation Tests

Test that validation works correctly

In [20]:
# Test 1: Missing key_vault_name should fail
try:
    bad_conn = AzureADLS(
        account="test",
        container="test",
        auth_mode="key_vault",
        secret_name="test-secret"
        # Missing key_vault_name
    )
    print("‚ùå Should have raised ValueError")
except ValueError as e:
    print(f"‚úÖ Validation caught missing key_vault_name: {e}")

# Test 2: Missing account_key should fail
try:
    bad_conn = AzureADLS(
        account="test",
        container="test",
        auth_mode="direct_key"
        # Missing account_key
    )
    print("‚ùå Should have raised ValueError")
except ValueError as e:
    print(f"‚úÖ Validation caught missing account_key: {e}")

# Test 3: Invalid auth_mode should fail
try:
    bad_conn = AzureADLS(
        account="test",
        container="test",
        auth_mode="invalid_mode"
    )
    print("‚ùå Should have raised ValueError")
except ValueError as e:
    print(f"‚úÖ Validation caught invalid auth_mode: {e}")

print("\n‚úÖ All validation tests passed!")

‚úÖ Validation caught missing key_vault_name: key_vault mode requires 'key_vault_name' and 'secret_name' for connection to test/test
‚úÖ Validation caught missing account_key: direct_key mode requires 'account_key' for connection to test/test
‚úÖ Validation caught invalid auth_mode: Unsupported auth_mode: 'invalid_mode'. Use 'key_vault' or 'direct_key'.

‚úÖ All validation tests passed!


## Summary & Cleanup

In [21]:
print("="*60)
print("Phase 2A Walkthrough - Test Summary")
print("="*60)
print("")
print("‚úÖ Direct key authentication - PASSED")
print("‚úÖ PandasEngine read from ADLS - PASSED")
print("‚úÖ Multi-account connection - PASSED")
print("‚úÖ Write Parquet to different account - PASSED")
print("‚úÖ Validation tests - PASSED")
print("")
print("Optional tests (commented out):")
print("   - Key Vault authentication")
print("   - SparkEngine with multi-account")
print("")
print("Files created in ADLS:")
print(f"   - {bronze_conn.uri(TEST_CSV_PATH)}")
print(f"   - {silver_conn.uri(OUTPUT_PARQUET_PATH)}")
print("")
print("üéâ Phase 2A implementation validated with real ADLS connections!")
print("="*60)

Phase 2A Walkthrough - Test Summary

‚úÖ Direct key authentication - PASSED
‚úÖ PandasEngine read from ADLS - PASSED
‚úÖ Multi-account connection - PASSED
‚úÖ Write Parquet to different account - PASSED
‚úÖ Validation tests - PASSED

Optional tests (commented out):
   - Key Vault authentication
   - SparkEngine with multi-account

Files created in ADLS:
   - abfss://example-container@ingrglobaldigitalopsteam.dfs.core.windows.net/test/sample_data.csv
   - abfss://example-container@ingrglobaldigitalopsteam.dfs.core.windows.net/test/output/sample_data.parquet

üéâ Phase 2A implementation validated with real ADLS connections!


## Optional: Cleanup Test Data

Uncomment to delete test files from ADLS

In [22]:
# Uncomment to cleanup test data
import adlfs

fs = adlfs.AzureBlobFileSystem(**bronze_conn.pandas_storage_options())
try:
    fs.rm(bronze_conn.uri(TEST_CSV_PATH))
    print(f"‚úÖ Deleted: {bronze_conn.uri(TEST_CSV_PATH)}")
except:
    print("File not found or already deleted")

fs2 = adlfs.AzureBlobFileSystem(**silver_conn.pandas_storage_options())
try:
    fs2.rm(silver_conn.uri(OUTPUT_PARQUET_PATH), recursive=True)
    print(f"‚úÖ Deleted: {silver_conn.uri(OUTPUT_PARQUET_PATH)}")
except:
    print("File not found or already deleted")

print("‚ÑπÔ∏è  Cleanup commented out - uncomment to delete test files")

‚úÖ Deleted: abfss://example-container@ingrglobaldigitalopsteam.dfs.core.windows.net/test/sample_data.csv
‚úÖ Deleted: abfss://example-container@ingrglobaldigitalopsteam.dfs.core.windows.net/test/output/sample_data.parquet
‚ÑπÔ∏è  Cleanup commented out - uncomment to delete test files
