## Architecture Context: Databricks + ADLS Gen2 (Azure Gov) Integration

This notebook is part of a workflow to demonstrate:

✅ How Databricks integrates with Azure Data Lake Storage (ADLS) Gen2 on Azure Government cloud  
✅ Secure authentication using storage account keys (or preferably, Azure Key Vault)  
✅ How Spark reads and lists data directly from ADLS Gen2 using the `abfss://` protocol and the `.dfs.core.usgovcloudapi.net` endpoint

### Key Points:
- **ADLS Gen2** provides scalable, hierarchical storage for big data workloads.
- **Azure Government Cloud** uses dedicated `.usgovcloudapi.net` DNS domains.
- **Databricks Secrets** provide a secure way to manage and access sensitive credentials.
- **Spark on Databricks** can natively connect to ADLS via the Hadoop Azure connector, using access keys, OAuth, or service principals.

## Accessing ADLS Gen2 in Azure Government with Hardcoded Storage Account Key

This notebook section demonstrates how to:

✅ Set up Spark configuration to access an Azure Data Lake Storage (ADLS) Gen2 container  
✅ Use the **storage account access key** directly (hardcoded)  
✅ List the contents of a specific folder (`test/`) inside the ADLS container  
✅ Verify available secret scopes using `dbutils.secrets.listScopes()`

⚠ **Warning:**  
For production use, avoid hardcoding sensitive keys directly in notebooks.  
Instead, store them securely in Databricks Secrets or Azure Key Vault.


In [0]:
# Replace these with your actual details
storage_account_name = "araostorage"
container_name = "araolibraryloadtest"
storage_account_key = "YOUR_ACCESS_KEY"  # ⚠ Be careful — plain text!

# Configure Spark to use the access key (Azure Gov cloud domain)
spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.usgovcloudapi.net",
    storage_account_key
)

# Define the ABFSS path
path = f"abfss://{container_name}@{storage_account_name}.dfs.core.usgovcloudapi.net/test"

# Test listing the folder contents
try:
    files = dbutils.fs.ls(path)
    for file in files:
        print(f"✅ Name: {file.name}, Size: {file.size}, Path: {file.path}")
except Exception as e:
    print(f"❌ Error accessing path: {e}")



dbutils.secrets.listScopes()

✅ Name: Bridging the Digital Divide.pdf, Size: 61282, Path: abfss://araolibraryloadtest@araostorage.dfs.core.usgovcloudapi.net/test/Bridging the Digital Divide.pdf


[SecretScope(name='test-scope')]

## Accessing ADLS Gen2 in Azure Government with Databricks Secrets

This notebook section demonstrates how to:

✅ Securely configure Spark to access Azure Data Lake Storage (ADLS) Gen2  
✅ Retrieve the **storage account access key** from Databricks Secrets  
✅ Apply the key to Spark configs for authentication  
✅ List the contents of a specific folder (`test/`) inside the ADLS container

⚠ **Best Practice:**  
Using Databricks Secrets ensures sensitive credentials are not exposed in plaintext within notebooks or logs.

## Setting Up Databricks Secrets and Accessing ADLS Gen2 (Azure Gov)

This section explains how to securely configure your Databricks environment to access an Azure Data Lake Storage (ADLS) Gen2 container **without hardcoding sensitive credentials**.

---

### Step 1️⃣: Configure the Databricks CLI

Before you can create or manage secrets, you must set up the Databricks CLI.

Run this in your terminal:

```
databricks configure --token
```

✅ You will be prompted to enter:
- **Databricks Host** → e.g., `https://<workspace>.azuredatabricks.net` (commercial) or `https://<workspace>.azure.us` (Azure Gov)  
- **Personal Access Token (PAT)** → Generate this from the Databricks UI → User Settings → Access Tokens

This will create a `~/.databrickscfg` file to let the CLI communicate with your Databricks workspace.

---

### Step 2️⃣: Use the Script Below to Create the Secret

Use this shell script to:
✅ Check if the secret scope exists (creates if missing)  
✅ Check if the secret key exists (updates or creates as needed)  
✅ Store the ADLS storage account key securely inside Databricks Secrets

```
#!/bin/bash

# ------------------------------------------------------------
# Databricks Secret Setup Script (Interactive + Param-Driven)
# This script:
# ✅ Accepts optional CLI parameters for scope, key name, and secret value
# ✅ If parameters are missing, prompts the user interactively
# ✅ Checks if the scope exists (creates if missing)
# ✅ Checks if the key exists (updates if present, adds if missing)
# ✅ Stores or updates the secret securely
#
# Usage:
# ./setup_secret.sh <scope_name> <secret_key_name> <secret_value>
#
# Example:
# ./setup_secret.sh adls-secrets adls-access-key AbC1234xyz==
# ------------------------------------------------------------

# Get parameters or prompt interactively
SCOPE_NAME=$1
SECRET_KEY_NAME=$2
SECRET_VALUE=$3

if [ -z "$SCOPE_NAME" ]; then
    read -p "Enter the secret scope name: " SCOPE_NAME
fi

if [ -z "$SECRET_KEY_NAME" ]; then
    read -p "Enter the secret key name: " SECRET_KEY_NAME
fi

if [ -z "$SECRET_VALUE" ]; then
    read -s -p "Enter the secret value (input hidden): " SECRET_VALUE
    echo
fi

# Check if scope exists
echo "🔍 Checking if secret scope '$SCOPE_NAME' exists..."
if databricks secrets list-scopes | grep -q "$SCOPE_NAME"; then
    echo "✅ Secret scope '$SCOPE_NAME' already exists."
else
    echo "⚙️ Creating secret scope '$SCOPE_NAME'..."
    databricks secrets create-scope $SCOPE_NAME
    echo "✅ Secret scope '$SCOPE_NAME' created."
fi

# Check if secret key exists
if databricks secrets list-secrets $SCOPE_NAME | grep -q "$SECRET_KEY_NAME"; then
    echo "♻️ Secret key '$SECRET_KEY_NAME' already exists — updating it."
else
    echo "➕ Secret key '$SECRET_KEY_NAME' does not exist — creating it."
fi

# Store or update the secret
echo -n "$SECRET_VALUE" | databricks secrets put-secret $SCOPE_NAME $SECRET_KEY_NAME

# Final list
echo "📋 Final list of secrets in scope '$SCOPE_NAME':"
databricks secrets list-secrets $SCOPE_NAME

echo "🚀 Done! You can now access the secret in your Databricks notebooks using dbutils."
```

---

### Step 3️⃣: Use the Secret in Your Notebook

In your notebook, use the following approach to securely access ADLS:

```
# Load the storage account key from Databricks Secrets
storage_account = "araostorage"
container_name = "araolibraryloadtest"
scope = "adls-secrets"
key_name = "adls-access-key"

# Fetch and clean the key
storage_account_key = dbutils.secrets.get(scope=scope, key=key_name).strip()

# Apply Spark config for Azure Gov
spark.conf.set(
    f"fs.azure.account.key.{storage_account}.dfs.core.usgovcloudapi.net",
    storage_account_key
)

# List contents inside the container or specific folder
try:
    files = dbutils.fs.ls(f"abfss://{container_name}@{storage_account}.dfs.core.usgovcloudapi.net/")
    for file in files:
        print(f"✅ Name: {file.name}, Size: {file.size} bytes, Path: {file.path}")
except Exception as e:
    print(f"❌ Error accessing ADLS: {e}")
```

---

### ⚠ Important Best Practices

- Never print or log the full secret value directly in notebooks.
- Always use Databricks Secrets or Key Vault integration in production.
- For Azure Government, ensure you are using the `.dfs.core.usgovcloudapi.net` endpoint.

Once the above steps are complete, you will have:
✅ A secure, reusable secret  
✅ Spark configs wired to access ADLS Gen2  
✅ A tested path to list and process files in your Azure Gov environment

In [0]:
# Define the scope and key
scope = "adls-secrets"
key_name = "adls-access-key"

# Fetch the secret value
secret_value = dbutils.secrets.get(scope=scope, key=key_name)

# ⚠ WARNING: This will print the secret value in logs (use only for debugging)
print(f"🔍 Retrieved secret value: {secret_value}") #<--- will be redacted 

🔍 Retrieved secret value: [REDACTED]


## Accessing and Listing Files in ADLS Gen2 (Azure Gov) Using Databricks Secrets

This notebook cell demonstrates how to:

✅ Load the ADLS storage account access key securely from a Databricks Secret Scope (`adls-secrets`)  
✅ Apply the Spark configuration to authenticate to the Azure Government ADLS endpoint (`.dfs.core.usgovcloudapi.net`)  
✅ List the files and folders at the **root level** of the specified ADLS container (`araolibraryloadtest`)  
✅ Additionally, list the contents specifically inside the `test/` folder within that container

⚠ **Important:**  
- This approach avoids hardcoding sensitive keys directly in the notebook.  
- Always use `.strip()` on the retrieved secret to avoid hidden whitespace issues.

In [0]:
# Load ADLS storage account key from Databricks Secrets
storage_account = "araostorage"
container_name = "araolibraryloadtest"
scope = "adls-secrets"
key_name = "adls-access-key"

# Fetch and clean the key
storage_account_key = dbutils.secrets.get(scope=scope, key=key_name).strip()

# Apply Spark config for Azure Gov
spark.conf.set(
    f"fs.azure.account.key.{storage_account}.dfs.core.usgovcloudapi.net",
    storage_account_key
)

# Test: list files in ADLS container
try:
    files = dbutils.fs.ls(f"abfss://{container_name}@{storage_account}.dfs.core.usgovcloudapi.net/")
    for file in files:
        print(f"✅ {file.name} ({file.size} bytes)")
except Exception as e:
    print(f"❌ Error accessing ADLS: {e}")



folder_path = f"abfss://{container_name}@{storage_account}.dfs.core.usgovcloudapi.net/test"

try:
    files = dbutils.fs.ls(folder_path)
    for file in files:
        print(f"✅ Name: {file.name}, Size: {file.size} bytes, Path: {file.path}")
except Exception as e:
    print(f"❌ Error accessing folder '{folder_path}': {e}")

✅ test/ (0 bytes)


## Recursively Listing All Files and Folders in an ADLS Gen2 Container

This notebook section demonstrates how to:

✅ Walk through all folders and subfolders within the ADLS container  
✅ List all files, along with their names, sizes, and paths  
✅ Handle nested directory structures efficiently

⚠ **Note:**  
Recursive listing is useful for audits or large data sweeps but can be slow on very large directories.
Consider filtering by file type, size, or last modified date if needed.

In [0]:
# Set up base path
base_path = f"abfss://{container_name}@{storage_account}.dfs.core.usgovcloudapi.net/"

try:
    # List root-level folders/files
    root_items = dbutils.fs.ls(base_path)
    for item in root_items:
        if item.isDir():
            print(f"\n📁 Folder: {item.name}")
            folder_path = item.path

            # List contents inside the folder
            try:
                sub_items = dbutils.fs.ls(folder_path)
                for sub_item in sub_items:
                    print(f"   ✅ {sub_item.name} ({sub_item.size} bytes)")
            except Exception as sub_e:
                print(f"   ❌ Error reading folder {folder_path}: {sub_e}")
        else:
            print(f"📄 File at root: {item.name} ({item.size} bytes)")

except Exception as e:
    print(f"❌ Error accessing base path: {e}")

In [0]:
# Compare hardcoded vs. secret-loaded key lengths
hardcoded_key = "YOUR_ACCESS_KEY".strip()  # Replace with your actual hardcoded key
secret_key = dbutils.secrets.get(scope="adls-secrets", key="adls-access-key").strip()

print(f"Hardcoded key length: {len(hardcoded_key)}")
print(f"Secret key length: {len(secret_key)}")

if hardcoded_key == secret_key:
    print("✅ Keys match exactly")
else:
    print("❌ Keys differ — investigate secret storage")