# 02 - Unity Catalog Connection Setup

This notebook configures Unity Catalog connections for SFTP data sources:
- Create Unity Catalog connections for source and target SFTP
- Test connections using AutoLoader
- Configure external locations

## 1. Import Libraries

In [None]:
from pyspark.sql import SparkSession
import json

## 2. Load Configuration from Previous Setup

In [None]:
# Load configuration
config_df = spark.table("sftp_demo.config.connection_params")
config_dict = {row.key: row.value for row in config_df.collect()}

source_host = config_dict["source_host"]
source_username = config_dict["source_username"]
target_host = config_dict["target_host"]
target_username = config_dict["target_username"]
ssh_key_path = config_dict["ssh_key_path"]

print("Configuration loaded successfully")

## 3. Create Unity Catalog Connection for Source SFTP

**Note:** This requires Databricks workspace admin privileges.

In [None]:
# Create source SFTP connection using SQL
spark.sql(f"""
CREATE CONNECTION IF NOT EXISTS sftp_demo.source_sftp_connection
TYPE sftp
OPTIONS (
  host '{source_host}',
  port '22',
  username '{source_username}',
  privateKey SECRET ('sftp-credentials', 'source-private-key')
)
""")

print("Source SFTP connection created: sftp_demo.source_sftp_connection")

## 4. Create Unity Catalog Connection for Target SFTP

In [None]:
# Create target SFTP connection using SQL
spark.sql(f"""
CREATE CONNECTION IF NOT EXISTS sftp_demo.target_sftp_connection
TYPE sftp
OPTIONS (
  host '{target_host}',
  port '22',
  username '{target_username}',
  privateKey SECRET ('sftp-credentials', 'target-private-key')
)
""")

print("Target SFTP connection created: sftp_demo.target_sftp_connection")

## 5. Verify Connections

In [None]:
# List all connections
connections_df = spark.sql("SHOW CONNECTIONS IN sftp_demo")
display(connections_df)

## 6. Test Source Connection with AutoLoader

Read data from source SFTP using AutoLoader to verify the connection works.

In [None]:
# Test reading customers.csv from source SFTP
source_path = f"sftp://{source_host}/customers.csv"

customers_df = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.connectionName", "sftp_demo.source_sftp_connection")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(source_path)
)

# Display schema
customers_df.printSchema()

# Write to temporary table for verification
(
    customers_df.writeStream
    .format("memory")
    .queryName("test_customers")
    .outputMode("append")
    .start()
)

print("Source SFTP connection verified successfully")

In [None]:
# Display sample data
display(spark.sql("SELECT * FROM test_customers LIMIT 10"))

## 7. Create Catalog and Schema for Pipeline

In [None]:
# Create catalog structure for DLT pipeline
spark.sql("CREATE CATALOG IF NOT EXISTS sftp_demo")
spark.sql("CREATE SCHEMA IF NOT EXISTS sftp_demo.bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS sftp_demo.silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS sftp_demo.gold")

print("Catalog structure created:")
print("  - sftp_demo.bronze (raw data from source SFTP)")
print("  - sftp_demo.silver (cleaned and validated data)")
print("  - sftp_demo.gold (aggregated business-level data)")

## 8. Create External Location for Checkpoints

In [None]:
# Create checkpoint location in DBFS
checkpoint_location = "/dbfs/sftp_demo/checkpoints"
dbutils.fs.mkdirs(checkpoint_location)

print(f"Checkpoint location created: {checkpoint_location}")

## 9. Grant Permissions (if needed)

Grant necessary permissions to use the connections in DLT pipelines.

In [None]:
# Grant USAGE on connections to all users (adjust as needed)
# Uncomment if you need to grant permissions:

# spark.sql("""
# GRANT USAGE ON CONNECTION sftp_demo.source_sftp_connection 
# TO `account users`
# """)

# spark.sql("""
# GRANT USAGE ON CONNECTION sftp_demo.target_sftp_connection 
# TO `account users`
# """)

print("Connection permissions configured")

## 10. Test Complete Data Flow

In [None]:
# Read orders.csv from source SFTP
orders_path = f"sftp://{source_host}/orders.csv"

orders_df = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.connectionName", "sftp_demo.source_sftp_connection")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(orders_path)
)

# Display schema
orders_df.printSchema()

# Write to temporary table
(
    orders_df.writeStream
    .format("memory")
    .queryName("test_orders")
    .outputMode("append")
    .start()
)

print("Orders data loaded successfully")

In [None]:
# Display sample orders data
display(spark.sql("SELECT * FROM test_orders LIMIT 10"))

## Summary

Unity Catalog connection setup completed:
- ✓ Source SFTP connection created and tested
- ✓ Target SFTP connection created
- ✓ Catalog and schema structure created (bronze, silver, gold)
- ✓ AutoLoader successfully reading from source SFTP
- ✓ Checkpoint locations configured

Next step: Run notebook `03_dlt_pipeline.ipynb` to create and execute the Delta Live Tables pipeline