# 03 - SFTP Structured Streaming Demo

This notebook demonstrates reading and writing data with SFTP using:
- **AutoLoader** for reading from SFTP (built-in Databricks feature)
- **Custom SFTP Data Source** for writing to SFTP (Paramiko + Data Source API)

**Prerequisites:**
- Run notebook `01_infrastructure_setup.ipynb` first
- Run notebook `02_uc_connection_setup.ipynb` second

This notebook will:
1. Read data from source SFTP using AutoLoader
2. Write data to target SFTP using custom data source
3. List files in target directory to verify

## 1. Install Package and Load Configuration

In [0]:
# Install package from source (editable mode)
%pip install -q -e ..
dbutils.library.restartPython()

In [0]:
# Create widgets for catalog and schema configuration
dbutils.widgets.text("catalog_name", "sftp_demo", "Catalog Name")
dbutils.widgets.text("schema_name", "default", "Schema Name")

# Get widget values
CATALOG_NAME = dbutils.widgets.get("catalog_name")
SCHEMA_NAME = dbutils.widgets.get("schema_name")

print(f"Catalog: {CATALOG_NAME}")
print(f"Schema: {SCHEMA_NAME}")

In [0]:
# Load configuration from catalog
config_df = spark.table(f"{CATALOG_NAME}.config.connection_params")
config_dict = {row.key: row.value for row in config_df.collect()}

# Get configuration values
source_host = config_dict["source_host"]
source_username = config_dict["source_username"]
target_host = config_dict["target_host"]
target_username = config_dict["target_username"]
secret_scope = config_dict["secret_scope"]
ssh_key_secret = config_dict["ssh_key_secret"]

print("✓ Configuration loaded successfully")
print(f"  Source: {source_username}@{source_host}")
print(f"  Target: {target_username}@{target_host}")

## 2. Read from SFTP using AutoLoader

AutoLoader automatically finds the Unity Catalog connection based on the host in the SFTP URI.

This uses:
- **Checkpoint location**: Managed volume (`/Volumes/{catalog}/{schema}/_checkpoints`)
- **Schema inference**: AutoLoader automatically infers and evolves the schema
- **Serverless trigger**: `availableNow=True` for single micro-batch processing

In [0]:
# Test reading customers.csv from source SFTP using AutoLoader
source_sftp_uri = f"sftp://{source_username}@{source_host}:22/customers.csv"

# Use managed volume for checkpoint location
checkpoint_location = f"/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/_checkpoints/customers"
schema_location = f"/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/_checkpoints/customers_schema"

print(f"AutoLoader configuration:")
print(f"  Source URI: {source_sftp_uri}")
print(f"  Checkpoint: {checkpoint_location}")
print(f"  Schema: {schema_location}\n")

customers_df = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", schema_location)
    .option("header", "true")
    .load(source_sftp_uri)
)

# Display schema
print("Schema:")
customers_df.printSchema()

In [0]:
# Write streaming data to table using checkpoint location in managed volume
table_name = f"{CATALOG_NAME}.{SCHEMA_NAME}.test_customers"

print(f"Writing to table: {table_name}")
print(f"Checkpoint location: {checkpoint_location}\n")

query = (
    customers_df.writeStream
    .trigger(availableNow=True)
    .option("checkpointLocation", checkpoint_location)
    .toTable(table_name)
)

# Wait for the micro-batch to complete
query.awaitTermination()

print("✅ All data processed!")
print(f"\nData read from SFTP:")
spark.sql(f"SELECT * FROM {table_name}").show()

## 3. Write to SFTP using Custom Data Source

The custom SFTP Data Source uses:
- **Databricks Python Data Source API** - Native Spark integration
- **Paramiko 3.4.0** - Secure SSH/SFTP protocol implementation
- **Distributed writes** - Each Spark executor creates its own temp key file

This enables writing using the standard Spark API: `df.write.format("sftp")`

In [0]:
import tempfile
import os

# Import custom SFTP Data Source
from CustomDataSource import SFTPDataSource

# Register the SFTP data source with Spark
try:
  spark.dataSource.register(SFTPDataSource)
except AttributeError as e:
  print("Please ensure you run Serverless Environment version 4 or greater!")
  raise e

print("✓ SFTP data source registered with Spark")
print(f"  Format name: 'sftp'")
print(f"  Usage: df.write.format('sftp').option(...).save()")

In [0]:
# Create demo DataFrame to write to target SFTP
from datetime import datetime

demo_data = [
    (1, "Demo Customer 1", "demo1@example.com", "USA", datetime.now().strftime("%Y-%m-%d")),
    (2, "Demo Customer 2", "demo2@example.com", "UK", datetime.now().strftime("%Y-%m-%d")),
    (3, "Demo Customer 3", "demo3@example.com", "Canada", datetime.now().strftime("%Y-%m-%d"))
]

demo_df = spark.createDataFrame(demo_data, ["customer_id", "name", "email", "country", "signup_date"])

print("Demo DataFrame created:")
demo_df.show()

print(f"\n✓ Created demo DataFrame with {demo_df.count()} rows")

In [0]:
# Write demo data to target SFTP using Databricks Python Data Source API
# The SSH key content is passed as an option - each Spark executor creates its own temp file

# Get SSH private key content from secrets
ssh_key_content = dbutils.secrets.get(scope=secret_scope, key=ssh_key_secret)

remote_path = "/demo_customers.csv"

print(f"Writing demo DataFrame to SFTP using Spark DataSource API")
print(f"Target: {target_username}@{target_host}{remote_path}")
print(f"Technology: Paramiko SSHv2 library (version 3.4.0)")

# Write using Spark DataSource API - passes key content, not path
# Each executor will create its own temporary key file from the content
demo_df.write \
    .format("sftp") \
    .option("host", target_host) \
    .option("username", target_username) \
    .option("private_key_content", ssh_key_content) \
    .option("port", "22") \
    .option("path", remote_path) \
    .option("format", "csv") \
    .option("header", "true") \
    .mode("overwrite") \
    .save()

print("\n" + "="*70)
print("Custom SFTP Data Source Write Complete")
print("="*70)
print(f"Technology: Paramiko SSHv2 library")
print(f"API: Databricks Python Data Source API")
print(f"Pattern: spark.dataSource.register() + df.write.format('sftp')")
print(f"Key Distribution: private_key_content option")
print(f"Written: {demo_df.count()} rows to {remote_path}")
print("="*70)

## 4. Verify Files Written to Target SFTP

List files in the target SFTP directory to prove the write succeeded.

In [0]:
# Get SSH private key content from secrets and create temp file for connection test
from CustomDataSource import SFTPConnectionTester

ssh_key_content = dbutils.secrets.get(scope=secret_scope, key=ssh_key_secret)
tmp_key_file = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='_sftp_key')
tmp_key_file.write(ssh_key_content)
tmp_key_file.close()
os.chmod(tmp_key_file.name, 0o600)

try:
    # Connect to target SFTP and list files
    target_tester = SFTPConnectionTester(
        host=target_host,
        username=target_username,
        private_key_path=tmp_key_file.name,
        port=22
    )

    with target_tester as conn:
        files = conn.list_files(".")
        print("Files in target SFTP directory:")
        print("="*70)
        for f in files:
            print(f"  - {f}")
        print("="*70)
        
        # Check if our file exists
        if "demo_customers.csv" in files or any("demo_customers" in f for f in files):
            print("\n✅ SUCCESS: demo_customers.csv file(s) found in target SFTP!")
        else:
            print("\n⚠️  WARNING: demo_customers.csv not found in file list")
            
finally:
    # Clean up temporary key file
    if os.path.exists(tmp_key_file.name):
        os.remove(tmp_key_file.name)

## Summary

SFTP structured streaming demo completed:
- ✓ Read data from source SFTP using AutoLoader
- ✓ Displayed data in table
- ✓ Wrote data to target SFTP using custom data source
- ✓ Verified files exist in target directory

**Key Technologies:**
- AutoLoader (built-in Databricks) for SFTP reads
- Paramiko 3.4.0 + Data Source API for SFTP writes
- Unity Catalog connections for credential management
- Managed volumes for checkpoint storage