# 🧪 Databricks Auto Loader
This notebook scans S3 for files containing `edm_entity`, matches them to an existing Delta table, and loads them with format inference. If the target table doesn't exist, the file is left in the landing zone.

## 📁 Step 1: Configuration
Set up base variables for schema name, organization, and S3 path locations.

In [None]:
from pyspark.sql.functions import input_file_name
import os

org_name = "entity"
schema_name = "edm"

landing_path = "s3://your-bucket/landing-zone/"
router_schema_path = f"s3://your-bucket/schema-tracking/__router__/"
router_checkpoint_path = f"s3://your-bucket/checkpoints/__router__/"

## 📂 Step 2: File Detection Stream
Use Auto Loader to detect file arrivals in `landing_path` using binary format (only detects files, not content).

In [None]:
raw_df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "binaryFile")
    .option("cloudFiles.includeExistingFiles", "true")
    .option("cloudFiles.schemaLocation", router_schema_path)
    .load(landing_path)
    .withColumn("source_file", input_file_name()))

## ⚙️ Step 3: Batch Processing Logic
Loop over each file path and determine the correct routing logic.

In [None]:
def process_batch(batch_df, batch_id):
    paths = batch_df.select("path").rdd.map(lambda r: r[0]).collect()

    for path in paths:
        filename = os.path.basename(path)
        ext = os.path.splitext(filename)[1].lower().strip(".")

        if "edm_entity" in filename.lower():
            table_name = "edm_entity"
        else:
            print(f"❌ No routing rule for file: {filename}")
            continue

## 🏷️ Step 4: Build Target Paths
Construct target table name and tracking/checkpoint paths dynamically.

In [None]:
        target_table = f"{org_name}.bronze.{table_name}"
        schema_tracking_path = f"s3://your-bucket/schema-tracking/{org_name}/{schema_name}/{table_name}/"
        checkpoint_path = f"s3://your-bucket/checkpoints/{org_name}/{schema_name}/{table_name}/"
        print(f"🔍 Routing file: {filename} → Table: {target_table}")

## 📥 Step 5: Read and Validate File Format
Load the file based on its extension. Skip unsupported formats.

In [None]:
        try:
            if ext in ["csv", "txt"]:
                df = spark.read.option("header", "true").option("inferSchema", "true").csv(path)
            elif ext == "json":
                df = spark.read.option("inferSchema", "true").json(path)
            else:
                print(f"⚠️ Unsupported format: {filename}")
                continue

## 🧱 Step 6: Validate Table Existence
Check if the Delta table exists before writing. If not, leave the file unprocessed.

In [None]:
            if not spark._jsparkSession.catalog().tableExists(target_table):
                print(f"🚫 Table not found for {filename}. File remains in landing zone.")
                continue

## 💾 Step 7: Append Data to Table
Write the DataFrame into the matched Delta table.

In [None]:
            df.write.format("delta").mode("append").saveAsTable(target_table)
            print(f"✅ Loaded into: {target_table}")

## 🗃️ Step 8: Archive Processed File
Move the file from landing to archive only after successful ingestion.

In [None]:
            archive_path = path.replace("landing-zone", "archive-zone")
            dbutils.fs.mv(path, archive_path)
            print(f"📦 Archived: {archive_path}")

## ❌ Step 9: Handle Errors Gracefully
Catch exceptions and continue with other files.

In [None]:
        except Exception as e:
            print(f"❌ Error processing {filename}: {e}")

## 🚀 Step 10: Start Stream
Run the Auto Loader stream once.

In [None]:
(raw_df.writeStream
    .foreachBatch(process_batch)
    .option("checkpointLocation", router_checkpoint_path)
    .trigger(once=True)
    .start()
    .awaitTermination())