**⭐ 1. What This Pattern Solves**

Auto Loader simplifies streaming ingestion of files from cloud storage (S3, ADLS, GCS) into Delta tables.

Automatically detects new files without scanning the full directory.

Supports schema inference and evolution.

Handles incremental loads efficiently into Bronze tables.

Used for:

Streaming raw logs to Bronze

Ingesting JSON/CSV/Parquet files as they arrive

Preparing incremental datasets for Silver and Gold transformations

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Create a table to receive files
CREATE TABLE bronze USING DELTA LOCATION '/delta/bronze';

-- Stream new files into the table
COPY INTO bronze
FROM 's3://bucket/raw_data/'
FILEFORMAT = JSON
PATTERN = '*.json'

**⭐ 3. Core Idea**

Incremental file discovery: Only new files are processed

Schema evolution: New columns are detected automatically

Streaming ingestion → Delta Bronze table: Foundation for B→S→G pipelines

Reusability: Any file-based ingestion pipeline can be automated with Auto Loader for near-real-time ETL.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql.streaming import DataStreamWriter

df_stream = (spark.readStream.format("cloudFiles")
             .option("cloudFiles.format", "json")       # or csv/parquet
             .option("cloudFiles.schemaLocation", "/delta/schema")
             .load("/mnt/raw_data"))

(df_stream.writeStream
 .format("delta")
 .option("checkpointLocation", "/delta/checkpoints/bronze")
 .outputMode("append")
 .table("bronze"))


**⭐ 5. Detailed Example**

In [0]:
# Simulated JSON files arriving in S3
# File 1: [{"id":"A", "amount":100}]
# File 2: [{"id":"B", "amount":50}]

df_stream = (spark.readStream.format("cloudFiles")
             .option("cloudFiles.format", "json")
             .option("cloudFiles.schemaLocation", "/delta/schema")
             .load("/mnt/raw_data"))

(df_stream.writeStream
 .format("delta")
 .option("checkpointLocation", "/delta/checkpoints/bronze")
 .outputMode("append")
 .table("bronze"))

**Step-by-step:**

Auto Loader continuously scans /mnt/raw_data

Detects new JSON files automatically

Appends new rows to Bronze Delta table

Maintains checkpoint for exactly-once delivery

**⭐ 6. Mini Practice Problems**

Use Auto Loader to stream CSV logs into Bronze table.

Simulate schema evolution by adding a new column and verify ingestion works.

Stream daily JSON files and verify deduplication before writing to Silver.

**⭐ 7. Full Data Engineering Problem**

Scenario: Healthcare provider receives hourly patient vitals in JSON from multiple devices:

Use Auto Loader to stream data into Bronze

Transform and clean in Silver (dedup, normalize units)

Aggregate for Gold dashboards (daily averages per patient)

Maintain incremental, near-real-time updates with exactly-once guarantees

**⭐ 8. Time & Space Complexity**

Time: O(new_files) → only new files scanned

Space: Checkpoint and schema logs are minimal; raw files stored externally

Efficient for high-frequency ingestion at scale

**⭐ 9. Common Pitfalls**

Forgetting checkpointLocation → exactly-once guarantees lost

Writing in overwrite mode → drops previous data

Not handling schema evolution → ingestion errors

Pointing Auto Loader to directories with already processed files → duplicates