# Auto Loader in Databricks
## Efficiently Ingest Data at Scale

**Auto Loader** is a Databricks feature designed to incrementally and efficiently process new data files as they arrive in cloud storage. It replaces the need for complex file management logic or scheduling generic file listing jobs.

### Key Features:
1.  **Incremental Processing:** Only processes new files.
2.  **Scalability:** Can handle millions of files per hour.
3.  **Schema Evolution:** Automatically handles changing schemas (new columns, type changes).
4.  **File Notification & Directory Listing:** Two modes to detect new files.

### The `cloudFiles` Source
Auto Loader is accessed via the Structured Streaming source called `cloudFiles`.

```python
spark.readStream.format("cloudFiles") \
     .option("cloudFiles.format", "csv") \
     .load("/path/to/files")

## In this notebook, we will explore:

> Setting up Auto Loader for CSV ingestion.

> Using Schema Hints for data typing.

> Incremental Loading (adding files one by one).

> Schema Evolution Modes (addNewColumns, rescue, failOnNewColumns, none).

In [None]:
# Setup: Define Paths and Clean up previous runs
# We will use a temporary location for this demo to simulate cloud storage

import time

base_path = "/tmp/databricks_zero_to_hero/autoloader_demo"
input_path = f"{base_path}/input"
checkpoint_path = f"{base_path}/checkpoints"
schema_location = f"{base_path}/schema_log"
table_name = "autoloader_demo_table"

# Clean up
dbutils.fs.rm(base_path, True)
spark.sql(f"DROP TABLE IF EXISTS {table_name}")

print(f"Environment setup complete. \nInput Path: {input_path}")

In [None]:
# Helper function to create dummy CSV data
def create_file(file_id, date_str, data_content):
    # Simulating a nested directory structure: Year/Month/Day
    year, month, day = date_str.split("-")
    file_path = f"{input_path}/{year}/{month}/{day}/file_{file_id}.csv"
    
    dbutils.fs.put(file_path, data_content, True)
    print(f"Created file: {file_path}")

# 1. Create Initial Data (Day 1)
# Schema: id, name, amount, date
data_day_1 = """id,name,amount,date
1,Alice,100,2023-10-01
2,Bob,200,2023-10-01"""

create_file(1, "2023-10-01", data_day_1)

## 1. Basic Auto Loader Implementation

We will configure Auto Loader to read CSV files.
*   **`cloudFiles.format`**: The format of the source files (csv, json, parquet, etc.).
*   **`cloudFiles.schemaLocation`**: Where Auto Loader stores the inferred schema state.
*   **`pathGlobFilter`**: To select specific files (e.g., `*.csv`).

In [None]:
# Define the Auto Loader Stream Reader
# Note: We are NOT providing a schema explicitly. Auto Loader infers it.

df_stream = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", schema_location) # Crucial for schema evolution
    .option("header", "true")
    
    # Optional: Use hints to enforce specific types for columns if known
    .option("cloudFiles.schemaHints", "id INT, amount DOUBLE") 
    
    .load(f"{input_path}/*/*/*") # Wildcards for nested folders
)

# Let's inspect the stream object
print("Is streaming:", df_stream.isStreaming)
df_stream.printSchema()

### Writing to a Delta Table
We will use `trigger(availableNow=True)`. This runs the stream as a **Batch** job. It processes all currently available files and then shuts down. This is very cost-effective for periodic ETL.

In [None]:
# Write the stream to a Delta Table
query = (df_stream.writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint_path)
    .option("mergeSchema", "true") # Allow writing new columns to the Delta table
    .outputMode("append")
    .table(table_name)
)

# Wait for the batch to finish
query.awaitTermination()

# Check the results
display(spark.sql(f"SELECT * FROM {table_name}"))

## 2. Incremental Processing
Auto Loader tracks processed files in RocksDB (located in the checkpoint directory). If we add a new file, it should pick up **only** the new file in the next run.

In [None]:
# 2. Add New Data (Day 2)
data_day_2 = """id,name,amount,date
3,Charlie,150,2023-10-02
4,David,300,2023-10-02"""

create_file(2, "2023-10-02", data_day_2)

# Run the EXACT same stream code again
# It uses the checkpoint to know what to read
query = (df_stream.writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint_path)
    .option("mergeSchema", "true") 
    .trigger(availableNow=True) # Process only new data
    .table(table_name)
)

query.awaitTermination()

# Verify that we have 4 records now (2 from Day 1 + 2 from Day 2)
count = spark.sql(f"SELECT count(*) FROM {table_name}").collect()[0][0]
print(f"Total records in table: {count}")
display(spark.sql(f"SELECT * FROM {table_name}"))

## 3. Schema Evolution Modes

What happens if a new file arrives with an **extra column**? Auto Loader has different modes controlled by `cloudFiles.schemaEvolutionMode`.

| Mode | Description | behavior |
| :--- | :--- | :--- |
| **`addNewColumns`** | (Default) The stream updates the schema automatically. | New columns are added to the table. |
| **`failOnNewColumns`** | The stream fails. | Useful if schema changes are strictly forbidden. |
| **`rescue`** | Schema is never evolved. | New columns are packed into a `_rescued_data` column. Stream does not fail. |
| **`none`** | Ignores new columns. | New columns are simply not read. |

Let's test the default behavior: **`addNewColumns`**.

In [None]:
# 3. Add Data with a NEW COLUMN 'region'
# Only the schema evolution mode determines if this works. Default is 'addNewColumns'.

data_schema_change = """id,name,amount,date,region
5,Eve,500,2023-10-03,US-East
6,Frank,600,2023-10-03,EU-West"""

create_file(3, "2023-10-03", data_schema_change)

# Run the stream again
query = (df_stream.writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint_path)
    .option("mergeSchema", "true") # Delta table needs this to accept the new column from Auto Loader
    .trigger(availableNow=True)
    .table(table_name)
)

query.awaitTermination()

# Check result - 'region' column should exist now, filled with nulls for older records
display(spark.sql(f"SELECT * FROM {table_name}"))

## 4. Rescue Mode (`rescue`)

Sometimes you don't want to break the schema, but you don't want to lose data either. `rescue` mode puts unexpected data into a special column `_rescued_data`.

*Note: To demo this cleanly, we will start a fresh stream with a new checkpoint, as schema evolution behavior is bound to the checkpoint state.*

In [None]:
# Setup for Rescue Demo
rescue_checkpoint = f"{base_path}/checkpoint_rescue"
rescue_schema_loc = f"{base_path}/schema_rescue"
rescue_table = "autoloader_rescue_demo"

# Create a stream with 'rescue' mode
df_rescue_stream = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("header", "true")
    # Set mode to RESCUE
    .option("cloudFiles.schemaEvolutionMode", "rescue") 
    .option("cloudFiles.schemaLocation", rescue_schema_loc)
    .load(f"{input_path}/*/*/*")
)

# Write stream
query_rescue = (df_rescue_stream.writeStream
    .format("delta")
    .option("checkpointLocation", rescue_checkpoint)
    .option("mergeSchema", "true")
    .trigger(availableNow=True)
    .table(rescue_table)
)

query_rescue.awaitTermination()

print("Checking table for _rescued_data column...")
display(spark.sql(f"SELECT * FROM {rescue_table}"))

### Observation on Rescue Mode
In the output above, because we started fresh, Auto Loader inferred the schema based on *all* files (including the one with `region`). 

However, if we now add a file with yet another column (e.g., `status`), and the inferred schema is already locked, `rescue` mode will put `status` into the `_rescued_data` JSON column instead of creating a new top-level column.

In [None]:
# Add file with 'status' column
data_rescue_test = """id,name,amount,date,region,status
7,Grace,700,2023-10-04,US-West,Active"""

create_file(4, "2023-10-04", data_rescue_test)

# Run Rescue Stream again
(df_rescue_stream.writeStream
    .format("delta")
    .option("checkpointLocation", rescue_checkpoint)
    .trigger(availableNow=True)
    .table(rescue_table)
).awaitTermination()

# Review Results
# You should see the 'status' data inside the '_rescued_data' column
display(spark.sql(f"SELECT * FROM {rescue_table} WHERE id = 7"))

## 5. File Detection Modes

1.  **Directory Listing (Default):**
    *   Lists files in the input directory to find new ones.
    *   Scales well due to incremental listing, but can be slow for massive historical buckets.
    
2.  **File Notification:**
    *   Uses cloud services (AWS SQS, Azure Event Grid, GCP Pub/Sub).
    *   Storage sends a notification when a file lands. Auto Loader reads the queue.
    *   **Pros:** Extremely performant for massive directories.
    *   **Cons:** Requires elevated cloud permissions to set up the queues/subscriptions.

**How to enable File Notification:**
```python
.option("cloudFiles.useNotifications", "true")

(This requires specific cloud setup and won't run in standard community edition without cloud config, so we are keeping it as a reference code block).

In [None]:
# Cleanup
# Remove temp files and tables
dbutils.fs.rm(base_path, True)
spark.sql(f"DROP TABLE IF EXISTS {table_name}")
spark.sql(f"DROP TABLE IF EXISTS {rescue_table}")
print("Cleanup complete.")