## [00] Project Planning

### Objective

Build an LLM-powered **Schema Drift Detection & Auto‑Resolution Agent** in Databricks. The goal is to **automatically identify** structural changes in Delta table schemas and **suggest code updates** (PySpark or SQL) to handle them. This helps avoid broken pipelines and reduces manual intervention.

### Problem Statement

- **Schema drift** occurs when data schemas change unexpectedly (e.g., added, removed, renamed columns, or changed data types).
- These changes can **break ETL pipelines**, cause downstream failures, and require manual fixes.
- We need a system that can **detect drift**, **explain it to humans**, and **generate resolution code** automatically within Databricks.

### High Level Architecture

1. **Fetch** current and new table schemas from Delta.
2. **Compare** schemas to compute a structured diff.
3. **Generate prompt** for LLM, describing the schema changes.
4. **LLM returns** a code snippet for resolution (e.g., modify DataFrame or ALTER TABLE).
5. **Display** the suggestion in notebook; allow engineer to **review and execute**.
6. *(Future scope)* Automatically apply changes or log outputs for monitoring.

### Milestones Overview

| Week | Main Focus |
|------|------------|
| 1    | Research, planning, schema evolution tests |
| 2    | Schema comparison logic (PySpark) |
| 3    | Prompt engineering + LLM integration |
| 4    | End-to-end prototype development |
| 5    | Testing, validation, prompt refinement |
| 6    | Documentation, demo, final wrap-up |

### Learning: Delta Schema Enforcement & Evolution
**Link**: https://www.databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html

#### 1. Schema Enforcement Definition and Behavior

Delta Lake’s schema enforcement (aka schema validation) ensures data quality by rejecting any writes that don’t match the expected schema of a target table. It acts like a gatekeeper, blocking writes with extra columns or mismatched types

#### 2. Enforcement Rules

The blog outlines these key rules for schema enforcement:
- **No additional columns** in the incoming DataFrame that aren't already present in the table.

- **Missing columns** are allowed and filled with ```NULL```.

- **Data types must match exactly** - strict type matching is enforced.

- **Column names must match case-insensitively**, and Delta Lake forbids having both ```Foo``` and ```foo``` as seperate columns.

#### 3. Schema Evolution Mechanisms
When ```.option("mergeSchema", "true")``` is used:
  - New columns added in the DataFrame are automatically appended to the table’s schema.

  - Nested struct fields are also supported.

  - Type changes from nullable to another type (e.g. ```NullType``` → ```StringType```) and certain upcasts (```Byte → Short → Integer```) are handled.
  This makes the evolution seamless for common changes.

#### 4. Use Cases & Trade-offs
- **Schema enforcement** is ideal for production-quality tables feeding downstream systems like ML models and BI dashboards—offering strong data integrity and preventing accidental schema drift.

- **Schema evolution** is useful when you intend to change schemas, letting you add columns without manual intervention. However:
  - It doesn’t handle **column removal**, **in-place type changes**, or **renames** (especially case changes)—these require ```.option("overwriteSchema", "true")``` or DDL commands.
  - It’s purposely limited so as not to silently break downstream expectations.

### Learning: Auto Loader Schema Drift
**Link:** https://community.databricks.com/t5/technical-blog/schema-management-and-drift-scenarios-via-databricks-auto-loader/ba-p/63393

#### 1. Schema Inference Mechanism
- Auto Loader **samples up to 50 GB or 1,000 files** to infer the schema for the input directory.  

- It writes inferred schemas into a ```_schemas``` folder under the configured ```cloudFiles.schemaLocation```—this becomes your source of truth for schema evolution over time

#### 2. Supported File Formats & Type Inference
- **JSON, CSV, XML:** Auto Loader infers everything as strings unless you enable ```.option("cloudFiles.inferColumnTypes", "true")```.  

- **Text, Binary:** These formats don't support evolution.

- **Parquet, Avro:** Typed formats retain their native data types and are merged during sampling. On type conflicts, it chooses the widest type (e.g., long over int), unless overridden by ```schemaHints```.

#### 3. Automatic Schema Evolution  
- When Auto Loader detects **new columns**, it:
  1. Stops the stream with an `UnknownFieldException`.
  2. Merges the new column(s) into the schema, placing them at the end.
  3. Keeps existing column types intact.  
- **Note:** Instructs a pipeline restart (e.g., via Lakeflow Jobs) to resume with updated schema.

#### 4. ```cloudFiles.schemaEvolutionMode``` Control  
Auto Loader supports several modes:

| Mode | Behavior |
|------|----------|
| `addNewColumns` | (Default) Stream **fails**, schema file updated to include new columns. |
| `rescue` | Stream continues; unexpected fields go into a `_rescued_data` column. |
| `failOnNewColumns` | Stream fails; schema is NOT updated until manually changed. |
| `none` | Stream continues; new columns are ignored unless `rescuedDataColumn` is set. |

- The default behavior is `addNewColumns` unless a user-supplied schema is provided, in which case the default changes to `none`.

#### 5. `_rescued_data` Column  
- If using ```rescue``` mode (or enabling ```rescuedDataColumn```), unmatched fields are captured in a ```_rescued_data``` column rather than being dropped.  
- You can rename this column via the ```rescuedDataColumn``` option.

#### 6. Partition Columns Are Ignored in Drift  
- Auto Loader can detect Hive-style partition columns (e.g., ```/date=2025-01-01/```).  
- **Partition evolution is not supported**: new partitions will not be added to the schema unless manually specified using ```cloudFiles.partitionColumns```.

### How Research Informs the Project

#### 1. Schema Enforcement & Evolution (Delta Lake)
- Clarifies which schema changes Delta *handles automatically*—such as column additions, nullable upcasts, and nested fields—ensuring the agent avoids duplicated work.

- Defines the boundaries of *unsupported changes* (e.g., column drops, renames, type conversions) that require explicit intervention, allowing the agent to detect and prompt for these specific adjustments.

- Supports development of logic that differentiates between "safe" drift (handled by system) and "risky" drift (requiring agent-generated code), improving system reliability

#### 2. Auto Loader Schema Management (Streaming Context)
- Highlights how Auto Loader samples schemas and manages drift through ```_schemas``` metadata and ```cloudFiles.schemaEvolutionMode``` configurations (e.g., ```addNewColumns```, ```rescue```), which enables the agent to align its behavior with real-time ingestion pipelines.

- Identifies scenarios—like data captured in ```_rescued_data```, unsupported partition-schema changes, or streaming failures—where the agent should suggest specific modes or extraction logic, enhancing streaming robustness.