## [01] Schema Evolution Test

This notebook demonstrates how Delta Lake enforces and evolves schemas using `.option("mergeSchema")` and `.option("overwriteSchema")`. All code is assuming a valid mount (e.g., `/mnt/...`) is configured as the write path.

In [0]:
# Setting up the write path. Defining the Delta table storage path
path = "/Volumes/workspace/default/schema_evolution_test"
print("Delta write path:", path)

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Create base DataFrame with two columns
df_base = spark.createDataFrame(
    [(1, "Alice")],
    ["id", "name"]
)

# Overwrite any existing table at the path
df_base.write.format("delta") \
    .mode("overwrite") \
    .save(path)

# Confirm the schema
print("Initial schema:")
spark.read.format("delta").load(path).printSchema()

In [0]:
# Introduce a new 'age' column
df_new = spark.createDataFrame([(2, "Bob", 30)], ["id", "name", "age"])

try:
    df_new.write.format("delta") \
        .mode("append") \
        .save(path)
except Exception as e:
    print("Expected enforcement error:\n", e)

In [0]:
df_new.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save(path)

print("Schema after mergeSchema:")
spark.read.format("delta").load(path).printSchema()

Without mergeSchema, writes fail if schema mismatch exists (extra column)

In [0]:
# Change 'age' type to string
df_type_change = spark.createDataFrame([(3, "Charlie", "35")], ["id", "name", "age"])

df_type_change.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save(path)

print("Schema after type change attempt:")
spark.read.format("delta").load(path).printSchema()

Delta supports some safe type promotions—review the schema to verify behavior

In [0]:
# Rename 'name' to 'full_name'
df_renamed = spark.read.format("delta").load(path) \
    .withColumnRenamed("name", "full_name")

df_renamed.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save(path)

print("Schema after overwriteSchema rename:")
spark.read.format("delta").load(path).printSchema()

Renames (and other structural changes) require overwriteSchema=true

| Scenario | Feature Tested | Expected Behavior | Notes |
|----------|----------------|--------------------|-------|
| Append data with same schema | Baseline | ✅ Data appends successfully | Confirms Delta can accept schema-matching data without error |
| Append data with **new column** | Schema Enforcement | ❌ Fails unless `.option("mergeSchema", "true")` is set | Delta blocks writes with unknown columns by default |
| Append data with **new column** + mergeSchema | Schema Evolution | ✅ Delta adds the new column to the table schema | Only supported for additions; no error thrown |
| Append data with **changed type** | Schema Evolution (nullable/upcast) | ✅ Sometimes allowed, depending on compatibility (e.g., INT → LONG) | Delta allows upcasts; stricter type changes require overwrite or DDL |
| Rename column | Not supported by mergeSchema | ❌ Delta throws error unless `.option("overwriteSchema", "true")` is used | Renaming needs full overwrite to avoid data integrity issues |
| Overwrite table with new schema | Overwrite Schema           | ✅ Succeeds with `.option("overwriteSchema", "true")` | Replaces entire schema, used for renames or redefinitions |
| Column removal | Not supported by mergeSchema | ❌ Must use overwriteSchema or DDL | Delta does not auto-remove columns; must be done manually |

#### Takeaway
- Delta Lake supports automatic schema evolution **only for column additions** and some **safe type promotions** when ```.option("mergeSchema", "true")``` is used.  

- Schema enforcement is strict by default to prevent data corruption.  

- Advanced changes like renaming or deleting columns require ```.overwriteSchema``` or manual DDL updates.