In [0]:
# Load data - Explore the smaller October dataset
df_oct = spark.read.csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv", header=True, inferSchema=True)

Step 1: Write df_oct as Delta

df_oct.write
→ Starts a write operation from a Spark DataFrame

.format("delta")
→ Tells Spark to use Delta Lake, not Parquet or CSV

.mode("overwrite")
→ Replace existing data if it exists

.save("/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv")
→ Saves data to storage
→ This folder will now contain:

Parquet files and
_delta_log/ 

In [0]:
%sql
SHOW VOLUMES IN workspace.ecommerce;

database,volume_name
ecommerce,ecommerce_data


In [0]:
df_oct.write \
    .format("delta") \
    .mode("overwrite") \
    .save("/Volumes/workspace/ecommerce/ecommerce_data/df_oct_delta")


The above cell just writes files to a location as Delta files, not a named table. We just saved files to disk. We did not create a table that the SQL engine knows about.

To read this, we need to manually specify the path as there's no name that Spark remembers.

In [0]:
spark.read.format("delta").load("/Volumes/workspace/ecommerce/ecommerce_data/df_oct_delta")


DataFrame[event_time: timestamp, event_type: string, product_id: int, category_id: bigint, category_code: string, brand: string, price: double, user_id: int, user_session: string]

Delta data already exists on disk
Now we are just registering it as a table.

There are two ways to create Delta tables:

[1] External Delta table → data already exists at a path
[2] Managed Delta table → Databricks manages storage

In [0]:
# Create managed table
df_oct.write.format("delta").saveAsTable("df_oct_table")

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-7293070745601141>, line 2[0m
[1;32m      1[0m [38;5;66;03m# Create managed table[39;00m
[0;32m----> 2[0m df_oct[38;5;241m.[39mwrite[38;5;241m.[39mformat([38;5;124m"[39m[38;5;124mdelta[39m[38;5;124m"[39m)[38;5;241m.[39msaveAsTable([38;5;124m"[39m[38;5;124mdf_oct_table[39m[38;5;124m"[39m)

File [0;32m/databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/readwriter.py:737[0m, in [0;36mDataFrameWriter.saveAsTable[0;34m(self, name, format, mode, partitionBy, **options)[0m
[1;32m    735[0m [38;5;28mself[39m[38;5;241m.[39m_write[38;5;241m.[39mtable_name [38;5;241m=[39m name
[1;32m    736[0m [38;5;28mself[39m[38;5;241m.[39m_write[38;5;241m.[39mtable_save_method [38;5;241m=[39m [38;5;124m"[39m[38;5;124msave_as_table[39m[38;5;124m"[39m

Here, in the above cell, we register the table name. The name of the table is df_oct_table. Now Spark remembers where the data resides. We can read it using SQL or Python as shown below:

SELECT COUNT(*) FROM df_oct_table

spark.table("df_oct_table")

There’s no path needed — Spark already knows where the data is stored.

Schema enforcement - small example

In [0]:
# Create a small Delta table that has strict schema
schema_sample = "user_id INT, price DOUBLE, purchase_mode STRING"

data_sample = [(100, 100.0, "store"), (200, 250.5, "online")]

df_sample = spark.createDataFrame(data_sample, schema=schema_sample)

df_sample.write.format("delta").mode("overwrite").saveAsTable("delta_schema_test")


In [0]:
# Insert incompatible data

# Wrong schema – price is string, user_id missing
flawed_data = [("Shah Rukh Khan", "100", "purchase")]

df_flawed = spark.createDataFrame(flawed_data, ["user_id", "price", "purchase_mode"])

df_flawed.write.format("delta").mode("append").saveAsTable("delta_schema_test")


[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-6976869407386029>, line 8[0m
[1;32m      4[0m flawed_data [38;5;241m=[39m [([38;5;124m"[39m[38;5;124mShah Rukh Khan[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124m100[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mpurchase[39m[38;5;124m"[39m)]
[1;32m      6[0m df_flawed [38;5;241m=[39m spark[38;5;241m.[39mcreateDataFrame(flawed_data, [[38;5;124m"[39m[38;5;124muser_id[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mprice[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mpurchase_mode[39m[38;5;124m"[39m])
[0;32m----> 8[0m df_flawed[38;5;241m.[39mwrite[38;5;241m.[39mformat([38;5;124m"[39m[38;5;124mdelta[39m[38;5;124m"[39m)[38;5;241m.[39mmode([38;5;124m"[39m[38;5;124mappend[39m[38;5;124m"[39m)[38;5;241m.[39msaveAsTable([38;5;124m"[39m[38;5

In [0]:
# Tect schema enforcement on df_oct

In [0]:
df_oct.printSchema()


root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- category_id: long (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- user_session: string (nullable = true)



In [0]:
# Test schema enforcement using df_oct
try:
    wrong_schema = spark.createDataFrame([("Ullu","Goobe","Owl")], ["event_time","event_type","product_id"])
    wrong_schema.write.format("delta").mode("append").saveAsTable("df_oct_table")
except Exception as e:
    print(f"Schema enforcement: {e}")

Schema enforcement: [DELTA_FAILED_TO_MERGE_FIELDS] Failed to merge fields 'event_time' and 'event_time'.

JVM stacktrace:
com.databricks.sql.transaction.tahoe.DeltaAnalysisException
	at com.databricks.sql.transaction.tahoe.schema.SchemaMergingUtils$.$anonfun$mergeDataTypes$1(SchemaMergingUtils.scala:231)
	at scala.collection.ArrayOps$.map$extension(ArrayOps.scala:936)
	at com.databricks.sql.transaction.tahoe.schema.SchemaMergingUtils$.merge$1(SchemaMergingUtils.scala:217)
	at com.databricks.sql.transaction.tahoe.schema.SchemaMergingUtils$.mergeDataTypes(SchemaMergingUtils.scala:335)
	at com.databricks.sql.transaction.tahoe.schema.SchemaMergingUtils$.mergeSchemas(SchemaMergingUtils.scala:179)
	at com.databricks.sql.transaction.tahoe.schema.ImplicitMetadataOperation$.mergeSchema(ImplicitMetadataOperation.scala:359)
	at com.databricks.sql.transaction.tahoe.schema.ImplicitMetadataOperation.updateMetadata(ImplicitMetadataOperation.scala:112)
	at com.databricks.sql.transaction.tahoe.schema.I