##Load your e-commerce data

In [0]:
# Load your e-commerce data
events = spark.read.csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv", header=True, inferSchema=True)
print(f"Loaded {events.count()} rows from CSV")

Loaded 42448764 rows from CSV


##1: Convert CSV to Delta format

In [0]:
delta_path = "/Volumes/workspace/ecommerce/ecommerce_data/events_delta"
events.write.format("delta").mode("overwrite").save(delta_path)
print(f"✓ Task 1: Saved Delta to {delta_path}")

✓ Task 1: Saved Delta to /Volumes/workspace/ecommerce/ecommerce_data/events_delta


##2: Create managed table

In [0]:
events.write.format("delta").saveAsTable("events_table")
print("✓ Step 2: Created managed table 'events_table'")

✓ Step 2: Created managed table 'events_table'


##3: SQL approach


In [0]:
spark.sql("""
    CREATE TABLE events_delta
    USING DELTA
    AS SELECT * FROM events_table
""")


DataFrame[num_affected_rows: bigint, num_inserted_rows: bigint]

##4: Test schema enforcement

In [0]:
try:
    wrong_schema = spark.createDataFrame([("a","b","c")], ["x","y","z"])
    wrong_schema.write.format("delta").mode("append").save(delta_path)
except Exception as e:
    print(f"Schema enforcement: {e}")

print("✓ All 4 tasks completed!")

Schema enforcement: [_LEGACY_ERROR_TEMP_DELTA_0007] A schema mismatch detected when writing to the Delta table (Table ID: 3ab275ef-9108-4584-92ac-c13172315126).
To enable schema migration using DataFrameWriter or DataStreamWriter, please set:
'.option("mergeSchema", "true")'.
For other operations, set the session configuration
spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation
specific to the operation for details.

Table schema:
root
-- event_time: timestamp (nullable = true)
-- event_type: string (nullable = true)
-- product_id: integer (nullable = true)
-- category_id: long (nullable = true)
-- category_code: string (nullable = true)
-- brand: string (nullable = true)
-- price: double (nullable = true)
-- user_id: integer (nullable = true)
-- user_session: string (nullable = true)


Data schema:
root
-- x: string (nullable = true)
-- y: string (nullable = true)
-- z: string (nullable = true)

         
Table ACLs are enabled in this cluster, so automati

##5:Handle duplicate inserts

In [0]:
print("\n--- Step 5: Handle duplicate inserts ---")

# Create sample data with duplicates
duplicate_data = [
    (1, "product_A", 100),
    (1, "product_A", 100),  # Duplicate
    (2, "product_B", 200),
    (1, "product_A", 100)   # Another duplicate
]

dup_df = spark.createDataFrame(duplicate_data, ["user_id", "product", "price"])

# Save with duplicates - USE SAME FOLDER AS BEFORE
dup_path = "/Volumes/workspace/ecommerce/ecommerce_data/duplicate_test"
dup_df.write.format("delta").mode("overwrite").save(dup_path)
print(f"Saved {dup_df.count()} rows (with duplicates)")

# Remove duplicates
unique_df = spark.read.format("delta").load(dup_path).distinct()
unique_df.write.format("delta").mode("overwrite").save(dup_path + "_clean")
print(f"After deduplication: {unique_df.count()} rows")

print("✓ Step 5: Duplicate handling demonstrated")


--- Step 5: Handle duplicate inserts ---
Saved 4 rows (with duplicates)
After deduplication: 2 rows
✓ Step 5: Duplicate handling demonstrated
