**Medallion Architecture?** \
Step by step way to organize data from raw --> clean --> useful.

**It has Three Layer** \
Bronze  -->  Silver  -->  Gold

**Bronze layer**  -> Exact copy of source data. No changes.


**Raw ingestion (Nothing is done with data just copied exactly).**

In [0]:
df_raw = spark.read.csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv", header=True, inferSchema=True)


df_raw.write\
    .format("delta")\
    .mode("append")\
    .saveAsTable("bronze_ecommerce_data_Oct_2019")

**Silver layer** -> Cleaned, standardized, and trusted data.(Removing duplicates, fixing datatype, Handling null, validating schema, adding required columns)\

**Incremental Processing** ->\
Only new or changed data is processed\
Delta MERGE helps update existing records and insert new ones

In [0]:
from pyspark.sql.functions import col, year, month
from delta.tables import DeltaTable

df_silver = (
    spark.table("bronze_ecommerce_data_Oct_2019")
    .filter(col("event_time").isNotNull())       
    .dropDuplicates(["product_id", "user_id", "event_time"])
    .withColumn("year", year("event_time"))
    .withColumn("month", month("event_time"))
)

if not spark.catalog.tableExists("silver_ecommerce_data_Oct_2019"):
    df_silver.write \
        .format("delta") \
        .mode("overwrite") \
        .saveAsTable("silver_ecommerce_data_Oct_2019")
else:
    silver_tbl = DeltaTable.forName(spark, "silver_ecommerce_data_Oct_2019")

    silver_tbl.alias("t").merge(
        df_silver.alias("s"),
        "t.product_id = s.product_id AND "
        "t.user_id = s.user_id AND "
        "t.event_time = s.event_time"
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()





**Gold layer** -> Aggregated, business-level insights.

In [0]:
from pyspark.sql.functions import sum, count

df_gold = (
    spark.table("silver_ecommerce_data_Oct_2019")
    .filter(col("event_type") == "purchase")
    .groupBy("brand", "year", "month")
    .agg(
        count("*").alias("total_orders"),
        sum("price").alias("total_revenue")
    )
)

df_gold.write \
  .format("delta") \
  .mode("overwrite") \
  .saveAsTable("gold_monthly_brand_revenue_Oct_2019")


In [0]:
spark.table("gold_monthly_brand_revenue_Oct_2019").orderBy("total_revenue", ascending=False).show()

+--------+----+-----+------------+--------------------+
|   brand|year|month|total_orders|       total_revenue|
+--------+----+-----+------------+--------------------+
|   apple|2019|   10|      142857| 1.111987547299987E8|
| samsung|2019|   10|      172877|4.6401831680000074E7|
|  xiaomi|2019|   10|       56609|   9192640.240000024|
|    NULL|2019|   10|       58209|   8539660.590000002|
|  huawei|2019|   10|       23499|   4883104.729999999|
|    acer|2019|   10|        6880|  3575715.6899999958|
|      lg|2019|   10|        8725|  3387360.9099999974|
| lucente|2019|   10|       11576|   3123438.960000002|
|    sony|2019|   10|        6729|          2478196.68|
|    oppo|2019|   10|       10887|   2412033.980000003|
|  lenovo|2019|   10|        4578|  1752638.5300000017|
| indesit|2019|   10|        5024|  1250060.9500000004|
|   bosch|2019|   10|        5705|  1248729.0900000005|
|      hp|2019|   10|        3596|  1227215.9899999998|
|   artel|2019|   10|        6123|  1033967.3299