**Medallion Architecture?** \
Step by step way to organize data from raw --> clean --> useful.

**It has Three Layer** \
Bronze  -->  Silver  -->  Gold

**Bronze layer**  -> Exact copy of source data. No changes.


**Raw ingestion (Nothing is done with data just copied exactly).**

In [0]:
df_raw = spark.read.csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv", header=True, inferSchema=True)


df_raw.write\
    .format("delta")\
    .mode("append")\
    .saveAsTable("bronze_ecommerce_data_Oct_2019")

**Silver layer** -> Cleaned, standardized, and trusted data.(Removing duplicates, fixing datatype, Handling null, validating schema, adding required columns)

In [0]:
from pyspark.sql.functions import col, year, month
from delta.tables import DeltaTable

df_silver = (
    spark.table("bronze_ecommerce_data_Oct_2019")
    .filter(col("event_time").isNotNull())       
    .dropDuplicates(["product_id", "user_id", "event_time"])
    .withColumn("year", year("event_time"))
    .withColumn("month", month("event_time"))
)

if not spark.catalog.tableExists("silver_ecommerce_data_Oct_2019"):
    df_silver.write \
        .format("delta") \
        .mode("overwrite") \
        .saveAsTable("silver_ecommerce_data_Oct_2019")
else:
    silver_tbl = DeltaTable.forName(spark, "silver_ecommerce_data_Oct_2019")

    silver_tbl.alias("t").merge(
        df_silver.alias("s"),
        "t.product_id = s.product_id AND "
        "t.user_id = s.user_id AND "
        "t.event_time = s.event_time"
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()





**Gold layer** -> Aggregated, business-level insights.

In [0]:
from pyspark.sql.functions import sum, count

df_gold = (
    spark.table("silver_ecommerce_data_Oct_2019")
    .filter(col("event_type") == "purchase")
    .groupBy("brand", "year", "month")
    .agg(
        count("*").alias("total_orders"),
        sum("price").alias("total_revenue")
    )
)

df_gold.write \
  .format("delta") \
  .mode("overwrite") \
  .saveAsTable("gold_monthly_brand_revenue_Oct_2019")


In [0]:
spark.table("gold_monthly_brand_revenue_Oct_2019").show()

+----------+----+-----+------------+------------------+
|     brand|year|month|total_orders|     total_revenue|
+----------+----+-----+------------+------------------+
|    dremel|2019|   10|           8|            639.78|
|  burberry|2019|   10|          32|           2166.82|
|    pituso|2019|   10|         169|15676.239999999998|
|   colombo|2019|   10|           7|            275.51|
|      swat|2019|   10|         148|          10731.75|
|powercolor|2019|   10|          26|           5413.19|
|    gipfel|2019|   10|          29|           2970.15|
|    agness|2019|   10|          70|           2165.96|
|     meizu|2019|   10|        1735|214465.64999999997|
|    momert|2019|   10|           9|            540.53|
|    missha|2019|   10|          60|           1255.34|
|       cnd|2019|   10|          16|           1711.45|
|   neoline|2019|   10|         509| 84460.59000000001|
|   byintek|2019|   10|          24| 3584.129999999999|
|   tomfarr|2019|   10|          45|           7