# 01 — Build Bronze Layer (Spark SQL)

> **Colab-ready Spark SQL notebooks** following Medallion Architecture.
> Run notebooks in order: **01 Bronze → 02 Silver → 03 Gold → Analytics**.

### Conventions
- Databases (schemas): `bronze`, `silver`, `gold`
- Naming: `snake_case`
- Storage: managed tables under `/content/spark-warehouse` (created automatically)
- All code uses **Spark SQL** via `spark.sql(...)` and shows previews with `.show(10, truncate=False)`

## What is the Bronze layer?
- **Raw landing zone** — store sources exactly as delivered.
- Preserve raw fidelity and schema drift for auditability.
- Keep loads **idempotent** (truncate → bulk load).
- Avoid business rules; transform later in Silver.

In [None]:
from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("Medallion-SparkSQL")
         .config("spark.sql.warehouse.dir", "/content/spark-warehouse")
         .enableHiveSupport().getOrCreate())
for db in ["bronze","silver","gold"]:
    spark.sql(f"CREATE DATABASE IF NOT EXISTS {db}")
print("Databases ready:", [r.databaseName for r in spark.sql("SHOW DATABASES").collect()])

In [None]:
def preview(table_fqn, limit=10):
    print("\n=== Preview:", table_fqn, "===")
    spark.sql(f"SELECT COUNT(*) AS row_count FROM {table_fqn}").show(truncate=False)
    spark.sql(f"SELECT * FROM {table_fqn} LIMIT {limit}").show(limit, truncate=False)

## Configure file locations

In [None]:
DATA_BASE="/content"
FILES={"crm_cust_info":f"{DATA_BASE}/cust_info.csv","crm_prd_info":f"{DATA_BASE}/prd_info.csv",
"crm_sales_details":f"{DATA_BASE}/sales_details.csv","erp_loc_a101":f"{DATA_BASE}/LOC_A101.csv",
"erp_cust_az12":f"{DATA_BASE}/CUST_AZ12.csv","erp_px_cat_g1v2":f"{DATA_BASE}/PX_CAT_G1V2.csv"}; print(FILES)

### bronze.crm_cust_info

In [None]:
spark.sql("DROP TABLE IF EXISTS bronze.crm_cust_info")
(spark.read.option("header",True).option("inferSchema",True).csv(FILES["crm_cust_info"])
 .write.mode("overwrite").saveAsTable("bronze.crm_cust_info")); preview("bronze.crm_cust_info")

### bronze.crm_prd_info

In [None]:
spark.sql("DROP TABLE IF EXISTS bronze.crm_prd_info")
(spark.read.option("header",True).option("inferSchema",True).csv(FILES["crm_prd_info"])
 .write.mode("overwrite").saveAsTable("bronze.crm_prd_info")); preview("bronze.crm_prd_info")

### bronze.crm_sales_details

In [None]:
spark.sql("DROP TABLE IF EXISTS bronze.crm_sales_details")
(spark.read.option("header",True).option("inferSchema",True).csv(FILES["crm_sales_details"])
 .write.mode("overwrite").saveAsTable("bronze.crm_sales_details")); preview("bronze.crm_sales_details")

### bronze.erp_loc_a101

In [None]:
spark.sql("DROP TABLE IF EXISTS bronze.erp_loc_a101")
(spark.read.option("header",True).option("inferSchema",True).csv(FILES["erp_loc_a101"])
 .write.mode("overwrite").saveAsTable("bronze.erp_loc_a101")); preview("bronze.erp_loc_a101")

### bronze.erp_cust_az12

In [None]:
spark.sql("DROP TABLE IF EXISTS bronze.erp_cust_az12")
(spark.read.option("header",True).option("inferSchema",True).csv(FILES["erp_cust_az12"])
 .write.mode("overwrite").saveAsTable("bronze.erp_cust_az12")); preview("bronze.erp_cust_az12")

### bronze.erp_px_cat_g1v2

In [None]:
spark.sql("DROP TABLE IF EXISTS bronze.erp_px_cat_g1v2")
(spark.read.option("header",True).option("inferSchema",True).csv(FILES["erp_px_cat_g1v2"])
 .write.mode("overwrite").saveAsTable("bronze.erp_px_cat_g1v2")); preview("bronze.erp_px_cat_g1v2")