# DataAnalytics — Spark SQL

> **Colab-ready Spark SQL notebooks** following Medallion Architecture.
> Run notebooks in order: **01 Bronze → 02 Silver → 03 Gold → Analytics**.

### Conventions
- Databases (schemas): `bronze`, `silver`, `gold`
- Naming: `snake_case`
- Storage: managed tables under `/content/spark-warehouse` (created automatically)
- All code uses **Spark SQL** via `spark.sql(...)` and shows previews with `.show(10, truncate=False)`

In [None]:
from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("Medallion-SparkSQL")
         .config("spark.sql.warehouse.dir", "/content/spark-warehouse")
         .enableHiveSupport().getOrCreate())
for db in ["bronze","silver","gold"]:
    spark.sql(f"CREATE DATABASE IF NOT EXISTS {db}")
print("Databases ready:", [r.databaseName for r in spark.sql("SHOW DATABASES").collect()])

## Dimensions / Measures

In [None]:
spark.sql("SELECT DISTINCT country FROM gold.dim_customers ORDER BY country").show(50, truncate=False)

In [None]:
spark.sql("SELECT DISTINCT category, subcategory, product_name FROM gold.dim_products ORDER BY 1,2,3").show(50, truncate=False)

In [None]:
spark.sql("SELECT SUM(sales_amount) total_sales, SUM(quantity) total_quantity, AVG(price) avg_price FROM gold.fact_sales").show(truncate=False)

## Ranking & Trends

In [None]:
spark.sql("SELECT p.product_name, SUM(f.sales_amount) total_revenue FROM gold.fact_sales f LEFT JOIN gold.dim_products p ON p.product_key=f.product_key GROUP BY p.product_name ORDER BY total_revenue DESC LIMIT 5").show(truncate=False)

In [None]:
spark.sql("SELECT date_trunc('month', order_date) order_month, SUM(sales_amount) total_sales FROM gold.fact_sales WHERE order_date IS NOT NULL GROUP BY order_month ORDER BY order_month").show(200, truncate=False)

## Cumulative & YoY

In [None]:
spark.sql("WITH y AS (SELECT date_trunc('year', order_date) y, SUM(sales_amount) s FROM gold.fact_sales WHERE order_date IS NOT NULL GROUP BY 1) SELECT y, s, SUM(s) OVER (ORDER BY y) running_total FROM y ORDER BY y").show(200, truncate=False)

In [None]:
spark.sql("WITH ys AS (SELECT year(order_date) y, p.product_name, SUM(f.sales_amount) s FROM gold.fact_sales f LEFT JOIN gold.dim_products p ON f.product_key=p.product_key WHERE order_date IS NOT NULL GROUP BY y, p.product_name) SELECT y, product_name, s, LAG(s) OVER (PARTITION BY product_name ORDER BY y) py, s - LAG(s) OVER (PARTITION BY product_name ORDER BY y) diff_py FROM ys ORDER BY product_name, y").show(200, truncate=False)

## Segmentation & Part‑to‑Whole

In [None]:
spark.sql("WITH ps AS (SELECT product_key, CASE WHEN cost < 100 THEN 'Below 100' WHEN cost BETWEEN 100 AND 500 THEN '100-500' WHEN cost BETWEEN 500 AND 1000 THEN '500-1000' ELSE 'Above 1000' END band FROM gold.dim_products) SELECT band, COUNT(*) total FROM ps GROUP BY band ORDER BY total DESC").show(truncate=False)

In [None]:
spark.sql("WITH cs AS (SELECT p.category, SUM(f.sales_amount) s FROM gold.fact_sales f LEFT JOIN gold.dim_products p ON p.product_key=f.product_key GROUP BY p.category) SELECT category, s, SUM(s) OVER () overall, ROUND(s / SUM(s) OVER () * 100, 2) pct FROM cs ORDER BY s DESC").show(200, truncate=False)