
# End-to-End Lakehouse — Medallion Architecture 

**Medallion Architecture** (Bronze → Silver → Gold). 

### What you'll build
- **Bronze**: raw ingestion from CSV → Delta tables, with lineage columns
- **Silver**: cleansing, standardization, type-casting, **deduplication (Window + row_number)**, and a **quarantine** area for invalid rows
- **Gold**: star-like business model (dimensions + fact) and daily revenue aggregates
- **(Optional)**: incremental ingestion using **Structured Streaming** (no DLT)

### How to use
1. Upload your CSVs to DBFS (e.g., `dbfs:/FileStore/lab_data/`).  
2. Adjust the **paths** in the next section if your filenames/locations differ.  
3. Run the notebook from top to bottom on a Databricks cluster.



## 1. Parameters, Imports, and Paths

This cell:
- Imports PySpark helper modules.
- Defines a **database name** in the **legacy Hive Metastore**.
- Builds a per-user **base path** in DBFS for Bronze/Silver/Gold layers.
- Defines **raw CSV paths** (pointing to `FileStore/lab_data`), which you can change.
- Ensures the **directory structure** exists using `dbutils.fs.mkdirs`.
- Creates and selects the **Hive database** with `CREATE DATABASE IF NOT EXISTS` and `USE`.
- Sets a sensible shuffle partitions value to optimize shuffles for medium datasets.

### Functions & APIs explained
- `from pyspark.sql import functions as F`: Imports many column functions (e.g., `col`, `trim`, `lower`, `to_date`, etc.) under the alias `F` for readability.
- `from pyspark.sql import Window`: Used to define **window specifications** for analytics like deduplication with `row_number`.
- `from pyspark.sql.types import *`: Imports Spark **data types** like `StructType`, `StructField`, `StringType`, etc.
- `spark.sql(...)`: Executes SQL statements against the Spark/Hive metastore.
- `dbutils.fs.mkdirs(path)`: Databricks utility to create folders in **DBFS**.
- `spark.conf.set(...)`: Sets Spark conf at runtime, here to control shuffle partitions.


In [0]:

from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql.types import *

# ---- Database (Hive Metastore) ----
db = "retail_lakehouse"  # change if you prefer a different schema/database name

# ---- Build a user-specific base path so multiple users don't collide ----
_current_user = spark.sql("SELECT current_user()").first()[0]
user_safe = _current_user.replace('@','_').replace('.','_').replace('+','_')
base_path = f"dbfs:/FileStore/lakehouse/{user_safe}/retail"

# ---- Raw CSV paths (match your filenames in DBFS) ----
# If your files are not in dbfs:/FileStore/lab_data/, upload them there or change these paths.
raw_customers_path = "dbfs:/FileStore/lab_data/customer_demographics.csv"
raw_products_path  = "dbfs:/FileStore/lab_data/products.csv"
raw_sales_path     = "dbfs:/FileStore/lab_data/sales.csv"

# ---- Derived storage layout for each medallion layer ----
bronze_root = f"{base_path}/bronze"
silver_root = f"{base_path}/silver"
gold_root   = f"{base_path}/gold"
quarantine_root = f"{base_path}/quarantine"

paths = {
    "bronze_customers": f"{bronze_root}/customers",
    "bronze_products":  f"{bronze_root}/products",
    "bronze_sales":     f"{bronze_root}/sales",
    "silver_customers": f"{silver_root}/customers",
    "silver_products":  f"{silver_root}/products",
    "silver_sales":     f"{silver_root}/sales",
    "gold_dim_customer": f"{gold_root}/dim_customer",
    "gold_dim_product":  f"{gold_root}/dim_product",
    "gold_fact_sales":   f"{gold_root}/fact_sales",
    "gold_sales_daily":  f"{gold_root}/sales_daily",
    "quarantine_sales":  f"{quarantine_root}/sales"
}

# ---- Create folders (idempotent) ----
dbutils.fs.mkdirs(bronze_root)
dbutils.fs.mkdirs(silver_root)
dbutils.fs.mkdirs(gold_root)
dbutils.fs.mkdirs(quarantine_root)

# ---- Create and select Hive database  ----
spark.sql(f"CREATE DATABASE IF NOT EXISTS {db}")
spark.sql(f"USE {db}")

# ---- Optional tuning ----
spark.conf.set("spark.sql.shuffle.partitions", "200")

print("Environment ready in Hive Metastore, database:", db)
print("Base path:", base_path)



## 2. Source Schemas (StructType) — why and how

We **explicitly define schemas** to avoid costly inference and to catch malformed data early.

### Concepts & functions
- **`StructType([...]) / StructField(name, type, nullable)`**: Spark’s way to represent a table schema.
- **Types used**: `StringType`, `IntegerType`, `DoubleType`, etc.  
  We keep `*Date*` columns as **strings** in Bronze and convert to `date` in Silver for safer parsing.
- Ingesting with an explicit schema makes CSV reading **faster and safer**.

Below, the schemas are auto-generated from your CSVs so the column names match your data exactly.


In [0]:

from pyspark.sql.types import *

customersSchema = StructType([
        StructField("CustomerId", StringType(), True),
        StructField("CustomerName", StringType(), True),
        StructField("EmailAddress", StringType(), True),
        StructField("Region", StringType(), True),
        StructField("CustomerType", StringType(), True)
    ])
productsSchema  = StructType([
        StructField("Item", StringType(), True),
        StructField("Category", StringType(), True),
        StructField("ProductName", StringType(), True),
        StructField("Segment", StringType(), True),
        StructField("Price", DoubleType(), True),
        StructField("LaunchDate", StringType(), True)
    ])
salesSchema     = StructType([
        StructField("SalesOrderNumber", StringType(), True),
        StructField("SalesOrderLineNumber", IntegerType(), True),
        StructField("OrderDate", StringType(), True),
        StructField("CustomerName", StringType(), True),
        StructField("EmailAddress", StringType(), True),
        StructField("Item", StringType(), True),
        StructField("Quantity", DoubleType(), True),
        StructField("UnitPrice", DoubleType(), True),
        StructField("TaxAmount", DoubleType(), True),
        StructField("ProductMetadata", StringType(), True)
    ])

print("Customers schema:", [f.name + ":" + f.dataType.simpleString() for f in customersSchema])
print("Products schema:", [f.name + ":" + f.dataType.simpleString() for f in productsSchema])
print("Sales schema:", [f.name + ":" + f.dataType.simpleString() for f in salesSchema])



## 3. BRONZE — Raw ingestion from CSV to Delta

**Goal:** Land the raw CSV data **as-is** into Delta, and add **lineage columns** for traceability.

### Functions used
- `spark.read.option("header", True).schema(...).csv(path)`: Reads CSV with a fixed schema.
- `withColumn("ingest_ts", current_timestamp())`: Adds an **ingestion timestamp**.
- `withColumn("ingest_file", input_file_name())`: Records the **source file** each row came from.
- `withColumn("record_source", lit("csv"))`: Simple provenance marker.
- `write.format("delta").mode("overwrite").save(path)`: Writes a **Delta Lake** table to storage.
- `CREATE TABLE ... USING DELTA LOCATION '...'`: Registers the Delta folder as a **Hive Metastore table**.

Why Delta? It gives you ACID transactions, time travel, and performance optimizations for analytics.


In [0]:
from pyspark.sql.functions import input_file_name, current_timestamp, lit

# ---- Customers to Bronze ----
bronze_customers = (
    spark.read.option("header", True)
        .schema(customersSchema)
        .csv(raw_customers_path)
        .withColumn("ingest_ts", current_timestamp())
        .withColumn("ingest_file", input_file_name())
        .withColumn("record_source", lit("csv"))
)
bronze_customers.write.format("delta").mode("overwrite").option("overwriteSchema","true").save(paths["bronze_customers"])
spark.sql(f"CREATE TABLE IF NOT EXISTS bronze_customers USING DELTA LOCATION '{paths['bronze_customers']}'")

# ---- Products to Bronze ----
bronze_products = (
    spark.read.option("header", True)
        .schema(productsSchema)
        .csv(raw_products_path)
        .withColumn("ingest_ts", current_timestamp())
        .withColumn("ingest_file", input_file_name())
        .withColumn("record_source", lit("csv"))
)
bronze_products.write.format("delta").mode("overwrite").option("overwriteSchema","true").save(paths["bronze_products"])
spark.sql(f"CREATE TABLE IF NOT EXISTS bronze_products USING DELTA LOCATION '{paths['bronze_products']}'")

# ---- Sales to Bronze ----
bronze_sales = (
    spark.read.option("header", True)
        .schema(salesSchema)
        .csv(raw_sales_path)
        .withColumn("ingest_ts", current_timestamp())
        .withColumn("ingest_file", input_file_name())
        .withColumn("record_source", lit("csv"))
)
bronze_sales.write.format("delta").mode("overwrite").option("overwriteSchema","true").save(paths["bronze_sales"])
spark.sql(f"CREATE TABLE IF NOT EXISTS bronze_sales USING DELTA LOCATION '{paths['bronze_sales']}'")

for t in ["bronze_customers","bronze_products","bronze_sales"]:
    print(t, "rows:", spark.table(t).count())


## 4. SILVER — Cleanse, standardize, and validate

**Goal:** Produce clean, conformed data ready for analytics and joins.

### Key techniques & functions
- **Deduplication with Window**:
  - `Window.partitionBy(keys).orderBy(ingest_ts desc, ingest_file desc)` defines the latest record per key.
  - `F.row_number().over(window)` assigns row numbers; keeping `row_number == 1` preserves the **latest**.
- **Column cleanup**:
  - `trim` removes stray whitespace.
  - `lower` standardizes case (e.g., emails).
  - `to_date` converts safe string dates to `date` type (Silver stage is where we cast types).
  - `cast` forces numeric types like `double` for `Quantity`, `UnitPrice`, `TaxAmount`.
- **Validity checks**:
  - Filter out negative prices/quantities or missing required fields.
  - Use `exceptAll` to capture invalid rows into **quarantine** for inspection.

We also keep **ingestion lineage** columns if useful for traceability.


In [0]:

from pyspark.sql.functions import col, trim, lower, to_date

# Load Bronze
b_c = spark.table("bronze_customers")
b_p = spark.table("bronze_products")
b_s = spark.table("bronze_sales")

def has(df, c): 
    return c in df.columns

def dedup_latest(df, keys):
    """
    Deduplicate by keeping the *latest* record per key set based on ingest_ts/ingest_file.
    - keys: list of column names defining natural key (e.g., CustomerName or (SalesOrderNumber, SalesOrderLineNumber))
    - Window.partitionBy(...).orderBy(desc) ranks rows; row_number==1 keeps the most recent.
    """
    w = Window.partitionBy(*[col(k) for k in keys]).orderBy(col("ingest_ts").desc(), col("ingest_file").desc())
    return df.withColumn("rn", F.row_number().over(w)).filter(col("rn") == 1).drop("rn")

# ---- Customers (clean) ----
cust_keys = ["CustomerName"] if has(b_c,"CustomerName") else (["CustomerID"] if has(b_c,"CustomerID") else b_c.columns[:1])
s_customers = dedup_latest(b_c, cust_keys)

for c in [x for x in ["CustomerName","Region","CustomerType","EmailAddress"] if has(s_customers,x)]:
    s_customers = s_customers.withColumn(c, trim(col(c)))
if has(s_customers,"EmailAddress"):
    s_customers = s_customers.withColumn("EmailAddress", lower(col("EmailAddress")))

s_customers.write.format("delta").mode("overwrite").save(paths["silver_customers"])
spark.sql(f"CREATE TABLE IF NOT EXISTS silver_customers USING DELTA LOCATION '{paths['silver_customers']}'")

# ---- Products (clean) ----
prod_keys = ["Item"] if has(b_p,"Item") else b_p.columns[:1]
s_products = dedup_latest(b_p, prod_keys)

for c in [x for x in ["Item","Category","ProductName","Segment"] if has(s_products,x)]:
    s_products = s_products.withColumn(c, trim(col(c)))
for c in [x for x in ["Price"] if has(s_products,x)]:
    s_products = s_products.withColumn(c, col(c).cast("double"))
if has(s_products,"LaunchDate"):
    s_products = s_products.withColumn("LaunchDate", to_date(col("LaunchDate")))

s_products.write.format("delta").mode("overwrite").save(paths["silver_products"])
spark.sql(f"CREATE TABLE IF NOT EXISTS silver_products USING DELTA LOCATION '{paths['silver_products']}'")

# ---- Sales (clean + validate) ----
sales_keys = ["SalesOrderNumber","SalesOrderLineNumber"] if has(b_s,"SalesOrderLineNumber") and has(b_s,"SalesOrderNumber") else (["SalesOrderNumber"] if has(b_s,"SalesOrderNumber") else b_s.columns[:1])
s_sales = dedup_latest(b_s, sales_keys)

for c in [x for x in ["SalesOrderNumber","CustomerName","Item","EmailAddress"] if has(s_sales,x)]:
    s_sales = s_sales.withColumn(c, trim(col(c)))
if has(s_sales,"EmailAddress"):
    s_sales = s_sales.withColumn("EmailAddress", lower(col("EmailAddress")))
if has(s_sales,"OrderDate"):
    s_sales = s_sales.withColumn("OrderDate", to_date(col("OrderDate")))
for c in [x for x in ["Quantity","UnitPrice","TaxAmount"] if has(s_sales,x)]:
    s_sales = s_sales.withColumn(c, col(c).cast("double"))

# Basic validity: non-null keys/dates and non-negative metrics
valid_sales = s_sales
if has(valid_sales,"SalesOrderNumber"):
    valid_sales = valid_sales.filter(col("SalesOrderNumber").isNotNull())
if has(valid_sales,"OrderDate"):
    valid_sales = valid_sales.filter(col("OrderDate").isNotNull())
if has(valid_sales,"Quantity"):
    valid_sales = valid_sales.filter((col("Quantity").isNotNull()) & (col("Quantity") >= 0))
if has(valid_sales,"UnitPrice"):
    valid_sales = valid_sales.filter((col("UnitPrice").isNotNull()) & (col("UnitPrice") >= 0))

invalid_sales = s_sales.exceptAll(valid_sales)

valid_sales.write.format("delta").mode("overwrite").save(paths["silver_sales"])
spark.sql(f"CREATE TABLE IF NOT EXISTS silver_sales USING DELTA LOCATION '{paths['silver_sales']}'")

invalid_sales.write.format("delta").mode("overwrite").save(paths["quarantine_sales"])
spark.sql(f"CREATE TABLE IF NOT EXISTS quarantine_sales USING DELTA LOCATION '{paths['quarantine_sales']}'")

print("Silver written. Quarantine count:", spark.table("quarantine_sales").count())


## 5. GOLD — Star-like model (dimensions + fact) and aggregates

**Goal:** Produce analytics-ready data marts.

### Steps & functions
- Build **dimension tables** by dropping duplicates on natural keys:
  - Customer key: use `CustomerName` if present, else `CustomerID` or first column.
  - Product key: use `Item` if present else first column.
- Build **fact table** by joining Silver sales with dimensions:
  - `.join(dim_product, sales[prod_key] == dim_product.product_key, "left")`
  - `.join(dim_customer, sales[cust_key] == dim_customer.customer_key, "left")`
- Derive metrics:
  - `pre_tax_amount = Quantity * UnitPrice`
  - `total_amount = pre_tax_amount + coalesce(TaxAmount, 0.0)`
- Derive **date parts** (`year`, `month`, `dayofmonth`, `weekofyear`) for partitioning and drill-downs.
- **Partition** the fact by `(order_year, order_month)` to speed up time-based queries.

We also create a small **daily aggregate** table for quick reporting.


In [0]:

from pyspark.sql.functions import year, month, dayofmonth, weekofyear, coalesce

s_c = spark.table("silver_customers")
s_p = spark.table("silver_products")
s_s = spark.table("silver_sales")

# Determine dimension keys based on available columns
cust_key = "CustomerName" if "CustomerName" in s_c.columns else ("CustomerID" if "CustomerID" in s_c.columns else s_c.columns[0])
prod_key = "Item" if "Item" in s_p.columns else s_p.columns[0]

# Dimensions
dim_customer = s_c.dropDuplicates([cust_key]).withColumnRenamed(cust_key, "customer_key")
dim_customer.write.format("delta").mode("overwrite").save(paths["gold_dim_customer"])
spark.sql(f"CREATE TABLE IF NOT EXISTS gold_dim_customer USING DELTA LOCATION '{paths['gold_dim_customer']}'")

dim_product = s_p.dropDuplicates([prod_key]).withColumnRenamed(prod_key, "product_key")
dim_product.write.format("delta").mode("overwrite").save(paths["gold_dim_product"])
spark.sql(f"CREATE TABLE IF NOT EXISTS gold_dim_product USING DELTA LOCATION '{paths['gold_dim_product']}'")

# Fact Sales
fact = (s_s
    .join(dim_product, s_s[prod_key] == dim_product["product_key"], "left")
    .join(dim_customer, s_s[cust_key] == dim_customer["customer_key"], "left")
)

if "Quantity" in fact.columns and "UnitPrice" in fact.columns:
    fact = fact.withColumn("pre_tax_amount", F.col("Quantity") * F.col("UnitPrice"))
else:
    fact = fact.withColumn("pre_tax_amount", F.lit(None).cast("double"))

if "TaxAmount" in fact.columns:
    fact = fact.withColumn("total_amount", F.col("pre_tax_amount") + coalesce(F.col("TaxAmount"), F.lit(0.0)))
else:
    fact = fact.withColumn("total_amount", F.col("pre_tax_amount"))

if "OrderDate" in fact.columns:
    fact = (fact
        .withColumn("order_year", year(F.col("OrderDate")))
        .withColumn("order_month", month(F.col("OrderDate")))
        .withColumn("order_day", dayofmonth(F.col("OrderDate")))
        .withColumn("order_week", weekofyear(F.col("OrderDate")))
    )

# Select a curated set of columns if present
select_cols = []
for c in ["SalesOrderNumber","SalesOrderLineNumber","OrderDate","customer_key","product_key",
          "Quantity","UnitPrice","TaxAmount","pre_tax_amount","total_amount",
          "order_year","order_month","order_day","order_week"]:
    if c in fact.columns: 
        select_cols.append(c)

fact = fact.select(*select_cols)

partition_cols = [c for c in ["order_year","order_month"] if c in fact.columns]
(fact.write.format("delta").mode("overwrite").partitionBy(partition_cols).save(paths["gold_fact_sales"]))
spark.sql(f"CREATE TABLE IF NOT EXISTS gold_fact_sales USING DELTA LOCATION '{paths['gold_fact_sales']}'")

# Daily aggregate
if "OrderDate" in fact.columns:
    gold_sales_daily = (fact.groupBy("OrderDate")
        .agg(F.sum("Quantity").alias("total_qty") if "Quantity" in fact.columns else F.lit(None).alias("total_qty"),
             F.sum("total_amount").alias("total_revenue") if "total_amount" in fact.columns else F.lit(None).alias("total_revenue"))
    )
    (gold_sales_daily.write.format("delta").mode("overwrite").save(paths["gold_sales_daily"]))
    spark.sql(f"CREATE TABLE IF NOT EXISTS gold_sales_daily USING DELTA LOCATION '{paths['gold_sales_daily']}'")

print("Gold model created.")



## 6. (Optional) Performance & Maintenance — OPTIMIZE / Z-ORDER / VACUUM

**Delta Lake** includes helpful commands:
- `OPTIMIZE table ZORDER BY (...)` clusters data files by the given columns to speed up range queries.
- `VACUUM table RETAIN <HOURS> HOURS` cleans old files. Be mindful of your retention policy.


In [0]:

# These are optional; uncomment if you have sufficient privileges and the tables exist.
# spark.sql("OPTIMIZE gold_fact_sales ZORDER BY (OrderDate)")
# spark.sql("OPTIMIZE gold_sales_daily ZORDER BY (OrderDate)")
# spark.sql("VACUUM gold_fact_sales RETAIN 168 HOURS")   # keep 7 days of history
# spark.sql("VACUUM gold_sales_daily RETAIN 168 HOURS")
print("Optional maintenance commands are commented out by default.")



## 7. Validation and Sample Queries

This section verifies row counts and shows sample results to confirm the pipeline produced usable outputs.

- `spark.table(name).count()` — quick sanity checks.
- `display(...)` — Databricks utility to render tables in the UI.
- Example queries: **Top orders by revenue**, recent days’ aggregates, etc.


In [0]:

tables = ["bronze_customers","bronze_products","bronze_sales",
          "silver_customers","silver_products","silver_sales","quarantine_sales",
          "gold_dim_customer","gold_dim_product","gold_fact_sales"]

# gold_sales_daily may not exist if OrderDate was absent; handle safely
try:
    spark.table("gold_sales_daily")
    tables.append("gold_sales_daily")
except Exception:
    pass

print("Row counts:")
for t in tables:
    try:
        print(t, spark.table(t).count())
    except Exception as e:
        print(t, "ERROR:", e)

print("\nSample: Top 10 sales orders by total_amount (if column exists)")
try:
    display(spark.table("gold_fact_sales").orderBy(F.desc("total_amount")).limit(10))
except Exception as e:
    print("Display failed:", e)

print("\nSample: Last 20 days in gold_sales_daily (if table exists)")
try:
    display(spark.table("gold_sales_daily").orderBy(F.col("OrderDate").desc()).limit(20))
except Exception as e:
    print("Display failed:", e)



## 8. (Optional) Incremental Ingestion with **Structured Streaming**
You can make Bronze ingestion incremental by watching a **directory** for new CSV files:

### Concepts & functions
- `spark.readStream.format("csv").schema(...).load(dir)`: Read new files that land in `dir`.
- `withColumn("ingest_ts", current_timestamp())`, etc.: Same lineage columns as batch.
- `writeStream.format("delta").option("checkpointLocation", path).outputMode("append").start(dest)`:
  - **`checkpointLocation`** stores offsets and state so the stream can resume safely.
  - **`availableNow=True`** processes all current files and then stops (good for micro-batch backfills).

> Set `raw_sales_path` to a **directory** (e.g., `dbfs:/FileStore/lab_data/sales/`) and drop new CSVs there.


In [0]:

# Example streaming job (disabled by default).
# Ensure raw_sales_path points to a DIRECTORY with incoming files before enabling.

# from pyspark.sql.functions import input_file_name, current_timestamp, lit
# checkpoint_dir = f"{base_path}/_checkpoints/bronze_sales"

# stream_df = (spark.readStream
#     .format("csv")
#     .option("header", True)
#     .schema(salesSchema)              # enforce schema
#     .load(raw_sales_path)             # DIRECTORY with new files arriving
#     .withColumn("ingest_ts", current_timestamp())
#     .withColumn("ingest_file", input_file_name())
#     .withColumn("record_source", lit("csv_stream"))
# )

# query = (stream_df.writeStream
#     .format("delta")
#     .option("checkpointLocation", checkpoint_dir)
#     .outputMode("append")
#     .trigger(availableNow=True)       # process existing files then stop
#     .start(paths["bronze_sales"])
# )
# query.awaitTermination()
# print("Streaming ingestion completed (availableNow).")



## 9. Troubleshooting & FAQs

- **File not found**: Make sure your CSVs are in `dbfs:/FileStore/lab_data/` or update the `raw_*_path` variables.
- **Schema mismatch**: Adjust the `customersSchema/productsSchema/salesSchema` in the Schema cell.
- **Delta errors**: Ensure your cluster runtime supports Delta (all current DBR versions do).
- **Performance**: Tune `spark.sql.shuffle.partitions` and consider enabling the optional **OPTIMIZE**/`ZORDER` steps.
- **No Unity Catalog**: This notebook uses `CREATE DATABASE`/`USE` only; **no `USE CATALOG`** or catalog prefixes are present.


| Technique           | Purpose                           | Frequency            | Pros                                      | Cons                                |
|---------------------|-----------------------------------|----------------------|-------------------------------------------|-------------------------------------|
| **OPTIMIZE**        | Compact small files into larger   | Weekly/Daily         | Speeds queries, reduces file overhead     | Compute cost for rewriting data     |
| **Z-ORDER**         | Cluster data for better skipping  | With OPTIMIZE        | Faster queries on multi-column filters    | Rewrite cost; choose ≤3 columns     |
| **Auto Optimize**   | Optimize during writes            | Always on (if fits)  | No manual compaction, avoids small files  | Slightly slower writes              |
| **VACUUM**          | Delete old unreferenced files     | Weekly+              | Saves storage, keeps storage clean        | Reduces time travel capability      |
| **Partitioning**    | Reduce scanned data by directory  | Table design time    | Large query speedup for targeted filters  | Over-partitioning → small files     |
| **Predicate Pushdown** | Push filters to storage layer | Always               | Reads less data, improves performance     | Limited by available file statistics|
| **Caching**         | Keep hot data in memory           | Job runtime          | Very fast re-use of data                  | Consumes cluster memory              |
