# Notebook: 04_optimization_schedule

## Purpose:
This notebook serves as the **orchestration layer** of the data pipeline. It executes the full ETL flow, from raw data ingestion to aggregated insights, and applies performance optimizations post-processing.

---

## Key Responsibilities:

1. **Trigger Upstream Notebooks**
   - Executes the Bronze, Silver, and Gold transformation notebooks in sequence using `dbutils.notebook.run`.

2. **Optimize Delta Tables**
   - Applies Delta Lake performance commands such as:
     - `OPTIMIZE` to compact small files.
     - `ZORDER BY` to boost query performance for specific fields.
     - `VACUUM` to clean obsolete files and reduce storage cost.

3. **Simulate Daily Production Run**
   - Represents a real-world job scheduler like Databricks Workflows, Airflow, or CRON.
   - Ensures that the pipeline can be repeatedly and reliably executed end-to-end.

---

##  Outcome:
By running this notebook, a complete and optimized data pipeline is executed—ensuring clean, curated, and performant datasets are available for downstream analytics or BI dashboards.


##  Step 1: Load Bronze, Silver, and Gold Tables

Each Delta table from the ETL pipeline is loaded to validate its existence and structure prior to optimization.


In [0]:
df_bronze = spark.read.table("bronze_sales")
df_silver = spark.read.table("silver_sales")
df_gold = spark.read.table("gold_sales")

df_gold.display()  # Just previewing one layer


## Step 2: Optimize and ZORDER Tables

The `OPTIMIZE` command is used to compact small files and improve read performance.  
`ZORDER BY` further clusters files based on a filter column such as `InvoiceDate`.


In [0]:
%sql
OPTIMIZE gold_sales ZORDER BY (InvoiceDate)


##  Step 3: Vacuum to Reclaim Storage

To reduce storage costs, old data files not needed for time travel are removed using the `VACUUM` command.


In [0]:
%sql
-- SET spark.databricks.delta.retentionDurationCheck.enabled = false;

VACUUM gold_sales RETAIN 168 HOURS


## Step 4: Delta Table History and Time Travel

Delta Lake maintains version history and allows rollback to previous states using version number or timestamp.


In [0]:
%sql
DESCRIBE HISTORY gold_sales;

-- Example rollback preview
SELECT * FROM gold_sales VERSION AS OF 0;


## Step 5: Simulated Job Scheduling

Each pipeline notebook can be chained programmatically using `%run` to simulate scheduled execution.


In [0]:
# Simulate daily run
dbutils.notebook.run("01_bronze_ingestion", 300)
dbutils.notebook.run("02_silver_cleaning", 300)
dbutils.notebook.run("03_gold_aggregation", 300)
