# Incremental Data Load & Production Execution

## Overview
Up to this point, we have executed our pipelines interactively using Jupyter Notebooks. While notebooks are great for development and exploration, production workloads typically run as automated batch jobs.

**The Standard Approach:**
1.  **Code format:** Convert `.ipynb` notebooks to standard Python scripts (`.py`).
2.  **Execution:** Use `spark-submit` command-line utility to submit jobs to the Spark Cluster.
3.  **Arguments:** Pass runtime parameters (like `rundate`) dynamically via the command line.

## Objective
In this notebook, we will simulate a production **Incremental Load** for a new date: **2022-01-04**.
1.  **Generate Data:** Create incremental source files for Store and Customer dimensions.
2.  **Convert Code:** Explain how to convert notebooks to scripts.
3.  **Execute Pipeline:** Run the loads for Store (SCD1) and Customer (SCD2).
4.  **Validate:** Verify that Store data was updated in-place and Customer data preserved history.

In [None]:
# Import necessary libraries
import pyspark
from pyspark.sql import SparkSession

# Initialize Spark Session for data generation and validation
spark = SparkSession.builder \
    .appName("Incremental Load Setup") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Job Parameters
run_date = "20220104"
base_path = "s3a://warehouse/" 

print(f"Preparing Incremental Load for: {run_date}")

## 1. Generate Incremental Data
We simulate the arrival of new data files for **Store** and **Customer**.

**Store Data (SCD Type 1 Changes):**
*   Store 5004 & 5006 have updated phone numbers.
*   Expectation: The existing records in `dim_store` should be updated.

**Customer Data (SCD Type 2 Changes):**
*   Customer C003 changes Plan Type.
*   Customer C014 changes Phone Number.
*   Expectation: New active records created, old records marked inactive.

In [None]:
# --- 1. Create Store Incremental File ---
store_data = [
    ("5004", "Pet House OR", "321 Birch Blvd", "Small Town", "OR", "76684", "91-88822-11111"), # Phone Changed
    ("5006", "Pet House JK", "987 Cedar Rd", "Hill Town", "JK", "22222", "91-33333-44444")  # Phone Changed
]
store_cols = ["store_id", "store_name", "address", "city", "state", "zip", "phone"]

df_store_inc = spark.createDataFrame(store_data, store_cols)
store_path = f"{base_path}landing/source/store/store_{run_date}.csv"

# Write CSV
df_store_inc.coalesce(1).write.mode("overwrite").option("header", "true").csv(store_path.replace(f"store_{run_date}.csv", ""))
print(f"Store Incremental Data created at: {store_path}")

# --- 2. Create Customer Incremental File ---
cust_data = [
    ("C003", "Imtiaz Ali", "789 Oak Ave", "Bigcity", "JK", "9876", "91-00000-00002", "imtiaz@email.com", "1990-03-03", "Platinum"), # Plan Changed to Platinum
    ("C014", "Madison White", "687 Elm St", "Smallville", "MH", "66555", "91-99999-88888", "madisonwhite@email.com", "1982-02-02", "Basic") # Phone Changed
]
cust_cols = ["customer_id", "name", "address", "city", "state", "zip_code", "phone_number", "email", "date_of_birth", "plan_type"]

df_cust_inc = spark.createDataFrame(cust_data, cust_cols)
cust_path = f"{base_path}landing/source/customer/customer_{run_date}.csv"

# Write CSV
df_cust_inc.coalesce(1).write.mode("overwrite").option("header", "true").csv(cust_path.replace(f"customer_{run_date}.csv", ""))
print(f"Customer Incremental Data created at: {cust_path}")

## 2. Convert Notebooks to Scripts
In a production environment, we use the `jupyter nbconvert` utility to transform our interactive `.ipynb` files into executable `.py` scripts.

*Note: The command below is for demonstration. It assumes the notebooks exist in the current directory.*

In [None]:
# Example commands to convert notebooks (These are shell commands)
# !jupyter nbconvert --to script 12_Store_Dimension_Load_End_to_End.ipynb
# !jupyter nbconvert --to script 13_Customer_Dimension_Load_SCD2.ipynb

print("To convert notebooks to scripts, run:")
print("jupyter nbconvert --to script <notebook_name>.ipynb")

## 3. Execute Pipeline via Spark-Submit
We use `spark-submit` to trigger the jobs. Crucially, we modify our utility code to accept the `run_date` as a command-line argument (`sys.argv[1]`) instead of reading a config file.

### Simulated Execution
Since we are inside a notebook, we will simulate the execution logic by calling the specific loads for the new rundate.

In [None]:
# --- SIMULATING THE INCREMENTAL BATCH RUN ---

# In reality, you would run:
# !spark-submit master local[*] load_store_dim.py 20220104
# !spark-submit master local[*] load_customer_dim.py 20220104

# Here, we will define a helper to run the logic for the new date

def run_incremental_validation(table_name, run_date):
    print(f"--- Validating Load for {table_name} on {run_date} ---")
    df = spark.sql(f"SELECT * FROM pyspark_warehouse.{table_name}")
    
    if "dim_store" in table_name:
        # Check for updated phone numbers
        df.filter(col("store_id").isin(["5004", "5006"])).show(truncate=False)
    elif "dim_customer" in table_name:
        # Check for history preservation (SCD2)
        df.filter(col("customer_id").isin(["C003", "C014"])).orderBy("customer_id", "effective_start_dt").show(truncate=False)

# Note: The actual loading logic would repeat the code from notebooks 12 and 13
# but filtered for the new date. For the purpose of this tutorial, 
# assume the scripts have run externally or the data is loaded via the simulated logic below.

## 4. Validation Results
After the batch jobs complete, we query the Data Warehouse to ensure the incremental logic worked correctly.

In [None]:
# --- Validate Store (SCD1) ---
print("Validating Store Dimension (SCD Type 1 - Updates):")
# We expect the phone numbers to be the NEW values from 20220104 file
# Old values should be overwritten.
# (Simulated query assuming data was loaded)
# spark.sql("SELECT store_id, phone, update_dt FROM pyspark_warehouse.dim_store WHERE store_id IN ('5004', '5006')").show()

print("Validating Customer Dimension (SCD Type 2 - History):")
# We expect 2 rows for C003 and C014:
# 1. Old row: active_flg='N', effective_end_dt = timestamp
# 2. New row: active_flg='Y', effective_end_dt = 9999-12-31
# (Simulated query assuming data was loaded)
# spark.sql("SELECT customer_id, plan_type, phone_number, active_flg, effective_start_dt, effective_end_dt FROM pyspark_warehouse.dim_customer WHERE customer_id IN ('C003', 'C014') ORDER BY customer_id, effective_start_dt").show()