**⭐ 1. What This Pattern Solves**

Captures pipeline execution metadata (success, failure, row counts, timing) for traceability, debugging, and compliance. Helps monitor production pipelines and maintain audit trails.

Use-cases:

Recording the number of rows processed per stage

Capturing pipeline start/end timestamps

Logging errors and exceptions

Maintaining audit tables for regulatory or internal reporting

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Insert execution log
INSERT INTO pipeline_audit
(pipeline_name, run_id, start_time, end_time, status, row_count)
VALUES ('customer_pipeline', 'run_20251204', CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 'SUCCESS', 1000);

**⭐ 3. Core Idea**

Maintain a central audit table and log key metrics at each step. Use PySpark to calculate metrics, and write to Delta, Parquet, or external monitoring tools.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from datetime import datetime

start_time = datetime.now()
row_count = df.count()

status = "SUCCESS"

audit_record = spark.createDataFrame([
    [pipeline_name, run_id, start_time, datetime.now(), status, row_count]
], ["pipeline_name", "run_id", "start_time", "end_time", "status", "row_count"])

audit_record.write.format("delta").mode("append").save("/path/to/audit_table")

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
from datetime import datetime

spark = SparkSession.builder.getOrCreate()

pipeline_name = "customer_pipeline"
run_id = "run_20251204"

data = [(1, "Alice"), (2, "Bob")]
df = spark.createDataFrame(data, ["id", "name"])

start_time = datetime.now()
row_count = df.count()

status = "SUCCESS"

audit_record = spark.createDataFrame([
    [pipeline_name, run_id, start_time, datetime.now(), status, row_count]
], ["pipeline_name", "run_id", "start_time", "end_time", "status", "row_count"])

audit_record.show()

In [0]:
+----------------+----------+-------------------+-------------------+-------+---------+
|   pipeline_name|    run_id|         start_time|           end_time| status|row_count|
+----------------+----------+-------------------+-------------------+-------+---------+
|customer_pipeline|run_20251204|2025-12-04 15:00:00|2025-12-04 15:00:05|SUCCESS|        2|
+----------------+----------+-------------------+-------------------+-------+---------+


**⭐ 6. Mini Practice Problems**

Log row counts for multiple DataFrames in a single run.

Add a column to capture the environment (dev, prod) in audit logs.

Log pipeline failures using try/except and store status as "FAILURE".

**⭐ 7. Full Data Engineering Problem**

Scenario: You run a nightly ETL pipeline:

Read multiple raw tables → clean and transform.

Count rows processed per table.

Capture start/end times, environment, run ID, and error messages.

Write audit logs to a Delta table.

Send alert email if any table fails QC checks.

**⭐ 8. Time & Space Complexity**

Counting rows → triggers a Spark job (O(n) per DataFrame).

Logging itself is minimal memory.

Avoid writing audit logs inside every transformation; batch logging after major steps.

**⭐ 9. Common Pitfalls**

Logging after .collect() on huge DataFrames → memory explosion.

Not including run_id or timestamp → difficult to trace executions.

Overlogging every row → bloats audit table.

Ignoring failures → pipeline may fail silently without proper audit.