**⭐ 1. What This Pattern Solves**

Ensures that incoming or transformed data meets expected rules before further processing. Detects issues early, prevents corrupt data from propagating, and allows automated alerts.

Use-cases:

Ensuring no nulls in critical columns (user_id, transaction_id)

Checking ranges for numeric fields (amount > 0)

Validating string formats (email regex)

Verifying unique keys or foreign key relationships

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Check for nulls
SELECT *
FROM raw_table
WHERE user_id IS NULL;

-- Check numeric ranges
SELECT *
FROM raw_table
WHERE amount <= 0;

**⭐ 3. Core Idea**

Implement assertions or filters as pre-write validations. If rules fail, log, raise an error, or route data to a quarantine table.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
# Example checks
null_count = df.filter(col("col_name").isNull()).count()
if null_count > 0:
    raise ValueError(f"Column col_name has {null_count} nulls!")

invalid_rows = df.filter((col("amount") <= 0) | (col("status").isNull()))
if invalid_rows.count() > 0:
    invalid_rows.show()
    raise ValueError("Data quality check failed!")

**⭐ 5. Detailed Example**

In [0]:
data = [(1, 100, "completed"), (2, -50, "completed"), (3, 200, None)]
df = spark.createDataFrame(data, ["user_id", "amount", "status"])

from pyspark.sql.functions import col

# Check for nulls in 'status'
null_count = df.filter(col("status").isNull()).count()
print("Null count in status:", null_count)

# Check for invalid amounts
invalid_rows = df.filter(col("amount") <= 0)
invalid_rows.show()


In [0]:
Null count in status: 1
+-------+------+------+
|user_id|amount|status|
+-------+------+------+
|      2|   -50|completed|
+-------+------+------+


**⭐ 6. Mini Practice Problems**

Check for nulls in email and phone_number columns.

Validate age column is between 0 and 120.

Ensure order_id is unique in a DataFrame.

**⭐ 7. Full Data Engineering Problem**

Scenario: You build a pipeline ingesting healthcare patient records:

Read raw CSV → patient_id, age, weight, diagnosis.

Ensure patient_id is not null and unique.

Check age and weight fall within reasonable ranges.

Route invalid records to a quarantine Delta table for review.

Only write validated data to Silver table.

**⭐ 8. Time & Space Complexity**

Each filter/validation is O(n) on the number of rows.

Using count() triggers a job → expensive on large datasets.

Memory usage minimal unless storing invalid rows for logging.

**⭐ 9. Common Pitfalls**

Using multiple .count() calls → triggers multiple jobs.

Not handling edge cases like empty DataFrames → raises errors unintentionally.

Writing invalid data to production tables without routing.

Ignoring performance → filters on huge datasets without partitioning can be slow.