**⭐ 1. What This Pattern Solves**

Extracts only the rows that satisfy a condition.
This is used in:

Data cleaning

Valid record extraction

Partition pruning

Business-rule filtering

Removing bad or null values

Pre-aggregation filtering before joins

Filtering is one of the most frequent DE operations.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT *
FROM table
WHERE age > 30 AND status = 'ACTIVE';

In [0]:
df.filter((F.col("age") > 30) & (F.col("status") == "ACTIVE"))

**⭐ 3. Core Idea**

Apply Boolean expressions to keep rows:

Comparisons

Logical operators

Null checks

String conditions

Expression strings (expr)

You plug conditions into .filter() and the DataFrame shrinks.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
df.filter(
    (F.col("col") op value) &
    (F.col("col2").isNotNull())
)

In [0]:
df.filter("col > 10 AND status = 'ACTIVE'")

**⭐ 5. Detailed Example**

In [0]:
+----+------+--------+
| id | age  | status |
+----+------+--------+
| 1  | 25   | ACTIVE |
| 2  | 40   | ACTIVE |
| 3  | 45   | INACTIVE |
+----+------+--------+


In [0]:
from pyspark.sql import functions as F

out = df.filter(
    (F.col("age") > 30) &
    (F.col("status") == "ACTIVE")
)

In [0]:
+----+------+--------+
| id | age  | status |
+----+------+--------+
| 2  | 40   | ACTIVE |
+----+------+--------+

**⭐ 6. Mini Practice Problems**

Filter rows where salary > 50000 and department = 'IT'.

Filter rows where event_date is not null.

Filter customers whose country is either "US" or "CA".

**⭐ 7. Full Data Engineering Problem**

**Scenario:**
You receive a Bronze streaming feed of transactions.
Rules for valid transactions:

amount > 0

currency in ["USD", "CAD", "EUR"]

transaction_date should not be null

**Task:**
Write a PySpark filter transformation that produces the Silver validated transactions DataFrame.

This mirrors what happens in every fintech ETL validation pipeline.

**⭐ 8. Time & Space Complexity**

| Operation | Complexity                           |
| --------- | ------------------------------------ |
| Filtering | **O(n)** — checks each row once      |
| Memory    | Minimal — no extra columns allocated |


**⭐ 9. Common Pitfalls**

❌ Using and / or instead of & / | → breaks PySpark
❌ Forgetting parentheses around conditions
❌ Filtering nulls using == None instead of .isNull()
❌ Using expensive functions inside filters on huge datasets
❌ Writing long string expressions instead of safer column expressions