**Filtering & Row Selection**

Q1: Simple Predicate Filtering

Problem: Select all employees with age > 30.

Constraints: Use DataFrame API and column expressions.
data = [
    (1, "Alice", 30),
    (2, "Bob", 25),
    (3, "Charlie", 35)
]

In [0]:
+---+-------+---+
|id |name   |age|
+---+-------+---+
|3  |Charlie|35 |
+---+-------+---+

Q2: Filtering on Multiple Conditions

Problem: Select employees with age > 25 and name not equal to "Bob".

Constraints: Combine multiple conditions using & operator.

Expected Output:

In [0]:
+---+-------+---+
|id |name   |age|
+---+-------+---+
|3  |Charlie|35 |
+---+-------+---+

Q3: Null Handling in Filters

Constraints: Use filter or where with null-safe conditions.

Problem: Select rows where department is not null.

In [0]:
+---+-------+---------+
|id |name   |department|
+---+-------+---------+
|1  |Alice  |HR       |
|3  |Charlie|Finance  |
+---+-------+---------+

Q4: Exclusion / Anti-Filter

Problem: Select employees whose name is not "Alice" or "Bob".

Constraints: Use ~col("name").isin(...) or col("name") != ....

Expected Output:

In [0]:
+---+-------+
|id |name   |
+---+-------+
|3  |Charlie|
+---+-------+


Q5: Using isin for Filtering

Problem: Select employees from departments ["HR", "Finance"].

Expected Output:

In [0]:
+---+-------+---------+
|id |name   |department|
+---+-------+---------+
|1  |Alice  |HR       |
|3  |Charlie|Finance  |
+---+-------+---------+

Q6: Pre-Join Filtering

Problem: Filter employees with age > 30 before joining with salaries to improve performance.

Constraints: Filter first, then join.

Expected Output:

In [0]:
+---+-------+---+------+
|id |name   |age|salary|
+---+-------+---+------+
|3  |Charlie|35 |7000  |
+---+-------+---+------+


Q7: Complex Boolean Conditions

Problem: Select employees whose (age < 30 AND department = "HR") OR (age > 30 AND department = "Finance").

Expected Output:

In [0]:
+---+-------+---+---------+
|id |name   |age|department|
+---+-------+---+---------+
|1  |Alice  |30 |HR       |
|3  |Charlie|35 |Finance  |
+---+-------+---+---------+

Q8: Post-Join Filtering

Problem: Join employees with salaries, then select only those with salary > 4500.

Expected Output:

In [0]:
+---+-------+---+------+
|id |name   |age|salary|
+---+-------+---+------+
|1  |Alice  |30 |5000  |
|3  |Charlie|35 |7000  |
+---+-------+---+------+


Q9: Anti-Join / Exclusion Pattern

Problem: Select employees without a salary record.

Expected Output:

In [0]:
+---+----+---+
|id |name|age|
+---+----+---+
|4  |David|28|
+---+----+---+


Q10: Filtering with coalesce for Nulls

Problem: Select employees where department is null or "HR" using coalesce.

Expected Output:

In [0]:
+---+-------+---------+
|id |name   |department|
+---+-------+---------+
|1  |Alice  |HR       |
|2  |Bob    |None     |
+---+-------+---------+

Q11: Conditional Filtering on Multiple Columns

Problem: Select employees with age > 30 and either department = "Finance" or salary > 6000.

Expected Output:

In [0]:
+---+-------+---+---------+------+
|id |name   |age|department|salary|
+---+-------+---+---------+------+
|3  |Charlie|35 |Finance  |7000  |
+---+-------+---+---------+------+


Q12: Filtering Top-N per Group (Complex)

Problem: Select top-1 salary per department.

Expected Output:

In [0]:
+---------+-------+------+---+
|department|name   |id    |salary|
+---------+-------+------+---+
|HR       |Alice  |1     |5000|
|Finance  |Charlie|3     |7000|
+---------+-------+------+---+


Q13: Anti-Filter with Multiple Conditions

Problem: Exclude employees who are either under 30 or belong to "HR" department.

Expected Output:

In [0]:
+---+-------+---+
|id |name   |age|
+---+-------+---+
|3  |Charlie|35 |
+---+-------+---+

Q14: Filtering Using Expressions (expr)

Problem: Select employees with salary * 12 > 60000 using expr.

Expected Output:

In [0]:
+---+-------+------+------+
|id |name   |age   |salary|
+---+-------+------+------+
|3  |Charlie|35    |7000  |
+---+-------+------+------+


Q15: Large-Scale Filtering (Scalable)

Problem: For 1M+ employees, select only active employees with last_login in the past 30 days. Simulate small dataset.

Constraints: Use distributed filtering and avoid collecting the data locally.

Expected Output:

In [0]:
+---+-------+----------+------+
|id |name   |last_login|active|
+---+-------+----------+------+
|1  |Alice  |2026-01-10|True  |
|3  |Charlie|2026-01-15|True  |
+---+-------+----------+------+