**⭐ 1. What This Pattern Solves**

Selecting the top N items per group is extremely common in DE work.
Use-cases include:

Top 3 products by sales per region

Most recent orders per customer

Top N users by activity per day

This is basically "group by + order + limit per group."

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY purchase_amount DESC) as rn
    FROM purchases
) tmp
WHERE rn <= 3;

**⭐ 3. Core Idea**

Partition data by the grouping key

Order by the metric of interest

Assign a row number per partition

Filter to keep only the top N

PySpark provides Window functions for this.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, desc

window_spec = Window.partitionBy("group_col").orderBy(desc("metric_col"))

df_with_rank = df.withColumn("rn", row_number().over(window_spec))
top_n_df = df_with_rank.filter("rn <= N")

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, desc

spark = SparkSession.builder.getOrCreate()

data = [
    ("A", "product1", 100),
    ("A", "product2", 300),
    ("A", "product3", 200),
    ("B", "product4", 400),
    ("B", "product5", 150)
]

df = spark.createDataFrame(data, ["region", "product", "sales"])

window_spec = Window.partitionBy("region").orderBy(desc("sales"))
df_with_rank = df.withColumn("rn", row_number().over(window_spec))
top_2_per_region = df_with_rank.filter("rn <= 2")

top_2_per_region.show()

**Step-by-step:**

Partition by region

Order by sales descending

Assign row_number()

Filter rows where rn <= 2

Result: Top 2 products by sales per region.

**⭐ 6. Mini Practice Problems**

Get top 1 most recent order per customer using order_date.

Find top 5 highest paid employees per department.

List top 3 trending videos per category.

**⭐ 7. Full Data Engineering Problem**

Scenario: Daily ETL loads of e-commerce sales. Need top 3 products per category to feed a dashboard. Dataset is 100M+ rows daily.

Solution Approach:

Read Bronze daily sales logs

Partition by category

Order by sales_amount

Use row_number() over window

Filter top 3

Write to Silver table

Performance tip: Cache intermediate DataFrame if reused downstream.

**⭐ 8. Time & Space Complexity**

Time: O(N log N) per partition (due to sort)

Space: O(N) per partition (for row_number metadata)

Large partitions can cause memory pressure → consider bucketing or broadcasting if groups are small.

**⭐ 9. Common Pitfalls**

Using rank() instead of row_number() may produce duplicates if there’s a tie

Forgetting to orderBy inside Window → incorrect top N

Not filtering rn <= N → full dataset remains

Partition too large → triggers shuffle and memory issues