**⭐ 1. What This Pattern Solves**

Removes duplicate rows or keeps only the latest record per key. Critical for Silver/Gold tables where each entity should be unique.

Use cases:

Keep latest transaction per CustomerID

Remove duplicate logs before analytics

Ensure unique users in a dataset

Deduplicate based on timestamp or version

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Simple deduplication
SELECT DISTINCT CustomerID, OrderDate, Amount
FROM Orders;

-- Keep latest per customer
SELECT *
FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate DESC) AS rn
    FROM Orders
) t
WHERE rn = 1;

**⭐ 3. Core Idea**

Two main approaches:

Simple deduplication: dropDuplicates(subset=[...]) removes exact duplicates.

Windowed deduplication: Use row_number() over a window to keep the latest or highest-priority row per group.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql.window import Window
from pyspark.sql import functions as F

# 1. Simple deduplication
df.dropDuplicates(subset=["col1", "col2"])

# 2. Windowed deduplication (keep latest)
window_spec = Window.partitionBy("col1").orderBy(F.desc("col2"))
df.withColumn("rn", F.row_number().over(window_spec)) \
  .filter("rn = 1") \
  .drop("rn")

**⭐ 5. Detailed Example**

In [0]:
data = [
    ("Alice", "2025-01-01", 100),
    ("Alice", "2025-01-01", 100),  # duplicate
    ("Alice", "2025-01-02", 150),
    ("Bob", "2025-01-01", 200)
]

df = spark.createDataFrame(data, ["Customer", "Date", "Amount"])

# Simple deduplication
deduped_simple = df.dropDuplicates(["Customer", "Date", "Amount"])
deduped_simple.show()

# Windowed deduplication (keep latest Date per Customer)
window_spec = Window.partitionBy("Customer").orderBy(F.desc("Date"))
deduped_window = df.withColumn("rn", F.row_number().over(window_spec)) \
                   .filter("rn = 1") \
                   .drop("rn")
deduped_window.show()

In [0]:
+--------+----------+------+
|Customer|      Date|Amount|
+--------+----------+------+
|Alice   |2025-01-02|   150|
|Bob     |2025-01-01|   200|
+--------+----------+------+


**⭐ 6. Mini Practice Problems**

Drop exact duplicates in a dataset with UserID and SessionID.

Keep the latest transaction per CustomerID using window functions.

Deduplicate Product table to keep the highest version number per product.

**⭐ 7. Full Data Engineering Problem**

Scenario: Bronze sales dataset has CustomerID, OrderDate, Amount.

Requirement: Silver table with latest order per customer for analytics.

Steps:

Read Bronze dataset.

Remove exact duplicates using dropDuplicates if necessary.

Define window by CustomerID ordered by OrderDate DESC.

Keep row_number() = 1.

Write Silver table to Delta for downstream reporting.

This mirrors production pipelines where deduplication ensures data integrity in aggregated tables.

**⭐ 8. Time & Space Complexity**

dropDuplicates: O(n), requires shuffle if subset has many columns.

Windowed deduplication: O(n) per partition; depends on partition size; requires memory for ordering.

**⭐ 9. Common Pitfalls**

Using dropDuplicates without specifying subset → may remove unintended rows.

Forgetting orderBy in window → may keep wrong “latest” record.

Not caching large datasets → recomputation cost for multiple window functions.

High-cardinality partitions → memory pressure in windowed deduplication.