**⭐ 1. What This Pattern Solves**

Counts unique values per group. Essential for metrics like number of unique customers, sessions, products, or devices.

Use cases:

Unique customers per region

Unique products sold per month

Active users per day/week

Deduplicated counts in reporting and analytics

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT Region, COUNT(DISTINCT CustomerID) AS UniqueCustomers
FROM Orders
GROUP BY Region;

**⭐ 3. Core Idea**

Use F.countDistinct() in agg() or approx_count_distinct() for large datasets. This avoids full shuffle while giving approximate results efficiently.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql import functions as F

# Exact distinct count
df.groupBy("col1").agg(F.countDistinct("col2").alias("unique_count"))

# Approximate distinct count (faster for big data)
df.groupBy("col1").agg(F.approx_count_distinct("col2").alias("approx_unique_count"))


**⭐ 5. Detailed Example**

In [0]:
data = [
    ("East", "Alice"),
    ("East", "Alice"),
    ("East", "Bob"),
    ("West", "Bob"),
    ("West", "Charlie")
]

df = spark.createDataFrame(data, ["Region", "Customer"])

# Exact distinct count
result = df.groupBy("Region").agg(F.countDistinct("Customer").alias("UniqueCustomers"))

result.show()

In [0]:
+------+---------------+
|Region|UniqueCustomers|
+------+---------------+
|East  |              2|
|West  |              2|
+------+---------------+


**⭐ 6. Mini Practice Problems**

Count unique products sold per Store.

Count distinct session IDs per Day.

Use approx_count_distinct for user IDs per Region and compare with exact count.

**⭐ 7. Full Data Engineering Problem**

Scenario: Bronze clickstream logs have SessionID, UserID, Page, Date.

Goal: Silver table with unique users per day.

Steps:

Read logs from S3.

Group by Date.

Compute countDistinct(UserID) or approx_count_distinct(UserID) for performance.

Write to Delta for daily reporting.

This is very common in web analytics and product usage metrics.

**⭐ 8. Time & Space Complexity**

Time: O(n) over number of rows; exact distinct requires shuffle and full aggregation.

Space: Depends on number of distinct values per group; approx_count_distinct reduces memory footprint.

**⭐ 9. Common Pitfalls**

Using count() instead of countDistinct() → double counts duplicates.

Forgetting alias() → messy column names.

Not using approx_count_distinct() for very large datasets → performance bottleneck.

Combining high-cardinality distinct counts with other aggregations without caching → shuffle explosion.