# PySpark: Zero to Hero
## Module 11: Window Functions & Unique Data

In this module, we tackle advanced analytical problems.
1.  **Distinct Data:** Removing duplicates.
2.  **Window Functions:** The superpower of SQL/Spark for ranking and running totals.
3.  **Bonus:** Introduction to Databricks Community Cloud.

### Scenario:
We want to find the **2nd Highest Salary** in each department. This is impossible with standard `groupBy` but easy with Window Functions.

In [None]:
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import col, row_number, rank, dense_rank, desc

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Window_Functions_Demo") \
    .master("local[*]") \
    .getOrCreate()

# Create Employee Data with duplicates and varying salaries
data = [
    ("001", "John Doe", "IT", 70000),
    ("002", "Jane Smith", "IT", 65000),
    ("003", "Bob Brown", "IT", 80000),
    ("004", "Alice Lee", "HR", 50000),
    ("005", "Jack Chan", "HR", 50000), # Duplicate Salary in HR
    ("006", "Jill Wong", "HR", 60000),
    ("001", "John Doe", "IT", 70000)  # Duplicate Row
]
columns = ["emp_id", "name", "dept", "salary"]

df = spark.createDataFrame(data, columns)

print("--- Original Data (With Duplicates) ---")
df.show()

In [None]:
# 1. Get Distinct Rows (Removes fully duplicate rows)
df_unique = df.distinct()

print("--- Unique Rows ---")
df_unique.show()

# 2. Drop Duplicates based on specific columns
# E.g., If emp_id is same, keep only one (regardless of other columns)
df_deduped = df.dropDuplicates(["emp_id"])

print("--- Deduped by Employee ID ---")
df_deduped.show()

In [None]:
# Problem: Find the 2nd Highest Salary per Department.

# Step 1: Define the Window Specification
# Partition by Department -> Sort by Salary Descending
window_spec = Window.partitionBy("dept").orderBy(col("salary").desc())

# Step 2: Apply Window Function (Row Number)
# row_number() gives a sequential number 1, 2, 3... within the window partition
df_ranked = df_unique.withColumn("rank", row_number().over(window_spec))

print("--- Ranked Data (Row Number) ---")
df_ranked.show()

In [None]:
# Now that we have the rank, filtering is easy.
# Get the 2nd highest salary (Rank = 2)

df_second_highest = df_ranked.filter(col("rank") == 2)

print("--- 2nd Highest Salary per Department ---")
df_second_highest.drop("rank").show()

In [None]:
# What if salaries are tied? (See HR department: 50,000 appears twice)

# 1. Rank: Skips numbers (1, 2, 2, 4)
# 2. Dense Rank: No skipping (1, 2, 2, 3)

window_spec = Window.partitionBy("dept").orderBy(col("salary").desc())

df_comparison = df_unique \
    .withColumn("row_number", row_number().over(window_spec)) \
    .withColumn("rank", rank().over(window_spec)) \
    .withColumn("dense_rank", dense_rank().over(window_spec))

print("--- Row Number vs Rank vs Dense Rank ---")
df_comparison.filter(col("dept") == "HR").show()

## Bonus Tip: Databricks Community Cloud

If you cannot install Docker locally, you can use **Databricks Community Edition** for free.

1.  **URL:** [community.cloud.databricks.com](https://community.cloud.databricks.com)
2.  **Sign Up:** Choose the "Community Edition" (Free, no credit card required).
3.  **Cluster:** It gives you a micro-cluster (15GB RAM) that terminates after 2 hours of inactivity.
4.  **Usage:** The code we write here works **exactly 100% same** in Databricks notebooks.

## Summary

1.  **`distinct()`**: Removes fully duplicate rows.
2.  **`dropDuplicates([cols])`**: Removes duplicates based on specific columns.
3.  **Window Functions:**
    *   Require `Window.partitionBy(...).orderBy(...)`.
    *   **`row_number()`**: Unique sequential number (1, 2, 3, 4).
    *   **`rank()`**: Skips on ties (1, 2, 2, 4).
    *   **`dense_rank()`**: No skipping on ties (1, 2, 2, 3).

**Next Steps:**
We are now ready to handle complex Data Engineering tasks. In the next module, we will tackle **Data Repartitioning & Coalesce**.