# PySpark: Zero to Hero
## Module 10: Union, Sorting, and Aggregations

In this module, we move from row-level transformations to set-level operations and summarization.

### Agenda:
1.  **Union:** Combining two DataFrames (Appending).
2.  **Sorting:** Ordering data using `orderBy` and `sort`.
3.  **Aggregations:** Summarizing data using `groupBy`, `sum`, `avg`, and `count`.
4.  **Bonus:** Handling Unions when column order is different (`unionByName`).

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, count, desc, asc

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Union_Sort_Agg") \
    .master("local[*]") \
    .getOrCreate()

# Prepare Data: We will create TWO identical schemas but different data to demonstrate Union
columns = ["emp_id", "name", "department", "salary"]

data_1 = [
    ("001", "John Doe", "IT", 50000),
    ("002", "Jane Smith", "HR", 45000)
]

data_2 = [
    ("003", "Bob Brown", "IT", 55000),
    ("004", "Alice Lee", "Sales", 48000),
    ("001", "John Doe", "IT", 50000) # Duplicate record for testing
]

df1 = spark.createDataFrame(data_1, columns)
df2 = spark.createDataFrame(data_2, columns)

print("--- DataFrame 1 ---")
df1.show()

print("--- DataFrame 2 ---")
df2.show()

In [None]:
# Union combines rows from two DataFrames.
# Requirement: Both DataFrames must have the SAME number of columns.
# Note: In PySpark, 'union' behaves like SQL 'UNION ALL' (it keeps duplicates).

df_union = df1.union(df2)

print("--- Unioned DataFrame (Contains Duplicates) ---")
df_union.show()

# To remove duplicates (Simulating SQL 'UNION'), use .distinct()
df_unique = df_union.distinct()

print("--- Distinct Union (No Duplicates) ---")
df_unique.show()

In [None]:
# Sorting data using orderBy (or sort - they are aliases).
# Let's sort by Salary in Descending order.

# Method 1: Using col() object (Recommended)
df_sorted = df_union.orderBy(col("salary").desc())

# Method 2: Using String syntax (Simple)
# df_sorted = df_union.orderBy("salary", ascending=False)

print("--- Sorted by Salary (Desc) ---")
df_sorted.show()

In [None]:
# Aggregations are used to summarize data.
# Scenario: Count number of employees per department.

df_grouped = df_union.groupBy("department").count()

print("--- Count per Department ---")
df_grouped.show()

In [None]:
# We often need multiple metrics at once (e.g., Total Salary AND Average Salary).
# We use the .agg() function for this.

df_summary = df_union.groupBy("department").agg(
    sum("salary").alias("total_salary"),
    avg("salary").alias("avg_salary"),
    count("emp_id").alias("emp_count")
)

print("--- Department Summary ---")
df_summary.show()

In [None]:
# In SQL, we use 'HAVING' to filter after aggregation.
# In PySpark, we simply chain a .where() method after the aggregation.

# Scenario: Show departments where Total Salary > 50,000
df_high_budget = df_summary.where(col("total_salary") > 50000)

print("--- Departments with Budget > 50k ---")
df_high_budget.show()

In [None]:
# Problem: Standard .union() matches columns by position, not name.
# If df1 has ["id", "name"] and df2 has ["name", "id"], .union() will corrupt the data.

# Solution: Use .unionByName() to match columns safely.

# Create df3 with different column order
data_3 = [("Marketing", 60000, "005", "Mike Ross")]
columns_3 = ["department", "salary", "emp_id", "name"] # Different order

df3 = spark.createDataFrame(data_3, columns_3)

# Safe Union
df_safe_union = df1.unionByName(df3)

print("--- Safe Union (Column Order Auto-Resolved) ---")
df_safe_union.show()

## Summary

1.  **`union()`**: Appends data. Does **not** remove duplicates by default (acts like SQL `UNION ALL`).
2.  **`distinct()`**: Removes duplicates.
3.  **`orderBy(col.desc())`**: Sorts data.
4.  **`groupBy().agg()`**: The standard pattern for calculating sums, averages, and counts.
5.  **`unionByName()`**: Essential when datasets have the same columns but in different orders.

**Next Steps:**
In the next module, we will cover **Joins** (Inner, Left, Right, Full) and handling ambiguous columns.