# PySpark: Zero to Hero
## Module 8: Advanced Column Operations & Filtering

In this module, we will perform essential data manipulation tasks that are common in every ETL pipeline.

### Agenda:
1.  **Adding Columns:** `withColumn()` & `lit()`.
2.  **Calculated Columns:** Performing math on existing columns.
3.  **Renaming Columns:** `withColumnRenamed()`.
4.  **Dropping Columns:** `drop()`.
5.  **Advanced Filtering:** `limit()`.
6.  **Bonus:** Adding multiple columns efficiently using a loop (avoiding multiple `withColumn` calls).

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, expr

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Advanced_Column_Ops") \
    .master("local[*]") \
    .getOrCreate()

# Create Dummy Data
data = [
    ("001", "John Doe", "30", "50000"),
    ("002", "Jane Smith", "25", "45000"),
    ("003", "Bob Brown", "35", "55000"),
    ("004", "Alice Lee", "28", "48000")
]
columns = ["emp_id", "name", "age", "salary"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Cast salary to Double for calculations
df = df.withColumn("salary", col("salary").cast("double"))

print("--- Original DataFrame ---")
df.show()

In [None]:
# 1. Calculated Column: Add a 'tax' column (20% of salary)
# 2. Static Column: Add a 'country' column with value 'USA' using lit()

# lit() stands for Literal. It is used to add a constant value to a DataFrame.

df_added = df \
    .withColumn("tax", col("salary") * 0.2) \
    .withColumn("country", lit("USA"))

print("--- Added Tax and Country ---")
df_added.show()

In [None]:
# Renaming 'emp_id' to 'id' and 'name' to 'full_name'
# We use withColumnRenamed(existing_name, new_name)

df_renamed = df_added \
    .withColumnRenamed("emp_id", "id") \
    .withColumnRenamed("name", "full_name")

print("--- Renamed Columns ---")
df_renamed.show()

> **Pro Tip:** Although Spark allows spaces in column names (e.g., "Tax Amount"), it is highly recommended to avoid them. Downstream systems (like Parquet files, Hive, or some SQL databases) often fail or require complex escaping when handling spaces. Stick to `snake_case` (e.g., `tax_amount`).

In [None]:
# Removing columns we don't need.
# Let's drop 'tax' and 'country' to return to a cleaner state.

# You can drop single or multiple columns
df_dropped = df_renamed.drop("tax", "country")

print("--- Dropped Columns ---")
df_dropped.show()

In [None]:
# Sometimes you don't want to process the whole dataset, just a sample.
# limit(n) returns n rows from the DataFrame.

# Get top 2 employees
df_limited = df_dropped.limit(2)

print("--- Top 2 Rows ---")
df_limited.show()

In [None]:
# Problem: Chaining multiple .withColumn() calls can be inefficient and messy 
# if you have 50 columns to add.

# Solution: Use a Loop or Dictionary to add multiple columns dynamically.

# Define a dictionary of new columns {name: value/expression}
new_columns = {
    "bonus": col("salary") * 0.1,
    "is_active": lit(True),
    "department": lit("IT")
}

# Use a loop to apply them
df_dynamic = df_dropped
for col_name, col_val in new_columns.items():
    df_dynamic = df_dynamic.withColumn(col_name, col_val)

print("--- Dynamically Added Multiple Columns ---")
df_dynamic.show()

## Summary

1.  **`withColumn(name, logic)`**: Used to add a new column or overwrite an existing one.
2.  **`lit(value)`**: Essential function to add static/constant values.
3.  **`withColumnRenamed(old, new)`**: Renames a column.
4.  **`drop(col1, col2)`**: Removes columns.
5.  **`limit(n)`**: Returns top n rows (Action-like behavior but returns a DataFrame).
6.  **Dynamic Columns**: Using loops with dictionaries keeps code clean when adding many columns.

**Next Steps:**
In the next module, we will look at **File I/O**: Reading and Writing CSV, Parquet, and JSON files, and understanding the different **Save Modes**.