# PySpark: Zero to Hero
## Module 9: Working with Strings, Dates, and Nulls

In real-world data engineering, data is rarely clean. You will often encounter:
1.  Inconsistent String formatting.
2.  Dates stored as Strings.
3.  Missing (Null) values.

In this notebook, we will learn how to clean and standardize this data.

### Agenda:
1.  **Conditional Logic:** `when().otherwise()` (Case When).
2.  **String Operations:** `regexp_replace`.
3.  **Date Operations:** `to_date`, `current_date`, `date_format`.
4.  **Null Handling:** `na.drop`, `fill`, and `coalesce`.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, expr, regexp_replace, to_date, current_date, current_timestamp, date_format, coalesce

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Strings_Dates_Nulls") \
    .master("local[*]") \
    .getOrCreate()

# Define Employee Data (Notice the Hire Date is a String)
# Also introducing a record with Unknown Gender for Null testing
data = [
    ("001", "John Doe", "Male", "50000", "2015-01-01"),
    ("002", "Jane Smith", "Female", "45000", "2016-04-15"),
    ("003", "Bob Brown", "Male", "55000", "2014-05-01"),
    ("004", "Alice Lee", "Female", "48000", "2017-09-30"),
    ("005", "Jack Chan", "Male", "60000", "2013-04-01"),
    ("018", "N/A", None, "1000", "2022-01-01") # Data with Null Gender
]

columns = ["emp_id", "name", "gender", "salary", "hire_date"]

df = spark.createDataFrame(data, columns)

print("--- Original Data ---")
df.printSchema()
df.show()

In [None]:
# Scenario: We want to standardize Gender.
# Male -> M
# Female -> F
# Null/Other -> null

# We use the 'when().otherwise()' function chain.
df_gender = df.withColumn("short_gender", 
    when(col("gender") == "Male", "M")
    .when(col("gender") == "Female", "F")
    .otherwise(None) # Sets value to Null if no condition matches
)

print("--- Gender Standardized (Case When) ---")
df_gender.show()

In [None]:
# Scenario: Replace 'J' with 'Z' in names (e.g., 'John' -> 'Zohn')
# We use regexp_replace(column, pattern, replacement)

df_regex = df_gender.withColumn("cleaned_name", 
    regexp_replace(col("name"), "J", "Z")
)

print("--- Regex Replace Applied ---")
df_regex.select("name", "cleaned_name").show()

In [None]:
# Scenario: 'hire_date' is currently a String. We cannot perform date calculations on it.
# 1. Convert String to DateType
# 2. Add Current Date
# 3. Add Current Timestamp

df_dates = df_regex \
    .withColumn("hire_date_dt", to_date(col("hire_date"), "yyyy-MM-dd")) \
    .withColumn("current_dt", current_date()) \
    .withColumn("current_ts", current_timestamp())

print("--- Date Transformations ---")
df_dates.printSchema() # Notice hire_date_dt is now 'date' type
df_dates.select("hire_date", "hire_date_dt", "current_dt").show()

In [None]:
# Scenario: We want to extract just the Year from the hire date.
# We use date_format() with the "y" pattern.

df_year = df_dates.withColumn("hire_year", date_format(col("hire_date_dt"), "yyyy"))

print("--- Extracted Year ---")
df_year.select("hire_date_dt", "hire_year").show()

In [None]:
# 1. Handling Nulls in a Column (Filling default value)
# Scenario: If 'short_gender' is Null, fill it with 'O' (Other).
# We use coalesce() which returns the first non-null value.

df_filled = df_year.withColumn("short_gender_fixed", 
    coalesce(col("short_gender"), lit("O"))
)

print("--- Nulls Filled with 'O' ---")
df_filled.select("name", "gender", "short_gender", "short_gender_fixed").show()


# 2. Dropping Rows with Nulls
# Scenario: If a critical column (like name) is 'N/A' or Null, we drop the row.
# Note: na.drop() drops rows containing ANY nulls by default.

df_dropped = df_filled.na.drop()

print("--- Rows with Nulls Dropped ---")
df_dropped.show()

## Summary

1.  **`when(condition, value).otherwise(value)`**: The standard way to implement IF/ELSE logic.
2.  **`regexp_replace`**: Powerful string manipulation using Regex patterns.
3.  **`to_date(col, format)`**: Essential for converting strings to dates for calculation.
4.  **`date_format(col, pattern)`**: Extracts specific parts of a date (Year, Month, etc.).
5.  **`coalesce(col1, col2)`**: Returns the first non-null value (Great for filling defaults).
6.  **`na.drop()`**: Removes rows with missing data.

**Next Steps:**
In the next module, we will dive into **Aggregations**: `GroupBy`, `Count`, `Sum`, `Avg`, and sorting data using `OrderBy`.