# PySpark: Zero to Hero
## Module 7: Columns, Schemas, and Structured Transformations

In the previous module, we created a simple DataFrame. In this session, we dive deep into the structure of a DataFrame: **Rows, Columns, and Schemas**.

We will learn how to manipulate data using the most common transformations.

### Agenda:
1.  **Defining Schemas:** `StructType` vs. DDL Strings.
2.  **Column Expressions:** `col()`, `expr()`, and string references.
3.  **Basic Selection:** Using `.select()`.
4.  **Advanced Selection:** Using `.selectExpr()` (The SQL-like way).
5.  **Data Type Casting:** Converting String to Integer.
6.  **Filtering:** Using `.where()` with conditions.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, expr

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Structured_Transformations") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Active")

In [None]:
# In production, we often define schemas explicitly using StructType.
# This ensures type safety and handles null values correctly.

data = [
    ("001", "John Doe", "30", "50000"),
    ("002", "Jane Smith", "25", "45000"),
    ("003", "Bob Brown", "35", "55000"),
    ("004", "Alice Lee", "28", "48000")
]

# Explicit Schema Definition
schema = StructType([
    StructField("emp_id", StringType(), True),  # True = Nullable
    StructField("name", StringType(), True),
    StructField("age", StringType(), True),     # Intentionally keeping age as String to demo Casting later
    StructField("salary", StringType(), True)
])

# Create DataFrame
df = spark.createDataFrame(data, schema)

print("--- Original Schema ---")
df.printSchema()
df.show()

## 2. Referencing Columns

In PySpark, there are three main ways to refer to a column in a transformation:
1.  **String:** `"salary"` (Simplest, but limited features)
2.  **Column Object:** `col("salary")` (Most common, allows operations like `col("age") + 1`)
3.  **DataFrame Reference:** `df["salary"]` or `df.salary`
4.  **Expression:** `expr("salary + 100")` (SQL style)

In [None]:
# Using .select() to pick specific columns
# We will use different reference styles here

df_selected = df.select(
    col("emp_id"),          # Style 2: Column Object
    "name",                 # Style 1: String
    df.salary               # Style 3: DF Reference
)

print("--- Selected Columns ---")
df_selected.show()

In [None]:
# expr() allows you to write SQL fragments directly inside Python code.
# This is useful for renaming or simple arithmetic without importing specific functions.

# Example: Rename 'name' to 'full_name' using alias
df_expr = df.select(
    col("emp_id"),
    expr("name as full_name"),    # SQL style renaming
    expr("salary + 500 as bonus") # SQL style arithmetic
)

print("--- Using Expressions ---")
df_expr.show()

In [None]:
# .selectExpr() combines .select() and .expr().
# It accepts strings that are treated as SQL expressions.
# This is often cleaner than wrapping everything in expr().

# Scenario: 
# 1. Select emp_id
# 2. Rename name -> full_name
# 3. CAST age from String to Integer

df_casted = df.selectExpr(
    "emp_id",
    "name as full_name",
    "cast(age as int) as age_int",  # Casting in SQL style
    "cast(salary as double) as salary_double"
)

print("--- SelectExpr with Casting ---")
df_casted.printSchema()
df_casted.show()

In [None]:
# Filtering using .where() (or .filter(), they are aliases)
# We want employees older than 25

# Note: We use the 'df_casted' dataframe because 'age_int' is now an Integer.
df_filtered = df_casted.where("age_int > 25")

print("--- Filtered Data (Age > 25) ---")
df_filtered.show()

In [None]:
# Bonus Tip: Instead of writing the long StructType code (Cell 3),
# Spark allows DDL (Data Definition Language) strings for schemas.

# Simple String Schema
ddl_schema = "emp_id STRING, name STRING, age INT, salary DOUBLE"

data_simple = [("005", "Jack Chan", 40, 60000.0)]

df_ddl = spark.createDataFrame(data_simple, schema=ddl_schema)

print("--- DataFrame created with DDL Schema ---")
df_ddl.printSchema()
df_ddl.show()

## Summary

1.  **Schema:** Can be defined using `StructType` (Programmatic) or DDL String (Simple).
2.  **`select()`:** Used to pick columns. Combine with `col()` for programmatic access.
3.  **`expr()`:** Lets you write SQL snippets inside Python code.
4.  **`selectExpr()`:** A shortcut for running SQL expressions on columns. Excellent for **Casting** data types.
5.  **Casting:** Changing data types (e.g., String -> Int) is crucial for numerical filtering.

**Next Steps:**
We will explore more complex transformations, aggregations, and working with different file formats.