# 02 - Catalyst Optimizer & Query Planning

Focus: understanding **what Spark plans**, **how Catalyst transforms plans**, and **how to read plans for performance**.

Quick Catalyst vs Non-Catalyst (Python UDF) performance benchmark: https://rknutalapati.medium.com/understanding-catalyst-optimizer-in-pyspark-catalyst-vs-non-catalyst-functions-48dd1c75d037

## Datasets (local only)

Same **two Hugging Face datasets**, already saved locally.

- Transaction Categorization (local parquet): `TRANSACTION_PATH`
- FreshRetailNet-50K (Retail Sales): `RETAIL_PATH`

(We will mostly use the transactions dataset in this notebook; retail is used later in joins / skew notebooks.)

Spark UI: http://localhost:404

## Helper: print each plan stage separately

Spark's `df.explain("extended")` prints everything at once.
For learning Catalyst, it's useful to inspect each step separately:

- **Parsed** (unresolved names, just what you typed)
- **Analyzed** (resolved columns / types)
- **Optimized** (rule-based rewrites)
- **Physical** (operators Spark will run)

In [5]:
# Below we access Catalyst plans through the underlying JVM DataFrame object (_jdf). 
# Via queryExecution(), Spark exposes the parsed, analyzed, optimized, and physical plans separately.

def get_query_execution(df):
    return df._jdf.queryExecution()

def show_plan(df, plan_type: str):
    qe = get_query_execution(df)
    plan_type = plan_type.lower().strip()

    if plan_type == "parsed":
        print(qe.logical().toString())
    elif plan_type == "analyzed":
        print(qe.analyzed().toString())
    elif plan_type == "optimized":
        print(qe.optimizedPlan().toString())
    elif plan_type == "physical":
        print(qe.executedPlan().toString())
    else:
        raise ValueError("plan_type must be: parsed | analyzed | optimized | physical")

## Load dataset (transactions)

Set your local parquet path and load it.


In [6]:
# Load parquet
txn = spark.read.parquet(r"C:\code\spark-tuning-handbook\data\transaction_cat.parquet")
txn.printSchema()
print("rows:", txn.count())
print("partitions:", txn.rdd.getNumPartitions())

root
 |-- transaction_description: string (nullable = true)
 |-- category: string (nullable = true)
 |-- country: string (nullable = true)
 |-- currency: string (nullable = true)

rows: 4501043
partitions: 4


---

## What is Catalyst

Catalyst is Spark SQL's optimizer and planner.

You write DataFrame / SQL code -> Spark builds a **logical plan** -> Catalyst rewrites it -> Spark produces a **physical plan** (operators like scans, exchanges, hashes, sorts).

For performance you care about:
- where **filters** happen (pushed into scan or not)
- how many **columns** are read (pruning)
- where **shuffles** appear (Exchange)
- whether Spark uses **codegen** (WholeStageCodegen)

The planning pipeline

1) **Parsed** plan: what you typed (names are not validated yet)
2) **Analyzed** plan: columns and types are resolved
3) **Optimized** plan: Catalyst rewrites (push filters/projects, fold constants, simplify expressions)
4) **Physical** plan: concrete execution operators

Next: build one query and inspect each stage **in separate cells**.


## One query, four plans

Query:
- filter by country
- join currency_lookup
- compute a derived column
- select a few columns

In [7]:
from pyspark.sql import functions as F

# currecny lookup table
currency_lookup = spark.createDataFrame([
    ("USD", "US Dollar"),
    ("GBP", "British Pound"),
    ("EUR", "Euro")
], ["currency_code", "currency_name"])

query = (
    txn
    .filter(F.col("country") == "USA")
    .join(currency_lookup, txn["currency"] == currency_lookup["currency_code"])
    .withColumn("desc_len", F.length("transaction_description"))
    .select("country", "currency_name", "desc_len")
)

# no actions, only transformations description (Lazy Evaluation)
query

DataFrame[country: string, currency_name: string, desc_len: int]

In [19]:
# In production, you would typically use query.explain(extended=True).
# We use show_plan() here to isolate each stage for educational clarity.
# Standard explain() outputs a dense block that can be harder to parse visually.

# query.explain() # short Phisical Plan
# query.explain("formatted")  # detailed Phisical Plan
query.explain(extended=True) # complete plan

== Parsed Logical Plan ==
Join Inner, (currency#15 = currency_code#24)
:- Relation [transaction_description#12,category#13,country#14,currency#15] parquet
+- ResolvedHint (strategy=broadcast)
   +- LogicalRDD [currency_code#24, currency_name#25], false

== Analyzed Logical Plan ==
transaction_description: string, category: string, country: string, currency: string, currency_code: string, currency_name: string
Join Inner, (currency#15 = currency_code#24)
:- Relation [transaction_description#12,category#13,country#14,currency#15] parquet
+- ResolvedHint (strategy=broadcast)
   +- LogicalRDD [currency_code#24, currency_name#25], false

== Optimized Logical Plan ==
Join Inner, (currency#15 = currency_code#24), rightHint=(strategy=broadcast)
:- Filter isnotnull(currency#15)
:  +- Relation [transaction_description#12,category#13,country#14,currency#15] parquet
+- Filter isnotnull(currency_code#24)
   +- LogicalRDD [currency_code#24, currency_name#25], false

== Physical Plan ==
AdaptiveSpark

### Parsed logical plan (unresolved)
**Logical Tree**: Spark organizes the query as a hierarchy. It reads from the bottom up: it starts at the Relation (source), applies a Filter, and finally performs a Project (selection).

**Tree Connectors** +- and :- show the parent-child relationship between operators. A +- indicates the last child of a node, while :- would indicate a preceding child (used when a node has multiple inputs, like a Join).

Notice:
- **Unresolved parts**: New columns or selections often appear with a tick (e.g., 'country) and no ID yet.
- **Inherited IDs**: Because we started from an existing DataFrame (txn), existing columns already have their internal IDs (like #123).
- **Logic check**: Spark has parsed Python code into a logical tree but hasn't verified if the new operations (like length()) are valid for those types yet.
- **Batch vs. Streaming**: The false flag in the LogicalRDD node confirms the data is a static Batch dataset rather than a continuous Stream

In [9]:
show_plan(query, 'parsed')

'Project ['country, 'currency_name, 'desc_len]
+- Project [transaction_description#12, category#13, country#14, currency#15, currency_code#24, currency_name#25, length(transaction_description#12) AS desc_len#27]
   +- Join Inner, (currency#15 = currency_code#24)
      :- Filter (country#14 = USA)
      :  +- Relation [transaction_description#12,category#13,country#14,currency#15] parquet
      +- LogicalRDD [currency_code#24, currency_name#25], false



### Logical Tree Structure
```text
Project (Root / Output)
  └── Project (Calculation Node)
        └── Join (Binary Node)
              /            \
     Left Child (:-)    Right Child (+-)
            |               |
       Filter Node      Leaf Node (LogicalRDD)
            |
    Leaf Node (Relation)

### Analyzed logical plan (resolved)

In our local environment, Saprk uses an In-memory Catalog to resolve the schema. It retrieves the schema either from file metadata (Parquet, Avro, ORC) or by scanning the data AKA InferSchema (CSV, JSON), then caches this structure in the current SparkSession's memory.

Notice:
- **Validation & Errors**: This stage triggers an AnalysisException if Spark fails to map column or table names to the actual schema, ensuring all references are validated before optimization.
- **Attribute Resolution**: Each column is assigned a unique Expression ID, allowing Spark to track attributes even if column names overlap across tables.
- **Type Validation**: Spark verifies the schema and confirms data types. It ensures functions like length() are only applied to compatible types.

In [10]:
show_plan(query, 'analyzed')

Project [country#14, currency_name#25, desc_len#27]
+- Project [transaction_description#12, category#13, country#14, currency#15, currency_code#24, currency_name#25, length(transaction_description#12) AS desc_len#27]
   +- Join Inner, (currency#15 = currency_code#24)
      :- Filter (country#14 = USA)
      :  +- Relation [transaction_description#12,category#13,country#14,currency#15] parquet
      +- LogicalRDD [currency_code#24, currency_name#25], false



### Optimized logical plan (Catalyst rewrites)

This stage is the core of **Rule-Based Optimization (RBO)**. Spark uses the Catalyst optimizer to apply a series (~50) of deterministic rules that transform the logical tree into a more efficient version **without data statistics**.

**How it works**: Catalyst groups optimization rules into Batches. Each batch is executed repeatedly until it reaches a Fixed Point (the plan stops changing), ensuring that even simple rules can have a large global impact.

Notice:
- **Predicate Pushdown**: Moves filters (like isnotnull or EqualTo) as close to the data source as possible. This minimizes the number of rows that participate in expensive operations like Joins.
- **Column Pruning**: Detects which columns are actually needed for the final result and removes all others early in the plan to reduce memory and CPU overhead.
- **Constant Folding**: Pre-calculates expressions involving constants (e.g., 100 * 0.5 becomes 50.0) during the optimization phase so they aren't calculated for every row during execution.
- **Inferring Null Filters**: As seen in the plan, Spark automatically injects Filter isnotnull(...) on join keys. This "null-safe" optimization prevents the engine from attempting to match null values, which would never satisfy an inner join.

In [11]:
show_plan(query, 'optimized')

Project [country#14, currency_name#25, length(transaction_description#12) AS desc_len#27]
+- Join Inner, (currency#15 = currency_code#24)
   :- Project [transaction_description#12, country#14, currency#15]
   :  +- Filter ((isnotnull(country#14) AND (country#14 = USA)) AND isnotnull(currency#15))
   :     +- Relation [transaction_description#12,category#13,country#14,currency#15] parquet
   +- Filter isnotnull(currency_code#24)
      +- LogicalRDD [currency_code#24, currency_name#25], false



### Physical plan (execution operators)

This stage transitions from "what to do" (logic) to **"how to do it"** (execution). While RBO worked with logical rules, the Physical Planning phase uses Cost-Based Optimization (CBO) to select the most efficient physical algorithms.

**Key Concepts:**
- **Algorithm Selection**: Spark evaluates different physical operators for the same logic. For example, it chooses BroadcastHashJoin (fastest, memory-intensive) or SortMergeJoin (scalable, shuffle-intensive) based on data size.
- **Cost-Based Optimizer (CBO)**: Catalyst calculates the "cost" (CPU, I/O, Network) of multiple physical plans and picks the cheapest one.
- **Statistics Matter**: CBO relies on data stats (row count, table size, histograms). Without running ANALYZE TABLE, Spark makes decisions based on rough estimates, which can lead to suboptimal plans. However, no statistics are often better than stale statistics. Outdated stats can mislead the optimizer into picking a disastrous plan.

Notice:
- **SortMergeJoin**: In this plan, Spark selected SortMergeJoin because the datasets are treated as potentially large.
- **Exchange (Shuffle)**: You can see Exchange hashpartitioning, which indicates a Shuffle operation-Spark is re-distributing data across the cluster to align join keys.
- **WholeStageCodegen**: The asterisks (*) next to operators (e.g., *Project) show that Spark has collapsed these steps into a single highly-optimized Java function to improve performance. Will be displayed after an action with AdaptiveSparkPlan isFinalPlan=true.

In [12]:
show_plan(query, 'physical')

AdaptiveSparkPlan isFinalPlan=false
+- Project [country#14, currency_name#25, length(transaction_description#12) AS desc_len#27]
   +- SortMergeJoin [currency#15], [currency_code#24], Inner
      :- Sort [currency#15 ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(currency#15, 200), ENSURE_REQUIREMENTS, [plan_id=113]
      :     +- Filter ((isnotnull(country#14) AND (country#14 = USA)) AND isnotnull(currency#15))
      :        +- FileScan parquet [transaction_description#12,country#14,currency#15] Batched: true, DataFilters: [isnotnull(country#14), (country#14 = USA), isnotnull(currency#15)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/C:/code/spark-tuning-handbook/data/transaction_cat.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,USA), IsNotNull(currency)], ReadSchema: struct<transaction_description:string,country:string,currency:string>
      +- Sort [currency_code#24 ASC NULLS FIRST], false, 0
         +- Exchan

In [15]:
# Generating statistics locally without a persistent metastore is a headache.
# Another way to "help" the Physical Plan is by using Hints

from pyspark.sql.functions import broadcast

query = txn.join(broadcast(currency_lookup), txn.currency == currency_lookup.currency_code)
show_plan(query, 'physical')

AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [currency#15], [currency_code#24], Inner, BuildRight, false
   :- Filter isnotnull(currency#15)
   :  +- FileScan parquet [transaction_description#12,category#13,country#14,currency#15] Batched: true, DataFilters: [isnotnull(currency#15)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/C:/code/spark-tuning-handbook/data/transaction_cat.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(currency)], ReadSchema: struct<transaction_description:string,category:string,country:string,currency:string>
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]),false), [plan_id=235]
      +- Filter isnotnull(currency_code#24)
         +- Scan ExistingRDD[currency_code#24,currency_name#25]



---

## Topic 8 - How to read plans fast

Order:
1) Scan (FileScan / pushed filters / read schema)
2) Exchange (shuffles)
3) Aggregations (partial + final)
4) Sort
5) WholeStageCodegen

Spot the expensive ones: **Exchange, Sort, huge scans**.