# PySpark Tutorial - Explain Plan

The creation of a DataFrame involves a sequence of instructions, which can include reading data and applying transformations such as calculated columns, joins and summarisation.

When an action is performed on the DataFrame, Spark analyses the code from each component and using its optimiser determines a Plan to execute the code.

The steps are as follows:

| Step | Item                   | Description |
| ---: | :--------------------- | :---------- |
|    1 | Parsed Logical Plan    | Initial Plan from Parsing the DataFrame code. |
|    2 | Analyzed Logical Plan  | Analysed Plan (similar to the Parsed Logical Plan). |
|    3 | Optimized Logical Plan | Plan created after the optimiser has processed the previous plans.  This has potentially reduced the number of instructions in the Plan significantly. |
|    4 | Physical Plan          | The final Physical Plan that will be executed. |

The following examples will show the use of the DataFrame `explain()` method to analyse the plans.

## Read Sample Data
The following code creates a DataFrame for both the policies and claims by reading CSV files and applying the data type transformations covered in previous sections.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.appName("PySparkTutorial").getOrCreate()

policyDF = spark.read.option("header", True).csv("./data/policy.csv") \
                .withColumn("sum_insured", F.col("sum_insured").cast(IntegerType())) \
                .withColumn("vehicle_age", F.col("vehicle_age").cast(IntegerType())) \
                .withColumn("premium", F.col("premium").cast(IntegerType()))
claimsDF = spark.read.option("header", True).csv("./data/claims.csv") \
                .withColumn("cost", F.col("cost").cast(IntegerType()))

def fix_dates(df):
    """Find all columns named *_date and convert from string to Spark Date type."""
    for column in df.columns:
        if column.endswith("_date") and dict(df.dtypes)[column] == 'string':
            print("NOTE: Fixing date column '{}'.".format(column))
            df = df.withColumn(column, F.to_date(df[column], "yyyyMMdd"))
    return df


policyDF = fix_dates(policyDF)
claimsDF = fix_dates(claimsDF)

NOTE: Fixing date column 'inception_date'.
NOTE: Fixing date column 'start_date'.
NOTE: Fixing date column 'end_date'.
NOTE: Fixing date column 'incident_date'.


## Explain the Spark Plan
There are a number of options available to explain the different Plans used with Spark.

The Physical Plan for a Spark DataFrame can be shown with the `explain()` method.  This shows the steps that will be performed when an action is taken on the given DataFrame.  The default output can be hard to review, however following steps will improve on this initial example.

In [2]:
policyDF.explain()

== Physical Plan ==
*(1) Project [policy#16, make#17, cast(vehicle_age#18 as int) AS vehicle_age#41, cast(sum_insured#19 as int) AS sum_insured#32, cast(cast(unix_timestamp(inception_date#20, yyyyMMdd, Some(Australia/Sydney)) as timestamp) as date) AS inception_date#85, cast(cast(unix_timestamp(start_date#21, yyyyMMdd, Some(Australia/Sydney)) as timestamp) as date) AS start_date#94, cast(cast(unix_timestamp(end_date#22, yyyyMMdd, Some(Australia/Sydney)) as timestamp) as date) AS end_date#103, cast(premium#23 as int) AS premium#50]
+- FileScan csv [policy#16,make#17,vehicle_age#18,sum_insured#19,inception_date#20,start_date#21,end_date#22,premium#23] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/michael/pyspark-tutorial/data/policy.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<policy:string,make:string,vehicle_age:string,sum_insured:string,inception_date:string,star...




Adding the option `mode="formatted"` improves the readability with a summary before the details.  Note that this is displayed as a hierarchy with the first item (Project) being the last thing executed, and the last item (Scan) being the first (ie: the 'Project' is dependant on the 'Scan').

In [3]:
policyDF.explain(mode="formatted")

== Physical Plan ==
* Project (2)
+- Scan csv  (1)


(1) Scan csv 
Output [8]: [policy#16, make#17, vehicle_age#18, sum_insured#19, inception_date#20, start_date#21, end_date#22, premium#23]
Batched: false
Location: InMemoryFileIndex [file:/home/michael/pyspark-tutorial/data/policy.csv]
ReadSchema: struct<policy:string,make:string,vehicle_age:string,sum_insured:string,inception_date:string,start_date:string,end_date:string,premium:string>

(2) Project [codegen id : 1]
Output [8]: [policy#16, make#17, cast(vehicle_age#18 as int) AS vehicle_age#41, cast(sum_insured#19 as int) AS sum_insured#32, cast(cast(unix_timestamp(inception_date#20, yyyyMMdd, Some(Australia/Sydney)) as timestamp) as date) AS inception_date#85, cast(cast(unix_timestamp(start_date#21, yyyyMMdd, Some(Australia/Sydney)) as timestamp) as date) AS start_date#94, cast(cast(unix_timestamp(end_date#22, yyyyMMdd, Some(Australia/Sydney)) as timestamp) as date) AS end_date#103, cast(premium#23 as int) AS premium#50]
Input [8]: 

The details of all Plans can be produced using the option `extended=True`.  In this case you can see that in the original **Parsed Logical Plan** the casting for the various columns to Integer and Date types have happened each in a separate Project layer.  The **Optimized Logical Plan** has managed to correctly reduce this to a single Project layer.

In [4]:
policyDF.explain(extended=True)

== Parsed Logical Plan ==
'Project [policy#16, make#17, vehicle_age#41, sum_insured#32, inception_date#85, start_date#94, to_date(end_date#22, Some(yyyyMMdd)) AS end_date#103, premium#50]
+- Project [policy#16, make#17, vehicle_age#41, sum_insured#32, inception_date#85, to_date(start_date#21, Some(yyyyMMdd)) AS start_date#94, end_date#22, premium#50]
   +- Project [policy#16, make#17, vehicle_age#41, sum_insured#32, to_date(inception_date#20, Some(yyyyMMdd)) AS inception_date#85, start_date#21, end_date#22, premium#50]
      +- Project [policy#16, make#17, vehicle_age#41, sum_insured#32, inception_date#20, start_date#21, end_date#22, cast(premium#23 as int) AS premium#50]
         +- Project [policy#16, make#17, cast(vehicle_age#18 as int) AS vehicle_age#41, sum_insured#32, inception_date#20, start_date#21, end_date#22, premium#23]
            +- Project [policy#16, make#17, vehicle_age#18, cast(sum_insured#19 as int) AS sum_insured#32, inception_date#20, start_date#21, end_date#22, pr

## Improving the Code
The following example shows the transformation implemented as a single `select()` rather than using the `withColumn()` method.  This improves the original **Parsed Logical Plan** however note that the final **Physical Plan** (which is executed) is identical in both cases.

In [5]:
otherDF = spark.read.option("header", True).csv("./data/policy.csv") \
               .select("policy",
                       "make",
                       F.col("vehicle_age").cast(IntegerType()).alias("vehicle_age"),
                       F.col("sum_insured").cast(IntegerType()).alias("sum_insured"),
                       F.to_date(F.col("inception_date"), "yyyyMMdd").alias("inception_date"),
                       F.to_date(F.col("start_date"), "yyyyMMdd").alias("inception_date"),
                       F.to_date(F.col("end_date"), "yyyyMMdd").alias("inception_date"),
                       F.col("premium").cast(IntegerType()).alias("premium")
                      )

This can be tested to be the same as the original policyDF DataFrame.

In [6]:
assert policyDF.collect() == otherDF.collect()

Reviewing the output of the `explain()` now (as shown below) it can be seen that the **Optimized Logical Plan** and **Physical Plan** are identical to the example for the `policyDF` DataFrame, however the initial **Parsed Logical Plan** is much simpler.

In [7]:
otherDF.explain(extended=True)

== Parsed Logical Plan ==
'Project [unresolvedalias('policy, None), unresolvedalias('make, None), cast('vehicle_age as int) AS vehicle_age#148, cast('sum_insured as int) AS sum_insured#149, to_date('inception_date, Some(yyyyMMdd)) AS inception_date#150, to_date('start_date, Some(yyyyMMdd)) AS inception_date#151, to_date('end_date, Some(yyyyMMdd)) AS inception_date#152, cast('premium as int) AS premium#153]
+- Relation[policy#132,make#133,vehicle_age#134,sum_insured#135,inception_date#136,start_date#137,end_date#138,premium#139] csv

== Analyzed Logical Plan ==
policy: string, make: string, vehicle_age: int, sum_insured: int, inception_date: date, inception_date: date, inception_date: date, premium: int
Project [policy#132, make#133, cast(vehicle_age#134 as int) AS vehicle_age#148, cast(sum_insured#135 as int) AS sum_insured#149, to_date('inception_date, Some(yyyyMMdd)) AS inception_date#150, to_date('start_date, Some(yyyyMMdd)) AS inception_date#151, to_date('end_date, Some(yyyyMMdd)) 

Improving the code to simplify the **Parsed Logical Plan** can both assist the Optimiser, and also improve your ability to understand the steps that Spark will take.

The loss of the generic `fix_dates` function from previous examples can be replaced with another generic function, which builds the `select()` rather than a sequence of `withColumn()` methods:

In [8]:
def fix_types(df):
    """Update columns to the correct Spark Date type."""
    sel = []
    
    intcols = ("premium", "sum_insured", "vehicle_age")
                    
    for column in df.columns:
        if column in intcols:
            print("NOTE: Fixing integer column '{}'.".format(column))
            sel.append(F.col(column).cast(IntegerType()).alias(column))
        elif column.endswith("_date") and dict(df.dtypes)[column] == 'string':
            print("NOTE: Fixing date column '{}'.".format(column))
            sel.append(F.to_date(F.col(column), "yyyyMMdd").alias(column))
        else:
            sel.append(column)

    return df.select(sel)

otherDF = spark.read.option("header", True).csv("./data/policy.csv")
otherDF = fix_types(otherDF)

NOTE: Fixing integer column 'vehicle_age'.
NOTE: Fixing integer column 'sum_insured'.
NOTE: Fixing date column 'inception_date'.
NOTE: Fixing date column 'start_date'.
NOTE: Fixing date column 'end_date'.
NOTE: Fixing integer column 'premium'.


This method of generating a `select()` results in a single 'Project' layer in the **Parsed Logical Plan**.

In [9]:
otherDF.explain(extended=True)

== Parsed Logical Plan ==
'Project [unresolvedalias('policy, None), unresolvedalias('make, None), cast('vehicle_age as int) AS vehicle_age#210, cast('sum_insured as int) AS sum_insured#211, to_date('inception_date, Some(yyyyMMdd)) AS inception_date#212, to_date('start_date, Some(yyyyMMdd)) AS start_date#213, to_date('end_date, Some(yyyyMMdd)) AS end_date#214, cast('premium as int) AS premium#215]
+- Relation[policy#194,make#195,vehicle_age#196,sum_insured#197,inception_date#198,start_date#199,end_date#200,premium#201] csv

== Analyzed Logical Plan ==
policy: string, make: string, vehicle_age: int, sum_insured: int, inception_date: date, start_date: date, end_date: date, premium: int
Project [policy#194, make#195, cast(vehicle_age#196 as int) AS vehicle_age#210, cast(sum_insured#197 as int) AS sum_insured#211, to_date('inception_date, Some(yyyyMMdd)) AS inception_date#212, to_date('start_date, Some(yyyyMMdd)) AS start_date#213, to_date('end_date, Some(yyyyMMdd)) AS end_date#214, cast(pr

In [10]:
spark.stop()