Partition pruning 
- a Spark optimization where Spark reduces the amount of data read by skipping partitions that don't match the filter condition.

Static Partition Pruning:
- When partitions can be evaluated statically during query planning (before execution), Spark directly prunes irrelevant partitions in the query plan.
Photon supports operations involving statically pruned partitions.

No Partition Pruning:
- If no partitions are available or if the query involves a non-partitioned DataFrame, Photon cannot perform the optimizations tied to partitioning and pruning. As a result, Photon may fall back to default Spark query planning.

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionPurning').getOrCreate()

# Static Partition Pruning
data = [(1, "A"), (2, "B"), (3, "A"), (4, "C"), (5, "A")]
df = spark.createDataFrame(data, ["id", "value"]).repartition("value")

filter_condition = "value = 'A'"
result_with_partition_pruning = df.filter(filter_condition)

print('result with Patition Purning:')
result_with_partition_pruning.show()
print('Execution Plan with Partition Purning:')
result_with_partition_pruning.explain()

# No Partition Pruning
data = [(1, "A"), (2, "B"), (3, "A"), (4, "C"), (5, "A")]
df2 = spark.createDataFrame(data, ["id", "value"])

filter_condition = "value = 'A'"

result_without_partition_pruning = df2.filter(filter_condition)

print('result plan without partition purning:')
result_without_partition_pruning.show()
print('execution plan without partition purning:')
result_without_partition_pruning.explain()

# Output: == Photon Explanation ==
# Photon does not fully support the query because:
# 		Unsupported node: LocalTableScan [id#11575L, value#11576].

# Photon does not fully support the query without partition pruning because:
# The DataFrame is not partitioned, so the query requires a LocalTableScan that operates entirely on the driver node.
# Photon optimizations are designed for partitioned data processing or distributed execution, which are absent when LocalTableScan is used.
# Photon falls back to default Spark behavior when it cannot efficiently process LocalTableScan nodes.

# dynamic

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PartitionOverwriteExample")\
    .config("spark.sql.sources.partitionOverwriteMode", "dynamic") \
    .getOrCreate()

# Sample data about customers from different months
data = [("Rajesh", "2023", "07"), ("Rajesh", "2023", "08"), ("Sunita", "2023", "09")]
columns = ["name", "year", "month"]

df = spark.createDataFrame(data, schema=columns)

# Write data partitioned by year and month
df.write.partitionBy("year", "month").mode("overwrite").format("parquet").save("/Volumes/workspace/default/spark_vol1/partition_dynamic")

In [0]:
df_jul = spark.read.format("parquet").load("/Volumes/workspace/default/spark_vol1/partition_dynamic/year=2023/month=07")
df_aug = spark.read.format("parquet").load("/Volumes/workspace/default/spark_vol1/partition_dynamic/year=2023/month=08")
def_sep = spark.read.format("parquet").load("/Volumes/workspace/default/spark_vol1/partition_dynamic/year=2023/month=09")
df_jul.show()
df_aug.show()
def_sep.show()

In [0]:
spark.version

In [0]:
data = [("Update", "2023", "07")]
columns = ["name", "year", "month"]

df = spark.createDataFrame(data, schema=columns)

df.write.option("partitionOverwriteMode", "dynamic").partitionBy("year", "month").mode("overwrite").format("parquet").save("/Volumes/workspace/default/spark_vol1/partition_dynamic")

In [0]:
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("name", StringType(), True)
])

df_jul = spark.read.format("parquet").load("/Volumes/workspace/default/spark_vol1/partition_dynamic/year=2023/month=07")
df_aug = spark.read.format("parquet").schema(schema).load("/Volumes/workspace/default/spark_vol1/partition_dynamic/year=2023/month=08")
def_sep = spark.read.format("parquet").schema(schema).load("/Volumes/workspace/default/spark_vol1/partition_dynamic/year=2023/month=09")
df_jul.show()
df_aug.show()
def_sep.show()