In [5]:
#we use the findspark library to locate spark on our local machine
import findspark
findspark.init(r'C:\spark\spark-3.5.0-bin-hadoop3')
import pyspark # only run this after findspark.init()

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()
          

data = [(1,10),(2,20),(3,10),(4,20),(5,10),
    (6,30),(7,50),(8,50),(9,50),(10,30),
    (11,10),(12,10),(13,40),(14,40),(15,40),
    (16,40),(17,50),(18,10),(19,40),(20,40)
  ]

df=spark.createDataFrame(data,["id","value"])

df.show()

+---+-----+
| id|value|
+---+-----+
|  1|   10|
|  2|   20|
|  3|   10|
|  4|   20|
|  5|   10|
|  6|   30|
|  7|   50|
|  8|   50|
|  9|   50|
| 10|   30|
| 11|   10|
| 12|   10|
| 13|   40|
| 14|   40|
| 15|   40|
| 16|   40|
| 17|   50|
| 18|   10|
| 19|   40|
| 20|   40|
+---+-----+



df is assumed to be a DataFrame that you want to perform some operation on.

repartition(3, "value") is used to repartition the DataFrame df into three partitions based on the "value" column. This operation reshuffles the data in the DataFrame and redistributes it into the specified number of partitions. In this case, you are requesting three partitions, and the data is being partitioned based on the values in the "value" column.

.explain(True) is called on the resulting DataFrame after the repartitioning. The explain() method is used to display the execution plan of the DataFrame operation. The True argument passed to explain() indicates that you want to display a detailed (extended) explanation of the execution plan, including physical and logical plans.

When you run this code, it will print out the detailed execution plan for the repartitioned DataFrame, showing how Spark plans to perform the repartitioning operation and any other related operations. This can be helpful for understanding the underlying optimizations and transformations that Spark performs to execute your DataFrame operations efficiently.

In [6]:
df.repartition(3,"value").explain(True)  

== Parsed Logical Plan ==
'RepartitionByExpression ['value], 3
+- LogicalRDD [id#26L, value#27L], false

== Analyzed Logical Plan ==
id: bigint, value: bigint
RepartitionByExpression [value#27L], 3
+- LogicalRDD [id#26L, value#27L], false

== Optimized Logical Plan ==
RepartitionByExpression [value#27L], 3
+- LogicalRDD [id#26L, value#27L], false

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange hashpartitioning(value#27L, 3), REPARTITION_BY_NUM, [plan_id=42]
   +- Scan ExistingRDD[id#26L,value#27L]



repartitionByRange("value") is used to repartition the DataFrame based on the values in the "value" column using range-based partitioning. Range-based partitioning distributes the data into partitions based on specified ranges of values within the column. Each partition will contain a specific range of values from the "value" column.

In [9]:
df.repartitionByRange("value").explain(True)


== Parsed Logical Plan ==
'RepartitionByExpression ['value ASC NULLS FIRST]
+- LogicalRDD [id#26L, value#27L], false

== Analyzed Logical Plan ==
id: bigint, value: bigint
RepartitionByExpression [value#27L ASC NULLS FIRST]
+- LogicalRDD [id#26L, value#27L], false

== Optimized Logical Plan ==
RepartitionByExpression [value#27L ASC NULLS FIRST]
+- LogicalRDD [id#26L, value#27L], false

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange rangepartitioning(value#27L ASC NULLS FIRST, 200), REPARTITION_BY_COL, [plan_id=124]
   +- Scan ExistingRDD[id#26L,value#27L]



In [None]:
df.repartitionByRange(3,"value").explain(True)