In PySpark, you don't "assign" a fixed size to a partition like a hard memory limit. Instead, you influence the size of partitions by configuring Spark properties or by repartitioning your data based on a target count.

Depending on whether you are **reading data**, **shuffling data** (during joins/grouping), or **writing data**, you use different methods to control partition size.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PartitionSizeControl") \
    .config("spark.sql.files.maxPartitionBytes", "67108864") \
    # Set to 64MB (64 * 1024 * 1024 bytes)
    .getOrCreate()

# Now, any read operation will try to split files into 64MB chunks
df = spark.read.parquet("path/to/large_data")
print(f"Number of partitions: {df.rdd.getNumPartitions()}")

In [None]:
2. Controlling Size During Shuffles (Joins/Aggregations)
When you perform an operation that shuffles data (like a groupBy or join), Spark defaults to 200 partitions. If your data is 200 GB, each partition would be 1 GB (too large). If it's 200 MB, each would be 1 MB (too small).