## Data Partitioning in PySpark In-depth

Data partitioning is critical to data processing performance especially for large volume of data processing in Spark.  
Partitions in Spark won’t span across nodes though one node can contains more than one partitions.  
When processing, Spark assigns one task for each partition and each worker threads can only process one task at a time.  
Thus, 
* with too few partitions, the application won’t utilize all the cores available in the cluster and it can cause data skewing problem; 
* with too many partitions, it will bring overhead for Spark to manage too many small tasks.

In [None]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()
print(spark.version)

#### Load libraries

In [None]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField, ArrayType, MapType, DoubleType
from pyspark.sql.functions import lit, col, expr, when, sum, avg, max, min, mean, count, udf, explode, concat_ws, year, month, dayofmonth

#### Create Spark session

In [None]:
spark = SparkSession.builder.appName('PySpark Partitioning In-depth').getOrCreate()

#### Create Dataframe

In [None]:
from datetime import date, timedelta

start_date = date(2019, 1, 1)
data = []

for i in range(0, 50):
  data.append({'Country': 'CN', 'Date': start_date + timedelta(days=i), 'Amount': 10+i})
  data.append({'Country': 'AU', 'Date': start_date + timedelta(days=i), 'Amount': 10+i})

schema = StructType([
  StructField('Country', StringType(), nullable=False),
  StructField('Date', DateType(), nullable=False),
  StructField('Amount', IntegerType(), nullable=False)]
)

df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show()

In [None]:
table_name = 'temp.partitions_testing'
table_path = f'/mnt/{table_name}'

#### Write data frame to file system

In [None]:
(df
.write
.format('delta')
.mode('overwrite')
.save(table_path))

#### Check Number of Partitions

In [None]:
print(df.rdd.getNumPartitions())

In [None]:
%sh
ls -lah /dbfs/mnt/temp.partitions_testing

#### Repartitioning with coalesce function

Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition(). This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

In [None]:
df = df.coalesce(16)

(df
.write
.format('delta')
.mode('overwrite')
.save(table_path))

In [None]:
print(df.rdd.getNumPartitions()) # still 8
# bt ls will return 16

In [None]:
%sh
ls -lah /dbfs/mnt/temp.partitions_testing

In [None]:
df = df.coalesce(4)

(df
.write
.format('delta')
.mode('overwrite')
.save(table_path))

In [None]:
print(df.rdd.getNumPartitions()) # will return 4
# in delta there will be 20 files

In [None]:
%sh
ls -lah /dbfs/mnt/temp.partitions_testing

In [None]:
%sh
rm -rf /dbfs/mnt/

In [None]:
# after dr]eleting the delta table
# there should be 4 files
(df
.write
.format('delta')
.mode('overwrite')
.save(table_path))

In [None]:
%sh
ls -lah /dbfs/mnt/temp.partitions_testing

In [None]:
%sh
rm -rf /dbfs/mnt/

#### Repartitioning with repartition function

Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.  

##### Repartition by number

Pypark will try to evenly distribute the data to each partitions.  
If the total partition number is greater than the actual record count (or RDD size), some partitions will be empty.

In [None]:
df = df.repartition(10)

(df
.write
.format('delta')
.mode('overwrite')
.save(table_path))

In [None]:
print(df.rdd.getNumPartitions())

In [None]:
%sh
ls -lah /dbfs/mnt/temp.partitions_testing

In [None]:
%sh
rm -rf /dbfs/mnt/

##### Repartition by column

The below scripts will create 200 partitions (Spark by default create 200 partitions).  
However only three sharded files are generated:

* One file stores data for CN country.
* Another file stores data for AU country.
* The other one is empty.

In [None]:
df = df.repartition('Country')

In [None]:
(df
.write
.format('delta')
.mode('overwrite')
.save(table_path))

In [None]:
print(df.rdd.getNumPartitions())

In [None]:
%sh
ls -lah /dbfs/mnt/temp.partitions_testing

In [None]:
%sh
rm -rf /dbfs/mnt/

#### Partition by multiple columns

In [None]:
df = (
  df
  .withColumn('Year', year('Date'))
  .withColumn('Month', month('Date'))
  .withColumn('Day', dayofmonth('Date'))
)
df = df.repartition('Year', 'Month', 'Day', 'Country')

In [None]:
(df
.write
.format('delta')
.mode('overwrite')
.save(table_path))

In [None]:
print(df.rdd.getNumPartitions())

In [None]:
%sh
ls -lah /dbfs/mnt/temp.partitions_testing

In [None]:
%sh
rm -rf /dbfs/mnt/

#### Match repartition keys with write partition keys

In [None]:
(df
.write
.partitionBy('Year', 'Month', 'Day', 'Country')
.format('delta')
.mode('overwrite')
.save(table_path))

In [None]:
%sh
find /dbfs/mnt/temp.partitions_testing -maxdepth 4 -type d

#### Read from partitioned data

In [None]:
df = spark.read.format('delta').load(f'{table_path}/Year=2019/Month=2/Day=1/Country=CN')
print(df.rdd.getNumPartitions())
df.show()

In [None]:
df = spark.read.format('delta').load(f'{table_path}/Year=2019/Month=2')
print(df.rdd.getNumPartitions())
df.show()

In [None]:
%sh
rm -rf /dbfs/mnt/

#### The end of the notebook