In [1]:
# enable pyspark
import findspark
findspark.init()

In [2]:
'''
Scripts instantiates a SparkSession locally with 8 worker threads.
'''
appName = "PySpark Partition"
master = "local[8]"
from pyspark import SparkContext, SparkConf
# ref: https://towardsai.net/p/programming/pyspark-aws-s3-read-write-operations
#spark configuration
conf = SparkConf().set('spark.executor.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true'). \
 set('spark.driver.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true'). \
 setAppName(appName).setMaster(master)

sc=SparkContext(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')

# read aws credentials
import configparser
config = configparser.ConfigParser()
config.read_file(open(r'C:\Users\padma\.aws\credentials'))

accessKeyId= config['default']['AWS_ACCESS_KEY_ID']
secretAccessKey= config['default']['AWS_SECRET_ACCESS_KEY']

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', accessKeyId)
hadoopConf.set('fs.s3a.secret.key', secretAccessKey)
hadoopConf.set('fs.s3a.endpoint', 's3.amazonaws.com')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

print(sc)
from pyspark.sql import SparkSession
spark=SparkSession(sc)

<SparkContext master=local[8] appName=PySpark Partition>


## Data partitioning 
Data partitioning is critical to data processing performance especially for large volume of data processing in Spark. 
Partitions in Spark won’t span across nodes though one node can contains more than one partitions. When processing,
Spark assigns one task for each partition and each worker threads can only process one task at a time. Thus, 
with too few partitions, the application won’t utilize all the cores available in the cluster and it can cause 
data skewing problem; with too many partitions, it will bring overhead for Spark to manage too many small tasks.

In [4]:
'''
Scripts to populate a data frame with 100 records.
'''

from pyspark.sql.functions import year, month, dayofmonth
from pyspark.sql import SparkSession
from datetime import date, timedelta
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField

print(spark.version)
# Populate sample data
start_date = date(2019, 1, 1)
data = []
for i in range(0, 50):
    data.append({"Country": "CN", "Date": start_date +
                 timedelta(days=i), "Amount": 10+i})
    data.append({"Country": "AU", "Date": start_date +
                 timedelta(days=i), "Amount": 10+i})

schema = StructType([StructField('Country', StringType(), nullable=False),
                     StructField('Date', DateType(), nullable=False),
                     StructField('Amount', IntegerType(), nullable=False)])

df = spark.createDataFrame(data, schema=schema)
df.show()
print(df.rdd.getNumPartitions())

3.1.2
+-------+----------+------+
|Country|      Date|Amount|
+-------+----------+------+
|     CN|2019-01-01|    10|
|     AU|2019-01-01|    10|
|     CN|2019-01-02|    11|
|     AU|2019-01-02|    11|
|     CN|2019-01-03|    12|
|     AU|2019-01-03|    12|
|     CN|2019-01-04|    13|
|     AU|2019-01-04|    13|
|     CN|2019-01-05|    14|
|     AU|2019-01-05|    14|
|     CN|2019-01-06|    15|
|     AU|2019-01-06|    15|
|     CN|2019-01-07|    16|
|     AU|2019-01-07|    16|
|     CN|2019-01-08|    17|
|     AU|2019-01-08|    17|
|     CN|2019-01-09|    18|
|     AU|2019-01-09|    18|
|     CN|2019-01-10|    19|
|     AU|2019-01-10|    19|
+-------+----------+------+
only showing top 20 rows

8


In [6]:
# Write data frame to file system
# 8 sharded files will be generated for each partition under folder data/example.csv
# 7 shards/files with 12 rows and one file with 16 rows
df.count()
df.write.mode("overwrite").csv("data/example.csv", header=True)


## Repartitioning with coalesce function
This function is defined as the following:
<pre>
def coalesce(numPartitions)
Returns a new :class:DataFrame that has exactly numPartitions partitions.
</pre>

This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

See below:
Now if we run the following code, can you guess how many sharded files will be generated?
The answer is still 8. **This is because coalesce function does’t involve reshuffle of data.** 
In the above code, we want to increate the partitions to 16 but the number of partitions 
stays at the current (8).

In [12]:
df = df.coalesce(16)
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv", header=True)

8


If we decrease the partitions to 4 by running the following code, how many files will be generated? The answer is 4 

In [13]:
df = df.coalesce(4)
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv", header=True)

4


## Repartitioning with repartition function
The other method for repartitioning is repartition. It’s defined as the follows:
<pre>
def repartition(numPartitions, *cols)
</pre>
Returns a new :class:DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.

numPartitions can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used.

Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.

Data reshuffle occurs when using this function. Let’s try some examples using the above dataset.

### Repartition by number
Use the code below to repartition the data to 10 partitions.
Spark will try to evenly distribute the data to each partitions. If the total partition number is greater than the actual record count (or RDD size), some partitions will be empty. After we run the above code, data will be reshuffled to 10 partitions with 10 sharded files generated.

If we repartition the data frame to 1000 partitions, how many sharded files will be generated?
The answer is 100 because the other 900 partitions are empty and each file has one record.

In [14]:
df = df.repartition(10)
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv", header=True)

10


### Repartition by column
We can also repartition by columns.
For example, let’s run the code below to repartition the data by column Country.
This will create 200 partitions (**Spark by default create 200 partitions**).  However only three sharded files are generated:
- One file stores data for CN country.
- Another file stores data for AU country.
- The other one is empty.

In [15]:
df = df.repartition("Country")
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv", header=True)

200


Similarly, if we can also partition the data by Date column:
<pre>
df = df.repartition("Date")
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv", header=True)
</pre>
If you look into the data, you may find the data is probably not partitioned properly as you would expect, for example, one partition file only includes data for both countries and different dates too.

**This is because by default Spark use hash partitioning as partition function**. You can use range partitioning function or customize the partition functions. I will talk more about this in my other posts.

### Partition by multiple columns
In real world, you would probably partition your data by multiple columns. To implement the multiple column partitioning strategy, we need to derive some new columns (year, month, date). Code below derives some new columns and then repartition the data frame with those columns.

When you look into the saved files, you may find that all the new columns are also saved and the files still mix different sub partitions. To improve this, we need to match our write partition keys with repartition keys.

In [18]:
# derive some new columns (year, month, date)
df = df.withColumn("Year", year("Date")).withColumn(
"Month", month("Date")).withColumn("Day", dayofmonth("Date"))
# repartition the data frame with new columns
df = df.repartition("Year", "Month", "Day", "Country")
df.show()
print(df.rdd.getNumPartitions())
df.write.mode("overwrite").csv("data/example.csv", header=True)

+-------+----------+------+----+-----+---+
|Country|      Date|Amount|Year|Month|Day|
+-------+----------+------+----+-----+---+
|     AU|2019-01-21|    30|2019|    1| 21|
|     CN|2019-01-29|    38|2019|    1| 29|
|     AU|2019-01-19|    28|2019|    1| 19|
|     AU|2019-02-02|    42|2019|    2|  2|
|     AU|2019-02-07|    47|2019|    2|  7|
|     AU|2019-02-05|    45|2019|    2|  5|
|     AU|2019-02-08|    48|2019|    2|  8|
|     CN|2019-01-27|    36|2019|    1| 27|
|     CN|2019-01-21|    30|2019|    1| 21|
|     CN|2019-01-25|    34|2019|    1| 25|
|     CN|2019-02-06|    46|2019|    2|  6|
|     AU|2019-01-11|    20|2019|    1| 11|
|     CN|2019-01-19|    28|2019|    1| 19|
|     CN|2019-02-19|    59|2019|    2| 19|
|     AU|2019-02-03|    43|2019|    2|  3|
|     AU|2019-02-09|    49|2019|    2|  9|
|     CN|2019-01-14|    23|2019|    1| 14|
|     AU|2019-01-16|    25|2019|    1| 16|
|     CN|2019-02-16|    56|2019|    2| 16|
|     AU|2019-01-10|    19|2019|    1| 10|
+-------+--

### partitionBy
When you look into the saved files, you may find that all the new columns are also saved and the files still mix different sub partitions. To improve this, we need to match our write partition keys with repartition keys.
To match partition keys, we just need to change the last line to add a partitionBy function:

In [19]:
# derive some new columns (year, month, date)
df = df.withColumn("Year", year("Date")).withColumn(
"Month", month("Date")).withColumn("Day", dayofmonth("Date"))
# repartition the data frame with new columns
df = df.repartition("Year", "Month", "Day", "Country")
df.show()
print(df.rdd.getNumPartitions())
df.write.partitionBy("Year", "Month", "Day", "Country").mode(
"overwrite").csv("data/example.csv", header=True)

+-------+----------+------+----+-----+---+
|Country|      Date|Amount|Year|Month|Day|
+-------+----------+------+----+-----+---+
|     AU|2019-01-21|    30|2019|    1| 21|
|     CN|2019-01-29|    38|2019|    1| 29|
|     AU|2019-01-19|    28|2019|    1| 19|
|     AU|2019-02-02|    42|2019|    2|  2|
|     AU|2019-02-07|    47|2019|    2|  7|
|     AU|2019-02-05|    45|2019|    2|  5|
|     AU|2019-02-08|    48|2019|    2|  8|
|     CN|2019-01-27|    36|2019|    1| 27|
|     CN|2019-01-21|    30|2019|    1| 21|
|     CN|2019-01-25|    34|2019|    1| 25|
|     CN|2019-02-06|    46|2019|    2|  6|
|     AU|2019-01-11|    20|2019|    1| 11|
|     CN|2019-01-19|    28|2019|    1| 19|
|     CN|2019-02-19|    59|2019|    2| 19|
|     AU|2019-02-03|    43|2019|    2|  3|
|     AU|2019-02-09|    49|2019|    2|  9|
|     CN|2019-01-14|    23|2019|    1| 14|
|     AU|2019-01-16|    25|2019|    1| 16|
|     CN|2019-02-16|    56|2019|    2| 16|
|     AU|2019-01-10|    19|2019|    1| 10|
+-------+--