# Partitioning

Partitioning is the act of dividing or partitioning; separation by the creation of a boundary that divides or keeps apart.

Data partitioning is used in Spark, Amazon Athena, and Google BigQuery to improve performance of query executions.

To scale-out big data solutions, data is divided into partitions that can be managed, accessed, and executed separately and in parallel.

Spark splits data into smaller chunks, which are called partitions and then processes these partitions in a parallel fashion (many partitions can be processed concurrently) by using executors spanned in worker nodes.

By default, Spark uses a Hash Partitioner to partition the data across different partitions.

The Hash Partitioner works on the concept of using the hashcode() function. 

    total records: 100,000,000,000
    number of partitions: 10,000
    number of elements per partition: 10,000,000
    maximum possible parallelism: 10,000

Partitioning data can improve manageability, scalability, reduce contention, and optimize performance.

As your physical data is distributed as partitions across the physical cluster, Spark treats each partition as a high-level logical data abstraction (such as RDDs and DataFrames) in memory (and disk if there is not a sufficient memory).

Spark cluster will optimize partition access and will read the partition closest to it in the network, observing data locality.

In [13]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("data-patition").master("local[*]").getOrCreate()

In [14]:
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

In [15]:
def debug(iterator):
    print("elements=", list(iterator))

In [16]:
sc=spark.sparkContext
rdd = sc.parallelize(numbers)
num_partitions = rdd.getNumPartitions()
num_partitions

2

In [17]:
rdd.foreachPartition(debug)

# Partition as Text Format

In [None]:
df = spark.read.option("inferSchema", "true")\
  .csv(input_path)\
  .toDF('customer_id', 'year', 'transaction_id', 'transaction_value')