### 2. Understanding Partitions

In this notebook, we will:

**1.**  Understand how Spark partitions the data.

**2.** Interact with the partitions of a dataframe directly.

**3.** See the step in the physical plan that corresponds to shuffle.

Spark works with large data by separating the data into several partitions in manageable sizes. Then, the executors in the cluster can work with the partitions. I like to think of each partition as a local dataframe.

In [60]:
from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.config("spark.sql.shuffle.partitions", 16).getOrCreate()

Following cell creates the necessary data. You don't have to understand the code. You just need to know that:

- This is typical dataset (``daily_data``) that we deal with every day.

- Resulting dataframe contains ``n_stores`` x ``n_products`` pairs.

- You can specify start and end dates.

- Resulting dataframe is partitioned by data and written in parquet format.

In [61]:
import pandas as pd
import numpy as np

def create_demo_data(n_products, n_stores, start_date="2021-01-01", end_date="2022-01-01"):
    """Creates demo data, writes it as parquet partitioned by date, reads it and returns the dataframe"""
    dates = pd.date_range(start_date, end_date)
    dates = [str(date)[:10] for date in dates]

    day_index = np.arange(len(dates))
    result = []
    for product in range(n_products):
        for store in range(n_stores):
            sales = np.random.poisson(10, size=len(dates))
            partial_df = (
                pd.DataFrame(dates, columns=["date"])
                .assign(product_id=product)
                .assign(store_id=store)
                .assign(day_index=day_index)
                .assign(sales_quantity=sales)
            )
            result.append(partial_df)
    pdf = pd.concat(result)
    result = spark.createDataFrame(pdf)
    result.repartition("date").write.partitionBy("date").parquet("demo-data", mode="overwrite")
    return spark.read.parquet("demo-data")

Let's create a dataset that only contains a single pair and 31 days of data.

In [62]:
df = create_demo_data(n_products=1, n_stores=1, start_date="2021-01-01", end_date="2021-01-31")

In [63]:
df.count()

31

We can obtain the number of partitions in the dataframe by ``rdd.getNumPartitions``.

In [64]:
df.rdd.getNumPartitions()

16

Although we have 31 dates, the dataframe has 16 partitions (decided by ``spark.sql.shuffle.partitions``).

We can interact with partitions directly using ``rdd.glom`` method.

In [33]:
help(df.rdd.glom)

Help on method glom in module pyspark.rdd:

glom() method of pyspark.rdd.RDD instance
    Return an RDD created by coalescing all elements within each partition
    into a list.
    
    Examples
    --------
    >>> rdd = sc.parallelize([1, 2, 3, 4], 2)
    >>> sorted(rdd.glom().collect())
    [[1, 2], [3, 4]]



Let's count the number of rows in each partition:

In [34]:
df.rdd.glom().map(lambda x: len(x)).collect()

[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]

Spark tries to divide the partitions evenly as much as it's possible. Here, we have 2 rows of data at each partition, except for the last partition. Last partition only contains a single row. We can also collect the data in each partition:

In [36]:
df.rdd.glom().collect()

[[Row(product_id=0, store_id=0, day_index=17, sales_quantity=17, date=datetime.date(2021, 1, 18)),
  Row(product_id=0, store_id=0, day_index=28, sales_quantity=7, date=datetime.date(2021, 1, 29))],
 [Row(product_id=0, store_id=0, day_index=9, sales_quantity=8, date=datetime.date(2021, 1, 10)),
  Row(product_id=0, store_id=0, day_index=6, sales_quantity=8, date=datetime.date(2021, 1, 7))],
 [Row(product_id=0, store_id=0, day_index=20, sales_quantity=6, date=datetime.date(2021, 1, 21)),
  Row(product_id=0, store_id=0, day_index=8, sales_quantity=16, date=datetime.date(2021, 1, 9))],
 [Row(product_id=0, store_id=0, day_index=5, sales_quantity=14, date=datetime.date(2021, 1, 6)),
  Row(product_id=0, store_id=0, day_index=27, sales_quantity=17, date=datetime.date(2021, 1, 28))],
 [Row(product_id=0, store_id=0, day_index=21, sales_quantity=7, date=datetime.date(2021, 1, 22)),
  Row(product_id=0, store_id=0, day_index=16, sales_quantity=13, date=datetime.date(2021, 1, 17))],
 [Row(product_id=

Now we will create a dataset that has two products, one store and five days.

In [65]:
df = create_demo_data(n_products=2, n_stores=1, start_date="2021-01-01", end_date="2021-01-05")

See that we have 5 partitions (one for each date) now.

In [66]:
df.rdd.getNumPartitions()

5

The parquet file format contains the partition information, and we have written the dataframe partitioned by date. Thus, we see that all rows for a date are in the same partition. For example, the first partition contains all rows for dates ``'2021-01-05'`` and ``'2021-01-04'``.

In [69]:
df.rdd.glom().collect()

[[Row(product_id=0, store_id=0, day_index=4, sales_quantity=14, date=datetime.date(2021, 1, 5)),
  Row(product_id=1, store_id=0, day_index=4, sales_quantity=8, date=datetime.date(2021, 1, 5))],
 [Row(product_id=0, store_id=0, day_index=3, sales_quantity=6, date=datetime.date(2021, 1, 4)),
  Row(product_id=1, store_id=0, day_index=3, sales_quantity=10, date=datetime.date(2021, 1, 4))],
 [Row(product_id=0, store_id=0, day_index=0, sales_quantity=11, date=datetime.date(2021, 1, 1)),
  Row(product_id=1, store_id=0, day_index=0, sales_quantity=7, date=datetime.date(2021, 1, 1))],
 [Row(product_id=0, store_id=0, day_index=1, sales_quantity=9, date=datetime.date(2021, 1, 2)),
  Row(product_id=1, store_id=0, day_index=1, sales_quantity=10, date=datetime.date(2021, 1, 2))],
 [Row(product_id=0, store_id=0, day_index=2, sales_quantity=13, date=datetime.date(2021, 1, 3)),
  Row(product_id=1, store_id=0, day_index=2, sales_quantity=14, date=datetime.date(2021, 1, 3))]]

Repartition by ``product_id`` and see what happens:

In [71]:
df.repartition("product_id").rdd.getNumPartitions()

1

In this case there is a single partition. I find it easy to think that there are 16 partitions, but 15 of them are empty.

**Takeaways:**

- Just because there is two products does not mean there will be two partitions. It means that all rows for a product will be in the same partition. In this case, we have a single partition.


Docstring for HashPartitioning in Spark: [partitioning.scala](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala)


```scala
/**
 * Represents a partitioning where rows are split up across partitions based on the hash
 * of `expressions`.  All rows where `expressions` evaluate to the same values are guaranteed to be
 * in the same partition.
 */
 case class HashPartitioning(expressions: Seq[Expression], numPartitions: Int)
 ```
 
 It also means that we can pass expressions to ``repartition`` method: For example: ``df.repartition(F.ceil(F.rand() * 100))`` is valid.
 

In [72]:
df.repartition(F.col("sales_quantity") > 5).rdd.getNumPartitions()

1

**Shuffle in physical plan**

``Exchange`` step in physical plan corresponds to shuffling of data between executors. It is an expensive operation since all the data will move between executors across the cluster. This is also a typical cause of disk spill since the executors might need to write data to disk.

In [51]:
df.repartition("product_id").explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange hashpartitioning(product_id#206L, 16), REPARTITION_BY_COL, [id=#480]
   +- FileScan parquet [product_id#206L,store_id#207L,day_index#208L,sales_quantity#209L,date#210] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/workspaces/rocks/Untitled Folder/demo-data], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<product_id:bigint,store_id:bigint,day_index:bigint,sales_quantity:bigint>




Here, we did not specify the number of partitions but the ``hashpartitioning`` step shows that we requested 16 partitions (decided by ``spark.sql.shuffle.partitions``). This is why tuning ``spark.sql.shuffle.partitions`` is so important. Whenever there is a shuffle, resulting number of partitions depend on ``spark.sql.shuffle.partitions.`` See:

+- Exchange hashpartitioning(product_id#206L, **16**), REPARTITION_BY_COL, [id=#480]

Shuffle often referred as a necessary evil, but it turns out we can sometimes reduce the number of shuffles necessary (which is the topic of another notebook). 

**Bonus:** Low level ``rdd`` API. 

- ``rdd`` API can be used to perform custom low level operations. Using ``DataFrame`` API is suggested over low level ``rdd`` API since ``DataFrame`` API provides several optimizations. Still, they are useful to know about. Also, there might still be some use cases.

With ``rdd.mapPartitions`` we can perform a custom mapping to each partition. Each partition should enough information to compute its results.

In [81]:
help(df.rdd.mapPartitions)

Help on method mapPartitions in module pyspark.rdd:

mapPartitions(f, preservesPartitioning=False) method of pyspark.rdd.RDD instance
    Return a new RDD by applying a function to each partition of this RDD.
    
    Examples
    --------
    >>> rdd = sc.parallelize([1, 2, 3, 4], 2)
    >>> def f(iterator): yield sum(iterator)
    >>> rdd.mapPartitions(f).collect()
    [3, 7]



**Exercise**: Find the maximum sales of each partition with ``rdd.mapPartitions``. Remember we have 5 partitions.

In [116]:
def max_sales(partition):
    max_sales = 0
    for row in partition:
        sales = row.sales_quantity or 0
        if sales > max_sales:
            max_sales = sales
    yield (max_sales,) # needs to be tuple

In [117]:
df.rdd.mapPartitions(max_sales).toDF(["max_sales"]).show()

+---------+
|max_sales|
+---------+
|       14|
|       10|
|       11|
|       10|
|       14|
+---------+



``rdd.mapPartitions`` is very flexible. The return value can be anything, it just applies a function to a partition.

Here is a good example. We can fit a ``LinearRegression`` to each partition and return the fitted model:

In [122]:
from sklearn.linear_model import LinearRegression

def fit_linear_regression(partition):
    pdf = pd.DataFrame(partition, columns=["product_id", "store_id", "day_index", "sales_quantity", "date"])
    lr = LinearRegression().fit(pdf.loc[:, ["product_id", "day_index"]], pdf.sales_quantity)
    yield (lr,) # needs to be tuple

In [123]:
result = df.rdd.mapPartitions(fit_linear_regression).collect()

In [124]:
result

[(LinearRegression(),),
 (LinearRegression(),),
 (LinearRegression(),),
 (LinearRegression(),),
 (LinearRegression(),)]

In [125]:
result[0][0].coef_

array([-6.,  0.])