# Partitioning

**Data Source**
* English Wikipedia pageviews by second
* Size on Disk: ~255 MB
* Type: Tab Separated Text File & Parquet files

**Technical Accomplishments:**
* To understand the relationship between Slots/cores and partitions
* To review `repartition(n)` and `coalesce(n)`
* To review one key side effect of shuffle partitions

### **The Data Source**

For this exercise we will use the **Pageviews By Seconds** data set.

The file is located on the HDFS at **data/pageviews_by_second.tsv**.

In [6]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

schema = StructType(
  [
    StructField("timestamp", StringType(), False),
    StructField("site", StringType(), False),
    StructField("requests", IntegerType(), False)
  ]
)

fileName = "data/pageviews_by_second.tsv"

# Create our initial DataFrame
initialDF = (spark.read
  .option("header", "true")
  .option("sep", "\t")
  .schema(schema)
  .csv(fileName)
)

Now let's see what our data consists of:

* timestamp of the record when it was created
* the site i.e mobile or desktop
* Lastly the number of requests

This is for the one record per site, per second, and captures the no. of requests which are made in that one second. 

i.e. for every second of the day, there are 2 records.

In [7]:
initialDF.show()

+-------------------+-------+--------+
|          timestamp|   site|requests|
+-------------------+-------+--------+
|2015-03-16T00:09:55| mobile|    1595|
|2015-03-16T00:10:39| mobile|    1544|
|2015-03-16T00:19:39|desktop|    2460|
|2015-03-16T00:38:11|desktop|    2237|
|2015-03-16T00:42:40| mobile|    1656|
|2015-03-16T00:52:24|desktop|    2452|
|2015-03-16T00:54:16| mobile|    1654|
|2015-03-16T01:18:11| mobile|    1720|
|2015-03-16T01:30:32|desktop|    2288|
|2015-03-16T01:32:24| mobile|    1609|
|2015-03-16T01:42:08|desktop|    2341|
|2015-03-16T01:45:53| mobile|    1704|
|2015-03-16T01:55:37|desktop|    2554|
|2015-03-16T01:57:29| mobile|    1825|
|2015-03-16T02:03:16|desktop|    2492|
|2015-03-16T02:10:32| mobile|    1667|
|2015-03-16T02:16:45|desktop|    2452|
|2015-03-16T02:19:32|desktop|    2412|
|2015-03-16T02:20:16|desktop|    2350|
|2015-03-16T02:22:08| mobile|    1802|
+-------------------+-------+--------+
only showing top 20 rows



## Initial Steps

Before processing any data, we must follow several steps to simply prepare the data for analysis:
1. **Read the data**
2. Balance the number of partitions with the number of slots
3. Cache the data
4. Adjust 'spark.sql.shuffle.partitions'
5. Perform some of the basic ETL
6. If the ETL was costly, re-cache the data

## Slots vs Partitions

* We have our 'initialDF' (**Step #1**) which tells to nothing more than reading in and through the data.
* In our **Step #2** we must ask the question, what is the relationship between slots and partitions?


** *Note:* ** *The Spark API uses the term **core** meaning a thread available for parallel execution.*<br/>*Here we refer to it as **slot** to avoid confusion with the number of cores in the underlying CPU(s)*<br/>*to which there isn't necessarily an equal number.*

### Slots(Cores)

Generally, you should know how many cores you have, when you created your cluster.

For checking programatically, one can use 'SparkContext.defaultParallelism'

> For operations such as parallelize with no parent RDDs, it depends on the cluster manager:
> * Local mode: number of cores (local machine)
> * Mesos fine grained mode: 2
> * **Others: total number of cores on all executor nodes or 2, whichever is larger**

In [8]:
cores = sc.defaultParallelism

print("You have {} cores, or slots.".format(cores))

You have 2 cores, or slots.


### Partitions

* The 2nd Half of this question asked above is How many partitions of data do I have?
* With these we have 2 sub-questions:
  1. Why there are soo many?
  2. What is a partition?

>Answer to the last question is, a **partition** is a small piece of the total data/dataset.


>Back to the 1st question, answer for it can be provided by running the following command which
>>* acutally takes the `initialDF`
>>* can converts it to an `RDD`
>>* and then asks the `RDD` for the number of partitions

In [9]:
partitions = initialDF.rdd.getNumPartitions()
print("Partitions: {0:,}".format( partitions ))

Partitions: 2


* In Spark 3.0 there are a lot of optimizations have been added to the readers.
* Namely the few, you can looks at **the number of slots**, the **size of the data**, and makes a optimised guess that at how many partitions **should be created**.

But 2 partitions and 2 slots is just too easy.
  * Let's read in another copy of this same data.
  * A parquet file that was saved in 5 partitions.
  * This gives us an excuse to reason about the **relationship between slots and partitions**

In [10]:
# Create our initial DataFrame. We can let it infer the 
# schema because the cost for parquet files is really low.
alternateRDD = sc.textFile("data/pageviews_by_second_csv")

print("Partitions", alternateRDD.getNumPartitions())

Partitions 5


Now that we have 5 partitions we must ask:

> What is going to happen, When we perform and action like `count()` **with 2 slots and only 5 partitions?**

In [11]:
alternateRDD.count()

7200000

In [12]:
alternateDF = alternateRDD.map(lambda x: x.split(",")).toDF(["timestamp","site","request"])

### Use Every Slot or Core

With some very less exceptions, one always want the no. of partitions to be **a factor of the number of slots**.

In such way **every slot is used**.

I.e., every slots are being assigned a task.

With 5 partitions & 2 slots we are **over-utilizing three slots as we have 2 slots**.

### More or Less Partitions?

As in **general guideline**, it is advised that each partition (when cached) is roughly around 200MB.
* Size on disk is not a good gauge. For example:
>* CSV files are generally large on disk but small in RAM - consider the string "56789" which is 10 bytes compared to the integer 56789 which is only 4 bytes.
>* Parquet files are highly compressed but uncompressed in RAM.

**Question:** If I read in data and it comes in with 10 partitions should I:
* reduce the partitions down to 8 (1x number of slots)
* or increase the partitions up to 16 (2x number of slots)

**Answer:** It totally depends on the size of each partition
* Firstly, Read the data. 
* then Cache it. 
* Look at the size per partition.
* For Example, If your partition size is near or over 200MB consider increasing the number of partitions.
* For Example, If your partition size is under 200MB consider decreasing the number of partitions.

The goal will **ALWAYS** be to use as few partitions as possible while maintaining at least 1 x number-of-slots.

## repartition(n) or coalesce(n)

We have two operations that can help to address above problem: `repartition(n)` and `coalesce(n)`.

If you look at the API docs, `coalesce(n)` is described like this:
> Returns a new Dataset that has exactly numPartitions partitions, when fewer partitions are requested.<br/>
> If a larger number of partitions is requested, it will stay at the current number of partitions.

If you look at the API docs, `repartition(n)` is described like this:
> Returns a new Dataset that has exactly numPartitions partitions.

The key differences between the two are
* `coalesce(n)` is a **narrow** transformation and can only be used to reduce the number of partitions.
* `repartition(n)` is a **wide** transformation and can be used to reduce or increase the number of partitions.

In [13]:
repartitionedDF = alternateDF.repartition(8)

print("Partitions: {0:,}".format( repartitionedDF.rdd.getNumPartitions() ))

Partitions: 8


## Want to Cache, Again?

Back to general list
1. **Read the data**
2. **Balance the number of partitions with the number of slots**
3. Cache the data
4. Adjust `spark.sql.shuffle.partitions`
5. Perform some of the basic ETL
6. If the ETL was costly, Possibly re-cache the data

We just balanced the number of partitions w.r.t. the number of slots.

Depending on the no. of the partitions and the size of data, the shuffle operation can be quite expensive.

Now Let's cache the result of the `repartition(n)` call
* Or more specifically, let's mark it for caching.
* The actual cache will occur later once an action is performed
* Or you could just execute a count to force materialization of the cache.

In [14]:
repartitionedDF.cache()

DataFrame[timestamp: string, site: string, request: string]

## spark.sql.shuffle.partitions

0. <div style="text-decoration:line-through">Read the data</div>
0. <div style="text-decoration:line-through">Balance the number of partitions with the number of slots</div>
0. <div style="text-decoration:line-through">Cache the data</div>
0. Adjust `spark.sql.shuffle.partitions`
0. Perform some basic ETL
0. If the ETL was costly, Possibly re-cache the data

The next problem has to do with a side effect of certain **wide** transformations.

So far, we haven't run any **wide** transformations other than `repartition(n)`
* But eventually we will in thi section
* Let's illustrate the problem that we will **eventually** hit
* We can do this by simply sorting our data.

In [15]:
(repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sort the data
   .foreach(lambda x: None)               # litterally does nothing except trigger a job
)

### Quick Detour

Something is not right here
* We only executed one action.
* But two jobs were triggered.
* If we look at the physical plan we can see the reason for the extra job.
* The answer lies in the step **Exchange rangepartitioning**

In [16]:
(repartitionedDF
  .orderBy(col("timestamp"), col("site"))
  .explain()
)
print("-"*80)

(repartitionedDF
  .orderBy(col("timestamp"), col("site"))
  .limit(3000000)
  .explain()
)
print("-"*80)

== Physical Plan ==
*(1) Sort [timestamp#56 ASC NULLS FIRST, site#57 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(timestamp#56 ASC NULLS FIRST, site#57 ASC NULLS FIRST, 200), true, [id=#77]
   +- InMemoryTableScan [timestamp#56, site#57, request#58]
         +- InMemoryRelation [timestamp#56, site#57, request#58], StorageLevel(disk, memory, deserialized, 1 replicas)
               +- Exchange RoundRobinPartitioning(8), false, [id=#57]
                  +- *(1) Scan ExistingRDD[timestamp#56,site#57,request#58]


--------------------------------------------------------------------------------
== Physical Plan ==
TakeOrderedAndProject(limit=3000000, orderBy=[timestamp#56 ASC NULLS FIRST,site#57 ASC NULLS FIRST], output=[timestamp#56,site#57,request#58])
+- InMemoryTableScan [timestamp#56, site#57, request#58]
      +- InMemoryRelation [timestamp#56, site#57, request#58], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- Exchange RoundRobinPartitioning(8), fals

And just to prove that the extra job is due to the number of records in our DataFrame, re-run it with only 3M records:

In [17]:
(repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sort the data
  .limit(30000)                         # only 3 million please    
  .foreach(lambda x: None)                # litterally does nothing except trigger a job
)

Only 1 job.

Spark's Catalyst Optimizer is optimizing our jobs for us!

### The Real Problem

Back to the original issue:
* Re-run the original job.
* Take a look at the second job.
* Look at the 3rd Stage.
* Notice that it has 200 partitions!
* And this is our problem.

In [18]:
funkyDF = (repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sorts the data
)                                         #
funkyDF.foreach(lambda x: None)           # litterally does nothing except trigger a job

The problem is the number of partitions we ended up with.

Besides looking at the number of tasks in the final stage, we can simply print out the number of partitions

In [19]:
print("Partitions: {0:,}".format( funkyDF.rdd.getNumPartitions() ))

Partitions: 200


The engineers building Apache Spark chose a default value, 200, for the new partition size. After all our work to determine the right number of partitions they go and undo it on us.

The value 200 is actually based on practical experience, attempting to account for the most common scenarios to date. Work is being done to intelligently determine this new value but that is still in progress.

For now, we can tweak it with the configuration value `spark.sql.shuffle.partitions`

We can see below that it is actually configured for 200 partitions

In [20]:
spark.conf.get("spark.sql.shuffle.partitions")

'200'

We can change the config setting with the following command

In [21]:
spark.conf.set("spark.sql.shuffle.partitions", "8")

Now, if we re-run our query, we will see that we end up with the 8 partitions we want post-shuffle.

In [22]:
betterDF = (repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sort the data
)                                         #
betterDF.foreach(lambda x: None)          # litterally does nothing except trigger a job

print("Partitions: {0:,}".format( betterDF.rdd.getNumPartitions() ))

Partitions: 8


## Initial ETL

0. <div style="text-decoration:line-through">Read the data</div>
0. <div style="text-decoration:line-through">Balance the number of partitions with the number of slots</div>
0. <div style="text-decoration:line-through">Cache the data</div>
0. <div style="text-decoration:line-through">Adjust `spark.sql.shuffle.partitions`</div>
0. Perform some basic ETL
0. If the ETL was costly, Possibly re-cache the data

We may have some standard ETL.

In this case we will want to do something like convert the `timestamp` column from a **string** to a data type more appropriate for **date & time**.

We are not going to do that here, instead will will cover that specific case in a future notebook when we look at all the date & time functions.

In [23]:
pageviewsDF = (repartitionedDF
  .select(
    unix_timestamp( col("timestamp"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").alias("createdAt"), 
    col("site"), 
    col("request") 
  )
)

print("****BEFORE****")
repartitionedDF.printSchema()

print("****AFTER****")
pageviewsDF.printSchema()

****BEFORE****
root
 |-- timestamp: string (nullable = true)
 |-- site: string (nullable = true)
 |-- request: string (nullable = true)

****AFTER****
root
 |-- createdAt: timestamp (nullable = true)
 |-- site: string (nullable = true)
 |-- request: string (nullable = true)



And assuming that initial ETL was expensive... we would want to finish up by caching our final `DataFrame`

In [24]:
# mark it as cached.
pageviewsDF.cache() 

# materialize the cache.
pageviewsDF.count() 

7200000

## All Done

0. <div style="text-decoration:line-through">Read the data</div>
0. <div style="text-decoration:line-through">Balance the number of partitions to the number of slots</div>
0. <div style="text-decoration:line-through">Cache the data</div>
0. <div style="text-decoration:line-through">Adjust `spark.sql.shuffle.partitions`</div>
0. <div style="text-decoration:line-through">Perform some basic ETL</div>
0. <div style="text-decoration:line-through">If the ETL was costly, Possibly re-cache the data</div>