d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

#Partitioning

** Data Source **
* English Wikipedia pageviews by second
* Size on Disk: ~255 MB
* Type: Tab Separated Text File & Parquet files
* More Info: <a href="https://old.datahub.io/dataset/english-wikipedia-pageviews-by-second" target="_blank">https&#58;old.datahub.io/dataset/english-wikipedia-pageviews-by-second</a>

**Technical Accomplishments:**
* Understand the relationship between partitions and slots/cores
* Review `repartition(n)` and `coalesce(n)`
* Review one key side effect of shuffle partitions

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "../Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) **The Data Source**

This data uses the **Pageviews By Seconds** data set.

The file is located on the DBFS at **dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv**.

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Use a schema to avoid the overhead of inferring the schema
# In the case of CSV/TSV it requires a full scan of the file.
schema = StructType(
  [
    StructField("timestamp", StringType(), False),
    StructField("site", StringType(), False),
    StructField("requests", IntegerType(), False)
  ]
)

fileName = "/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv"

# Create our initial DataFrame
initialDF = (spark.read
  .option("header", "true")
  .option("sep", "\t")
  .schema(schema)
  .csv(fileName)
)

We can see below that our data consists of...
* when the record was created
* the site (mobile or desktop) 
* and the number of requests

This amounts to one record per site, per second, and captures the number of requests made in that one second. 

That means for every second of the day, there are two records.

In [0]:
display(initialDF)

-sandbox
##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) My First Steps

Before processing any data, there are normally several steps to simply prepare the data for analysis such as
0. <div style="text-decoration:line-through">Read the data in</div>
0. Balance the number of partitions to the number of slots
0. Cache the data
0. Adjust the `spark.sql.shuffle.partitions`
0. Perform some basic ETL (ie convert strings to timestamp)
0. Possibly re-cache the data if the ETL was costly

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Partitions vs Slots

* We have our `initialDF` (**Step #1**) which amounts to nothing more than reading in the data.
* For **Step #2** we have to ask the question, what is the relationship between partitions and slots.


** *Note:* ** *The Spark API uses the term **core** meaning a thread available for parallel execution.*<br/>*Here we refer to it as **slot** to avoid confusion with the number of cores in the underlying CPU(s)*<br/>*to which there isn't necessarily an equal number.*

### Slots/Cores

In most cases, if you created your cluster, you should know how many cores you have.

However, to check programmatically, you can use `SparkContext.defaultParallelism`

For more information, see the doc <a href="https://spark.apache.org/docs/latest/configuration.html#execution-behavior" target="_blank">Spark Configuration, Execution Behavior</a>
> For operations like parallelize with no parent RDDs, it depends on the cluster manager:
> * Local mode: number of cores on the local machine
> * Mesos fine grained mode: 8
> * **Others: total number of cores on all executor nodes or 2, whichever is larger**

In [0]:
cores = sc.defaultParallelism

print("You have {} cores, or slots.".format(cores))

### Partitions

* The second 1/2 of this question is how many partitions of data do I have?
* With that we have two subsequent questions:
  0. Why do I have that many?
  0. What is a partition?

For the last question, a **partition** is a small piece of the total data set.

Google defines it like this:
> the action or state of dividing or being divided into parts.

If our goal is to process all our data (say 1M records) in parallel, we need to divide that data up.

If I have 8 **slots** for parallel execution, it would stand to reason that I want 1M / 8 or 125,000 records per partition.

Back to the first question, we can answer it by running the following command which
* takes the `initialDF`
* converts it to an `RDD`
* and then asks the `RDD` for the number of partitions

In [0]:
partitions = initialDF.rdd.getNumPartitions()
print("Partitions: {0:,}".format( partitions ))

* It is **NOT** coincidental that we have **8 slots** and **8 partitions**
* In Spark 2.0 a lot of optimizations have been added to the readers.
* Namely the readers looks at **the number of slots**, the **size of the data**, and makes a best guess at how many partitions **should be created**.
* You can actually double the size of the data several times over and Spark will still read in **only 8 partitions**.
* Eventually it will get so big that Spark will forgo optimization and read it in as 10 partitions, in that case.

But 8 partitions and 8 slots is just too easy.
  * Let's read in another copy of this same data.
  * A parquet file that was saved in 5 partitions.
  * This gives us an excuse to reason about the **relationship between slots and partitions**

In [0]:
# Create our initial DataFrame. We can let it infer the 
# schema because the cost for parquet files is really low.
alternateDF = (spark.read
  .parquet("/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet")
)

print("Partitions: {0:,}".format( alternateDF.rdd.getNumPartitions() ))

Now that we have only 5 partitions we have to ask...

What is going to happen when I perform and action like `count()` **with 8 slots and only 5 partitions?**

In [0]:
alternateDF.count()

**Question #1:** Is it OK to let my code continue to run this way?

**Question #2:** What if it was a **REALLY** big file that read in as **200 partitions** and we had **256 slots**?

**Question #3:** What if it was a **REALLY** big file that read in as **200 partitions** and we had only **8 slots**, how long would it take compared to a dataset that has only 8 partitions?

**Question #4:** Given the previous example (**200 partitions** vs **8 slots**) what are our options (given that we cannot increase the number of partitions)?

### Use Every Slot/Core

With some very few exceptions, you always want the number of partitions to be **a factor of the number of slots**.

That way **every slot is used**.

That is, every slots is being assigned a task.

With 5 partitions & 8 slots we are **under-utilizing three of the eight slots**.

With 9 partitions & 8 slots we just guaranteed our **job will take 2x** as long as it may need to.
* 10 seconds, for example, to process the first 8.
* Then as soon as one of the first 8 is done, another 10 seconds to process the last partition.

### More or Less Partitions?

As a **general guideline** it is advised that each partition (when cached) is roughly around 200MB.
* Size on disk is not a good gauge. For example...
* CSV files are large on disk but small in RAM - consider the string "12345" which is 10 bytes compared to the integer 12345 which is only 4 bytes.
* Parquet files are highly compressed but uncompressed in RAM.
* In a relational database... well... who knows?

The **200 comes from** the real-world-experience of Databricks' engineers and is **based largely on efficiency** and not so much resource limitations. 

On an executor with a reduced amount of RAM you might need to lower that.

For example, at 8 partitions (corresponding to our max number of slots) & 200MB per partition
* That will use roughly **1.5GB**
* We **might** get away with that on CE.
* If you have transformations that balloon the data size (such as Natural Language Processing) you are sure to run into problems.

**Question:** If I read in my data and it comes in as 10 partitions should I...
* reduce my partitions down to 8 (1x number of slots)
* or increase my partitions up to 16 (2x number of slots)

**Answer:** It depends on the size of each partition
* Read the data in. 
* Cache it. 
* Look at the size per partition.
* If you are near or over 200MB consider increasing the number of partitions.
* If you are under 200MB consider decreasing the number of partitions.

The goal will **ALWAYS** be to use as few partitions as possible while maintaining at least 1 x number-of-slots.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) repartition(n) or coalesce(n)

We have two operations that can help address this problem: `repartition(n)` and `coalesce(n)`.

If you look at the API docs, `coalesce(n)` is described like this:
> Returns a new Dataset that has exactly numPartitions partitions, when fewer partitions are requested.<br/>
> If a larger number of partitions is requested, it will stay at the current number of partitions.

If you look at the API docs, `repartition(n)` is described like this:
> Returns a new Dataset that has exactly numPartitions partitions.

The key differences between the two are
* `coalesce(n)` is a **narrow** transformation and can only be used to reduce the number of partitions.
* `repartition(n)` is a **wide** transformation and can be used to reduce or increase the number of partitions.

So, if I'm increasing the number of partitions I have only one choice: `repartition(n)`

If I'm reducing the number of partitions I can use either one, so how do I decide?
* First off, `coalesce(n)` is a **narrow** transformation and performs better because it avoids a shuffle.
* However, `coalesce(n)` cannot guarantee even **distribution of records** across all partitions.
* For example, with `coalesce(n)` you might end up with **a few partitions containing 80%** of all the data.
* On the other hand, `repartition(n)` will give us a relatively **uniform distribution**.
* And `repartition(n)` is a **wide** transformation meaning we have the added cost of a **shuffle operation**.

In our case, we "need" to go form 5 partitions up to 8 partitions - our only option here is `repartition(n)`.

In [0]:
repartitionedDF = alternateDF.repartition(8)

print("Partitions: {0:,}".format( repartitionedDF.rdd.getNumPartitions() ))

-sandbox
##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Cache, Again?

Back to list...
0. <div style="text-decoration:line-through">Read the data in</div>
0. <div style="text-decoration:line-through">Balance the number of partitions to the number of slots</div>
0. Cache the data
0. Adjust the `spark.sql.shuffle.partitions`
0. Perform some basic ETL (i.e., convert strings to timestamp)
0. Possibly re-cache the data if the ETL was costly

We just balanced the number of partitions to the number of slots.

Depending on the size of the data and the number of partitions, the shuffle operation can be fairly expensive (though necessary).

Let's cache the result of the `repartition(n)` call..
* Or more specifically, let's mark it for caching.
* The actual cache will occur later once an action is performed
* Or you could just execute a count to force materialization of the cache.

In [0]:
repartitionedDF.cache()

-sandbox
##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) spark.sql.shuffle.partitions

0. <div style="text-decoration:line-through">Read the data in</div>
0. <div style="text-decoration:line-through">Balance the number of partitions to the number of slots</div>
0. <div style="text-decoration:line-through">Cache the data</div>
0. Adjust the `spark.sql.shuffle.partitions`
0. Perform some basic ETL (i.e., convert strings to timestamp)
0. Possibly re-cache the data if the ETL was costly

The next problem has to do with a side effect of certain **wide** transformations.

So far, we haven't hit any **wide** transformations other than `repartition(n)`
* But eventually we will... 
* Let's illustrate the problem that we will **eventually** hit
* We can do this by simply sorting our data.

In [0]:
(repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sort the data
   .foreach(lambda x: None)               # literally does nothing except trigger a job
)

### Quick Detour
Something isn't right here...
* We only executed one action.
* But two jobs were triggered.
* If we look at the physical plan we can see the reason for the extra job.
* The answer lies in the step **Exchange rangepartitioning**

In [0]:
(repartitionedDF
  .orderBy(col("timestamp"), col("site"))
  .explain()
)
print("-"*80)

(repartitionedDF
  .orderBy(col("timestamp"), col("site"))
  .limit(3000000)
  .explain()
)
print("-"*80)

And just to prove that the extra job is due to the number of records in our DataFrame, re-run it with only 3M records:

In [0]:
(repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sort the data
  .limit(3000000)                         # only 3 million please    
  .foreach(lambda x: None)                # literally does nothing except trigger a job
)

Only 1 job.

Spark's Catalyst Optimizer is optimizing our jobs for us!

### The Real Problem

Back to the original issue...
* Rerun the original job (below).
* Take a look at the second job.
* Look at the 3rd Stage.
* Notice that it has 200 partitions!
* And this is our problem.

In [0]:
funkyDF = (repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sorts the data
)                                         #
funkyDF.foreach(lambda x: None)           # literally does nothing except trigger a job

The problem is the number of partitions we ended up with.

Besides looking at the number of tasks in the final stage, we can simply print out the number of partitions

In [0]:
print("Partitions: {0:,}".format( funkyDF.rdd.getNumPartitions() ))

The engineers building Apache Spark chose a default value, 200, for the new partition size.

After all our work to determine the right number of partitions they go and undo it on us.

The value 200 is actually based on practical experience, attempting to account for the most common scenarios to date.

Work is being done to intelligently determine this new value but that is still in progress.

For now, we can tweak it with the configuration value `spark.sql.shuffle.partitions`

We can see below that it is actually configured for 200 partitions

In [0]:
spark.conf.get("spark.sql.shuffle.partitions")

We can change the config setting with the following command

In [0]:
spark.conf.set("spark.sql.shuffle.partitions", "8")

Now, if we re-run our query, we will see that we end up with the 8 partitions we want post-shuffle.

In [0]:
betterDF = (repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sort the data
)                                         #
betterDF.foreach(lambda x: None)          # literally does nothing except trigger a job

print("Partitions: {0:,}".format( betterDF.rdd.getNumPartitions() ))

-sandbox
##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Initial ETL

0. <div style="text-decoration:line-through">Read the data in</div>
0. <div style="text-decoration:line-through">Balance the number of partitions to the number of slots</div>
0. <div style="text-decoration:line-through">Cache the data</div>
0. <div style="text-decoration:line-through">Adjust the `spark.sql.shuffle.partitions`</div>
0. Perform some basic ETL (i.e., convert strings to timestamp)
0. Possibly re-cache the data if the ETL was costly

We may have some standard ETL.

In this case we will want to do something like convert the `timestamp` column from a **string** to a data type more appropriate for **date & time**.

We are not going to do that here, instead will will cover that specific case in a future notebook when we look at all the date & time functions.

But so as to not leave you in suspense...

In [0]:
pageviewsDF = (repartitionedDF
  .select(
    unix_timestamp( col("timestamp"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").alias("createdAt"), 
    col("site"), 
    col("requests") 
  )
)

print("****BEFORE****")
repartitionedDF.printSchema()

print("****AFTER****")
pageviewsDF.printSchema()

And assuming that initial ETL was expensive... we would want to finish up by caching our final `DataFrame`

In [0]:
# mark it as cached.
pageviewsDF.cache() 

# materialize the cache.
pageviewsDF.count() 

-sandbox
##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) All Done

0. <div style="text-decoration:line-through">Read the data in</div>
0. <div style="text-decoration:line-through">Balance the number of partitions to the number of slots</div>
0. <div style="text-decoration:line-through">Cache the data</div>
0. <div style="text-decoration:line-through">Adjust the `spark.sql.shuffle.partitions`</div>
0. <div style="text-decoration:line-through">Perform some basic ETL (i.e., convert strings to timestamp)</div>
0. <div style="text-decoration:line-through">Possibly re-cache the data if the ETL was costly</div>

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "../Includes/Classroom-Cleanup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/labs.png) Partitioning Lab
It's time to put what we learned to practice.

Go ahead and open the notebook [Partitioning Lab]($./Partitioning Lab) and complete the exercises.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>