d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Caching

**Technical Accomplishments:**
* Understand how caching works
* Explore the different caching mechanisms
* Discuss tips for the best use of the cache

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [0]:
%run "../Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) A Fresh Start
For this section, we need to clear the existing cache.

There are several ways to accomplish this:
  * Remove each cache one-by-one, fairly problematic
  * Restart the cluster - takes a fair while to come back online
  * Just blow the entire cache away - this will affect every user on the cluster!!

In [0]:
#!!! DO NOT RUN THIS ON A SHARED CLUSTER !!!
#YOU WILL CLEAR YOUR CACHE AND YOUR COWORKER'S

# spark.catalog.clearCache()

This will ensure that any caches produced by other labs/notebooks will be removed.

Next, open the **Spark UI** and go to the **Storage** tab - it should be empty.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) The Data Source

This data uses the **Pageviews By Seconds** data set.

The files are located on DBFS at **dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv**.

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

schema = StructType(
  [
    StructField("timestamp", StringType(), False),
    StructField("site", StringType(), False),
    StructField("requests", IntegerType(), False)
  ]
)

fileName = "dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv"

pageviewsDF = (spark.read
  .option("header", "true")
  .option("sep", "\t")
  .schema(schema)
  .csv(fileName)
)

The 255 MB pageviews file is currently in our object store, which means each time you scan through it, your Spark cluster has to read the 255 MB of data remotely over the network.

Once again, use the **`count()`** action to scan the entire 255 MB file from disk and count how many total records (rows) there are:

In [0]:
total = pageviewsDF.count()

print("Record Count: {0:,}".format( total ))

The pageviews DataFrame contains 7.2 million rows.

Make a note of how long the previous operation takes.

Re-run it several times trying to establish an average.

Let's try a slightly more complicated operation, such as sorting, which induces an "expensive" shuffle.

In [0]:
(pageviewsDF
  .orderBy("requests")
  .count()
)

Again, make note of how long the operation takes.

Rerun it several times to get an average.

Every time we re-run these operations, it goes all the way back to the original data store.

This requires pulling all the data across the network for every execution.

In many/most cases, this network IO is the most expensive part of a job.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) cache()

We can avoid all of this overhead by caching the data on the executors.

Go ahead and run the following command.

Make note of how long it takes to execute.

In [0]:
pageviewsDF.cache()

The **`cache(..)`** operation doesn't do anything other than mark a **`DataFrame`** as cacheable.

And while it does return an instance of **`DataFrame`** it is not technically a transformation or action

In order to actually cache the data, Spark has to process over every single record.

As Spark processes every record, the cache will be materialized.

A very common method for materializing the cache is to execute a **`count()`**.

**BUT BEFORE YOU DO** Check the **Spark UI** to make sure it's still empty even after calling **`cache()`**.

In [0]:
pageviewsDF.count()

The last **`count()`** will take a little longer than normal.

It has to perform the cache and do the work of materializing the cache.

Now the **`pageviewsDF`** is cached **AND** the cache has been materialized.

Before we rerun our queries, check the **Spark UI** and the **Storage** tab.

Now, run the two queries and compare their execution time to the ones above.

In [0]:
pageviewsDF.count()

In [0]:
(pageviewsDF
  .orderBy("requests")
  .count()
)

Faster, right?

All of our data is being stored in RAM on the executors.

We are no longer making network calls.

Our plain **`count()`** should be sub-second.

Our **`orderBy(..)`** & **`count()`** should be around 3 seconds.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Spark UI - Storage

Now that the pageviews **`DataFrame`** is cached in memory let's go review the **Spark UI** in more detail.

In the **RDDs** table, you should see only one record - multiple if you reran the **`cache()`** operation.

Let's review the **Spark UI**'s **Storage** details
* RDD Name
* Storage Level
* Cached Partitions
* Fraction Cached
* Size in Memory
* Size on Disk

Next, let's dig deeper into the storage details...

Click on the link in the **RDD Name** column to open the **RDD Storage Info**.

Let's review the **RDD Storage Info**
* Size in Memory
* Size on Disk
* Executors

If you recall...
* We should have 8 partitions.
* With 255MB of data divided into 8 partitions...
* The first seven partitions should be 32MB each.
* The last partition will be significantly smaller than the others.

**Question:** Why is the **Size in Memory** nowhere near 32MB?

**Question:** What is the difference between **Size in Memory** and **Size on Disk**?

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) persist()

`cache()` is just an alias for **`persist()`**

Let's take a look at the API docs for
* **`Dataset.persist(..)`** if using Scala
* **`DataFrame.persist(..)`** if using Python

`persist()` allows one to specify an additional parameter (storage level) indicating how the data is cached:
* DISK_ONLY
* DISK_ONLY_2
* MEMORY_AND_DISK
* MEMORY_AND_DISK_2
* MEMORY_AND_DISK_SER
* MEMORY_AND_DISK_SER_2
* MEMORY_ONLY
* MEMORY_ONLY_2
* MEMORY_ONLY_SER
* MEMORY_ONLY_SER_2
* OFF_HEAP

** *Note:* ** *The default storage level for...*
* *RDDs are **MEMORY_ONLY**.*
* *DataFrames are **MEMORY_AND_DISK**.* 
* *Streaming is **MEMORY_AND_DISK_2**.*

Before we can use the various storage levels, it's necessary to import the enumerations...

In [0]:
from pyspark import StorageLevel

**Question:** How do we purge data from the cache?

**`unpersist(..)`** or **`uncache()`**?

Try it...

In [0]:
# pageviewsDF.uncache()
# pageviewsDF.unpersist()

Real quick, go check the **Storage** tab in the **Spark UI** and confirm that the cache has been expunged.

**Question:** What will happen if you take 75% of the cache and then I come along and try to use %50 (of the total)...
* with **MEMORY_ONLY**?
* with **MEMORY_AND_DISK**?
* with **DISK_ONLY**?

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) RDD Name

If you haven't noticed yet, the **RDD Name** on the **Storage** tab in the **Spark UI** is a big ugly name.

It's a bit hacky, but there is a workaround for assigning a name.
0. Create your **`DataFrame`**.
0. From that **`DataFrame`**, create a temporary view with your desired name.
0. Specifically, cache the table via the **`SparkSession`** and its **`Catalog`**.
0. Materialize the cache.

In [0]:
pageviewsDF.unpersist()

pageviewsDF.createOrReplaceTempView("Pageviews_DF_Python")
spark.catalog.cacheTable("Pageviews_DF_Python")

pageviewsDF.count()

And now to clean up after ourselves...

In [0]:
pageviewsDF.unpersist()

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "../Includes/Classroom-Cleanup"


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>