# Cache and Persist

The **cache()** and **persist()** are methods used to store RDD, DataFrames and Datasets in memory to improve their re-usability across multiple Spark operations. When a dataset is `cached or persisted`, each `worker node` stores its partitioned data in memory. And `Spark’s persisted data on nodes are fault-tolerant` meaning if any partition of a Dataset is lost, it will automatically be recomputed using the original transformations that created it.

## Advantage

Computations Cost efficient - Spark computations are very expensive hence reusing the computations are used to save cost.

Computations Time efficient - Reusing the repeated computations saves lots of time.

## Dis-advantage

Large Storage Cost - As we know the memory in the worker node is shared by the computation and storage. If we persist large dataset on a worker node, the memory left for computation will be reduced. If we store the dataset on Disk, the performence will be impacted. So `don't cache/persist unless a dataset will be reused`.

## Cache vs Persist

**Cache** is the simplified version of **Persist** method. You can't specify the storage level (e.g. MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK). It uses the default storage level in your spark cluster config. For RDD cache() the default storage level is `MEMORY_ONLY`, for DataFrame and Dataset cache(), default is `MEMORY_AND_DISK`

```python

df.cache()
```

**persist** allows you to specify the storage level. Below is an example.

```python

from pyspark import StorageLevel
df.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
```

## Storage Level

All different storage level Spark supports are available at `org.apache.spark.storage.StorageLevel` class. The storage level specifies how and where to persist or cache a Spark DataFrame and Dataset.

- **MEMORY_ONLY**:  This is the default behavior of the RDD cache() method and stores the RDD or DataFrame as deserialized objects to JVM memory. When there is no enough memory available it will not save DataFrame of some partitions and these will be re-computed as and when required. This takes more memory. but unlike RDD, this would be slower than MEMORY_AND_DISK level as it recomputes the unsaved partitions and recomputing the in-memory columnar representation of the underlying table is expensive


- **MEMORY_ONLY_SER**:  This is the same as MEMORY_ONLY but the difference being it stores RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) then MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in order to deserialize.

- **MEMORY_ONLY_2**:  Same as MEMORY_ONLY storage level but replicate each partition to two cluster nodes.

- **MEMORY_ONLY_SER_2**:  Same as MEMORY_ONLY_SER storage level but replicate each partition to two cluster nodes.

- **MEMORY_AND_DISK**:  This is the default behavior of the DataFrame or Dataset. In this Storage Level, The DataFrame will be stored in JVM memory as a deserialized object. When required storage is greater than available memory, it stores some of the excess partitions into the disk and reads the data from the disk when required. It is slower as there is I/O involved.


- **MEMORY_AND_DISK_SER**:  This is the same as MEMORY_AND_DISK storage level difference being it serializes the DataFrame objects in memory and on disk when space is not available.

- **MEMORY_AND_DISK_2**:  Same as MEMORY_AND_DISK storage level but replicate each partition to two cluster nodes.

- **MEMORY_AND_DISK_SER_2**:  Same as MEMORY_AND_DISK_SER storage level but replicate each partition to two cluster nodes.

- **DISK_ONLY**:  In this storage level, DataFrame is stored only on disk and the CPU computation time is high as I/O is involved.

- **DISK_ONLY_2**:  Same as DISK_ONLY storage level but replicate each partition to two cluster nodes.

```text
Storage Level    Space used  CPU time  In memory  On-disk  Serialized   Recompute some partitions
----------------------------------------------------------------------------------------------------
MEMORY_ONLY          High        Low       Y          N        N         Y    
MEMORY_ONLY_SER      Low         High      Y          N        Y         Y
MEMORY_AND_DISK      High        Medium    Some       Some     Some      N
MEMORY_AND_DISK_SER  Low         High      Some       Some     Y         N
DISK_ONLY            Low         High      N          Y        Y         N

```
## Other important point

- Spark automatically monitors every persist() and cache() calls you make and `it checks usage on each node and drops persisted data if not used or using least-recently-used (LRU) algorithm`. The manually clean action **unpersist()** method can be used to.
- On Spark UI, the Storage tab shows where partitions exist in memory or disk across the cluster.
- Dataset cache() is an alias for persist(StorageLevel.MEMORY_AND_DISK)
- Caching of Spark DataFrame or Dataset is a **lazy operation**, meaning a DataFrame will not be cached until you trigger an action. 

In [1]:
from pyspark.sql import SparkSession
import os

local=True
if local:
    spark = SparkSession.builder\
        .master("local[4]")\
        .appName("CacheAndPersist")\
        .config("spark.driver.memory", "6g")\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443")\
        .appName("CacheAndPersist")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:master")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory","2g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()

# make the large dataframe show pretty
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)

23/09/07 14:38:54 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
23/09/07 14:38:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/09/07 14:38:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Below are a list of usefully functions

In [2]:
def showPersistedRdd():
    """
    This function shows the persistent rdd information
    :return: 
    :rtype: 
    """
    rdds = spark.sparkContext._jsc.getPersistentRDDs()
    print(f"Persisted rdd numbers: {len(rdds)}")
    for id, rdd in rdds.items():
        print(f"id: {id}\ndescription: {rdd}")

In [16]:
def showViews(globalView:bool = False):
    """
    This function shows all the available view in the spark context. If globalView is true, it will show global view, otherwise it only shows the temp view
    :return: 
    :rtype: 
    """
    base = "global_temp" if globalView else None
    for view in spark.catalog.listTables(base):
        print(view.name)
        

In [23]:
def dropView(viewName:str, globalView:bool=False):
    """
    This function can drop a view with the given view name. The no existing exception is handled by spark. If exist and delete success, return True, otherwise return False
    :param viewName: 
    :type viewName: 
    :param globalView: 
    :type globalView: 
    :return: 
    :rtype: 
    """
    if globalView:
        result = spark.catalog.dropGlobalTempView(viewName)
        print(f"Global view '{viewName}' has been deleted with return result: {result}.")
    else:
        # Drop the temporary view
        result = spark.catalog.dropTempView(viewName)
        print(f"Temporary view '{viewName}' has been deleted with return result: {result}.")

In [5]:
filePath = "/home/pengfei/data_set/kaggle/data_format/netflix.parquet"

df = spark.read.parquet(filePath)

                                                                                

In [6]:
df.show()

+-------+------+----------+
|user_id|rating|      date|
+-------+------+----------+
|1488844|     3|2005-09-06|
| 822109|     5|2005-05-13|
| 885013|     4|2005-10-19|
|  30878|     4|2005-12-26|
| 823519|     3|2004-05-03|
| 893988|     3|2005-11-17|
| 124105|     4|2004-08-05|
|1248029|     3|2004-04-22|
|1842128|     4|2004-05-09|
|2238063|     3|2005-05-11|
|1503895|     4|2005-05-19|
|2207774|     5|2005-06-06|
|2590061|     3|2004-08-12|
|   2442|     3|2004-04-14|
| 543865|     4|2004-05-28|
|1209119|     4|2004-03-23|
| 804919|     4|2004-06-10|
|1086807|     3|2004-12-28|
|1711859|     4|2005-05-08|
| 372233|     5|2005-11-23|
+-------+------+----------+


In [7]:
# now we try to cache the dataframe
# note that the cache() method is lazy transformation, if no action, it will not be executed.
cachedDf = df.cache()
cachedDf.show()



+-------+------+----------+
|user_id|rating|      date|
+-------+------+----------+
|1488844|     3|2005-09-06|
| 822109|     5|2005-05-13|
| 885013|     4|2005-10-19|
|  30878|     4|2005-12-26|
| 823519|     3|2004-05-03|
| 893988|     3|2005-11-17|
| 124105|     4|2004-08-05|
|1248029|     3|2004-04-22|
|1842128|     4|2004-05-09|
|2238063|     3|2005-05-11|
|1503895|     4|2005-05-19|
|2207774|     5|2005-06-06|
|2590061|     3|2004-08-12|
|   2442|     3|2004-04-14|
| 543865|     4|2004-05-28|
|1209119|     4|2004-03-23|
| 804919|     4|2004-06-10|
|1086807|     3|2004-12-28|
|1711859|     4|2005-05-08|
| 372233|     5|2005-11-23|
+-------+------+----------+


                                                                                

You can check the cached data in spark UI (http://localhost:4040/storage). As we mentioned before, the default storage level for dataframe is `memory and disk`.

Base on the config of your cluster, the above operation failed, because we only have 4 GB memory for the worker and half of it is reserved for the calculation. So if the cached data is too big, it's normal we have a out of memory error. When we encounter this kind of situation, we have two solution:
- we can increase the worker memory
- change the storage level.

The below code use the `persist` method to cache the data. It's equivalent of `df.cache()`.

```python
from pyspark.storagelevel import StorageLevel
persistDf = df.persist(StorageLevel.MEMORY_AND_DISK)
persistDf.show()
```

In [8]:
# We can also get the persisted rdd information by using the below function 
showPersistedRdd()

Persisted rdd numbers: 1
id: 18
description: *(1) ColumnarToRow
+- FileScan parquet [user_id#0,rating#1,date#2] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/home/pengfei/data_set/kaggle/data_format/netflix.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_id:string,rating:string,date:string>
 MapPartitionsRDD[18] at showString at NativeMethodAccessorImpl.java:0


## Performance test Dataframe vs Cached dataframe

Now let's check if the cached data frame has better performance

In [9]:
%%time

from pyspark.sql.functions import count
dfResu = df.groupBy("rating").agg(count("*").alias("count"))
dfResu.show()



+------+-------+
|rating|  count|
+------+-------+
|     3|6904181|
|  null|   4498|
|     5|5506583|
|     1|1118186|
|     4|8085741|
|     2|2439073|
+------+-------+

CPU times: user 6.79 ms, sys: 771 µs, total: 7.57 ms
Wall time: 5.34 s


                                                                                

In [10]:
%%time

cachedDfResu = cachedDf.groupBy("rating").agg(count("*").alias("count"))
cachedDfResu.show()



+------+-------+
|rating|  count|
+------+-------+
|     3|6904181|
|  null|   4498|
|     5|5506583|
|     1|1118186|
|     4|8085741|
|     2|2439073|
+------+-------+

CPU times: user 16.4 ms, sys: 0 ns, total: 16.4 ms
Wall time: 3.51 s


                                                                                

## Unpersist to clear the memory

We can also unpersist the persistence DataFrame or Dataset to remove it from the memory or storage.
The function signature is `unpersist(blocking : scala.Boolean) : Dataset.this.type` 

The default value of the `blocking` parameter is `False`. That means, it doesn't block the spark operation until all the blocks are deleted, and runs asynchronously. If you set it to `True`, it means all spark operation of the dataframe will be blocked until all the persisted block are deleted

In [11]:
normalDf = cachedDf.unpersist()

In [12]:
%%time

normalDfResu = normalDf.groupBy("rating").agg(count("*").alias("count"))
normalDfResu.show()



+------+-------+
|rating|  count|
+------+-------+
|     3|6904181|
|  null|   4498|
|     5|5506583|
|     1|1118186|
|     4|8085741|
|     2|2439073|
+------+-------+

CPU times: user 10.4 ms, sys: 1.71 ms, total: 12.1 ms
Wall time: 2.3 s


                                                                                

In [13]:
# after unpersist, we can check the number of persisted rdd
showPersistedRdd()

Persisted rdd numbers: 0


We can notice that after unpersist, we no-longer have persisted rdd 

## What happens when we create a view?

We also wants to know if we persist data when we create temp view and global view in spark. 

### A temp view example

In below code, we create a temp view.

In [14]:
tempTableName = "netflix"
df.createOrReplaceTempView(tempTableName)

In [17]:
showViews()

netflix


In [22]:
%%time

result = spark.sql(f"Select count(*) as count, rating from {tempTableName} group by rating")
result.show()



+-------+------+
|  count|rating|
+-------+------+
|6904181|     3|
|   4498|  null|
|5506583|     5|
|1118186|     1|
|8085741|     4|
|2439073|     2|
+-------+------+

CPU times: user 1.23 ms, sys: 3.48 ms, total: 4.71 ms
Wall time: 1.97 s


                                                                                

In [19]:
showPersistedRdd()

Persisted rdd numbers: 0


After the above operations, we can conclude that the temp view will not persist any Rdd.

Now let's clean the view

In [24]:
# clean the table
dropView(tempTableName)
# show available view
showViews()

Temporary view 'netflix' has been deleted with return result: True.


### A global view example

Now let's check what happens when we create a global view. 

> Global views are designed for long-term sharing of DataFrames across different Spark applications or sessions 

In [25]:
globalViewName = "gnetflix"
df.createOrReplaceGlobalTempView(globalViewName)

In [31]:
showViews()

In [32]:
showViews(globalView=True)

gnetflix


In [30]:
%%time
result = spark.sql(f"Select count(*) as count, rating from global_temp.{globalViewName} group by rating")
result.show()



+-------+------+
|  count|rating|
+-------+------+
|6904181|     3|
|   4498|  null|
|5506583|     5|
|1118186|     1|
|8085741|     4|
|2439073|     2|
+-------+------+

CPU times: user 3.64 ms, sys: 542 µs, total: 4.18 ms
Wall time: 2.01 s


                                                                                

In [33]:
# after the creation of the global view, let's check if there are any persisted rdd
showPersistedRdd()

Persisted rdd numbers: 0


**We can conclude that after the creation of the global view, no RDDs are persisted. After running an operation on the view, no RDDs are persisted**

In [59]:
dropView(globalViewName,globalView=True)
showViews(globalView=True)

Global view 'gnetflix' has been deleted with return result: True.
