## Cache & Persist
<ul>
    <li>Using cache ( ) and persist ( ) methods, Spark provides an optimization mechanism to
store the intermediate computation of a Spark DataFrame so they can be reused in
subsequent actions.</li>
    <li>When you persist data, each node stores its partitioned data in memory and reuses
them in other actions on that dataset.</li>
    <li>Spark’s persisted data on nodes are fault-tolerant meaning if any partition of a Dataset
is lost, it will automatically be recomputed using the original transformations that
created it.</li>
</ul>

## Cache
### Syntax

<pre>rdd.cache( )</pre>
<pre>df.cache( )</pre>

### To check whether the dataframe is cached or not

<pre>df.is_cached</pre>
<pre>df.storageLevel.useMemory</pre>

<p>both the output gives the booliean values <b>True</b> or <b>False</b></p>

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

In [6]:
spark = SparkSession.builder.appName("Yesttodecide").master("local[*]").getOrCreate()

In [7]:
myDF = spark.read.format("csv")\
        .options(header="true", inferSchema="true")\
        .load('C:\\Personal\\Projects\\Spark\\dataset\\txns')

In [8]:
myDF.show(5,truncate=False)

+-----+----------+-------+------+------------------+---------------------------------+-----------+----------+-------+
|txnno|txndate   |custno |amount|category          |product                          |city       |state     |spendby|
+-----+----------+-------+------+------------------+---------------------------------+-----------+----------+-------+
|0    |06-26-2011|4007024|40.33 |Exercise & Fitness|Cardio Machine Accessories       |Clarksville|Tennessee |credit |
|1    |05-26-2011|4006742|198.44|Exercise & Fitness|Weightlifting Gloves             |Long Beach |California|credit |
|2    |06-01-2011|4009775|5.58  |Exercise & Fitness|Weightlifting Machine Accessories|Anaheim    |California|credit |
|3    |06-05-2011|4002199|198.19|Gymnastics        |Gymnastics Rings                 |Milwaukee  |Wisconsin |credit |
|4    |12-17-2011|4002613|98.81 |Team Sports       |Field Hockey                     |Nashville  |Tennessee |credit |
+-----+----------+-------+------+------------------+----

### Catch example

In [13]:
myDF.is_cached

False

In [10]:
cacheDf = myDF.where(col("state") == "California").cache()

In [11]:
print(cacheDf.count())

13035


In [12]:
cacheDf.is_cached

True

In [14]:
readingCachedDf = cacheDf.where(col("spendby") == "credit")
print(readingCachedDf.count())

11257


## Persist
<ul>
    <li>Persist ( ) in Apache Spark by default takes the storage level as MEMORY_AND_DISK to
save the Spark dataframe and RDD.</li>
    <li>Using persist ( ), will initially start storing the data in JVM memory and when the data
requires additional storage to accommodate, it pushes some excess data in the
partition to disk and reads back the data from disk when it is required.</li>
    <li>Since it involves I/O operation, persist ( ) is considerably slower than cache ( ).</li>
    <li>We have many options of storage levels that can be used with persist ( ).</li>
</ul>

### Syntax
<pre>
#persist dataframe with default storage-level
df.persist( )

#persist dataframe with MEMORY_AND_DISK_2
df.persist(pyspark.StorageLevel.MEMORY_AND_DISK_2)
</pre>

In [16]:
from pyspark.storagelevel import StorageLevel

persistDf = myDF.persist(StorageLevel.DISK_ONLY)

In [17]:
persistDf.count()

95904

In [18]:
# df.persist(pyspark.StorageLevel.MEMORY_ONLY)
# df.persist(pyspark.StorageLevel.DISK_ONLY)
# df.persist(pyspark.StorageLevel.MEMORY_AND_DISK)
# df.persist(pyspark.StorageLevel.MEMORY_ONLY_SER)
# df.persist(pyspark.StorageLevel.MEMORY_AND_DISK_SER)
# df.persist(pyspark.StorageLevel.DISK_ONLY)

### Spark RDD Unpersist
<pre>
df.unpersist ( )
</pre>

### Difference between Cache and Persist

https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/#:~:text=Both%20caching%20and%20persisting%20are,the%20user%2Ddefined%20storage%20level.