#### Persist with storage level: MEMORY_AND_DISK

In [0]:
from pyspark.sql import functions as F
from pyspark import StorageLevel

customer_df= (spark.read.format("parquet")
                  .option("path", "/FileStore/practise-data/*.parquet")
                  .load())

filtered_customer= customer_df.filter(F.col("city") != 'boston')

customer_group_df = filtered_customer.groupBy(F.col("gender")).agg(F.count(F.col("cust_id")).alias("gender_count"))

customer_group_df.persist(StorageLevel.MEMORY_AND_DISK) #persisting to memory and disk

customer_group_df.selectExpr("*").show()

+------+------------+
|gender|gender_count|
+------+------------+
|Female|        2256|
|  Male|        2227|
+------+------------+



![ ](files/practise-data/persist_storage.png)

####Storage Level: Disk Memory Serialized 1x Replicated <br> This means that data is stored as a stream of bytes inside memory and disk (In our example the data being small was stored completely in memory in serialized form)

In [0]:
customer_group_df.unpersist()

Out[5]: DataFrame[gender: string, gender_count: bigint]

#### Persist without passing any parameter

In [0]:
from pyspark.sql import functions as F
from pyspark import StorageLevel

customer_df_new= (spark.read.format("parquet")
                  .option("path", "/FileStore/practise-data/*.parquet")
                  .load())

filtered_customer_new= customer_df_new.filter(F.col("city") != 'seattle')

customer_group_df_new = filtered_customer_new.groupBy(F.col("gender")).agg(F.count(F.col("cust_id")).alias("total_count"))

customer_group_df_new.persist() #persisting without passing any parameter

customer_group_df_new.selectExpr("*").show()

+------+-----------+
|gender|total_count|
+------+-----------+
|Female|       2256|
|  Male|       2242|
+------+-----------+



![](files/practise-data/persist.png)

####Storage Level: Disk Memory Deserialized 1x Replicated <br> Here the data is stored as objects in memory and disk (In our example the data being small was stored completely in memory in deserialized form)

In [0]:
customer_group_df_new.unpersist()

Out[6]: DataFrame[gender: string, total_count: bigint]

#### Using cache to store dataframe

In [0]:
from pyspark.sql import functions as F


customer_cache_df= (spark.read.format("parquet")
                  .option("path", "/FileStore/practise-data/*.parquet")
                  .load())

filtered_customer_cache= customer_cache_df.filter((F.col("city") != 'seattle') | (F.col("city") != 'boston'))

customer_cache_df = filtered_customer_cache.groupBy(F.col("gender")).agg(F.count(F.col("cust_id")).alias("cnt"))

customer_cache_df.cache() #caching dataframe

customer_cache_df.selectExpr("*").show()

+------+----+
|gender| cnt|
+------+----+
|Female|2502|
|  Male|2498|
+------+----+



![](files/practise-data/cache.png)

####Storage Level: Disk Memory Deserialized 1x Replicated <br> This is similar to persist() where the default nature is to store data in deserialized form

In [0]:
customer_cache_df.unpersist()

Out[9]: DataFrame[gender: string, cnt: bigint]

####Conclusion: <br> persist(StorageLevel.MEMORY_AND_DISK): Memory and Disk in Deserialized form <br> persist() and cache(): Memory and Disk in Serialized form.