### Cache & Persist
Spark에서는 작업 결과를 메모리에 저장할 수 있다. 이때 Cache와 Persist를 사용할 수 있다.  
Regression과 같은 반복작업에서 메모리에 저장하여 계속 처리할 수 있기 때문에 효율적이다.  
만약 메모리에 데이터를 저장할 충분한 공간이 없다면 디스크를 사용할 수 있다.  

- Cache
    - RDD : MEMORY_ONLY
    - DF : MEMORY_AND_DISK
- Persist
    - Storage Level을 통해 지정가능
- unpersist()
    - 캐시 삭제

In [1]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import Window

spark = SparkSession.builder.master("local").appName("spark-cache").getOrCreate()

24/08/12 21:47:20 WARN Utils: Your hostname, MZC01-HYUCKSANGCHO.local resolves to a loopback address: 127.0.0.1; using 192.168.0.80 instead (on interface en0)
24/08/12 21:47:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/12 21:47:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
dfCached = spark.read.format("parquet").load("./airbnb_listings_parquet").cache()
dfCached.count()

                                                                                

4865

In [6]:
dfCached.rdd.getNumPartitions()

63

In [7]:
# InMemoryTableScan이라는 단계가 추가되었다.
dfCached.explain("FORMATTED")

== Physical Plan ==
InMemoryTableScan (1)
   +- InMemoryRelation (2)
         +- * ColumnarToRow (4)
            +- Scan parquet  (3)


(1) InMemoryTableScan
Output [5]: [listing_id#0, listing_url#1, listing_name#2, listing_summary#3, listing_desc#4]
Arguments: [listing_id#0, listing_url#1, listing_name#2, listing_summary#3, listing_desc#4]

(2) InMemoryRelation
Arguments: [listing_id#0, listing_url#1, listing_name#2, listing_summary#3, listing_desc#4], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@701abd9b,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) ColumnarToRow
+- FileScan parquet [listing_id#0,listing_url#1,listing_name#2,listing_summary#3,listing_desc#4] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/mzc01-hyucksangcho/Downloads/airbnb_listings_parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<listing_id:int,listing_url:string,listing_name:string,listing_sum

### Spark DataFrame과 Cache


In [8]:
dfTransformed = dfCached\
    .selectExpr("listing_id", "listing_name")\
    .where(col("listing_id") >= 10000000)\
    .cache()
    
dfTransformed.count()

                                                                                

4278

In [9]:
# Cache가 사용되는걸 알 수 있음
dfTransformed.select("listing_id").explain("FORMATTED") 

== Physical Plan ==
InMemoryTableScan (1)
   +- InMemoryRelation (2)
         +- * Filter (7)
            +- InMemoryTableScan (3)
                  +- InMemoryRelation (4)
                        +- * ColumnarToRow (6)
                           +- Scan parquet  (5)


(1) InMemoryTableScan
Output [1]: [listing_id#0]
Arguments: [listing_id#0]

(2) InMemoryRelation
Arguments: [listing_id#0, listing_name#2], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@701abd9b,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) Filter (isnotnull(listing_id#0) AND (listing_id#0 >= 10000000))
+- InMemoryTableScan [listing_id#0, listing_name#2], [isnotnull(listing_id#0), (listing_id#0 >= 10000000)]
      +- InMemoryRelation [listing_id#0, listing_url#1, listing_name#2, listing_summary#3, listing_desc#4], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(1) ColumnarToRow
               +- FileScan parquet [listing_id#0,listing_url#1,listing_

In [10]:
# 그렇다면 변수로 저장하지 않고 Cache를하면 어떻게 될까?
# 새로운 DataFrame으로 저장하지 않아도 캐싱이 되는걸 알 수 있다.
dfCached.selectExpr("listing_id", "listing_name").where(col("listing_id") >= 30000000).cache()
dfCached.selectExpr("listing_id", "listing_name").where(col("listing_id") >= 30000000).select("listing_id").explain("FORMATTED")

== Physical Plan ==
InMemoryTableScan (1)
   +- InMemoryRelation (2)
         +- * Filter (7)
            +- InMemoryTableScan (3)
                  +- InMemoryRelation (4)
                        +- * ColumnarToRow (6)
                           +- Scan parquet  (5)


(1) InMemoryTableScan
Output [1]: [listing_id#0]
Arguments: [listing_id#0]

(2) InMemoryRelation
Arguments: [listing_id#0, listing_name#2], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@701abd9b,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) Filter (isnotnull(listing_id#0) AND (listing_id#0 >= 30000000))
+- InMemoryTableScan [listing_id#0, listing_name#2], [isnotnull(listing_id#0), (listing_id#0 >= 30000000)]
      +- InMemoryRelation [listing_id#0, listing_url#1, listing_name#2, listing_summary#3, listing_desc#4], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(1) ColumnarToRow
               +- FileScan parquet [listing_id#0,listing_url#1,listing_

### Spark SQL과 Cache

In [12]:
dfCached.createOrReplaceTempView("RAW")
spark.sql("SELECT * FROM RAW LIMIT 10").show()

+----------+--------------------+--------------------+--------------------+--------------------+
|listing_id|         listing_url|        listing_name|     listing_summary|        listing_desc|
+----------+--------------------+--------------------+--------------------+--------------------+
|  12276698|https://www.airbn...|Downtown Casa in ...|Built in (Phone n...|Built in (Phone n...|
|  39589825|https://www.airbn...|Comfy Stapleton c...|                null|                null|
|  16676955|https://www.airbn...|Adorable Row Home...|Updated Spanish s...|Updated Spanish s...|
|  38638676|https://www.airbn...|Amenity Rich LUX ...|                null|                null|
|  33396764|https://www.airbn...|NE Dnvr Home 3br/...|                null|                null|
|   9842499|https://www.airbn...|Lg Light Bsmnt in...|Dbl BR and den w/...|Dbl BR and den w/...|
|  39390474|https://www.airbn...|Sonder | Universi...|Featured in The N...|Featured in The N...|
|  18503556|https://www.airbn.

In [13]:
# CACHE TABLE을 통해 캐싱한다.
spark.sql("""
CACHE TABLE RAW_CACHED AS SELECT * FROM RAW
""")

                                                                                

DataFrame[]

In [14]:
# UNCACHE TABLE을 통해 캐싱을 제거한다.
spark.sql("""
UNCACHE TABLE IF EXISTS RAW_CACHED
""")

DataFrame[]