## DPP 
Dynamic Partition Pruning

1. 설정 :

`spark.sql.optimizer.dynamicPartitionPruning.enabled` true (기본값)


2. 제약사항

- Fact Table이 반드시 물리적으로 파티셔닝 되어 있어야 한다.
- 작은 테이블 쪽이 Broadcast될수 있을 만큼 작아야 효과적

3. explain 확인

- `dynamicpruningexpression` 이 보인다면 DPP가 성공적 작동


In [22]:
from pyspark.sql import (
    Row,
    SparkSession)
import pyspark.sql.functions as F
import pyspark.sql.types as t

In [2]:
spark=(
    SparkSession
    .builder
    .appName("spark-DDP")
    .master("spark://spark-master:7077")
    .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/02 08:09:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# spark version 2.xx이면 DDP없음 3.x이상 필수
spark.version

'3.5.1'

In [5]:
#DDP 설정 확인 
spark.conf.get("spark.sql.optimizer.dynamicPartitionPruning.enabled")

'true'

In [9]:
# Dimension Table(작은 테이블)

In [10]:
dimension_df=spark.read.csv("file:///workspace/data/ecommerce_region.csv",
                       header=True,
                       inferSchema=True)
dimension_df.show()
dimension_df.printSchema()

+---------+-------------+
|region_id|         city|
+---------+-------------+
|        1|San Francisco|
|        2|  Los Angeles|
|        3|      Seattle|
|        4|    San Diego|
|        5|     New York|
+---------+-------------+

root
 |-- region_id: integer (nullable = true)
 |-- city: string (nullable = true)



In [11]:
# Fact Table만들기(대용량)

In [13]:
table_schema = t.StructType([
    t.StructField("date", t.StringType(), True),
    t.StructField("name", t.StringType(), True),
    t.StructField("region", t.IntegerType(), True)])

csv_file_path="file:///workspace/data/ecommerce_order.csv"

df=spark.read.schema(table_schema).csv(csv_file_path)
df.show()
df.printSchema()

+----------+----------------+------+
|      date|            name|region|
+----------+----------------+------+
|2022-04-03|    Tory Delgado|     1|
|2022-04-22|  Marivel Knight|     5|
|2022-05-24|   Jene Franklin|     1|
|2022-06-22|Jamison Santiago|     4|
|2022-05-28|     Kasey Wolfe|     1|
|2022-01-09|     Kathey Ryan|     5|
|2022-03-30|   Elenore Moore|     2|
|2022-10-07|  Walton Kennedy|     1|
|2022-10-06|Lakiesha Jimenez|     1|
|2022-01-19|  Gertude Ramsey|     3|
|2022-12-08|   Raguel George|     4|
|2022-01-07|      Larry Lowe|     5|
|2022-06-13| Piedad Williams|     1|
|2022-10-17| Melvin Mckinney|     2|
|2022-10-17|    Cher Lambert|     3|
|2022-12-13|    Elvina Grant|     1|
|2022-10-27|   Cristie Stone|     1|
|2022-01-18| Svetlana Hansen|     3|
|2022-07-21|  Roseline Bowen|     5|
|2022-07-03|     Lacy Flores|     1|
+----------+----------------+------+
only showing top 20 rows

root
 |-- date: string (nullable = true)
 |-- name: string (nullable = true)
 |-- regi

In [14]:
# 파티션 나누어 저장 
# 파티셔닝 되어 있어야 Pruning이될수있다.

In [15]:
(
    df
    .write
    .partitionBy("region")
    .mode("overwrite")
    .parquet("/workspace/data/output/partition_pruning")
)

                                                                                

In [15]:
# Static Partiton Pruning 확인

In [16]:
read_df=spark.read.parquet("/workspace/data/output/partition_pruning")
read_df.show()

+----------+----------------+------+
|      date|            name|region|
+----------+----------------+------+
|2022-06-22|Jamison Santiago|     4|
|2022-12-08|   Raguel George|     4|
|2022-08-03|   Johana Walton|     4|
|2022-06-15|     Yi Robinson|     4|
|2022-10-03|Lawerence Ramsey|     4|
|2022-11-30|  Lesley Collins|     4|
|2022-02-08|   Keneth Bailey|     4|
|2022-10-20|    Hang Robbins|     4|
|2022-12-08|       Wei Lewis|     4|
|2022-01-28|   Rhona Manning|     4|
|2022-06-23|Becki Strickland|     4|
|2022-12-09|      Gayle Byrd|     4|
|2022-07-18|Ellsworth Hunter|     4|
|2022-01-18|   Luna Ferguson|     4|
|2022-07-27| Criselda Murphy|     4|
|2022-01-06|   Alisia Rhodes|     4|
|2022-09-06|     Klara Gibbs|     4|
|2022-12-18|   Irish Goodman|     4|
|2022-04-28| Porfirio Harvey|     4|
|2022-12-27| Andreas Spencer|     4|
+----------+----------------+------+
only showing top 20 rows



In [17]:
# Static PartitonPruning (기존)
# 쿼리에 WHERE='' 처럼 상수가 박혀있을때만 작동했다 

In [20]:
# DPP x
read_df.where("region=2").explain("formatted")

== Physical Plan ==
* ColumnarToRow (2)
+- Scan parquet  (1)


(1) Scan parquet 
Output [3]: [date#118, name#119, region#120]
Batched: true
Location: InMemoryFileIndex [file:/workspace/data/output/partition_pruning]
PartitionFilters: [isnotnull(region#120), (region#120 = 2)]
ReadSchema: struct<date:string,name:string>

(2) ColumnarToRow [codegen id : 1]
Input [3]: [date#118, name#119, region#120]




In [21]:
# DPP 발생

In [24]:
joined_df=read_df.join(
    F.broadcast(dimension_df),
     read_df.region == dimension_df.region_id,
    "inner"
).where(dimension_df.city=="San Francisco")
joined_df.show()

+----------+----------------+------+---------+-------------+
|      date|            name|region|region_id|         city|
+----------+----------------+------+---------+-------------+
|2022-04-03|    Tory Delgado|     1|        1|San Francisco|
|2022-05-24|   Jene Franklin|     1|        1|San Francisco|
|2022-05-28|     Kasey Wolfe|     1|        1|San Francisco|
|2022-10-07|  Walton Kennedy|     1|        1|San Francisco|
|2022-10-06|Lakiesha Jimenez|     1|        1|San Francisco|
|2022-06-13| Piedad Williams|     1|        1|San Francisco|
|2022-12-13|    Elvina Grant|     1|        1|San Francisco|
|2022-10-27|   Cristie Stone|     1|        1|San Francisco|
|2022-07-03|     Lacy Flores|     1|        1|San Francisco|
|2022-01-01|   Kathey Little|     1|        1|San Francisco|
|2022-04-13|        Fe Reyes|     1|        1|San Francisco|
|2022-06-23|   Apryl Holland|     1|        1|San Francisco|
|2022-07-13|  Doloris Farmer|     1|        1|San Francisco|
|2022-07-12| Merrie Eric

In [25]:
joined_df.explain(mode="formatted")

== Physical Plan ==
AdaptiveSparkPlan (6)
+- BroadcastHashJoin Inner BuildRight (5)
   :- Scan parquet  (1)
   +- BroadcastExchange (4)
      +- Filter (3)
         +- Scan csv  (2)


(1) Scan parquet 
Output [3]: [date#118, name#119, region#120]
Batched: true
Location: InMemoryFileIndex [file:/workspace/data/output/partition_pruning]
PartitionFilters: [isnotnull(region#120), dynamicpruningexpression(region#120 IN dynamicpruning#173)]
ReadSchema: struct<date:string,name:string>

(2) Scan csv 
Output [2]: [region_id#50, city#51]
Batched: false
Location: InMemoryFileIndex [file:/workspace/data/ecommerce_region.csv]
PushedFilters: [IsNotNull(city), EqualTo(city,San Francisco), IsNotNull(region_id)]
ReadSchema: struct<region_id:int,city:string>

(3) Filter
Input [2]: [region_id#50, city#51]
Condition : ((isnotnull(city#51) AND (city#51 = San Francisco)) AND isnotnull(region_id#50))

(4) BroadcastExchange
Input [2]: [region_id#50, city#51]
Arguments: HashedRelationBroadcastMode(List(cast(in