# Reading query plans

![footer_logo_new](images/logo_new.png)

In [1]:
from pyspark import StorageLevel
from pyspark.sql import functions as F, SQLContext, SparkSession, Window
from pyspark.sql.types import*
from random import randint
import time
import datetime

spark = (SparkSession.builder
         .appName("workshop-spark-optimisation")
         .master("spark://spark-master:7077")
         .config("spark.eventLog.enabled", "true")
         .config("spark.eventLog.dir", "/opt/workspace/history")
         .config("spark.speculation", "true")
         .enableHiveSupport()
         .getOrCreate()
         )

## Initialize dataframes

### Meteo observations

In [2]:
meteo_data_file = "data/meteo-data/parquet"
meteo_df = spark.read.parquet(meteo_data_file)
meteo_df.printSchema()

root
 |-- station_identifier: string (nullable = true)
 |-- date: date (nullable = true)
 |-- observation_type: string (nullable = true)
 |-- observation_value: integer (nullable = true)
 |-- MFLAG1: string (nullable = true)
 |-- QFLAG1: string (nullable = true)
 |-- SFLAG1: string (nullable = true)
 |-- time: string (nullable = true)
 |-- yyyy: integer (nullable = true)



In [3]:
stations_meta_file = "data/meteo-data/stations.csv"

schema = StructType([
    StructField('station_identifier', StringType(), True),
    StructField('latitude', FloatType(), True),
    StructField('longitude', FloatType(), True),
    StructField('height_above_sea_level', FloatType(), True),
    StructField('station_name', StringType(), True)
])

stations_df = (spark.read
               .schema(schema)
               .option("header", "false")
               .csv(stations_meta_file)
              )
stations_df.printSchema()

root
 |-- station_identifier: string (nullable = true)
 |-- latitude: float (nullable = true)
 |-- longitude: float (nullable = true)
 |-- height_above_sea_level: float (nullable = true)
 |-- station_name: string (nullable = true)



In [4]:
# persist a df to speed up a demo and also a good example if you need to use df multiple times 
df1 = (meteo_df
       .where("yyyy >=2018"))

In [5]:
df2 = df1.join(stations_df,meteo_df["station_identifier"] == stations_df["station_identifier"], "inner")
df2 = df2.select(df2['observation_type'], df2['observation_value'], df2['station_name'])

In [6]:
df2.count()

180983588

In [7]:
df2.explain(mode='formatted')

== Physical Plan ==
* Project (10)
+- * BroadcastHashJoin Inner BuildRight (9)
   :- * Project (4)
   :  +- * Filter (3)
   :     +- * ColumnarToRow (2)
   :        +- Scan parquet  (1)
   +- BroadcastExchange (8)
      +- * Project (7)
         +- * Filter (6)
            +- Scan csv  (5)


(1) Scan parquet 
Output [4]: [station_identifier#0, observation_type#2, observation_value#3, yyyy#8]
Batched: true
Location: InMemoryFileIndex [file:/opt/workspace/data/meteo-data/parquet]
PartitionFilters: [isnotnull(yyyy#8), (yyyy#8 >= 2018)]
PushedFilters: [IsNotNull(station_identifier)]
ReadSchema: struct<station_identifier:string,observation_type:string,observation_value:int>

(2) ColumnarToRow [codegen id : 2]
Input [4]: [station_identifier#0, observation_type#2, observation_value#3, yyyy#8]

(3) Filter [codegen id : 2]
Input [4]: [station_identifier#0, observation_type#2, observation_value#3, yyyy#8]
Condition : isnotnull(station_identifier#0)

(4) Project [codegen id : 2]
Output [3]: [stat

In [8]:
df2.explain(mode='cost')

== Optimized Logical Plan ==
Project [observation_type#2, observation_value#3, station_name#22], Statistics(sizeInBytes=15.3 PiB)
+- Join Inner, (station_identifier#0 = station_identifier#18), Statistics(sizeInBytes=27.0 PiB)
   :- Project [station_identifier#0, observation_type#2, observation_value#3], Statistics(sizeInBytes=2.9 GiB)
   :  +- Filter ((isnotnull(yyyy#8) AND (yyyy#8 >= 2018)) AND isnotnull(station_identifier#0)), Statistics(sizeInBytes=7.8 GiB)
   :     +- Relation[station_identifier#0,date#1,observation_type#2,observation_value#3,MFLAG1#4,QFLAG1#5,SFLAG1#6,time#7,yyyy#8] parquet, Statistics(sizeInBytes=7.8 GiB)
   +- Project [station_identifier#18, station_name#22], Statistics(sizeInBytes=9.3 MiB)
      +- Filter isnotnull(station_identifier#18), Statistics(sizeInBytes=11.6 MiB)
         +- Relation[station_identifier#18,latitude#19,longitude#20,height_above_sea_level#21,station_name#22] csv, Statistics(sizeInBytes=11.6 MiB)

== Physical Plan ==
*(2) Project [observati

## Explain function

Prints the Query Plan. 

Allows to specify options like mode: 'formatted', 'cost', 'codegen'. Formatted is especially useful, it breaks down spagetti into something meaningful.

## How to read text output

### Understanding tree structure

When you see a tree structured output like this:
```
== Physical Plan ==
* SortMergeJoin Inner (12)
:- * Sort (6)
:  +- Exchange (5)
:     +- * Project (4)
:        +- * Filter (3)
:           +- * ColumnarToRow (2)
:              +- Scan parquet  (1)
+- * Sort (11)
   +- Exchange (10)
      +- * Project (9)
         +- * Filter (8)
            +- Scan csv  (7)
```         

You should read every branch from the end.

For example processing data from parquet source:
```
:- * Sort (6)
:  +- Exchange (5)
:     +- * Project (4)
:        +- * Filter (3)
:           +- * ColumnarToRow (2)
:              +- Scan parquet  (1)
```
this part of the query does:
1. Scan parquet
2. Converts columnar format to rows
3. Filters based on the predicate
4. Projects (selects) only required columns
5. Does the exchange, shuffle
6. Sorts the shuffled data

Eventually two branches are a part of the `SortMergeJoin Inner` operation.

### Understanding text

Besides the tree structure, there is a corresponding text explanation:

```
(8) Filter [codegen id : 3]
Input [5]: [station_identifier#449, latitude#450, longitude#451, height_above_sea_level#452, station_name#453]
Condition : isnotnull(station_identifier#449)

(9) Project [codegen id : 3]
Output [5]: [station_identifier#449, latitude#450, longitude#451, height_above_sea_level#452, station_name#453]
Input [5]: [station_identifier#449, latitude#450, longitude#451, height_above_sea_level#452, station_name#453]
```

Where `(8) Filter` is the name and id of the operator. And `[codegen id : 3]` is id of the codegen block, which incapsulates this Filter operator. See codegen below.

Input and Output represent the columns which are used by the operator.

## Codegen

When you see Codegen, it indicates that Spark query plan has generated a single block of operations. This block is combined from transformations, which could be applied on a single executor without a shuffle. 

Codegens are great because they take your code as an input and rewrite it as a single Spark-native code. 

In the Spark UI's SQL tab you can see a Codegen as a `WholeStageCodegen` blue box, which includes other smaller boxes. That's the visualization of the merged transformations.

![codegen](images/codegen.png)

## Scan operators

#### Parquet
In this example we observe Parquet reading stage.

```
(1) Scan parquet 
Output [4]: [station_identifier#0, observation_type#2, observation_value#3, yyyy#8]
Batched: true
Location: InMemoryFileIndex [file:/opt/workspace/data/meteo-data/parquet]
PartitionFilters: [isnotnull(yyyy#8), (yyyy#8 >= 2018)]
PushedFilters: [IsNotNull(station_identifier)]
ReadSchema: struct<station_identifier:string,observation_type:string,observation_value:int>
```

##### Pushdown Projection
If you look at the `Output`, you can see the list of columns, which are needed for the query. This list is pushed by the Query Optimizer because not all of the columns are needed for the end result.

#### Filters

##### PartitionFilters
`yyyy >= 2018` is pushed to the source, so that only files in the relevant partitions will be accessed.

##### PushedFilters
`IsNotNull(station_identifier)` - this column will participate in the inner join operation, thus cannot be null.

![scan_parquet_csv](images/scan_parquet_csv.png)

Spark UI shows some of the statistical runtime information, for example number of files, size etc.

### CSV
`Scan csv` shows the stage of reading CSV file(s). Because it is not partitioned (and there is no CSV partitioning) you can see the scan of a single file with some runtime information.

### Broadcast Hash Join

Type of join, when one of the tables is small enough to be distributed to each of the executor and make join with each of the partition.

![broadcast_hash_join](images/broadcast_hash_join.png)

### HashAggregate

Is the type of exchange when an aggregation happens first on each of the partitions, and then is transfered to the Driver for the final aggregation.

In our example `count()` operation consists of 3 intermediate counts on executors, values are send over the network (exchange) and the last aggregate count on the Driver.

![hash_aggregate](images/hash_aggregate.png)

In [None]:
spark.stop()

# Questions

1. Try to run the following query and explain the query plan

In [9]:
from pyspark import StorageLevel
from pyspark.sql import functions as F, SQLContext, SparkSession, Window
from pyspark.sql.types import*
from random import randint
import time
import datetime

spark = (SparkSession.builder
         .appName("explore-data")
         .master("spark://spark-master:7077")
         .config("spark.eventLog.enabled", "true")
         .config("spark.eventLog.dir", "/opt/workspace/history")
         .enableHiveSupport()
         .getOrCreate()
         )

meteo_data_file = "data/meteo-data/parquet"
meteo_df = spark.read.parquet(meteo_data_file)
observation_type_file = "data/meteo-data/observation_type.csv"

schema = StructType([
    StructField('observation_type', StringType(), True),
    StructField('description', StringType(), True)
])

observation_type_df = (spark.read
               .schema(schema)
               .option("header", "false")
               .csv(observation_type_file)
              )

def addStationUDF(s):
  if s is None :
    return None
  else :
    return s + ' station'
spark.udf.register("addStationUDF", addStationUDF)

meteo_df.registerTempTable("meteo_table")

df_udf = spark.sql("""
                    select 
                       addStationUDF(station_identifier) as standard_station_identifier
                       , observation_type
                    from 
                        meteo_table
                    where station_identifier = 'AG000060390'
                    and yyyy >= 2000
                """)
res = df_udf.join(observation_type_df, "observation_type", "left")

res.explain(mode='formatted')


== Physical Plan ==
* Project (12)
+- * BroadcastHashJoin LeftOuter BuildRight (11)
   :- * Project (6)
   :  +- BatchEvalPython (5)
   :     +- * Project (4)
   :        +- * Filter (3)
   :           +- * ColumnarToRow (2)
   :              +- Scan parquet  (1)
   +- BroadcastExchange (10)
      +- * Project (9)
         +- * Filter (8)
            +- Scan csv  (7)


(1) Scan parquet 
Output [3]: [station_identifier#82, observation_type#84, yyyy#90]
Batched: true
Location: InMemoryFileIndex [file:/opt/workspace/data/meteo-data/parquet]
PartitionFilters: [isnotnull(yyyy#90), (yyyy#90 >= 2000)]
PushedFilters: [IsNotNull(station_identifier), EqualTo(station_identifier,AG000060390)]
ReadSchema: struct<station_identifier:string,observation_type:string>

(2) ColumnarToRow [codegen id : 1]
Input [3]: [station_identifier#82, observation_type#84, yyyy#90]

(3) Filter [codegen id : 1]
Input [3]: [station_identifier#82, observation_type#84, yyyy#90]
Condition : (isnotnull(station_identifier#82)

2. What does BatchEvalPython block represents? 
3. Where you can find query plan? 
4. How to get a query plan for particular stage? 

In [None]:
spark.stop()