# Faster PySpark: Understanding Spark's query planning

Imagine the following scenario: you write a readable, well-thought-out PySpark program. When submitting your program to your Spark cluster, it runs. You wait. 

How can we peek under the hood and see the progression of our program?  Troubleshoot which step is taking a lot of time? This chapter is about understanding how we can access information about our Spark instance, such as its configuration and layout (CPU, memory, etc.). We also follow the execution of a program from raw Python code to optimized Spark instructions. 

### Navigating the Spark UI to understand the environment

This section covers how Spark uses allocated computing and memory resources and how we can configure how many resources are assigned to Spark. 

Our program follows a pretty simple set of steps:
1. We create a SparkSession object to access the data frame functionality of PySpark as well as to connect to our Spark instance.
2. We create a data frame containing all the text files (line by line) within the chosen directory, and we count the occurrence of each word.
3. We show the top 10 most frequent words

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.appName(
    "Counting word occurrences from a book, one more time."
).getOrCreate()

spark


In [2]:
results = (
    spark.read.text("./data/gutenberg_books/*.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']+", 0).alias("word"))
    .where(F.col("word") != "")
    .groupby(F.col("word"))
    .count()
)
results.orderBy(F.col("count").desc()).show(10)

+----+-----+
|word|count|
+----+-----+
| the|39188|
| and|24292|
|  of|21234|
|  to|20581|
|   i|15151|
|   a|14564|
|  in|12857|
|that| 9900|
|  it| 9451|
| was| 8939|
+----+-----+
only showing top 10 rows



Navigate to the **Spark UI**. The Spark UI landing pages (also known as the Job tab on the top menu) contain a lot of information, which we can divide into a few sections:
- The top menu provides access to the main sections of the Spark UI, which we explore in this chapter.
- The timeline provides a visual overview of the activities impacting your SparkSession; in our case, we see the cluster allocating resources (an executor driver, since we work locally) and performing our program.
- The jobs, which in our case are triggered by the `show()` action (depicted in the Spark UI as `showString`), are listed at the bottom of the page. In the case where a job is processing, it would be listed as in progress

#### Reviewing the configuration: The environment tab
This tab contains the configuration of the environment that our Spark instance sits on, so the information is useful for troubleshooting library problems, providing configuration information if you run into weird behavior (or a bug!), or understanding the specific behavior of a Spark instance.

The Environment tab contains all the information about how the machines on your cluster are set up. It covers information about the JVM and Scala versions installed (remember, Spark is a Scala program), as well as the options Spark is using for this session. 

#### Greater than the sum of its parts; The Executors tab and resource management

Executors tab contains information about the computing and memory resources available to our Spark instance. After clicking on Executors, we are presented with a summary and detailed view of all the nodes in our cluster. 
CPU cores and RAM used.  By default, Spark will allocate 1 GiB (gebibyte) of memory to the driver process. 

Spark uses RAM for three main purposes :
- A portion of the RAM is reserved for Spark internal processing, such as user data structures, internal metadata, and safeguarding against potential out-of-memory errors when dealing with large records.
- The second portion of the RAM is used for operations (operational memory). This is the RAM used during data transformation.
- The last portion of the RAM is used for the storage (storage memory) of data. RAM access is a lot faster than reading and writing data from and to disk, so Spark will try to put as much data in memory as possible. If operational memory needs grow beyond what’s available, Spark will spill some of the data from RAM to disk.

<img src="images/spark_resources.png">

Spark provides a few configuration flags to change the memory and number of CPU cores available. We have access to two identical sets of parameters to define the resources our drivers and executors will have access to. 

When creating the `SparkSession`, you can set the `master()` method to connect to
a specific cluster manager (in cluster mode) when working locally and specify the resources/number of cores to allocate from your computer.

We can decide to go from 16 cores to only 8 by passing `master("local[8]")` in the `SparkSession` builder object. Memory allocation is done through configuration flags; the most important when working locally is `spark.driver.memory`. This flag takes size as an attribute and is set via the `config()` method of the SparkSession builder object.

|Abbreviation| Definition  |
|------------|-------------|
|1b        |    1 byte|
|1k or 1kb |1 kibibyte = 1,024 bytes|
|1m or 1mb |1 mebibyte = 1,024 kibibytes|
|1g or 1gb |1 gibibyte = 1,024 mebibytes|
|1t or 1tb |1 tebibyte = 1,024 gibibytes|
|1p or 1pb |1 pebibyte = 1,024 tebibytes|


```python
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("Launching PySpark with custom options")
    .master("local[8]")
    .config("spark.driver.memory", "16g")
).getOrCreate()
```

---
# REFER CHAPTER 11 of the book
---