## SparkSession

- At the core of every Spark application is the `Spark driver program`, which creates a `SparkSession` object. Which if executes as a shell, creates a `spark` variable.

- `SparkSession` is the entry point to programming Spark with the `Dataset` and `DataFrame` API.

- A SparkSession can be used `create DataFrame`, `register DataFrame as tables`, execute `SQL` over tables, `cache tables`, and `read parquet files`. To create a `SparkSession`, use the following builder pattern:

### Spark Job(s) -> Stage(s) -> Task(s)

In [4]:
from IPython.display import Image
Image(url= "images/Spark-job-stage-task.png", width=700, height=800)

## Transformations, Actions, and Lazy Evaluation

- Transformations are all `lazy` by default and make up a `lineage` of transformations to be executed at a later time.
- A recorded lineage ___allows Spark___, at a later time in its execution plan, to rearrange certain transformations, coalesce them, or optimize transformations into stages for more efficient execution.

### Narrow vs Wide Dependencies
- Transformations can be classified as having either narrow dependencies or wide dependencies. Any transformation where a single output partition can be computed from a single input partition is a narrow transformation. For example, in the previous code snippet, filter() and contains() represent narrow transformations because they can operate on a single partition and produce the resulting output partition without any exchange of data.
- However, `groupBy()` or `orderBy()` instruct Spark to perform wide transformations, where data from other partitions is read in, combined, and written to disk. **They go across all partitions.**

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Word Count") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

22/06/06 21:37:34 WARN Utils: Your hostname, Pauls-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.4.90 instead (on interface en0)
22/06/06 21:37:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/06/06 21:37:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
