# Day 2: My first Spark Applications and Basic Spark Concepts
[pyspark Doc](https://spark.apache.org/docs/2.4.5/api/python/index.html)

Spark provides two sets of APIs, *Structured APIs* and *low-level APIs*. The Structured APIs are designed to implement the business logic of Spark applications and they hide the Spark internals of the *low-level API*. So for me, as a ETL developer, the Structured APIs are the best starting point to dive into data processing with Spark.

My today's challenge is to write my first little Spark application to get to get a first impression of the *Structured APIs* like `DataFrame`,`Dataset`, *SQL Tables/Views* and *Structured Streaming* and to undertsand some basic concepts like lazy evaluation of transformations, and data processing actions.

The first questions, that comes to my mind is, how to start a Spark application?

## SparkSession
Starting a Spark application generates a Spark job which is controlled and mangaged by exactly one *driver process* and several *executor processes* running across the cluster nodes doing the actual computational work. The driver process is controlled by a`SparkSession` object, which is the entry point of any Spark application, so there is always a one-to-one relationship between SparkSession and Spark application.

So how can I start a Spark session?

On day 1, I started an interactive Spark session by opening the pyspark console (`./bin/pyspark`). This implicitly creates a `SparkSession` object which is referenced by a variable called *spark*.

Since I don't want to type in all the code line by line into the interactive console, my Spark application must create it's own `SparkSession`object. So every Spark application starts with something like this:

In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
type(spark)

pyspark.sql.session.SparkSession

## Spark UI
Spark provides an UI to monitor the status and progess of Spark jobs. It is available on port 4040 on the driver node in the Spark cluster. Since I'm running Spark in local mode, i.e. all processes are running on my laptop, I can access the UI on http://localhost:4040.

After having executed the next code block in the Spark DataFrames section, the UI showed this to me.

<img src= "./screenshots/day-002/day-002_Spark_UI.jpeg">

Clicking on a jobname provides further details and metrics figures regarding the job execution icluding a graphical representation of the execution DAG (directed acyclic graph).

<img src= "./screenshots/day-002/day-002_Spark_UI_job_details.jpeg">


## Spark DataFrames
Now I can run my hello world example from day 1, which creates my first `DataFrame`, and similar to the interactive console, the starting point is the `SparkSession` object named *spark*.

In [2]:
myFirstDataFrame = spark.range(100).toDF("number")
myFirstDataFrame.show(10)

+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
+------+
only showing top 10 rows



Spark `DataFrame` objects look quite similar to Pandas dataframes. In fact, I can easily transform a Spark `DataFrame` into a Pandas dataframe.

***But caution:*** This method should only be used if the resulting Pandas’s DataFrame is expected to be small, as all the data is loaded into the driver’s memory.

In [3]:
pandasDF = myFirstDataFrame.toPandas()
pandasDF

Unnamed: 0,number
0,0
1,1
2,2
3,3
4,4
...,...
95,95
96,96
97,97
98,98


By the way, tranforming data to JSON is also easy. Each row is turned into a JSON document as one element in the returned list.

In [4]:
myFirstDataFrame.toJSON().take(3)

['{"number":0}', '{"number":1}', '{"number":2}']

In [5]:
type(myFirstDataFrame.toJSON().take(3))

list

Ok, back to topic. The main difference bewteen Spark and Pandas is, that Pandas dataframes reside on a single machine whereas a Spark `DataFrame`ìs an abstraction of the in-memory optimized low-level API *Resilient Distributed Dataset* (`RDD`), which is designed to be split up data into partitions which can be spread across a cluster of potentially thousends of nodes. 
Spark `DataFrame`objects have a surprising characteristic, they are immutable once they are created. So how can data processing works when data structures are written in stone?
## Lazy Evaluation, Transformations and Actions
The few lines of my code already demonstrate that the Structured API has a functional design. Since `DataFrame`objects are immutable, I have to use functions which read `DataFrame` objects as input, do some kind of data transformation and create a new `DataFrame`which again can be the input of another function to do further transformations and generating another `DataFrame`. So finally I can simply concatenate functions to create a sequence of transformations to get my desired data result.

Ok, let's do it. I want to see, if how it works. First, I want to read some data from CSV file into a dataframe. The file has a header line and I want Spark to derive the schema, i.e. name and type of the columns, from the file. Nevertheless I could also specify a schema explicitly instead of deriving it from file. Determining the schema processing time, instead of load time, is an example of the common *schema-on-read* design of Big Data architectures.

Important to keep in mind is, that the column types are not Python types (or Scala or Java types if I use another API languages). All language API commands are mapped to the Spark internal language *Catalyst* having its own types. That's why all API languages provide the same performance.

In [6]:
dfFlight = spark.read\
   .option("inferSchema", "true")\
   .option("header", "true")\
   .csv("./data/flight-data/2015-summary.csv")

Running this code doesn't show any observable result to me. That looks strange on first view. Actually, Spark hasn't done anything yet, except for deriving the schema by reading a small sample of rows. This is because Spark applies *lazy evaluation* of *transformations*, i.e. no data is moved or processed until Spark is forced to by an *action*, e.g. by calling the `write()` function. 

By defining transformations, I just give Spark a set of rules describing how a given `Dataframe` should be logically transformed into a new `Dataframe`object. By calling an action, I give Spark the command to apply these transformations, process the data and provide me the results.

The reason for the lazy evaluation approach is, that Spark first wants to know the whole story about *what* should be done effectively before it tries to determine an efficient way *how* to do this. Therefore Spark first compiles all transformations into a **logical** directed acyclic graph (DAG), than analyses this DAG, applies optimizations (e.g. predicate push-down to datasources) whenever possible and splits up the optimized **physical** DAG into stages and parallelised tasks of `RDD` manipulations before starting to execute them.

Important to note is, tha `DataFrame`objects are kept in memory when ever possible. In contrast to MapReduce, Spark tries to avoid writing intermediate results (i.e. `DataFrame` objects) to disk by *piplining* consecutive in-memory transformations, to gain better performance.

This piplining is only possible for *narrow* transformations. These are transformations where each input partition contributes only to one output partition or where the transormation can be applied partion by partition, so the partitions can be processed locally on the same cluster node. Simple row-based filter rules or commutative operations like summing up values, are common examples of narrow transformations.

On the other hand, in a *wide* transformation input partitions contribute to multiple output partitions, so data needs to be shuffled across cluster nodes. Sorting and average calculation are common wide transformations across multiple partitions. During shuffling Spark writes results to disk, so wide transformations are not performed in-memory.


Ok, now I want to see some action and Spark to show me the first 10 lines in my data file.

In [7]:
dfFlight.show(10)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 10 rows



Now I triggered actual data processing so I can see the results. Next to showing data, there are two other types of actions: writing output data, e.g. to file and actions to collect data to native objects in the respercitve language.

I can even combine transformations and actions in one single command.

In [8]:
spark.read\
   .option("inferSchema", "true")\
   .option("header", "true")\
   .csv("./data/flight-data/2015-summary.csv")\
   .show(10) 

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 10 rows



Obviously I can split up any sequence of transformations by asigning the intermedate `Dataframe` object to a variable and have a look into the intermediate results by calling the`show()` function on that `Dataframe`. This is a very nice feature for me when I need to debug complex analytical queries or ETL jobs which compile severeal subqueries together. If my subqueries provide the expected result, the bug must reside in the remaining part of my transormation logic so I can focus my analysis on that area.
## Query Explain Plans
So I've learned so far, that I just need to define the business logic by concatinating transformation functions and Spark does the optimisation for me. Fortunately Spark gives me insight, how it will perform my query by calling the `explain()` function.

I want to see, how Spark would **physically** execute the sorting, which is a wide transformation. Each step in the explain plan actually generate a new `DataFrame`.

In [9]:
dfFlight.sort("count").explain()

== Physical Plan ==
*(2) Sort [count#30 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#30 ASC NULLS FIRST, 200)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#28,ORIGIN_COUNTRY_NAME#29,count#30] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/oli/github/pyspark-tutorial/data/flight-data/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


Reading the explain plan from bottom upwards, it tells me, that first, Spark performs a file scan and than range partitioning is applied shuffling the data over 200 output partitions by default to sort the data. 

In [10]:
spark.conf.get("spark.sql.shuffle.partitions")

'200'

Since I'm running in local mode on a single machine, it might be better to limit the number of partitions to 5. I can do this be chaging the configuration of the `SparkSession` object *spark*.

In [11]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

The explain plan confirms, that my configuration change has the desired effect.

In [12]:
dfFlight.sort("count").explain()

== Physical Plan ==
*(2) Sort [count#30 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#30 ASC NULLS FIRST, 5)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#28,ORIGIN_COUNTRY_NAME#29,count#30] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/oli/github/pyspark-tutorial/data/flight-data/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


The `explain()` function can help me figuring out how flexible I can chain up functions which define `DataFrame` to `DataFrame` transformations. For example, is it relevant for the query execution, whether I filter before selecting or the other way around? 

In [13]:
dfRetail = spark.read\
   .format("csv")\
   .option("header", "true")\
   .option("inferSchema", "true")\
   .load("./data/retail-data/by-day/*.csv")

Can I get a performance benefit, when I filter very early?

In [14]:
from pyspark.sql.functions import col

dfRetail.where(col("InvoiceNo") != 536365)\
    .select("InvoiceNo", "Description")\
    .explain()

== Physical Plan ==
*(1) Project [InvoiceNo#86, Description#88]
+- *(1) Filter (isnotnull(InvoiceNo#86) && NOT (cast(InvoiceNo#86 as int) = 536365))
   +- *(1) FileScan csv [InvoiceNo#86,Description#88] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/oli/github/pyspark-tutorial/data/retail-data/by-day/2011-09-13.csv, ..., PartitionFilters: [], PushedFilters: [IsNotNull(InvoiceNo)], ReadSchema: struct<InvoiceNo:string,Description:string>


Or can I stick to the well-known SQL pattern: SELECT ... FROM ... WHERE?

In [15]:
dfRetail.select("InvoiceNo", "Description")\
    .where(col("InvoiceNo") != 536365)\
    .explain()

== Physical Plan ==
*(1) Project [InvoiceNo#86, Description#88]
+- *(1) Filter (isnotnull(InvoiceNo#86) && NOT (cast(InvoiceNo#86 as int) = 536365))
   +- *(1) FileScan csv [InvoiceNo#86,Description#88] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/oli/github/pyspark-tutorial/data/retail-data/by-day/2011-09-13.csv, ..., PartitionFilters: [], PushedFilters: [IsNotNull(InvoiceNo)], ReadSchema: struct<InvoiceNo:string,Description:string>


The execution plan is the same. Again Spark does the optimization in the background and performs the filter before the column projection. In this case, the functional PAI provides me more flexibility than the strict SQL syntax.

## Spark SQL, Tables and Views
<a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/index.html">SQL Language Reference</a> provided by Databricks.

[abc](https://docs.databricks.com/spark/latest/spark-sql/language-manual/index.html)

I've been working with relational databases and SQL for many years. So I'm happy to notice, that Spark also speaks my languange. In fact, the Spark SQL API supports the ANSI SQL 2003 standard. I can turn a `Dataframe` into a table or view, which I can query with SQL. All I need to do is register a table/view on that `Dataframe`.

In [16]:
dfFlight.createOrReplaceTempView("flight_data_2015")

Now I can write my Spark query as I did it for long times in classical databases, and I will get exactly the same result, as doing it the functional way.

This example calculates the top 5 countries having the highest number of flight destinations in 2015. Obviously the most flights went to the US.

In [17]:
# finding the top five destination countries by SQL
from pyspark.sql.functions import max, desc
# transformation
maxSql = spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5""")
# action
maxSql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



The equivalent functional query looks like this:

In [18]:
#transformation
dfFlight\
   .groupBy("DEST_COUNTRY_NAME")\
   .sum("count")\
   .withColumnRenamed("sum(count)","destination_total")\
   .sort(desc("destination_total"))\
   .limit(5)\
   .show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



The functional version looks to me even more self-explaing the **logical** transformations. The story reading it line-by-line is: first the data is grouped by the destination countries. Than, for each group (partition) the number of flights are summed up which generates a new, derived column which is than renamed. Afterwards the results are sorted is descending order by the calculated column and the output is limited by the top 5 rows.

Fortunately the convenience of using Spark SQL API instead of functions does not have an negative performance impact. The **physical** execution explain plans of both versions are exactly the same.

In [19]:
# both transformartions compile to the same plan
maxSql.explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[aggOrder#108L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#28,destination_total#106L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#28], functions=[sum(cast(count#30 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#28, 5)
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#28], functions=[partial_sum(cast(count#30 as bigint))])
         +- *(1) FileScan csv [DEST_COUNTRY_NAME#28,count#30] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/oli/github/pyspark-tutorial/data/flight-data/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>


In [20]:
dfFlight\
   .groupBy("DEST_COUNTRY_NAME")\
   .sum("count")\
   .withColumnRenamed("sum(count)","destination_total")\
   .sort(desc("destination_total"))\
   .limit(5)\
   .explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#151L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#28,destination_total#151L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#28], functions=[sum(cast(count#30 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#28, 5)
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#28], functions=[partial_sum(cast(count#30 as bigint))])
         +- *(1) FileScan csv [DEST_COUNTRY_NAME#28,count#30] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/oli/github/pyspark-tutorial/data/flight-data/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>


Interesting to note is that the `sum()` aggregation involves two *hashAggregate* steps. Because `sum()` is a commutative operation Spark first calculates partial sums partition by partition which is a narrow transformation. Afterwards the aggregted, i.e. already reduced data is shuffled (*Exchange hashpartitioning*) to calculate the overall sum across all partitions. This is another example how Spark optimizes the query execution by first analyzing  all transformations befor starting data processing.
## Spark Datasets
Datasets are a type-safe version of DataFrames. Since Python is a dynamically taped language, they are not available in pyspark but they can be used in the Java and Scala API. Good to keep in mind, but I skip it for now, since I prefer Python.

## Structured Streaming
So far I did all data processing in batch mode, i.e. all data get's processed at once. Batch mode forces me to wait, until all data I want to analyse is available. Stream processing on the other hand enables me to process data incrementally as it arrives, so I can get insights faster.

Stream processing in Spark is very similar to data processing in batch mode. The following example will demonstrate this. As far as I can see so for now, this is because Spark stream processing is actually event-triggered mirco-batch-processing. 

Ok, let's have a closer look and start from scratch, i.e. creating a new `SparkSession`.

In [21]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, column, col, sum

spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", "5")

### Batch processing
This time, my data source is not a single file, instead the data is split into several files on a day-by-day basis. Nevertheless, infering the schema, the meta data, works the same way. The only difference is the wildcard * in the filename to tell Spark, that I want to process all CSV files in the specified folder.

In [22]:
staticDF = spark.read\
   .format("csv")\
   .option("header", "true")\
   .option("inferSchema", "true")\
   .load("./data/retail-data/by-day/*.csv")

This defines the transformation which  tells Spark how to create the source DataFrame.

As a retailer, I want to analyse how much money each customer is pending in my shops per hour in each 1 day time window.
So I add a further transformation on that `DataFrame`, which describes the business logic of my data analysis, and results to another `DataFrame`.

In [23]:
purchaseByCustomerPerHour = staticDF\
   .selectExpr("CustomerId", "(UnitPrice * Quantity) as total_cost", "InvoiceDate")\
   .groupBy("CustomerId", window("InvoiceDate", "1 day"))\
   .sum("total_cost")

To take a look at the first 10 rows of the result, I have to call the action `show()`.

In [24]:
purchaseByCustomerPerHour.show(10)

+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   14075.0|[2011-12-05 01:00...|316.78000000000003|
|   18180.0|[2011-12-05 01:00...|            310.73|
|   15358.0|[2011-12-05 01:00...| 830.0600000000003|
|   15392.0|[2011-12-05 01:00...|304.40999999999997|
|   15290.0|[2011-12-05 01:00...|263.02000000000004|
|   16811.0|[2011-12-05 01:00...|             232.3|
|   12748.0|[2011-12-05 01:00...| 363.7899999999999|
|   16500.0|[2011-12-05 01:00...| 52.74000000000001|
|   16873.0|[2011-12-05 01:00...|1854.8300000000002|
|   14060.0|[2011-12-05 01:00...|297.47999999999996|
+----------+--------------------+------------------+
only showing top 10 rows



In batch mode, I'm actually processing the entire data history, i.e. all files at once, which can take quite a long time for large data sets. To make this faster, I can switch to stream processing.
### Stream processing


There are just two things, I have to do, to turn my batch processing into stream processing in Spark:
 - using `readStream()` instead of `read()`, and
 - defining a trigger refreshing the result after reading each input file

In the given example, a trigger get's fired after reading each file (*maxFilesPerTrigger* = 1). Since all files are already on my harddrive, Spark will actually refresh the results every few (milli-)seconds, so finally I'm quite close to realtime-processing in this demonstration.

The schema is the same, as for batch processing, so I'm re-using it from the *staticDF*.

In [25]:
streamingDF = spark.readStream\
   .schema(staticDF.schema)\
   .option("maxFilesPerTrigger", 1)\
   .format("csv")\
   .option("header", "true")\
   .load("./data/retail-data/by-day/*.csv")

Let's check, if the Stream creation was successfully.

In [27]:
streamingDF.isStreaming

True

The transformation is still the same, but now it is applied on a stream instead of a DataFrame.

In [28]:
purchaseByCustomerPerHour = streamingDF\
   .selectExpr("CustomerId", "(UnitPrice * Quantity) as total_cost", "InvoiceDate")\
   .groupBy(col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
   .sum("total_cost")

As I've learnd so far, That Spark evaluates lazly and nothing happens untill I call an action to initiate the stream processing. The action `writeStream` generates a table, which gets updated after each trigger event. Important to note is, **streaming tables are mutable** whereas `DataFrame`objectss **are immutable.**

Here I stream the results to my console using `format("console")`, to make it visible how the result table gets updated regularly. Using`format("memory")` would push the stream to an in-memory table so other stream processes could read it.

In [29]:
purchaseByCustomerPerHour\
   .writeStream\
   .format("console")\
   .queryName("customer_purchases")\
   .outputMode("complete")\
   .start()

<pyspark.sql.streaming.StreamingQuery at 0x7feeda8b5650>

I thin, day 2 is completed. My first little Spark applications did their jobs.