<h1 style="text-align:center"> INFO 323: Cloud Computing and Big Data</h1>
<h2 style="text-align:center"> College of Computing and Informatics</h2>
<h2 style="text-align:center">Drexel University</h2>

<h3 style="text-align:center"> Introduction to Spark Programming</h3>
<h3 style="text-align:center"> Yuan An, PhD</h3>
<h3 style="text-align:center">Associate Professor</h3>

# Cloud Computing Recap
- Typically, when we think of a “computer,” we think about one machine sitting on our desk at home or at work.
- There are some things that our computer is not powerful enough to perform. One particularly challenging area is data processing.
- Single machines do not have enough power and resources to perform computations on huge amounts of information (or the user probably does not have the time to wait for the computation to finish). 
- A cluster, or group, of computers, pools the resources of many machines together, giving us the ability to use all the cumulative resources as if they were a single computer.


# Cloud Computing and Spark
- Now, a group of machines alone is not powerful, we need a framework to coordinate work across them. 
- Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers.
- The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like
 - * Spark’s standalone cluster manager, 
 - * YARN, 
 - * or Mesos. 
- We then submit Spark Applications to these cluster managers, which will grant resources to our application so that we can complete our work.


# Spark Cluster
Spark cluster consists of driver and worker nodes.
![](https://i.imgur.com/6I4l6ZH.png)

# Spark Applications
Spark Applications consist of a driver process and a set of executor processes. 
![](https://i.imgur.com/zF7Ngfc.png)




# Spark Executor
Each executor’s core gets a partition of data to work on
![](https://i.imgur.com/FZN5dCb.png)

# Spark Driver
- Each Spark driver creates one or more Spark jobs; 
- Each Spark job creates one or more stages; 
- Each Spark stage creates one or more tasks to be distributed to executors
![](https://i.imgur.com/lMTRd1c.png)

# Spark Programs
A program consists of a sequence of transformations followed by an action.
![](https://i.imgur.com/NLTWLo0.png)


# Spark RDD
- The primary data abstraction structure for Spark applications, is one of the main differentiators between Spark and other cluster computing frameworks. 
- In-memory collections of data distributed across a cluster. 
- Spark programs using the Spark core API consist of
- * loading input data into an RDD
- * transforming the RDD into subsequent RDDs
- * storing or presenting the final output for an application from the resulting final RDD.


# DataFrames
The most common Structured API and simply represents a table of data with rows and columns
![](https://i.imgur.com/qvdB1rT.png)

# Transformations
- In Spark, the core data structures are immutable, meaning they cannot be changed after they’re created. 
- This might seem like a strange concept at first: if we cannot change it, how are we supposed to use it? 
- To “change” a DataFrame, we need to instruct Spark how we would like to modify it to do what we want.
- These instructions are called transformations. 


# Narrow Transformation
Transformations consisting of narrow dependencies (we’ll call them narrow transformations) are
those for which each input partition will contribute to only one output partition.
# Wide Transformation
A wide dependency (or wide transformation) style transformation will have input partitions
contributing to many output partitions. You will often hear this referred to as a shuffle whereby Spark
will exchange partitions across the cluster. With narrow transformations, Spark will automatically
perform an operation called pipelining, meaning that if we specify multiple filters on DataFrames,
they’ll all be performed in-memory. The same cannot be said for shuffles. When we perform a shuffle,
Spark writes the results to disk. 
![narrow vs wide](https://i.imgur.com/jJ4fypS.png)

# Lazy Evaluation
Lazy evaluation means that Spark will wait until the very last moment to execute the graph of
computation instructions. 

In Spark, instead of modifying the data immediately when you express some
operation, you build up a plan of transformations that you would like to apply to your source data. 

By waiting until the last minute to execute the code, Spark compiles this plan from your raw DataFrame
transformations to a streamlined physical plan that will run as efficiently as possible across the
cluster. 

This provides immense benefits because Spark can optimize the entire data flow from end to
end. 

An example of this is something called predicate pushdown on DataFrames. If we build a large
Spark job but specify a filter at the end that only requires us to fetch one row from our source data,
the most efficient way to execute this is to access the single record that we need. Spark will actually
optimize this for us by pushing the filter down automatically.


# Actions
- To trigger the computation, we run an action - instructs Spark to compute a result from a series of transformations. 
- There are three kinds of actions:
- * Actions to view data in the console
- * Actions to collect data to native objects in the respective language
- * Actions to write to output data sources


# DataFrame Example
DataFrame consists of a series of records (like rows in a table), that are of type Row,
and a number of columns (like columns in a spreadsheet) that represent a computation expression that
can be performed on each individual record in the Dataset. 

Schemas define the name as well as the
type of data in each column. 

Partitioning of the DataFrame defines the layout of the DataFrame or
Dataset’s physical distribution across the cluster. 

The partitioning scheme defines how that is
allocated. You can set this to be based on values in a certain column or nondeterministically.

In [33]:
range_df = spark.range(500).toDF("number")

In [34]:
range_df.take(2)

[Row(number=0), Row(number=1)]