d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Apache Spark Overview
* Short History of Apache Spark
* Who is Databricks?
* What is Apache Spark?
* A Unifying Engine
* The RDD
* DataFrames, Datasets & SQL
* Scala, Python, Java, R & SQL
* The Cluster: Drivers, Executors, Slots & Tasks
* Quick Note on Jobs & Stages
* Quick Note on Cluster Management
* Local Mode & Databricks CE
* Architectural & Administrative Topics

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Short History of Apache Spark
* <a href="https://en.wikipedia.org/wiki/Apache_Spark" target="_blank">Apache Spark</a> started as a research project at the 
University of California AMPLab, in 2009 by <a href="https://en.wikipedia.org/wiki/Matei_Zaharia" target="_blank">Matei Zaharia</a>.
* In 2013, the project was
  * donated to the Apache Software Foundation
  * open sourced
  * adopted the Apache 2.0 license
* In February 2014, Spark became a Top-Level <a href="https://spark.apache.org/" target="_blank">Apache Project<a/>.
* In November 2014, Spark founder <a href="https://en.wikipedia.org/wiki/Matei_Zaharia" target="_blank">Matei_Zaharia</a>'s 
company <a href="https://databricks.com" target="_blank">Databricks</a> set a new world record in large scale sorting using Spark.
* Latest stable release: <a href="https://spark.apache.org/downloads.html" target="_blank">CLICK-HERE</a>
* 600,000+ lines of code (75% Scala)
* Built by 1,000+ developers from more than 250+ organizations

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Who is Databricks?
* <a href="https://databricks.com" target="_blank">Databricks</a> was started by Spark founder <a href="https://en.wikipedia.org/wiki/Matei_Zaharia" target="_blank">Matei Zaharia</a>.
* Today, Databricks remains the #1 contributor to Apache Spark.
* Fully committed to maintaining Apache Spark as an Open Source project.
* *"Provides a Unified Analytics Platform that accelerates innovation by unifying data science, engineering, and business."*
  * Databricks Workspace - Interactive Data Science & Collaboration.
  * Databricks Workflows - Production Jobs & Workflow Automation.
  * Databricks Runtime
  * Databricks I/O (DBIO) - Optimized Data Access Layer
  * Databricks Serverless - Fully Managed Auto-Tuning Platform
  * Databricks Enterprise Security (DBES) - End-To-End Security & Compliance
* Actively involved with the Apache Spark community:
  * Private & Public Training
  * Consulting Services
  * Hosting Meetups
  * Blogs, Articles, Videos
  * And Much More!

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) What is Apache Spark?

Spark is a unified processing engine that can analyze big data using SQL, machine learning, graph processing or real-time stream analysis:

![Spark Engines](https://files.training.databricks.com/images/wiki-book/book_intro/spark_4engines.png)
<br/>
<br/>
* At its core is the Spark Engine.
* The DataFrames API provides an abstraction above RDDs while simultaneously improving performance 5-20x over traditional RDDs with its Catalyst Optimizer.
* Spark ML provides high quality and finely tuned machine learning algorithms for processing big data.
* The Graph processing API gives us an easily approachable API for modeling pairwise relationships between people, objects, or nodes in a network.
* The Streaming APIs give us End-to-End Fault Tolerance, with Exactly-Once semantics, and the possibility for sub-millisecond latency.

And it all works together seamlessly!

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) A Unifying Engine

And as a compute engine, Apache Spark is not tied to a specific environment or data warehouse strategy.

![Unified Engine](https://files.training.databricks.com/images/105/unified-engine.png)
<br/>
<br/>
* Built upon the Spark Core
* Apache Spark is data and environment agnostic.
* Languages: **Scala, Java, Python, R, SQL**
* Environments: **Yarn, Docker, EC2, Mesos, OpenStack, Databricks (our favorite), Digital Ocean, and much more...**
* Data Sources: **Hadoop HDFS, Casandra, Kafka, Apache Hive, HBase, JDBC (PostgreSQL, MySQL, etc.), CSV, JSON, Azure Blob, Amazon S3, ElasticSearch, Parquet, and much, much more...**

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) RDDs
* The primary data abstraction of Spark engine is the RDD: Resilient Distributed Dataset
  * Resilient, i.e., fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
  * Distributed with data residing on multiple nodes in a cluster.
  * Dataset is a collection of partitioned data with primitive values or values of values, e.g., tuples or other objects.
* The original paper that gave birth to the concept of RDD is <a href="https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf" target="_blank">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</a> by Matei Zaharia et al.
* Starting with Spark 2.0, we treat RDDs as the assembly language of the Spark ecosystem.
* DataFrames, Datasets & SQL provide the higher level abstraction over RDDs.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Scala, Python, Java, R & SQL
* Besides being able to run in many environments...
* Apache Spark makes the platform even more approachable by supporting multiple languages:
  * Scala - Apache Spark's primary language.
  * Python - More commonly referred to as PySpark
  * R - <a href="https://spark.apache.org/docs/latest/sparkr.html" target="_blank">SparkR</a> (R on Spark)
  * Java
  * SQL - Closer to ANSI SQL 2003 compliance
    * Now running all 99 TPC-DS queries
    * New standards-compliant parser (with good error messages!)
    * Subqueries (correlated & uncorrelated)
    * Approximate aggregate stats
* With the older RDD API, there are significant differences with each language's implementation, namely in performance.
* With the newer DataFrames API, the performance differences between languages are nearly nonexistence (especially for Scala, Java & Python).
* With that, not all languages get the same amount of love - just the same, that API gap for each language is rapidly closing, especially between Spark 1.x and 2.x.

![RDD vs DataFrames](https://files.training.databricks.com/images/105/rdd-vs-dataframes.png)

-sandbox

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) The Cluster: Drivers, Executors, Slots & Tasks
![Spark Physical Cluster, slots](https://files.training.databricks.com/images/105/spark_cluster_slots.png)

* The **Driver** is the JVM in which our application runs.
* The secret to Spark's awesome performance is parallelism.
  * Scaling vertically is limited to a finite amount of RAM, Threads and CPU speeds.
  * Scaling horizontally means we can simply add new "nodes" to the cluster almost endlessly.
* We parallelize at two levels:
  * The first level of parallelization is the **Executor** - a Java virtual machine running on a node, typically, one instance per node.
  * The second level of parallelization is the **Slot** - the number of which is determined by the number of cores and CPUs of each node.
* Each **Executor** has a number of **Slots** to which parallelized **Tasks** can be assigned to it by the **Driver**.

![Spark Physical Cluster, tasks](https://files.training.databricks.com/images/105/spark_cluster_tasks.png)
<br/>
<br/>
* The JVM is naturally multithreaded, but a single JVM, such as our **Driver**, has a finite upper limit.
* By creating **Tasks**, the **Driver** can assign units of work to **Slots** for parallel execution.
* Additionally, the **Driver** must also decide how to partition the data so that it can be distributed for parallel processing (not shown here).
* Consequently, the **Driver** is assigning a **Partition** of data to each task - in this way each **Task** knows which piece of data it is to process.
* Once started, each **Task** will fetch from the original data source the **Partition** of data assigned to it.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Quick Note on Jobs & Stages
* Each parallelized action is referred to as a **Job**.
* The results of each **Job** (parallelized/distributed action) is returned to the **Driver**.
* Depending on the work required, multiple **Jobs** will be required.
* Each **Job** is broken down into **Stages**. 
* This would be analogous to building a house (the job)
  * The first stage would be to lay the foundation.
  * The second stage would be to erect the walls.
  * The third stage would be to add the room.
  * Attempting to do any of these steps out of order just won't make sense, if not just impossible.
  
** *Note:* ** *We will be going much deeper into Jobs & Stages and the *<br/>
*effect they have on our software as we progress through this class.*

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Quick Note on Cluster Management

* At a much lower level, Spark Core employs a **Cluster Manager** that is responsible for provisioning nodes in our cluster.
  * Databricks provides a robust, high-performing **Cluster Manager** as part of its overall offerings.
  * Additional Cluster Managers are available for 
    <a href="https://spark.apache.org/docs/latest/running-on-mesos.html" target="_blank">Mesos</a>,
    <a href="https://spark.apache.org/docs/latest/running-on-yarn.html" target="_blank">Yarn</a> and by other third parties.
  * In addition to this, Spark has a <a href="https://spark.apache.org/docs/latest/spark-standalone.html" target="_blank">Standalone</a> mode in which you manually configure each node.
* In each of these scenarios, the **Driver** is [presumably] running on one node, with each **Executors** running on N different nodes.
* For the sake of this class, we don't need to concern ourselves with cluster management.
  * Ya Databricks!
* From a developer's and student's perspective my primary focus is on...
  * The number of **Partitions** my data is divided into.
  * The number of **Slots** I have for parallel execution.
  * How many **Jobs** am I triggering?
  * And lastly the **Stages** those jobs are divided into.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Architectural & Administrative Topics

Architectural & administrative topics go beyond the scope of this class. Our goal is to focus<br/>
on the core components of Spark that you need to know to get started developing applications.<br/>

**Examples include:**
* Which cluster manager should I use?
* How should I configure the Executor's JVM for minimum performance?
* What is the moral implication of setting the *spark.executor.logs.rolling.strategy* parameter to "time"?
* Why does it make kittens cry in China when I run Apache Spark with the *spark.pet.kitten* flag set to true?

We will be discussing the internals of Apache Spark as it relates to a developer's role - it's not strictly about the API.

And we don't want to leave you hanging!

If you do have an advanced, kitten-type question, we encourage you to post it to this class' Q&A. 

An instructor or engineer will do their best to help answer your question if not at the very least point you towards a solution.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>