# Distributed Computing Part 2 🗄️🔀

This short lecture focuses on the high API components of Spark which make it easy to use for various tasks.

## What will you learn in this course? 🧐🧐

In this course we will quickly review the different components of Spark and what they are useful for:

* The Spark stack
    * Spark Core - the main functionnalities of the framework
    * Spark SQL - to handle structured data and run queries
    * GraphX - Spark's toolbox for graph data structures
    * MLlib - The machine learning toolbox for Spark
    * Spark Streaming - An API to handle continuous inflow of data

## The Spark Stack ✨⚙️

One of Spark's promises is to deliver a unified analytics system. On top of its powerful distributed processing engine (Spark Core), sits a collection of higher-level libraries that all benefit from the improvements of the core library, which are low latency, and lazy execution.

*That's true in general, but can suffer from some caveats, in particular Spark Streaming's performances can't rival those of Storm and Flink which are other framework for running streaming jobs.*

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/spark-stack-oreilly-674376df-ecdf-45f2-8ef7-539393568c0e.png" />

Source: Learning Spark (O'Reilly - Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia)

### Spark Core 💖

Spark Core is the underlying general execution engine for the Spark platform that all other functionalities are built on top of.

It provides many core functionalities such as task dispatching and scheduling, memory management and basic I/O (input/output) functionalities, exposed through an application programming interface.

### Spark SQL 🔢

Spark module for structured data processing.

Spark SQL provides a programming abstraction called DataFrame and can also act as a distributed SQL query engine. DataFrames are the other main data format in Spark. Spark DataFrames are column oriented, they have a data schema which describes the name and type of all the available columns. It allows for easier processing but adds contraints on the cleanliness and structure of the data.

Also they're called "DataFrames", Spark's DataFrame are quite different from those of pandas that you might be familiar with.

At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) to build an extensible query optimizer.

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Catalyst-Optimizer-diagram-152974c4-e1fc-4bb5-a788-c1ee71657ecd.png" />

Source: [https://databricks.com/glossary/catalyst-optimizer](https://databricks.com/glossary/catalyst-optimizer)

### GraphX 📊

Spark module for Graph data structure.

GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale.

### MLlib 🔮

Machine Learning library for Spark, inspired by Scikit-Learn (in particular, its pipelines system).

Historically a RDD-based API, it now comes with a DataFrame-based API that has become the primary API while the RDD-based API is now in [maintenance mode](https://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api).

### Spark streaming 🌊

Spark module for stream processing.

Streaming, also called Stream Processing is used to query continuous data stream and process this data within a small time period from the time of receiving the data. This is the opposite of batch processing, which occurs at a previously scheduled time independently from the data influx.

Spark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, this comes at the cost of having to wait for the full mini-batch to be processed while alternatives like Apache Storm and Apache Flink process data by event and provide better speed.

## Ressources 📚📚

- [What is Spark SQL](https://databricks.com/glossary/what-is-spark-sql)
- [Deep dive into Spark SQL's Catalyst Optimizer](https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html)
- [SparkSqlAstBuilder](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SparkSqlAstBuilder.html)
- [A Gentle Introduction to Stream Processing](https://medium.com/stream-processing/what-is-stream-processing-1eadfca11b97)