# Using Apache Spark
Spark applications run as independent sets of processes on a cluster, coordinated by the **SparkContext object** in your main program (called the driver program).

![Spark Architecture](http://spark.apache.org/docs/latest/img/cluster-overview.png)

SparkContext allocate resources across applications. 

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. 

Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. 

Finally, SparkContext sends tasks to the executors to run.


In [7]:
sc

<pyspark.context.SparkContext at 0x3bf27b8>

Interactive programming: is the procedure of writing parts of a program while it is **already active**. The Jupyter Notebook will be the frontend for our **active program**.

For interactive programming we will have:
* A Jupyter/IPython notebook: where we run Python code
* PySparkShell application UI: to monitor Spark Cluster

## Monitoring Spark Jobs

Every SparkContext launches its own instance of Web UI which is available at http://[master]:4040 by default.

Web UI comes with the following tabs:

    * Jobs
    * Stages
    * Storage with RDD size and memory use
    * Environment
    * Executors
    * SQL

This information is available only until the application is running by default. 

### Jobs
* Job id
* Description
* Submission dat
* Job Duration
* Stages
* Tasks
![Jobs Page](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/images/spark-webui-jobs.png)

### Stages
** What is a Stage? **: 

A stage is a physical unit of execution. It is a step in a physical execution plan.

A stage is a set of parallel tasks, one per partition of an RDD, that compute partial results of a function executed as part of a Spark job.

![Stages](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/diagrams/stage-tasks.png)



In other words, a Spark job is a computation with that computation sliced into stages.

A stage is uniquely identified by id. When a stage is created, DAGScheduler increments internal counter nextStageId to track the number of stage submissions.

A stage can only work on the partitions of a single RDD (identified by rdd), but can be associated with many other dependent parent stages (via internal field parents), with the boundary of a stage marked by shuffle dependencies.

![Stages](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/images/spark-webui-stages-completed.png)

### Storage

Storage page permit us to see how RDD are partitioned across the cluster.

![Storage Page](https://github.com/f-guitart/data_mining/blob/master/notes/img/apache_spark_storage_page.png?raw=true)

### Environment 

This tab shows configuration and variables used in Apache Spark execution.

### Executors

In this tab, we can see information about executors available in the cluster. 

We can have relevant information about CPU and Memory, as wel as RDD storage.

We can also have information about executed tasks.

![Executors Page](https://github.com/f-guitart/data_mining/blob/master/notes/img/apache_spark_executors_page.png?raw=true)

### SQL

By default, it displays all SQL query executions. However, after a query has been selected, the SQL tab displays the details of the SQL query execution.

## Main Spark Concepts

### Partitions
Spark’s basic abstraction is the **Resilient Distributed Dataset, or RDD**.  

That fragmentation is what enables Spark to execute in parallel, and the level of fragmentation is a function of the number of **partitions** of your RDD.  

### Caching

You will often hear: "Apache handles all data in memory". 

This is tricky and here's where the magic relies. Most of the time you will be working with metadata not with all the data, and computations are only left for the time that you need the results.

Storing that results or leaving them to compute them again has a high impact in response times. When you store the results, it is said to be **catching the RDD**.

### 


[992023968016L, 984095744256L, 976215137296L, 968381956096L, 960596010000L]