# Spark Introduction

In this chapter we will cover the following topics:

- What is Spark?
- Spark Shell
- Spark & Jupyter Notebook
- Spark execution
- Spark context

# What is Spark?

 - Spark is a second-generation Big Data platform
 - The first generation was based on Map-Reduce. Powerful, but cumbersome to use with technical limitations.
 - Spark uses a new architecture and leverages functional programming: these combine to make things much easier to implement and perform better than earlier frameworks.
 - Emerged from the AMPLab at UC Berkeley, at the same time as Mesos.
 - Spark is now an Apache project.
 - It's now bundled with Cloudera. You might need to add another parcel for Spark2
 - Evolving very quickly.

# Spark advantages

 - Distributed processing and cluster computing
     - Application processes are distributed across a cluster of worker nodes
     - Data is also distributes on the worker nodes => run computation where the data lives
 - Works with distributed storage
     - supports data locally
 - Data in memory
     - configurable persistence for efficient iteration
 - High level programming framework
     - programmers can focus on logic, not plumbing
     

# Spark language support

Spark is internally written in Scala (a functional programming language which runs in the JVM) but supports a programming interface in the following languages:

- Scala
- Python
- Java
- R


# Working with Spark

There are several options to work with Spark.

- Spark shell (a REPL in python or scala)
- Jupyter notebooks
- Spark applications

# Spark Shell for Python

Spark supports a shell, for interactive work. To launch this use the `pyspark` command, which starts a python REPL connected to Spark:

```
Python 3.5.3 | packaged by conda-forge | (default, Jan 24 2017, 06:45:37)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 3.5.3 (default, Jan 24 2017 06:45:37)
SparkSession available as 'spark'.
>>>
>>> sc
<pyspark.context.SparkContext object at 0x103ba6d90>
>>> sqlContext
<pyspark.sql.context.SQLContext object at 0x103d7a790>
>>> spark
<pyspark.sql.session.SparkSession object at 0x103d7a590>
>>> exit()
```

# Spark Shell for Scala

Spark supports a shell, for interactive work. To launch this use the `spark-shell` command, which starts a scala REPL connected to Spark:

```
% spark-shell 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.114:4040
Spark context available as 'sc' (master = local[*], app id = local-1487929648200).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@37b1218

scala> spark
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@88e85ascala>sc

scala> exit 
```

# Spark & Jupyter Notebook

Jupyter was formerly part of the iPython project, but is now standalone. It's an excellent way to interactively work with Spark. To start it:

    jupyter notebook
This will start a notebook server and will direct your browser to it. Unlike the Spark shell, by default notebooks aren't connected to Spark. The most convenient way to connect is using the `findspark` package.

In [None]:
# This will search in standard locations, but setting SPARK_HOME might be required.
import findspark
findspark.init()
import pyspark

spark = (
    pyspark.sql.SparkSession.builder
    .getOrCreate()
)
sc = spark.sparkContext

# Spark execution

Sparks execution environment is divided into the following components:

- Driver
    - Launches applications from outside or inside a cluster.
- Executors
    - Separate execution engines or containers on the worker nodes of a cluster.
    - Tasks (unit of work) are run within the executors.
- Cluster manager
    - Allocates computing resources (CPU/Memory) in the distributed system the Spark application is run on.
    - Examples are Yarn, Mesos, Spark

![Spark execution model](images/spark-cluster.png "Spark execution model")

# SparkContext

Main entrypoint for Spark applications and a handle to the execution environment.

```
In [2]: sc.
sc.PACKAGE_EXTENSIONS    sc.broadcast             sc.environment           sc.pickleFile
sc.setCheckpointDir      sc.startTime             sc.accumulator           sc.cancelAllJobs
sc.getLocalProperty      sc.profiler_collector    sc.setJobGroup           sc.statusTracker
sc.addFile               sc.cancelJobGroup        sc.hadoopFile            sc.pythonExec
sc.setLocalProperty      sc.stop                  sc.addPyFile             sc.clearFiles
sc.hadoopRDD             sc.pythonVer             sc.setLogLevel           sc.textFile
sc.appName               sc.defaultMinPartitions  sc.master                sc.range
sc.setSystemProperty     sc.union                 sc.applicationId         sc.defaultParallelism
sc.newAPIHadoopFile      sc.runJob                sc.show_profiles         sc.version
sc.binaryFiles           sc.dump_profiles         sc.newAPIHadoopRDD       sc.sequenceFile
sc.sparkHome             sc.wholeTextFiles        sc.binaryRecords         sc.emptyRDD
sc.parallelize           sc.serializer            sc.sparkUser       
```