# 04 Spark essentials

## Spark context

The notebook deployment includes Spark automatically within each Python notebook kernel. This means that, upon kernel instantiation, there is an [SparkContext](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext) object called `sc` immediatelly available in the Notebook, as in a PySpark shell. Let's take a look at it:

In [1]:
?sc

We can inspect some of the SparkContext properties:

In [1]:
# Spark version we are using
print sc.version

1.5.2


In [2]:
# Name of the application we are running
print sc.appName

PySparkShell


In [3]:
# Some configuration variables
print sc.defaultParallelism
print sc.defaultMinPartitions

1
1


In [4]:
# Username running all Spark processes
# --> Note this is a method, not a property
print sc.sparkUser()

spark-vm


# Spark configuration

In [5]:
# Print out the SparkContext configuration
print sc._conf.toDebugString()

spark.app.name=PySparkShell
spark.master=local[*]
spark.rdd.compress=True
spark.serializer.objectStreamReset=100
spark.submit.deployMode=client


In [6]:
# Another way to get similar information
from pyspark import SparkConf, SparkContext
SparkConf().getAll()

[(u'spark.master', u'local[*]'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.app.name', u'PySparkShell')]

## Spark execution modes

We can also take a look at the Spark configuration this kernel is running under, by using the above configuration data:

In [7]:
print sc._conf.toDebugString()

spark.app.name=PySparkShell
spark.master=local[*]
spark.rdd.compress=True
spark.serializer.objectStreamReset=100
spark.submit.deployMode=client


... this includes the execution mode for Spark. The default mode is *local*, i.e. all Spark processes run locally in the launched Virtual Machine. This is fine for developing and testing with small datasets.

But to run Spark applications on bigger datasets, they must be executed in a remote cluster. There are configuration modes for that, which require:
* defining the addresses of the cluster in the configuration (this can be adjusted in the *Vagrantfile* for the virtual machine and then reprovisioned)
* network adjustments to make the VM "visible" from the cluster

This is an ongoing work