# Unit 6 Launching Spark Applications

## Contents

```
6.1 Creating a Spark application
6.2 Submitting an application to YARN
6.3 Simple application submission example
6.4 Cluster vs Client mode
6.5 Dynamic resource allocation
6.6 Adding dependencies
6.7 How-to install additional Python packages
6.8 Native Compression Libraries
6.9 Submitting the application in the background
6.10 Overriding configuration directory
6.11 Complex application submission example
6.12 Run an interactive shell
```

## Creating a Spark application
An application is very similar to a notebook, but there are some minor changes that must be applied.

The interactive notebook creates automatically the SparkContext (sc) and a SparkSession (spark) but in a standard application you must take care of creating them manually:

In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkContext

if __name__ == '__main__':
    spark = SparkSession \
        .builder \
        .appName('My Application') \
        .getOrCreate()
    sc = spark.sparkContext
    # ...
    # Application specific code
    # ..
    spark.stop()

## Submitting an application to YARN

To submit an application to YARN you use the **spark-submit** utility:

```
spark-submit
  --name NAME                 A name of your application.

  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").

  --num-executors NUM         Number of executors to launch (Default: 2).
  
  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
                              
  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)
```

The main options to take into account for resource allocation are:

* The `--num-executors` (spark.executor.instances as configuration property) option controls how many executors it will allocate for the application on the cluster .
* The `--executor-memory` (spark.executor.memory configuration property) option controls the memory allocated per executor.
* The `--executor-cores` (spark.executor.cores configuration property) option controls the cores allocated per executor.



## Simple application submission example

In the simplest case we can just use:

    spark-submit test.py

But in general we will frequently use:

    spark-submit --deploy-mode cluster --conf spark.yarn.submit.waitAppCompletion=false test.py

We can also give a name to the application (it will overwrite the name given in the python code) and we can explicitly indicate that we want to use yarn (which is the default in a Hadoop cluster):

    spark-submit --master yarn --deploy-mode client --name testWC test.py
    spark-submit --master yarn --deploy-mode cluster --name testWC test.py



## Cluster vs Client mode

### Client Mode
![Client mode](https://bigdata.cesga.es/img/spark_client_mode.png)

### Cluster Mode
![Cluster mode](https://bigdata.cesga.es/img/spark_cluster_mode.png)

More information: [Spark-on-YARN: Empower Spark Applications on Hadoop Cluster](https://www.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster)

## Dynamic resource allocation
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.

Our cluster has this feature enabled so it **automatically expands new executors when they are needed**, instead of fixing them at launch time with --num-executors.

This allows that interactive jobs dynamically add and remove executors during execution.

It is important to notice that when you specify the `--num-executors` option without explicitly disabling dynamic resource allocation, then num-executors indicates the initial number of executors to allocate (by default it is 2).

If you want to **disable dynamic resource allocation** and request a fixed number of executors you have to use the following spark-submit options:

    spark-submit --conf spark.dynamicAllocation.enabled=false --num-executors 4 ...

## Adding dependencies

To add dependencies we have to distinguish two different cases, when we need to extend Spark code itself which is written in Scala, or when we need to add the of our Python code.

When you need to add dependencies for Spark itself we can use:
- The **--packages** option pulls directly the packages from the Central Maven Repository. This approach requires an internet connection.
- The **--jars** option transfers associated jar files to the cluster.


To include the dependencies of our Python program we can use:
- The **--py-files** option adds .zip, .egg, or .py files to the PYTHONPATH.

### Adding Spark dependencies: packages
You can add Spark dependencies directly using the maven coordinates:

    spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 ...
    
The usual place to look for plublic packages is the Maven Central Repository:

    https://search.maven.org/

### Adding Spark dependencies: jar files
You can also add exising jar files directly as dependencies:

    spark-submit --jars /jar_path/spark-streaming-kafka-assembly_2.10-1.6.1.2.4.2.0-258.jar ...

### Adding Python dependencies: zip files
The easiest way to add our Python dependencies is to package all the dependencies in a zip file.
If our program is more than a simple script and it defines its own modules and packages, then it is also needed to package and distribute them so the executors can access them.

If we have a `requirements.txt` file we can generate a `dependencies.zip` file including all the dependencies with the following commands:
```
pip install -t dependencies -r requirements.txt
```

If the package we need it is not in PyPI but we have its `setup.py` then we can generate easily a zip with it and its dependencies running from the directory where the setup.py is located:
```
pip install -t dependencies .
```

Then we just package all the dependencies in a zip file:
```
cd dependencies
zip -r ../dependencies.zip .
```

And then we need to package also the code of our application:
```
zip -r my_program.zip my_program
```

Finally to submit your application you will use:

    spark-submit --py-files dependencies.zip,my_program.zip ...

### Adding Python dependencies: egg files

In case you have an egg file of a package you want to use, you can add it directly to the `--py-files` option of spark-submit or to the `sc.addPyFile()` method to make it available to the application. After that you can make use of it in your application in the standard way.

    spark-submit --py-files /egg_path/avro-1.8.1-py2.7.egg ...

In [None]:
# First we add the egg file to the application environment
sc.addPyFile('/home/cesga/jlopez/packages/ClusterShell-1.7.3-py2.7.egg')
# Then we can import and use it in the standard way
from ClusterShell.NodeSet import NodeSet
nodeset = NodeSet('c[6601-6610]')

### Adding Python dependencies: wheel files
Unfortunately wheel files are not yet supported.

There was a feature request by it has been recently closed because no progress had been made since 2016:

https://issues.apache.org/jira/browse/SPARK-6764

## How-to install additional Python packages
The simplest way is to use **pip** with the `--user` option:

    pip install --user pymongo
    
You can also create a **virtualenv** and use it to install all your dependencies:

    virtualenv venv
    . venv/bin/activate
    
In case you are using a virtualenv you have to point spark to the appropriate python interpreter for your virtualenv:

    export PYSPARK_DRIVER_PYTHON=/home/cesga/jlopez/my_app/venv/bin/python
    export PYSPARK_PYTHON=/home/cesga/jlopez/my_app/venv/bin/python

You will also need to adjust the permissions of your HOME directory so the spark user can access this virtualenv.

This a quick & dirty way that you can use during development but, for production, I would recommend the zip file alternative.

## Native Compression Libraries

To check native Hadoop and compression libraries availability you can run the `hadoop checknative` command:

```
[jlopez@cdh61-login6 ~]$ hadoop checknative
...
Native library checking:
hadoop:  true /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native/libhadoop.so.1.0.0
zlib:    true /lib64/libz.so.1
zstd  :  true /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native/libzstd.so.1
snappy:  true /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native/libsnappy.so.1
lz4:     true revision:10301
bzip2:   true /lib64/libbz2.so.1
openssl: true /lib64/libcrypto.so
ISA-L:   true /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native/libisal.so.2
```

## How-to submit the application in the background

By default when you submit an application the spark-submit command keeps active waiting for application output. To avoid this behaviour use spark.yarn.submit.waitAppCompletion=false:

    spark-submit --conf spark.yarn.submit.waitAppCompletion=false ...



## Overriding configuration directory
To specify a different configuration directory other than the default “$SPARK_HOME/conf”, you can set SPARK_CONF_DIR. Spark will use the configuration files (spark-defaults.conf, spark-env.sh, log4j.properties, etc) from this directory.

Example:

    export SPARK_CONF_DIR=/home/cesga/jlopez/conf/


## Complex application submission example

Here you can see a real example of how to submit a real application that consumes data from Kafka in avro format using Spark Streaming:

```
spark-submit --master yarn --deploy-mode cluster \
             --num-executors 2 \
             --conf spark.yarn.submit.waitAppCompletion=false  \
             --packages com.databricks:spark-avro_2.10:2.0.1 \
             --jars /home/cesga/jlopez/packages/spark-streaming-kafka-assembly_2.10-1.6.1.2.4.2.0-258.jar \
             --py-files /home/cesga/jlopez/packages/avro-1.8.1-py2.7.egg \
             --name 'SSH attack detector' \             
             ssh_attack_detector.py
```

## Running an interactive shell
Additionally to Jupyter notebooks you can also use the command line interactive shell provided by Spark:

    pyspark --master yarn --num-executors 4 --executor-cores 6 --queue interactive

    --num-executors NUM         Number of executors to launch (Default: 2).
    --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode)
    --driver-cores NUM          Number of cores used by the driver, only in cluster mode (Default: 1).
    --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
    --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").    


To use ipython instead of python for an interactive session use:

    module load anaconda2
    PYSPARK_DRIVER_PYTHON=$(which ipython) pyspark --queue interactive


## Exercises

* Exercise: In the solutions folder take a look at "Unit_6_WordCount.py" to see how the "Unit 4 WordCount" notebook was modified to create a spark application. Submit it to YARN.
* Exercise: Modify the "Unit 4 Working with meteorological data 2" notebook and submit it to YARN.
* Exercise: Modify the "Unit 5 Working with meteorological data" notebook and submit it to YARN.