# Deep learning on Spark
![footer_logo_new](images/logo_new.png)

In this module we'll look into integration with [Tensorflow](https://www.tensorflow.org) in Spark and support of Deep Learning on Spark.

## 1. Integration with Tensorflow 

Tensorflow is one of the most popular Deep learning libraries. As you may know most high-performance deep learning implementations are single-node only. We'll look into use-cases for distibuted processing. 

1. Hyperparameter Tuning and training: use Spark to find the best set of hyperparameters for neural network training
2. Deploying models at scale: use Spark to apply a trained neural network model on a large amount of data

### Hyperparameter Tuning and training
The interesting thing here is that even though TensorFlow itself is not distributed, the hyperparameter tuning process runs in parallel and can be distributed using Spark. There are multiple extentions that makes it possible to train ML model with Apache Spark.

- Elephas
- CERN dist-keras
- Intel Analytics BigDL
- Apache Spark SystemML’s Keras2DML
- Databricks Spark Deep Learning
- Yahoo TensorFlowOnSpark and other. 

We'll look into [Yahoo TensorFlowOnSpark](https://github.com/yahoo/TensorFlowOnSpark)

TensorFlowOnSpark provides some important benefits over alternative deep learning solutions.

- Easily migrate existing TensorFlow programs with <10 lines of code change.
- Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard.
- Server-to-server direct communication achieves faster learning when available.
- Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow.
- Easily integrate with your existing Spark data processing pipelines.
- Easily deployed on cloud or on-premise and on CPUs or GPUs.

#### Resources 
https://databricks.com/session/tensorflow-on-spark-scalable-tensorflow-learning-on-spark-clusters
https://github.com/yahoo/TensorFlowOnSpark/blob/master/examples/mnist/keras/README.md

### Deploying models at scale

MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark for batch-mode scoring or the MLeap runtime to power realtime API services.

Benefits:

- export/import of an MLeap Bundle to run your pipeline wherever it is needed.
- an unified runtime

![mleap](images/mleap.jpg)

Idea: train different pieces of pipeline using Spark, Scikit-learn or Tensorflow, then export them to one MLeap Bundle file and deploy it anywhere.

#### Common Serialization

In addition to providing a useful execution engine, MLeap Bundles provide a common serialization format for a large set of ML feature extractors and algorithms that are able to be exported and imported across Spark, Scikit-learn, Tensorflow and MLeap. This means easy pipelines conversion between these technologies depending on where you need to execute a pipeline.

#### Seamless Integrations

No changes will be required and to simply export to an MLeap Bundle or deploy to a Combust API server to start getting immediate use of the pipeline.

#### Resources
https://github.com/combust/mleap
https://mleap-docs.combust.ml



### Horovod 

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

- An open source distributed deep learning training framework
- Supports multiple applications: TensorFlow, PyTorch, Keras
- Separates infrastructure capabilities from ML application
- Installs on existing ML framework: pip install Horovod
- Uses bandwidth optimal communication protocols: RDMA, InfiniBand if available
    
HorovodEstimator is an Apache Spark MLlib-style estimator API that leverages the Horovod framework developed by Uber. It facilitates distributed, multi-GPU training of deep neural networks on Spark DataFrames, simplifying the integration of ETL in Spark with model training in TensorFlow. Specifically, HorovodEstimator simplifies launching distributed training with Horovod by:

- Distributing training code & data to each machine on your cluster
- Enabling passwordless SSH between the driver and workers, and launching training via MPI
- Writing custom data-ingest & model-export logic
- Simultaneously running model training & evaluation

#### Distributed training

HorovodEstimator is a Spark MLlib Estimator and can be used with the Spark MLlib Pipelines API, although estimator persistence is not yet supported. Fitting a HorovodEstimator returns an MLlib Transformer (a TFTransformer) that can be used for distributed inference on a DataFrame. It also stores model checkpoints (can be used to resume training), event files (contain metrics logged during training), and a tf.SavedModel (can be used to apply the model for inference outside Spark) into the specified model directory. HorovodEstimator makes no fault-tolerance guarantees. If an error occurs during training, HorovodEstimator does not attempt to recover, although you can rerun fit() to resume training from the latest checkpoint.

Benefit:

- Run multiple TensorFlow nodes in parallel
- Improve the accuracy and speed of model training

How it works:
- Model Parallelism: different layers of the same model may be trained, on different nodes
- Data Parallelism: same model applied on different subset of data, on different nodes


#### Distributed Training – Schemes

When we have a parameter server schema:

- Data parallelism implementation 
- Needs to sync model parameters 
- Uses a centralized or decentralized scheme to communicate parameter update
- Centralized schemes use a parameter server to communicate updates to parameters to nodes

![parameter_server](images/parameter-server.png)


Decentralized schemes use ring-allreduce to spread parameters across the nodes


![ring-allreduce](images/ring-allreduce.png)

Horovod is supports ring-allreduce

#### Resources
https://horovod.readthedocs.io/en/stable/spark.html

### Pandas UDF's with Spark 3
Pandas UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high-performance UDFs entirely in Python.


#### Apache Arrow in PySpark

Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes. This currently is most beneficial to Python users that work with Pandas/NumPy data. Its usage is not automatic and might require some minor changes to configuration or code to take full advantage and ensure compatibility. This guide will give a high-level description of how to use Arrow in Spark and highlight any differences when working with Arrow-enabled data.

![apache-arrow-and-pandas-udf-on-apache-spark-15-638](images/apache-arrow-and-pandas-udf-on-apache-spark-15-638.jpg)

Spark’s native ML library though powerful generally lack in features. Python’s libraries usually provide more options to tinker with model parameters, resulting in better tuned models. However, Python is bound on single compute machine and one contiguous block of memory, which makes it infeasible to be used for training on large scale distributed data-set.

![with_pandas_udf](images/with_pandas_udf.png)


### Nvidia Rapids with Spark 3

NVIDIA has worked with the Apache Spark community to implement GPU acceleration through the release of Spark 3.0 and the open source RAPIDS Accelerator for Spark. 

RAPIDS Accelerator for Apache Spark uses GPUs to:

- Accelerate end-to-end data preparation and model training on the same Spark cluster.
- Accelerate Spark SQL and DataFrame operations without requiring any code changes.
- Accelerate data transfer performance across nodes (Spark shuffles).

#### Machine Learning in Spark < 3.0

![nvidia-2](images/nvidia-2.png)

#### Machine Learning in Spark >=  3.0

![nvidia-3](images/nvidia-3.png)

    
#### GPU-aware scheduling in Spark

GPUs are now a schedulable resource in Apache Spark 3.0. This allows Spark to schedule executors with a specified number of GPUs, and you can specify how many GPUs each task requires. Spark conveys these resource requests to the underlying cluster manager, Kubernetes, YARN, or standalone. You can also configure a discovery script to detect which GPUs were assigned by the cluster manager. This greatly simplifies running ML applications that need GPUs, as you previously had to work around the lack of GPU scheduling in Spark applications.

#### An example of a flow for GPU scheduling. 

The diagram shows GPU scheduling flow from the Spark driver to the cluster manager, to executor launch, GPU assignment, and task launch.

![nvidia-spark](images/nvidia-spark.png)

The user submits an application with a GPU resource configuration discovery script. Spark starts the driver, which uses the configuration to pass on to the cluster manager, to request a container with a specified amount of resources and GPUs. The cluster manager returns the container. Spark launches the container. When the executor starts, it runs the discovery script. Spark sends that information back to the driver and the driver can then use that information to schedule tasks to GPUs.

#### Resources 

https://rapids.ai

https://developer.nvidia.com/blog/accelerating-apache-spark-3-0-with-gpus-and-rapids/

https://databricks.com/session_na20/deep-dive-into-gpu-support-in-apache-spark-3-x
