# Spark Job with Spark Operator
Using spark operator for running spark job over k8s.<br>

The `spark-on-k8s-operator` allows Spark applications to be defined in a declarative manner and supports one-time Spark applications with `SparkApplication` and cron-scheduled applications with `ScheduledSparkApplication`. <br>

When sending a request with MLRun to Spark operator the request contains your full application configuration including the code and dependencies to run (packaged as a docker image or specified via URIs), the infrastructure parameters, (e.g. the memory, CPU, and storage volume specs to allocate to each Spark executor), and the Spark configuration.

Kubernetes takes this request and starts the Spark driver in a Kubernetes pod (a k8s abstraction, just a docker container in this case). The Spark driver can then directly talk back to the Kubernetes master to request executor pods, scaling them up and down at runtime according to the load if dynamic allocation is enabled. Kubernetes takes care of the bin-packing of the pods onto Kubernetes nodes (the physical VMs), and will dynamically scale the various node pools to meet the requirements.

When using Spark operator the resources will be allocated per task, means scale down to zero when the tesk is done.


## Preperations
The 1st step is to prepare the iris dataset that we will use in this example.  
We will get the file using `mlrun.get_object()` directly from github and save it to our `projects` data container.

In [None]:
import mlrun
import os

# Create the data folder and set the dataset filepath
# We will save the dataset to our projects container
iris_dataset_filepath = os.path.abspath('/v3io/projects/howto/spark-operator/iris.csv')
os.makedirs(os.path.dirname(iris_dataset_filepath), exist_ok=True)

# Get the dataset from git
iris_dataset = mlrun.get_object('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv')

# Save the dataset at the designated path
with open(iris_dataset_filepath, 'wb') as f:
    f.write(iris_dataset)

### Spark Operator Function Setup

In [2]:
# set up new spark function with spark operator
# command will use our spark code which needs to be located on our file system
# the name param can have only non capital letters (k8s convention)
read_csv_filepath = os.path.join(os.path.abspath('.'), 'spark_read_csv.py')
sj = mlrun.new_function(kind='spark', command=read_csv_filepath, name='sparkreadcsv') 

# set spark driver config (gpu_type & gpus=<number_of_gpus>  supported too)
sj.with_driver_limits(cpu="1300m")
sj.with_driver_requests(cpu=1, mem="512m") 

# set spark executor config (gpu_type & gpus=<number_of_gpus> are supported too)
sj.with_executor_limits(cpu="1400m")
sj.with_executor_requests(cpu=1, mem="512m")

# adds fuse, daemon & iguazio's jars support
sj.with_igz_spark() 

# args are also supported
sj.spec.args = ['-spark.eventLog.enabled','true']

# add python module
sj.spec.build.commands = ['pip install matplotlib']

# Number of executors
sj.spec.replicas = 2 

## Deploy the spark function
### Build the docker image
If our function requires additional packages that are not yet available via any of our images, we may want to build a new docker image for it.  
Using the `fn.spec.build.baseImage` as base (defaults to base python 3) and the additional `fn.spec.build.commands` MLRun will build and deploy the image for you.

> You can skip this step if you had provided a ready image for the function

In [None]:
# Rebuilds the image with MLRun
sj.deploy() 

### Run the function

In [None]:
# Run task while setting the artifact path on which our run artifact (in any) will be saved
sj.run(artifact_path='/User')