# Spark Job with MLRun
Run a Spark job which reads a csv file and logs the dataset to MLRun database.<br>
This basic example can use as a schema for more complex workloads using MLRun and Spark.

In [1]:
# nuclio: ignore
import nuclio

## Set Function Kind and Base Image

In [2]:
# nuclio: ignore
# get build base image name by running the following code
# this will use us to config our spark job docker image
# Note: using the ${env_var} annotation will only work when **running** the notebook.
#       To allow creating a function from this notebook directly (instead of saving the function yaml
#       via `fn.export()` and importing the resulted `yaml` file) please **copy** the `baseImage_address` 
#       value and set it as the `spec.build.baseImage` nuclio configuration directly.
import os
os.environ['baseImage_address'] = os.environ["IGZ_DATANODE_REGISTRY_URL"] + '/iguazio/shell:' + os.environ["IGZ_VERSION"]

In [3]:
# set the function kind and docker image
%nuclio config kind = "job"
%nuclio config spec.build.baseImage = "${baseImage_address}"

%nuclio: setting kind to 'job'
%nuclio: setting spec.build.baseImage to 'datanode-registry.iguazio-platform.app.hsbctesting3.iguazio-cd0.com:80/iguazio/shell:3.0_katyak_debug_b1089_20201214154653'


## Build MLRun Function

In [4]:
#!/usr/local/bin/python

import mlrun
from mlrun.datastore import DataItem
from mlrun.execution import MLClientCtx

from subprocess import run

from pyspark.sql import SparkSession
import pyspark.sql.functions as f

In [5]:
#!/usr/local/bin/python

run(["/bin/bash", "/etc/config/v3io/spark-job-init.sh"])

def read_csv(context: MLClientCtx, 
             dataset: DataItem, 
             artifact_path):
    """
    Read csv while using spark job and mlrun - generate serverless function
    --------------------------------------------------------------------------------------------
    Parameters:
                context : MLClientCtx
                          MLRun introduces a concept of a runtime "context", 
                          the code can be set up to get parameters and inputs from the context, 
                          as well as log run outputs, artifacts, tags, and time-series metrics in the context.
                          
                dataset : csv_file
                          csv file which needs to be local (on our machine)
                          the default location will be "/v3io/projects/<file_name> 
                          which can be change by using mlrun.mount_v3io later in the function specs
                          
                artifact_path : String
                          path on which the outout/artifacts of the fucntion will be saved
                
    Returns:
                logged_dataset : mlrun_artifact
                          dataset will be logged into mlrun database as dataset artifact
    ---------------------------------------------------------------------------------------------
    Notes:
    ---------------------------------------------------------------------------------------------
    Examples:
    """
    
    # get csv file location
    location = dataset.local()
    
    # build spark session
    spark = SparkSession.builder.appName("Spark job").getOrCreate()
    
    # read csv
    df = spark.read.load(location, 
                         format="csv", 
                         sep=",", 
                         header="true")
    
    # sample for logging
    df_to_log = df.toPandas()
    
    # log final report
    context.log_dataset("df_sample", 
                        df=df_to_log,
                        format="csv", index=False,
                        artifact_path=artifact_path)
    
    spark.stop()

In [6]:
# nuclio: end-code

Please don't remove the # nuclio: end-code cell above
## Set MLRun Function Specs

In [7]:
# save spark service name (based on Iguazio services dashboard)
spark_service = "spark" 

In [8]:
# mlrun will transform the code above (up to nuclio: end-code cell) into serverless function 
# which will run in k8s pods
fn = mlrun.code_to_function(handler="read_csv")

In [9]:
# apply mount_v3io over our function so that our k8s pod which run our function
# will be able to access our data (shared data access)
fn.apply(mlrun.mount_v3io_extended())
fn.apply(mlrun.platforms.iguazio.mount_v3iod(namespace="default-tenant", v3io_config_configmap=spark_service + "-submit"))

# skip pulling an image if it already exists. If you would like to always force a pull, 
# you can set the imagePullPolicy of the container to Always.
fn.spec.image_pull_policy = "IfNotPresent"

In [10]:
# add build commands to our docker image with required moduls
fn.spec.build.commands = ['pip install matplotlib mlrun==0.6.0-rc6 pyspark']

# sets environment param in our docker image
fn.spec.build.extra = 'ENV PATH $PATH:/igz/.local/bin'

In [None]:
# build and deploy our docker image
fn.deploy(with_mlrun=False)

### Set MLRun and Run Function
Once running the function get be monitored in MLRun UI, here in the notebook<br>
And in out functions dashboard.

In [12]:
# set mlrun api path and arrtifact path for logging
artifact_path = mlrun.set_environment(api_path = 'http://mlrun-api:8080',
                                      artifact_path = os.path.abspath('./'))

In [13]:
# run our functions with the relevant params
fn.run(inputs={"dataset": "iris.csv"},
       artifact_path=artifact_path)

> 2020-12-21 07:54:02,198 [info] starting run spark-mlrun-describe_spark uid=bc460cc9dd2b48ecbccee71fcd320204 DB=http://mlrun-api:8080
> 2020-12-21 07:54:02,327 [info] Job is running in the background, pod: spark-mlrun-describe-spark-pfg8n
20/12/21 07:54:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+------+-----+-----------+----------+-----+
|length|width|petallength|petalwidth|label|
+------+-----+-----------+----------+-----+
|   5.1|  3.5|        1.4|       0.2|    0|
|   4.9|    3|        1.4|       0.2|    0|
|   4.7|  3.2|        1.3|       0.2|    0|
|   4.6|  3.1|        1.5|       0.2|    0|
|     5|  3.6|        1.4|       0.2|    0|
+------+-----+-----------+----------+-----+
only showing top 5 rows

> 2020-12-21 07:54:37,876 [info] run executed, status=complet

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...cd320204,0,Dec 21 07:54:19,completed,spark-mlrun-describe_spark,v3io_user=adminkind=jobowner=adminhost=spark-mlrun-describe-spark-pfg8n,dataset,,,df_sample


to track results use .show() or .logs() or in CLI: 
!mlrun get run bc460cc9dd2b48ecbccee71fcd320204 --project default , !mlrun logs bc460cc9dd2b48ecbccee71fcd320204 --project default
> 2020-12-21 07:54:40,740 [info] run executed, status=completed


<mlrun.model.RunObject at 0x7f116b4fc5d0>