# Spark Job with MLRun
Run a Spark job which reads a csv file and logs the dataset to MLRun database.<br>
This basic example can use as a schema for more complex workloads using MLRun and Spark.

In [1]:
!/User/align_mlrun.sh

Both server & client are aligned (0.6.5rc12).


In [2]:
# nuclio: ignore
import nuclio
import os

## Build MLRun Function

In [3]:
#!/usr/local/bin/python

import mlrun
from mlrun.datastore import DataItem
from mlrun.execution import MLClientCtx

from subprocess import run

from pyspark.sql import SparkSession
import pyspark.sql.functions as f

In [4]:
def read_csv(context: MLClientCtx, 
             dataset: DataItem, 
             artifact_path):
    """
    Read csv while using spark job and mlrun - generate serverless function
    --------------------------------------------------------------------------------------------
    Parameters:
                context : MLClientCtx
                          MLRun introduces a concept of a runtime "context", 
                          the code can be set up to get parameters and inputs from the context, 
                          as well as log run outputs, artifacts, tags, and time-series metrics in the context.
                          
                dataset : csv_file
                          csv file which needs to be local (on our machine)
                          the default location will be "/v3io/projects/<file_name> 
                          which can be change by using mlrun.mount_v3io later in the function specs
                          
                artifact_path : String
                          path on which the outout/artifacts of the fucntion will be saved
                
    Returns:
                logged_dataset : mlrun_artifact
                          dataset will be logged into mlrun database as dataset artifact
    ---------------------------------------------------------------------------------------------
    Notes:
    ---------------------------------------------------------------------------------------------
    Examples:
    """
    
    # get csv file location
    location = dataset.local()
    
    # build spark session
    spark = SparkSession.builder.appName("Spark job").getOrCreate()
    
    # read csv
    df = spark.read.load(location, 
                         format="csv", 
                         sep=",", 
                         header="true")
    
    # sample for logging
    df_to_log = df.toPandas()
    
    # log final report
    context.log_dataset("df_sample", 
                        df=df_to_log,
                        format="csv", index=False,
                        artifact_path=artifact_path)
    
    spark.stop()

In [5]:
# nuclio: end-code

## Download iris data-set

In [6]:
import requests
import shutil

def download_file(url,path):
    local_filename = url.split('/')[-1]
    
    #file_path = path+"/"+local_filename
    with requests.get(url, stream=True) as r:
        with open("/v3io/projects/"+local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

url = "https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv"

download_file(url,'/v3io/projects')

'iris_dataset.csv'

Please don't remove the # nuclio: end-code cell above
## Set MLRun Function Specs

In [7]:
#get spark service name
from configparser import ConfigParser
from itertools import chain

parser = ConfigParser()
configFilePath = os.environ['SPARK_HOME']+'/conf/spark-defaults.conf'
with open(configFilePath) as lines:
    lines = chain(("[top]",), lines)  # This line does the trick.
    parser.read_file(lines)
    spark_service_name = parser["top"]["spark.master"].split("://")[1].split("-master")[0]   
print(spark_service_name) 

spark


In [8]:
# mlrun will transform the code above (up to nuclio: end-code cell) into serverless function 
# which will run in k8s pods
fn = mlrun.code_to_function(handler="read_csv", kind="remote-spark")

In [9]:
# add build commands to our docker image with required moduls
fn.spec.build.commands = ['pip install matplotlib pyspark']


In [10]:
# build and deploy our docker image
fn.with_spark_service(spark_service=spark_service_name)
fn.deploy()

> 2021-07-12 14:00:09,417 [info] Started building image: .mlrun/func-default-spark-mlrun-read-csv:latest
E0712 14:00:51.928074       1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
[36mINFO[0m[0040] Retrieving image manifest datanode-registry.iguazio-platform.app.dev39.lab.iguazeng.com:80/iguazio/shell:3.0_b117_20210510150319 
[36mINFO[0m[0040] Retrieving image manifest datanode-registry.iguazio-platform.app.dev39.lab.iguazeng.com:80/iguazio/shell:3.0_b117_20210510150319 
[36mINFO[0m[0040] Built cross stage deps: map[]                
[36mINFO[0m[0040] Retrieving image manifest datanode-registry.iguazio-platform.app.dev39.lab.iguazeng.com:80/iguazio/shell:3.0_b117_20210510150319 
[36mINFO[0m[0040] Retrieving image manifest datanode-registry.iguazio-platform.app.dev39.lab.iguazeng.com:80/iguazio/shell:3.0_b117_20210510150319 
[36mINFO[0m[

True

### Set MLRun and Run Function
Once running the function get be monitored in MLRun UI, here in the notebook<br>
And in out functions dashboard.

In [11]:
# set mlrun api path and arrtifact path for logging
artifact_path = mlrun.set_environment(api_path = 'http://mlrun-api:8080',
                                      artifact_path = os.path.abspath('./'))

In [12]:
# run our functions with the relevant params
fn.run(inputs={"dataset": "iris_dataset.csv"},
       artifact_path=artifact_path[1])

> 2021-07-12 14:05:49,539 [info] starting run spark-mlrun-read-csv-read_csv uid=44a37c692e0f4c0690900c2c375e26fb DB=http://mlrun-api:8080
> 2021-07-12 14:05:49,874 [info] Job is running in the background, pod: spark-mlrun-read-csv-read-csv-28bh6
21/07/12 14:06:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
> 2021-07-12 14:06:36,641 [info] run executed, status=completed
final state: completed                                                          


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...375e26fb,0,Jul 12 14:06:12,completed,spark-mlrun-read-csv-read_csv,v3io_user=aviakind=remote-sparkowner=aviahost=spark-mlrun-read-csv-read-csv-28bh6,dataset,,,df_sample


to track results use .show() or .logs() or in CLI: 
!mlrun get run 44a37c692e0f4c0690900c2c375e26fb --project default , !mlrun logs 44a37c692e0f4c0690900c2c375e26fb --project default
> 2021-07-12 14:06:38,906 [info] run executed, status=completed


<mlrun.model.RunObject at 0x7fd364a04090>