# 3c - Training Custom Spark Model <a class="anchor" id="top"></a>
* [Introduction](#intro)
* [Setup](#setup)
* [Loading the data](#load-data)

## Introduction <a class="anchor" id="intro"></a>
In this section, we define and tain a custom Spark model.
The model will be based around Spark's perceptron implementation.
We then serialize the model to allow for deployment on an endpoint in the next section.

## Setup <a class="anchor" id="setup"></a>
First, we must import relevant Spark modules as well as libraries for statistical analysis and visualizations.
Note that will also start the Spark application that creates the `SparkSession` and sets it to the `spark` variable.

In [None]:
%%cleanup -f

In [1]:
import pyspark
import pyspark.ml as ml
import pyspark.sql as sql
import pyspark.sql.types as types
import pyspark.sql.functions as F
import mleap.pyspark.spark_support as support

import os
import boto3
import shutil
import zipfile
import tarfile
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("darkgrid")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
17,application_1643065673430_0016,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
s3 = boto3.resource("s3")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Loading the data <a class="anchor" id="load-data"></a>

Now, we load the training, validation, and test data from S3.
To do this, we must first get the bucket name where the data is stored.
This variable is stored on the local Sagemaker notebook instance during the `DevEnvironment` stack creation 
and must be explicitly passed to the Spark cluster.

In [3]:
%%local
import json
with open("/home/ec2-user/.aiml-bb/stack-data.json", "r") as f:
    data = json.load(f)
    data_bucket = data["data_bucket"]
    model_bucket = data["model_bucket"]

In [4]:
%%send_to_spark -i data_bucket -t str

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'data_bucket' as 'data_bucket' to Spark kernel

In [5]:
%%send_to_spark -i model_bucket -t str

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'model_bucket' as 'model_bucket' to Spark kernel

In [6]:
train_df = spark.read.csv(f"s3a://{data_bucket}/preprocessing_output/train/", inferSchema=True)
validation_df = spark.read.csv(f"s3a://{data_bucket}/preprocessing_output/validation/", inferSchema=True)
test_df = spark.read.csv(f"s3a://{data_bucket}/preprocessing_output/test/", inferSchema=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
%%pretty
test_df.select("_c0").groupby("_c0").count().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

_c0,count
1.0,1239258
0.0,1240932


In [8]:
train_df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

18591753

In [9]:
test_df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2480190

In [10]:
%%pretty
train_df.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19,_c20,_c21,_c22,_c23,_c24,_c25,_c26,_c27,_c28,_c29,_c30,_c31,_c32,_c33,_c34,_c35,_c36,_c37,_c38,_c39,_c40,_c41,_c42,_c43,_c44,_c45,_c46,_c47,_c48,_c49,_c50,_c51,_c52,_c53,_c54,_c55,_c56,_c57,_c58,_c59,_c60,_c61,_c62,_c63,_c64,_c65,_c66,_c67,_c68,_c69,_c70,_c71,_c72,_c73,_c74,_c75,_c76,_c77,_c78,_c79
0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.3632772872402005,-0.2667950420048977,0.0,0.0,0.0,1.8263820885807576,0.4388991928782338,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0339430482990322,0.2766763398569309,0.0,0.0,0.0,2.5607006602781754,1.6658219366060238,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0339430482990322,0.2766763398569309,0.0,0.0,0.0,1.986425879848144,1.386522450228966,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.726554574480401,-0.2075072548926982,0.0,0.0,0.8303506324339892,1.8263820885807576,0.7780485691932327,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0339430482990322,0.2766763398569309,0.0,0.0,0.0,2.7207444515455617,1.6658219366060238,0.0,0.0,0.0


## Model definition and training
For this task, we will be testing Spark's `MultilayerPerceptronClassifier`, which is a feedforward artificial neural network.
This was chosen becuse perceptron's can yield desirable results when the training set is large, as  is the case here.
Also, there is no direct analog implemented as a Sagemaker built-in algorithm at the time of writing, so this makes for a good example of an advantage of this method.

In [11]:
# Need to wrap features into vector for training.
feature_assembler = ml.feature.VectorAssembler(
    inputCols=[col for col in train_df.columns[1:]], 
    outputCol="features",
    handleInvalid="keep"
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
# Define perceptron.
layers = [
    len(train_df.columns) - 1, # Number of inputs.
    256,
    128,
    64,
    2 # Number of outputs.
]
mlp_classifier = ml.classification.MultilayerPerceptronClassifier(
    featuresCol="features",
    labelCol=train_df.columns[0],
    layers=layers
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
# Wrap steps into pipeline.
# Define Pipeline object.
inference_pipeline = ml.Pipeline(
    stages=[
        feature_assembler,
        mlp_classifier
    ]
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Fit model and evaluate

In [None]:
# Train the model on the training data and transform 
# the test data.
inference_model = inference_pipeline.fit(train_df)
transformed_test_df = inference_model.transform(test_df)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
%%pretty
transformed_test_df.show(5)

In [None]:
inference_model.summary