# EMR Notebook SageMaker Custom Abalone Ring Estimator

1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Load the Data](#Load-the-Data)
4. [Train the Model](#Train-the-Model)
5. [Inference Results](#Inference-Results)
7. [Wrap-Up](#Wrap-Up)

In [1]:
## Introduction

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
10,application_1573168723671_0011,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Setup
First we need to install the Python packages that we'll need throughout the notebook. EMR notebooks come with a default set of libraries for data processing. You can see which libraries are installed on the notebook by calling the list_packages() function. 

In [2]:
sc.list_packages()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Package                    Version
-------------------------- -------
beautifulsoup4             4.8.0  
boto                       2.49.0 
jmespath                   0.9.4  
lxml                       4.4.1  
mysqlclient                1.4.4  
nltk                       3.4.5  
nose                       1.3.4  
numpy                      1.14.5 
pip                        19.3.1 
py-dateutil                2.2    
python36-sagemaker-pyspark 1.2.4  
pytz                       2019.2 
PyYAML                     3.11   
setuptools                 41.6.0 
six                        1.12.0 
soupsieve                  1.9.3  
wheel                      0.33.6 
windmill                   1.6

To comunicate with Sagemaker we will need to install [Notebook scoped libraries](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html). We will install both boto3 (the AWS Python 3 API) and the high level Sagemaker SDK. 

In [3]:
sc.install_pypi_package("boto3");
sc.install_pypi_package('sagemaker');

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting boto3
  Using cached https://files.pythonhosted.org/packages/06/04/257af61ae66188fe4cc7fce0a1b1ad4baf675d9e47924c83d723904ce159/boto3-1.10.17-py2.py3-none-any.whl
Collecting s3transfer<0.3.0,>=0.2.0
  Using cached https://files.pythonhosted.org/packages/16/8a/1fc3dba0c4923c2a76e1ff0d52b305c44606da63f718d14d3231e21c51b0/s3transfer-0.2.1-py2.py3-none-any.whl
Collecting botocore<1.14.0,>=1.13.17
  Using cached https://files.pythonhosted.org/packages/d0/d8/67eec2958748b0084370a6ffb1cdb5c658400fbc7da51915303607ae7af3/botocore-1.13.17-py2.py3-none-any.whl
Collecting docutils<0.16,>=0.10
  Using cached https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl
Collecting python-dateutil<2.8.1,>=2.1; python_version >= "2.7"
  Using cached https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl
Collecting urllib

In [6]:
#define user specific parameters
region = 'us-west-2'
source_bucket = 's3a://emr-lab-income-dataset/'
sagemaker_execution_role = 'arn:aws:iam::883624334343:role/service-role/AmazonSageMaker-ExecutionRole-20190906T093404'
num_workers = 2

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
import boto3
import sagemaker

boto_sess = boto3.Session(region_name=region)
sage_sdk_session = sagemaker.Session(boto_session=boto_sess)
bucket = sage_sdk_session.default_bucket()

print('A SageMaker session was initiated! You are using {} as your S3 bucket for intermediate files.'.format(bucket))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

A SageMaker session was initiated! You are using sagemaker-us-west-2-883624334343 as your S3 bucket for intermediate files.

## Load the Data

We will use the abalone data set from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Abalone).

   Given is the attribute name, attribute type, the measurement unit and a
   brief description.  The number of rings is the value to predict: either
   as a continuous value or as a classification problem.

	Name			Data Type		Meas.	Description
	----			---------		-----	-----------
	Rings			integer					+1.5 gives the age in years
	Length			continuous		mm		Longest shell measurement
	Diameter		continuous		mm		perpendicular to length
	Height			continuous		mm		with meat in shell
	Whole weight	continuous		grams	whole abalone
	Shucked weight	continuous		grams	weight of meat
	Viscera weight	continuous		grams	gut weight (after bleeding)
	Shell weight	continuous		grams	after being dried
	Male			integer			1/0 	1 encodes true, 0 false
	Female			integer			1/0 	1 encodes true, 0 false
	Infant			integer			1/0 	1 encodes true, 0 false

In [8]:
#Pull down dataset from the S3
abalone_data = spark.read.load(source_bucket + 'clean/', format='csv', inferSchema=True, header=True).repartition(num_workers)
abalona_data.show(n=5)

abalone_data.rdd.getNumPartitions()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+------+--------+------+------------+--------------+--------------+------------+----+------+------+
|Rings|Length|Diameter|Height|Whole_weight|Shucked_weight|Viscera_weight|Shell_weight|Male|Female|Infant|
+-----+------+--------+------+------------+--------------+--------------+------------+----+------+------+
|   19|  0.69|    0.55|   0.2|      1.8465|         0.732|         0.472|        0.57|   1|     0|     0|
|   11| 0.585|   0.455|  0.15|       0.987|        0.4355|        0.2075|        0.31|   1|     0|     0|
|    9| 0.625|     0.5|  0.17|      1.0985|        0.4645|          0.22|       0.354|   1|     0|     0|
|   10|  0.46|    0.35|  0.12|       0.515|         0.224|         0.108|      0.1565|   1|     0|     0|
|    9|  0.43|    0.35|  0.09|       0.397|        0.1575|         0.089|        0.12|   0|     1|     0|
+-----+------+--------+------+------------+--------------+--------------+------------+----+------+------+
only showing top 5 rows

2

In [9]:
#Split the dataframe in to training and validation data
trainData, testData = abalone_data.randomSplit([.75,.25])

s3_train_emr = 's3a://'+ bucket + '/train/'
s3_test_emr = 's3a://'+ bucket + '/test/'
data_format = 'csv'

#Save the data in to S3 for later training by SageMaker
trainData.write.save(s3_train_emr, format=data_format, mode='overwrite')
testData.write.save(s3_test_emr, format=data_format, mode='overwrite')

print('Training dataset saved in {} format to {}!'.format(data_format, s3_train_emr))
print('Testing dataset saved in {} format to {}!'.format(data_format, s3_test_emr))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training dataset saved in csv format to s3a://sagemaker-us-west-2-883624334343/train/!
Testing dataset saved in csv format to s3a://sagemaker-us-west-2-883624334343/test/!

## Train the Model

In [10]:
model = 'XGBoost'
#model = 'LinearLeaner'

l2 = 1
l1 = 1

training_images = {
    'LinearLearner': '174872318107.dkr.ecr.{}.amazonaws.com/linear-learner:latest'.format(region),
    'XGBoost': '433757028032.dkr.ecr.{}.amazonaws.com/xgboost:latest'.format(region)
}

linear_hyperparams = {
    'feature_dim':len(abalone_data.columns)-1,
    'predictor_type': 'regressor',
    'loss': 'squared_loss',
    'wd': l2,
    'l1': l1
}

xgboost_hyperparams = {
    'num_round':100,
    'lambda': l2,
    'objective': 'reg:linear',
    'alpha': l1
}

hyperparams = {
    'LinearLearner': linear_hyperparams,
    'XGBoost': xgboost_hyperparams
}

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
estimator = sagemaker.estimator.Estimator(
    image_name=training_images[model],
    role=sagemaker_execution_role, 
    train_instance_count=1, 
    train_instance_type='ml.m5.large',
    sagemaker_session=sage_sdk_session, 
    hyperparameters=hyperparams[model]
)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
s3_train = s3_train_emr.replace('s3a://', 's3://')
train_channel = sagemaker.session.s3_input(s3_train + 'part', content_type='text/csv')
estimator.fit({'train': train_channel})

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2019-11-14 00:55:56 Starting - Starting the training job...
2019-11-14 00:55:57 Starting - Launching requested ML instances......
2019-11-14 00:56:57 Starting - Preparing the instances for training...
2019-11-14 00:57:47 Downloading - Downloading input data...
2019-11-14 00:58:24 Training - Downloading the training image..Arguments: train
[2019-11-14:00:58:37:INFO] Running standalone xgboost training.
[2019-11-14:00:58:37:INFO] Path /opt/ml/input/data/validation does not exist!
[2019-11-14:00:58:37:INFO] File size need to be processed in the node: 0.15mb. Available memory size in the node: 275.5mb
[2019-11-14:00:58:37:INFO] Determined delimiter of CSV input is ','
[00:58:37] S3DistributionType set as FullyReplicated
[00:58:37] 3103x10 matrix with 31030 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
[00:58:37] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 38 extra nodes, 0 pruned nodes, max_depth=6
[0]#011train-rmse:7.20036
[00:58:37] src/t

## Inference Results

How well did our algorithm perform?

In [18]:
s3_inference = s3_train.replace('train', 'inference')

transformer = estimator.transformer(
    instance_count = 1,
    instance_type = 'ml.m5.large',
    strategy = 'MultiRecord',
    output_path = s3_inference,
    assemble_with= 'Line',
    accept='text/csv')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [14]:
s3_test = s3_test_emr.replace('s3a://', 's3://')

transformer.transform(
    data=s3_test,
    content_type='text/csv',
    split_type='Line',
    input_filter='$[1:]',
    join_source='Input',
    wait=True
)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

...................................!

In [23]:
from pyspark.sql.types import FloatType
from copy import deepcopy

#Read the schema from the initial dataset so you can apply it to the inference data.
schema = deepcopy(abalone_data.schema)
schema.add("Estimated_rings", FloatType())

#Pull down the inference data from S3
inference_data = spark.read.load(s3_inference, format='csv', schema=schema)
inference_data.show(n=5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

name 'abalon_data' is not defined
Traceback (most recent call last):
NameError: name 'abalon_data' is not defined



In [20]:
rings = inference_data.schema.names[0]
predicted_rings = inference_data.schema.names[-1]

sql_rmse = 'SELECT SQRT(AVG(POWER({}-{}, 2))) AS RMSE FROM inference'.format(rings, predicted_rings)

inference_data.registerTempTable("inference")
test = spark.sql(sql_rmse)
test.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+
|              RMSE|
+------------------+
|2.2379608543835614|
+------------------+

## Wrap-Up
Congratulations! You processed data in Apache Spark on EMR and trained and deployed a machine learning model in Amazon SageMaker! Feel free to try different combinations of models and hyperparameters to see if you can reduce your model's RMSE.