## Setup

In [1]:
sc.version

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
26,application_1570741080126_0028,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

'2.4.4'

In [2]:
sc.install_pypi_package("boto3")
sc.install_pypi_package('sagemaker')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting boto3
  Downloading https://files.pythonhosted.org/packages/80/4d/af562d20771766f79018b15facaa88c70373e534e6bf49c362844e0a0775/boto3-1.10.11-py2.py3-none-any.whl (128kB)
Collecting botocore<1.14.0,>=1.13.11
  Downloading https://files.pythonhosted.org/packages/13/0c/607f0ac508711f8c41c6d5788080c991344acc61d31d32b3585ee66ccb67/botocore-1.13.11-py2.py3-none-any.whl (5.4MB)
Collecting s3transfer<0.3.0,>=0.2.0
  Using cached https://files.pythonhosted.org/packages/16/8a/1fc3dba0c4923c2a76e1ff0d52b305c44606da63f718d14d3231e21c51b0/s3transfer-0.2.1-py2.py3-none-any.whl
Collecting docutils<0.16,>=0.10
  Using cached https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl
Collecting python-dateutil<2.8.1,>=2.1; python_version >= "2.7"
  Using cached https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl
Col

In [3]:
import boto3
import sagemaker

region = 'us-west-2'

boto_sess = boto3.Session(region_name=region)
sage_sdk_session = sagemaker.Session(boto_session=boto_sess)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Loading the Data

We will use the abalone data set from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Abalone).

   Given is the attribute name, attribute type, the measurement unit and a
   brief description.  The number of rings is the value to predict: either
   as a continuous value or as a classification problem.

	Name		Data Type	Meas.	Description
	----		---------	-----	-----------
	Sex		nominal			M, F, and I (infant)
	Length		continuous	mm	Longest shell measurement
	Diameter	continuous	mm	perpendicular to length
	Height		continuous	mm	with meat in shell
	Whole weight	continuous	grams	whole abalone
	Shucked weight	continuous	grams	weight of meat
	Viscera weight	continuous	grams	gut weight (after bleeding)
	Shell weight	continuous	grams	after being dried
	Rings		integer			+1.5 gives the age in years

In [4]:
#Pull down dataset from the S3
abaloneData = spark.read.load('s3a://emr-lab-income-dataset/Clean/', format='csv', inferSchema=True, header=True)

for dimension in ['Length', 'Diameter', 'Height']:
    abaloneData = abaloneData.withColumn(dimension,abaloneData[dimension].cast('double'))

abaloneData.printSchema()
abaloneData = abaloneData.select(['Rings', 'Length', 'Diameter', 'Height', 'Whole_weight', 'Shucked_weight',
                                 'Viscera_weight', 'Shell_weight', 'Male', 'Female', 'Infant'])
abaloneData.show(n=5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- Length: double (nullable = true)
 |-- Diameter: double (nullable = true)
 |-- Height: double (nullable = true)
 |-- Whole_weight: double (nullable = true)
 |-- Shucked_weight: double (nullable = true)
 |-- Viscera_weight: double (nullable = true)
 |-- Shell_weight: double (nullable = true)
 |-- Male: integer (nullable = true)
 |-- Female: integer (nullable = true)
 |-- Infant: integer (nullable = true)
 |-- Rings: integer (nullable = true)

+-----+------+--------+------+------------+--------------+--------------+------------+----+------+------+
|Rings|Length|Diameter|Height|Whole_weight|Shucked_weight|Viscera_weight|Shell_weight|Male|Female|Infant|
+-----+------+--------+------+------------+--------------+--------------+------------+----+------+------+
|   15| 0.455|   0.365| 0.095|       0.514|        0.2245|         0.101|        0.15|   1|     0|     0|
|    7|  0.35|   0.265|  0.09|      0.2255|        0.0995|        0.0485|        0.07|   1|     0|     0|
|    9|  0.53| 

In [5]:
#Split the dataframe in to training and validation data
trainData, testData = abaloneData.randomSplit([.8,.2])

#Save the data in to S3 for later training by SageMaker
#trainData.write.save('s3a://emr-lab-income-dataset/train/', format='csv', mode='overwrite')
#testData.write.save('s3a://emr-lab-income-dataset/test/', format='csv', mode='overwrite')
#There is an issue with randomSplit. For dev purposes just use AbaloneData for now
abaloneData.write.save('s3a://emr-lab-income-dataset/test/', format='csv', mode='overwrite')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
len(abaloneData.columns)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

11

In [13]:
training_images = {'LinearLearner': '174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest',
                  'XGBoost': '174872318107.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest'}

hyperparams = {'feature_dim':len(abaloneData.columns)-1,
                  'predictor_type': 'regressor'}

sagemaker_execution_role = 'arn:aws:iam::883624334343:role/service-role/AmazonSageMaker-ExecutionRole-20190906T093404'
estimator = sagemaker.estimator.Estimator(
            image_name=training_images['LinearLearner'],
            role=sagemaker_execution_role, 
            train_instance_count=1, 
            train_instance_type='ml.p2.xlarge',
            output_path=None, 
            output_kms_key=None, 
            base_job_name=None, 
            sagemaker_session=sage_sdk_session, 
            hyperparameters=hyperparams, 
            train_use_spot_instances=False, 
            train_max_wait=None)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [14]:
s3_train_data = 's3://{}/{}/'.format('emr-lab-income-dataset', 'test')
estimator.fit({'train': s3_train_data})

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Error for Training job linear-learner-2019-11-06-23-36-02-846: Failed. Reason: ClientError: Unable to read data channel 'train'. Requested content-type is 'application/x-recordio-protobuf'. Please verify the data matches the requested content-type. (caused by MXNetError)

Caused by: [23:39:10] /opt/brazil-pkg-cache/packages/AIAlgorithmsCppLibs/AIAlgorithmsCppLibs-2.0.2051.0/AL2012/generic-flavor/src/src/aialgs/io/iterator_base.cpp:100: (Input Error) The header of the MXNet RecordIO record at position 0 in the dataset does not start with a valid magic number.

Stack trace returned 10 entries:
[bt] (0) /opt/amazon/lib/libaialgs.so(+0xb1f0) [0x7fad550de1f0]
[bt] (1) /opt/amazon/lib/libaialgs.so(+0xb54a) [0x7fad550de54a]
[bt] (2) /opt/amazon/lib/libaialgs.so(aialgs::iterator_base::Next()+0x4a6) [0x7fad550e7436]
[bt] (3) /opt/amazon/lib/libmxnet.so(MXDataIterNext+0x21) [0x7fad43bbd131]
[bt] (4) /opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) [0x7fad5563a858]

In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

In [16]:
#Write training and validation data to S3
trainData.write.save('s3a://emr-lab-income-dataset/train/', format='csv', mode='overwrite')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
#Write training and validation data to S3
abaloneData.write.save('s3a://emr-lab-income-dataset/Clean/abaloneData_writeTest.csv', format='csv', mode='overwrite')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Training and Hosting a Model

## Inference


How well did the algorithm perform? Let us display the digits corresponding to each of the labels and manually inspect the results:

Since we don't need to make any more inferences, now we delete the endpoint:

In [None]:
# Delete the endpoint

## More on SageMaker Spark

The SageMaker Spark Github repository has more about SageMaker Spark, including how to use SageMaker Spark with your own algorithms on Amazon SageMaker: https://github.com/aws/sagemaker-spark
