## SageMaker Training Job 

### Please go through this notebook only if you have finished Part 1 to Part 4 of the tutorial.

---
#### Step 1: Import packages, get IAM role, get the region and set the S3 bucket.

In [2]:
import os
import boto3
import re
import copy
import time
from time import gmtime, strftime
from sagemaker import get_execution_role

role = get_execution_role()

region = boto3.Session().region_name

bucket ='keras-sagemaker-train' # Put your s3 bucket name here

---
#### Step 2: Create the algorithm image and push to Amazon ECR.

In [3]:
%%sh

# The name of our algorithm
algorithm_name=keras-sagemaker-train

chmod +x src/*

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

# On a SageMaker Notebook Instance, the docker daemon may need to be restarted in order
# to detect your network configuration correctly.  (This is a known issue.)
if [ -d "/home/ec2-user/SageMaker" ]; then
  sudo service docker restart
fi

# Comment the line below to use a GPU
docker build  -t ${algorithm_name} -f Dockerfile.cpu .

# Uncomment the below line if you wish to run on a GPU
#docker build  -t ${algorithm_name} -f Dockerfile.gpu . 

docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Login Succeeded
Stopping docker: [  OK  ]
Starting docker:	.[  OK  ]
Sending build context to Docker daemon  211.1MB
Step 1/6 : FROM phenompeople/centos-python:3.6.3
 ---> e3d7d8ca4a30
Step 2/6 : ENV PATH="/opt/program:${PATH}"
 ---> Using cache
 ---> b365aa41218a
Step 3/6 : ADD requirements-cpu.txt /
 ---> Using cache
 ---> 3c3af67525fe
Step 4/6 : RUN pip3 install -r requirements-cpu.txt
 ---> Using cache
 ---> c6adb5f69019
Step 5/6 : COPY src /opt/program
 ---> Using cache
 ---> 902eafc56ddd
Step 6/6 : WORKDIR /opt/program
 ---> Using cache
 ---> 633aa8bfbd9d
Successfully built 633aa8bfbd9d
Successfully tagged keras-sagemaker-train:latest
The push refers to repository [850021735523.dkr.ecr.us-east-1.amazonaws.com/keras-sagemaker-train]
bfc4f2733525: Preparing
72ff4d93b480: Preparing
50b4f5bbfa32: Preparing
952e0784686f: Preparing
65c06ae44bbd: Preparing
f194f1dd3e8f: Preparing
ea264623c568: Preparing
c4cd48200f79: Preparing
bcc97fbfc9e1: Preparing
f194f1dd3e8f: Waiting
ea264623c5

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



---
#### Step 3: Define variables with data location and output location in S3 bucket.

In [4]:
data_location = 's3://{}/data'.format(bucket)
print("data location - " + data_location)

output_location = 's3://{}/output'.format(bucket)
print("output location - " + output_location)

data location - s3://keras-sagemaker-train/data
output location - s3://keras-sagemaker-train/output


---
#### Step 4: Create a SageMaker session.

In [5]:
import sagemaker as sage
sess = sage.Session()

---
#### Step 5: Define variables for account, region and algorithm image.

In [6]:
account = sess.boto_session.client('sts').get_caller_identity()['Account'] # aws account 
region = sess.boto_session.region_name # aws server region
image = '{}.dkr.ecr.{}.amazonaws.com/keras-sagemaker-train'.format(account, region) # algorithm image path in ECR

---
#### Step 6: Define hyperparameters to be passed to your algorithm. 
In this project we are reading two hyperparameters for training. Use of hyperparameters in optional.

In [7]:
hyperparameters = {"batch_size":128, "epochs":30}

---
#### Step 7: Create the training job using SageMaker Estimator.

In [8]:
classifier = sage.estimator.Estimator(image_name=image, 
                                      role=role,
                                      train_instance_count=1, 
                                      train_instance_type='ml.c5.2xlarge',
                                      hyperparameters=hyperparameters,
                                      output_path=output_location,
                                      sagemaker_session=sess)

---
#### Step 8: Run the training job by passing the data location.

In [9]:
classifier.fit(data_location)

2019-06-14 05:11:15 Starting - Starting the training job...
2019-06-14 05:11:17 Starting - Launching requested ML instances......
2019-06-14 05:12:26 Starting - Preparing the instances for training......
2019-06-14 05:13:36 Downloading - Downloading input data
2019-06-14 05:13:36 Training - Downloading the training image...
2019-06-14 05:14:16 Training - Training image download completed. Training in progress..
[31m2019-06-14 05:14:18.178636: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA[0m
[31m2019-06-14 05:14:18.210980: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz[0m
[31m2019-06-14 05:14:18.212464: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x42fa440 executing computations on platform Host. Devices:[0m
[31m2019-06-14 05:14:18.212487: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0):


2019-06-14 05:14:47 Uploading - Uploading generated training model
2019-06-14 05:14:47 Completed - Training job completed
[31mEpoch 21/30

 128/8000 [..............................] - ETA: 0s - loss: 0.1346 - acc: 0.9766[0m
[31m1152/8000 [===>..........................] - ETA: 0s - loss: 0.2094 - acc: 0.9384[0m
[31mEpoch 22/30

 128/8000 [..............................] - ETA: 0s - loss: 0.2559 - acc: 0.9297[0m
[31m1024/8000 [==>...........................] - ETA: 0s - loss: 0.2043 - acc: 0.9346[0m
[31mEpoch 23/30

 128/8000 [..............................] - ETA: 0s - loss: 0.2356 - acc: 0.9062[0m
[31m1024/8000 [==>...........................] - ETA: 0s - loss: 0.1846 - acc: 0.9424[0m
[31mEpoch 24/30

 128/8000 [..............................] - ETA: 0s - loss: 0.1556 - acc: 0.9609[0m
[31m1024/8000 [==>...........................] - ETA: 0s - loss: 0.1455 - acc: 0.9521[0m
[31mEpoch 25/30

 128/8000 [..............................] - ETA: 0s - loss: 0.1835 - acc: 0.96

Billable seconds: 89


## Congratulations! We had a successful training job run in Amazon SageMaker.
#### Please return to the tutorial for Part 6 where we will be running a training job in a GPU.