In [184]:
%%time
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

bucket = 'palate' # customize to sage.delete_endpoint(EndpointName=endpoint_name)your bucket
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/image-classification:latest'}
training_image = containers[boto3.Session().region_name]

CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 391 ms


## Training the ResNet model

In this demo, we are using [Caltech-256](http://www.vision.caltech.edu/Image_Datasets/Caltech256/) dataset, which contains 30608 images of 256 objects. For the training and validation data, we follow the splitting scheme in this MXNet [example](https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/data/caltech256.sh). In particular, it randomly selects 60 images per class for training, and uses the remaining data for validation. The algorithm takes `RecordIO` file as input. The user can also provide the image files as input, which will be converted into `RecordIO` format using MXNet's [im2rec](https://mxnet.incubator.apache.org/how_to/recordio.html?highlight=im2rec) tool. It takes around 50 seconds to converted the entire Caltech-256 dataset (~1.2GB) on a p2.xlarge instance. However, for this demo, we will use record io format. 

Once we have the data available in the correct format for training, the next step is to actually train the model using the data. After setting training parameters, we kick off training, and poll for status until training is completed.

## Training parameters
There are two kinds of parameters that need to be set for training. The first one are the parameters for the training job. These include:

* **Input specification**: These are the training and validation channels that specify the path where training data is present. These are specified in the "InputDataConfig" section. The main parameters that need to be set is the "ContentType" which can be set to "rec" or "lst" based on the input data format and the S3Uri which specifies the bucket and the folder where the data is present. 
* **Output specification**: This is specified in the "OutputDataConfig" section. We just need to specify the path where the output can be stored after training
* **Resource config**: This section specifies the type of instance on which to run the training and the number of hosts used for training. If "InstanceCount" is more than 1, then training can be run in a distributed manner. 

Apart from the above set of parameters, there are hyperparameters that are specific to the algorithm. These are:

* **num_layers**: The number of layers (depth) for the network. We use 101 in this samples but other values such as 50, 152 can be used. 
* **num_training_samples**: This is the total number of training samples. It is set to 15420 for caltech dataset with the current split
* **num_classes**: This is the number of output classes for the new dataset. Imagenet was trained with 1000 output classes but the number of output classes can be changed for fine-tuning. For caltech, we use 257 because it has 256 object categories + 1 clutter class
* **epochs**: Number of training epochs
* **learning_rate**: Learning rate for training
* **mini_batch_size**: The number of training samples used for each mini batch. In distributed training, the number of training samples used per batch will be N * mini_batch_size where N is the number of hosts on which training is run

In [185]:
# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 101, 152 and 200
# For this training, we will use 18 layers
num_layers = "18" 
# we need to specify the input image shape for the training data
image_shape = "3,750,750"
# we also need to specify the number of training samples in the training set
# for caltech it is 15420
num_training_samples = "2306"
# specify the number of output classes
num_classes = "40"
# batch size for training
mini_batch_size =  "30"
# number of epochs aka number of iterations for training
epochs = "10"
# learning rate
learning_rate = "0.01"
# resize
resize = "750,750"

In [186]:
%%time
import time
import boto3
from time import gmtime, strftime


s3 = boto3.client('s3')
# create unique job name 
job_name_prefix = 'sagemaker-palate-notebook'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
training_params = \
{
    # specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"            
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://palate/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,              
        "InstanceType": "ml.p2.8xlarge",
        "VolumeSizeInGB": 5
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "image_shape": image_shape,
        "num_layers": str(num_layers),
        "num_training_samples": str(num_training_samples),
        "num_classes": str(num_classes),
        "mini_batch_size": str(mini_batch_size),
        "epochs": str(epochs),
        "learning_rate": str(learning_rate)
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
#Training data should be inside a subdirectory called "train"
#Validation data should be inside a subdirectory called "validation"
#The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/photos/restaurants/'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "train_lst",  
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/photos/train_lst/'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation", 
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/photos/restaurants/'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation_lst",  
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/photos/validation_lst/'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        }
    ]
}
print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

Training job name: sagemaker-palate-notebook-2018-04-10-01-19-00

Input Data Location: {'S3DataType': 'S3Prefix', 'S3Uri': 's3://palate/photos/restaurants/', 'S3DataDistributionType': 'FullyReplicated'}
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 4.31 ms


In [187]:
# create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

# confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

Training job current status: InProgress
Training failed to start
Training failed with the following error: ClientError: image_shape must be smaller than the actual input image size. Please reduce the image_shape value or use the resize parameter.



In [188]:
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)

Training job ended with status: Failed


## Create Model

We now create a SageMaker Model from the training output. Using the model we can create an Endpoint Configuration.

In [91]:
%%time
import boto3
from time import gmtime, strftime

sage = boto3.Session().client(service_name='sagemaker') 

model_name="test2-image-classification-model"
print(model_name)
info = sage.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/image-classification:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/image-classification:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/image-classification:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/image-classification:latest'}
hosting_image = containers[boto3.Session().region_name]
primary_container = {
    'Image': hosting_image,
    'ModelDataUrl': model_data,
}

create_model_response = sage.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])


test2-image-classification-model
s3://palate/output/sagemaker-palate-notebook-2018-04-09-00-27-23/output/model.tar.gz
arn:aws:sagemaker:us-west-2:798315061807:model/test2-image-classification-model
CPU times: user 44 ms, sys: 0 ns, total: 44 ms
Wall time: 461 ms


### Create Endpoint Configuration
At launch, we will support configuring REST endpoints in hosting with multiple models, e.g. for A/B testing purposes. In order to support this, customers create an endpoint configuration, that describes the distribution of traffic across the models, whether split, shadowed, or sampled in some way.

In addition, the endpoint configuration describes the instance type required for model deployment, and at launch will describe the autoscaling configuration.

In [92]:
from time import gmtime, strftime

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_config_name = job_name_prefix + '-epc-' + timestamp
endpoint_config_response = sage.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge',
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print('Endpoint configuration name: {}'.format(endpoint_config_name))
print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

Endpoint configuration name: sagemaker-palate-notebook-epc--2018-04-09-00-43-19
Endpoint configuration arn:  arn:aws:sagemaker:us-west-2:798315061807:endpoint-config/sagemaker-palate-notebook-epc--2018-04-09-00-43-19


### Create Endpoint
Lastly, the customer creates the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 9-11 minutes to complete.

In [93]:
%%time
import time

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = job_name_prefix + '-ep-' + timestamp
print('Endpoint name: {}'.format(endpoint_name))

endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': endpoint_config_name,
}
endpoint_response = sagemaker.create_endpoint(**endpoint_params)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

Endpoint name: sagemaker-palate-notebook-ep--2018-04-09-00-43-31
EndpointArn = arn:aws:sagemaker:us-west-2:798315061807:endpoint/sagemaker-palate-notebook-ep--2018-04-09-00-43-31
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 270 ms


Finally, now the endpoint can be created. It may take sometime to create the endpoint...

In [94]:
# get the status of the endpoint
response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))


# wait until the status has changed
sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)


# print the status of the endpoint
endpoint_response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))

if status != 'InService':
    raise Exception('Endpoint creation failed.')

EndpointStatus = Creating
Endpoint creation ended with EndpointStatus = InService


## Perform Inference
Finally, the customer can now validate the model for use. They can obtain the endpoint from the client library using the result from previous operations, and generate classifications from the trained model using that endpoint.


In [95]:
import boto3
runtime = boto3.Session().client(service_name='runtime.sagemaker') 

### Download test image

In [189]:
import boto3
import os

s3 = boto3.resource('s3')
bucket= s3.Bucket('palate')

# user = '6ge446UtI_M0yzQa-I66Vg'
# user = 'IYk2DG_yBByVXxrVTo-BOg'
# user = 'izdoRybAthDrhWBqc_16lQ'
# user = 'zkHO17cJXt9Wvde4fcaL8A'
user = 'o5PjjS1IxZ8rvBKSFZeM7Q'
# user = '2CgLe_T0JLIhEGbg60ThEA'
photoCount = 0

for bucket in s3.buckets.all():
    counter=0
    counterClass = 0
    classifier = ''
    userPath = 'photos/users/' + str(user) + '/'
    for key in bucket.objects.all():
        path = str(key.key)
        #print(str(key.key))
        #print(userPath)
        #print(str(key.key.encode('utf-8')))
        if path.find(userPath) != -1:
            photoCount += 1
            
            print('Found: ' + str(path))
            awsUrl = 'https://s3-us-west-2.amazonaws.com/palate/'
            url = awsUrl + path
            print('URL: ' + url)
            # https://s3-us-west-2.amazonaws.com/palate/photos/users/b73qtAJ8kWdB7HIsTN0P5w/-zm-T2QyislPgM9ICspxRg.jpg
            saveAs = '/tmp/test' + str(photoCount) + '.jpg'
            !wget -O {saveAs} {url}
                
                
                

#!wget -O /tmp/test1.jpg https://s3-us-west-2.amazonaws.com/palate/photos/restaurants/the-morrison-los-angeles/1MmjIXWEduQUxs3_83bZDQ.jpg
#!wget -O /tmp/test2.jpg https://s3-us-west-2.amazonaws.com/palate/photos/restaurants/the-morrison-los-angeles/0J37zqRKZzpKzmyixAQSCQ.jpg
#!wget -O /tmp/test3.jpg https://s3-us-west-2.amazonaws.com/palate/photos/restaurants/the-morrison-los-angeles/41jf-3Vh4uY2nwmKd25jcA.jpg
#!wget -O /tmp/test4.jpg https://s3-us-west-2.amazonaws.com/palate/photos/restaurants/providence-los-angeles-2/6FX0CkFxUHwOSb031QwpqA.jpg
#!wget -O /tmp/test5.jpg https://s3-us-west-2.amazonaws.com/palate/photos/restaurants/providence-los-angeles-2/899_MnJupls8moZWrG9QxA.jpg
#!wget -O /tmp/test6.jpg https://s3-us-west-2.amazonaws.com/palate/photos/restaurants/providence-los-angeles-2/6SZNL6WALimZqi-e9EBDnQ.jpg
#!wget -O /tmp/test7.jpg https://s3-us-west-2.amazonaws.com/palate/photos/restaurants/providence-los-angeles-2/86mYtyJrcHdBJUo1Uzr8ZA.jpg
#!wget -O /tmp/test8.jpg https://s3-us-west-2.amazonaws.com/palate/photos/restaurants/pine-and-crane-los-angeles/2dRXRo71aRoF3G7JQ5HJUw.jpg
#!wget -O /tmp/test9.jpg https://s3-us-west-2.amazonaws.com/palate/photos/restaurants/perch-los-angeles/6affCEwXYqbigcoCtsYJqQ.jpg
#!wget -O /tmp/test10.jpg https://s3-us-west-2.amazonaws.com/palate/photos/restaurants/perch-los-angeles/BMGexDWsE7OGsLJBZk3Axg.jpg



#file_name = '/tmp/test'
# test image
#from IPython.display import Image
#for num in range(1,2,1):
#    Image(file_name + str(num) + '.jpg')

Found: photos/users/o5PjjS1IxZ8rvBKSFZeM7Q/-X9AxKzyLO60XrEvdAORLQ.jpg
URL: https://s3-us-west-2.amazonaws.com/palate/photos/users/o5PjjS1IxZ8rvBKSFZeM7Q/-X9AxKzyLO60XrEvdAORLQ.jpg
--2018-04-10 02:08:10--  https://s3-us-west-2.amazonaws.com/palate/photos/users/o5PjjS1IxZ8rvBKSFZeM7Q/-X9AxKzyLO60XrEvdAORLQ.jpg
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 54.231.168.204
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|54.231.168.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76881 (75K) [image/jpeg]
Saving to: ‘/tmp/test1.jpg’


2018-04-10 02:08:10 (47.0 MB/s) - ‘/tmp/test1.jpg’ saved [76881/76881]

Found: photos/users/o5PjjS1IxZ8rvBKSFZeM7Q/0Sg8AUyVcu94_8T_T9m0zg.jpg
URL: https://s3-us-west-2.amazonaws.com/palate/photos/users/o5PjjS1IxZ8rvBKSFZeM7Q/0Sg8AUyVcu94_8T_T9m0zg.jpg
--2018-04-10 02:08:10--  https://s3-us-west-2.amazonaws.com/palate/photos/users/o5PjjS1IxZ8rvBKSFZeM7Q/0Sg8AUyVcu94_8T_T9m0zg.jpg
Resolving 

In [190]:
# fetch object categories from S3
object_categories = []
!wget -O /tmp/object.lst https://s3-us-west-2.amazonaws.com/palate/photos/classes.lst
obj_file_name = '/tmp/object.lst'

with open(obj_file_name, 'r') as f:
    for line in f:
        object_categories.append(line.rstrip('\n'))
    
print(object_categories)

--2018-04-10 02:08:28--  https://s3-us-west-2.amazonaws.com/palate/photos/classes.lst
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.209.184
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.209.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1667 (1.6K) [text/plain]
Saving to: ‘/tmp/object.lst’


2018-04-10 02:08:28 (143 MB/s) - ‘/tmp/object.lst’ saved [1667/1667]

['shake-shack-los-angeles-7', 'manuela-los-angeles-2', 'the-kitchen-los-angeles', 'langers-los-angeles-2', 'the-bun-shop-los-angeles', 'little-jewel-of-new-orleans-los-angeles-2', 'wurstküche-los-angeles-2', 'boogie-mcgees-bayou-smokehouse-bbq-los-angeles', 'yuko-kitchen-los-angeles', 'marugame-monzo-los-angeles-2', 'bestia-los-angeles', 'yup-dduk-la-los-angeles', 'breva-restaurant-los-angeles-2', 'eggslut-los-angeles-7', 'animal-los-angeles', 'han-bat-shul-lung-tang-los-angeles', 'prank-los-angeles', 'république-los-angeles-2', 'pasta-sis

### Evaluation

Evaluate the image through the network for inteference. The network outputs class probabilities and typically, one selects the class with the maximum probability as the final class output.

**Note:** The output class detected by the network may not be accurate in this example. To limit the time taken and cost of training, we have trained the model only for a couple of epochs. If the network is trained for more epochs (say 20), then the output class will be more accurate.

In [191]:
import json
import numpy as np

class_sum = []
for num in range(40):
    class_sum.append(0.0)

for num in range(1,photoCount,1):
    with open('/tmp/test' + str(num) + '.jpg', 'rb') as f:
        payload = f.read()
        payload = bytearray(payload)
        endpoint_name = 'sagemaker-palate-notebook-ep--2018-04-09-00-43-31'
        response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                           ContentType='application/x-image', 
                                           Body=payload)
        result = response['Body'].read()
        # result will be in json format and convert it to ndarray
        result = json.loads(result)
        # the result will output the probabilities for all classes
        # find the class with maximum probability and print the class index
        index = np.argmax(result)

        print("Result: label - " + str(object_categories[index]) + ", probability - " + str(result[index]))

    index = 0
    for label in result:
        #print("Label: " + object_categories[index] + ", probability - " + str(label))
        class_sum[index] += label
        index += 1

max_idx = 0
max_val = 0
for index in range(40):
    if max_val < class_sum[index]:
        max_val = class_sum[index]
        max_idx = index
    print("Label: " + object_categories[index] + ", probability - " + str(class_sum[index] / photoCount))
    
print("Top Result: " + object_categories[max_idx] + ", probability - " + str(class_sum[max_idx] / photoCount) )

Result: label - broken-mouth-lees-homestyle-los-angeles-5, probability - 0.047106869518756866
Result: label - broken-mouth-lees-homestyle-los-angeles-5, probability - 0.07139304280281067
Result: label - broken-mouth-lees-homestyle-los-angeles-5, probability - 0.12511301040649414
Result: label - broken-mouth-lees-homestyle-los-angeles-5, probability - 0.12599118053913116
Result: label - broken-mouth-lees-homestyle-los-angeles-5, probability - 0.09249883890151978
Result: label - broken-mouth-lees-homestyle-los-angeles-5, probability - 0.047143127769231796
Result: label - broken-mouth-lees-homestyle-los-angeles-5, probability - 0.06112596020102501
Result: label - broken-mouth-lees-homestyle-los-angeles-5, probability - 0.046864911913871765
Result: label - broken-mouth-lees-homestyle-los-angeles-5, probability - 0.04744726046919823
Result: label - broken-mouth-lees-homestyle-los-angeles-5, probability - 0.13744908571243286
Result: label - broken-mouth-lees-homestyle-los-angeles-5, probabil

### Clean up

When we're done with the endpoint, we can just delete it and the backing instances will be released.  Run the following cell to delete the endpoint.

In [28]:
#sage.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
   'content-length': '0',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Tue, 03 Apr 2018 04:29:26 GMT',
   'x-amzn-requestid': '3b40e456-6e7f-46d4-9c23-bba48cb52e2e'},
  'HTTPStatusCode': 200,
  'RequestId': '3b40e456-6e7f-46d4-9c23-bba48cb52e2e',
  'RetryAttempts': 0}}