# TensorFlow Script Mode Training and Serving

Notebook para exemplificar o treinamento e deploy de modelos do tensorflow no Sagemaker. O modelo será um AutoEncoder para recomendação de filmes.

## Setup Env

In [13]:
import os
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import matplotlib.pyplot as plt
import math
import numpy as np

sagemaker_session = sagemaker.Session()
bucket            = sagemaker_session.default_bucket()
prefix            = 'recsys/autoenc_recsys/data'

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

## Dataset

In [8]:
ls ../data/movielens100k/

links.csv  movies.csv  ratings.csv  tags.csv  u.data


In [10]:
df = pd.read_csv('../data/movielens100k/ratings.csv')
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [12]:
df_movies = pd.read_csv('../data/movielens100k/movies.csv')
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [15]:
## Upload dataset to s3
inputs     = sagemaker_session.upload_data(path='../data/', 
                                           bucket=bucket, 
                                           key_prefix=prefix)
print('input spec (in this case, just an S3 path): {}'.format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-west-2-549173567278/recsys/autoenc_recsys/data


## Training 

In [39]:
from sagemaker.tensorflow import TensorFlow


estimator = TensorFlow(entry_point='train.py',
                         source_dir='../',
                         role=role,
                         train_instance_count=1,
                         train_instance_type='ml.p3.2xlarge',
                         framework_version='2.1.0',
                         py_version='py3',
                         hyperparameters={
                            'epochs': 20
                         },             
                         metric_definitions=[
                            {'Name': "train:loss", 'Regex': "loss: (.*?),",},           
                            {'Name': "validation:loss", 'Regex': "val_loss: (.*?)",},                                   
                         ]
                        )
estimator

<sagemaker.tensorflow.estimator.TensorFlow at 0x7fd718a815c0>

In [54]:
inputs

's3://sagemaker-us-west-2-549173567278/recsys/autoenc_recsys/data'

In [41]:
estimator.fit({
                'training':   inputs+"/movielens100k/ratings.csv",
                'validation': inputs+"/movielens100k/ratings.csv"
              })

2020-03-11 20:52:37 Starting - Starting the training job...
2020-03-11 20:52:40 Starting - Launching requested ML instances......
2020-03-11 20:53:41 Starting - Preparing the instances for training......
2020-03-11 20:55:03 Downloading - Downloading input data
2020-03-11 20:55:03 Training - Downloading the training image.........
2020-03-11 20:56:17 Training - Training image download completed. Training in progress.[34m2020-03-11 20:56:21,089 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2020-03-11 20:56:21,789 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training",
        "validation": "/opt/ml/input/data/validation"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameter

Após o treinamento, o modelo é versionado  e salvo no S3.

In [42]:
estimator.model_data

's3://sagemaker-us-west-2-549173567278/tensorflow-training-2020-03-11-20-52-35-085/output/model.tar.gz'

In [44]:
!aws s3 cp {estimator.model_data} ./local_model/model.tar.gz
!tar -xvzf ./local_model/model.tar.gz -C ./local_model

download: s3://sagemaker-us-west-2-549173567278/tensorflow-training-2020-03-11-20-52-35-085/output/model.tar.gz to local_model/model.tar.gz
model.ckpt.data-00001-of-00002
checkpoint
model.ckpt.index
model_info.json
model.ckpt.data-00000-of-00002
movies_idx.pkl


## Deploy Model

In [None]:
predictor = estimator.deploy(initial_instance_count=1, 
                             instance_type='ml.p3.2xlarge')

---------------------

In [None]:
# from sagemaker.predictor import RealTimePredictor, json_serializer, json_deserializer

# class JSONPredictor(RealTimePredictor):
#     def __init__(self, endpoint_name, sagemaker_session):
#         super(JSONPredictor, self).__init__(endpoint_name, sagemaker_session, json_serializer, json_deserializer)

In [48]:
trained_model_location = estimator.model_data
trained_model_location

's3://sagemaker-us-west-2-549173567278/tensorflow-training-2020-03-11-20-52-35-085/output/model.tar.gz'

In [68]:
from sagemaker.tensorflow import TensorFlowModel

model = TensorFlowModel(model_data=trained_model_location,
                         role=role,
                         framework_version='2.0.0',
                         entry_point='predictor.py',
                         source_dir='../',
                         image = '520713654638.dkr.ecr.us-west-2.amazonaws.com/tensorflow-inference:2.0.0-gpu')
#image = '763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:1.13-gpu'
# 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tensorflow:2.0.0b1-gpu-py3
model

2.0.0 is the latest version of tensorflow that supports Python 2. Newer versions of tensorflow will only be available for Python 3.Please set the argument "py_version='py3'" to use the Python 3 tensorflow image.


<sagemaker.tensorflow.model.TensorFlowModel at 0x7fd711e000f0>

### Create a endpoint

In [69]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.p3.2xlarge')
predictor

-*

UnexpectedStatusException: Error hosting endpoint tensorflow-inference-2020-03-12-00-22-38-456: Failed. Reason:  The role 'arn:aws:iam::549173567278:role/SageMakerFull' does not have BatchGetImage permission for the image: '520713654638.dkr.ecr.us-west-2.amazonaws.com/tensorflow-inference:2.0.0-gpu'..

In [74]:
from sagemaker.tensorflow.serving import Model

model = Model(entry_point='inference.py',
              model_data=trained_model_location,
              framework_version='2.0.0',
              role=role)

In [78]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.p3.2xlarge')
predictor

Using already existing model: tensorflow-inference-2020-03-12-00-41-08-314


-------------------------------*

UnexpectedStatusException: Error hosting endpoint tensorflow-inference-2020-03-12-00-41-08-314: Failed. Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..