# Tensorflow liveness local & remote training  

## Prerequisites

This notebook shows how to use the SageMaker Python SDK to run your code in a local container before deploying to SageMaker's managed training or hosting environments.  This can speed up iterative testing and debugging while using the same familiar Python SDK interface.  Just change your estimator's `instance_type` to `local` (or `local_gpu` if you're using an ml.p2 or ml.p3 notebook instance).

In order to use this feature, you'll need to install docker-compose (and nvidia-docker if training with a GPU).

**Note: you can only run a single local notebook at one time.**

## Overview

The **SageMaker Python SDK** helps you deploy your models for training and hosting in optimized, productions ready containers in SageMaker. The SageMaker Python SDK is easy to use, modular, extensible and compatible with TensorFlow, MXNet, PyTorch. This tutorial focuses on how to create a convolutional neural network model to train the [NUAA dataset](http://parnec.nuaa.edu.cn/_upload/tpl/02/db/731/template731/pages/xtan/NUAAImposterDB_download.html) using **Tensorflow in local mode**.

### Set up the environment

This notebook was created and tested on a single ml.p2.xlarge notebook instance.

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with appropriate full IAM role arn string(s).

In [1]:
!pip install awscli --upgrade
!pip install boto3 sagemaker

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [2]:
import tensorflow as tf
tf.__version__

'2.4.1'

In [4]:
import os
import subprocess

instance_type = "local"

try:
    if subprocess.call("nvidia-smi") == 0:
        ## Set type to GPU if one is present
        instance_type = "local_gpu"
except:
    pass

print("Instance type = " + instance_type)

Instance type = local_gpu


In [11]:
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

Couldn't call 'get_role' to get Role ARN from role name sagemaker_execution to get Role path.


### Download the NUAA dataset

In [5]:
#!pygmentize utils_tf.py
%run utils_tf.py

process_train_data
Read preprocessed data from cache file: train.pkl
process_test_data
Read preprocessed data from cache file: test.pkl
There are 2792 training examples 
There are 699 validation examples
There are 9123 testing examples
Image data shape is (64, 64)
There are 2 classes


### Data Preview

In [6]:
import numpy as np

# get some random training images
#dataiter = iter(trainloader)
#images, labels = dataiter.next()

# show images
#imshow(torchvision.utils.make_grid(images))

# print labels
#print(' '.join('%9s' % classes[labels[j]] for j in range(4)))

### Train locally
We use the ```sagemaker.Session.upload_data``` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job.

In [6]:
!python train.py --epochs 1

2021-11-22 00:27:46.314902: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
tensorflow version: 2.4.1
(2792, 64, 64) (2792,) (699, 64, 64) (699,)
channels_last
x_train shape: (2792, 64, 64, 1)
2792 train samples
699 test samples
2021-11-22 00:27:48.268203: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-22 00:27:48.269204: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-11-22 00:27:48.280949: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-22 00:27:48.281846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1080 Ti computeCapability: 6.1
coreClo

In [7]:
#inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix='data/cifar10')
prefix = 'keras-liveness'

#training_input_path   = sagemaker_session.upload_data('upload/training.npz', key_prefix=prefix+'/training')
#validation_input_path = sagemaker_session.upload_data('upload/validation.npz', key_prefix=prefix+'/validation')
training_input_path   = "s3://sagemaker-us-east-1-481126471388/keras-liveness/training/training.npz"
validation_input_path = "s3://sagemaker-us-east-1-481126471388/keras-liveness/validation/validation.npz"
print(training_input_path)
print(validation_input_path)

s3://sagemaker-us-east-1-481126471388/keras-liveness/training/training.npz
s3://sagemaker-us-east-1-481126471388/keras-liveness/validation/validation.npz


# Construct a script for training 
Here is the full code for the network model:

In [8]:
%env SM_NUM_GPUS=1
%env SM_MODEL_DIR=/tmp/model
%env SM_CHANNEL_TRAINING=upload
%env SM_CHANNEL_VALIDATION=upload
%env AWS_PROFILE=default-api
#DUMMY_IAM_ROLE = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'
print(instance_type, role)

env: SM_NUM_GPUS=1
env: SM_MODEL_DIR=/tmp/model
env: SM_CHANNEL_TRAINING=upload
env: SM_CHANNEL_VALIDATION=upload
env: AWS_PROFILE=default-api
local_gpu sagemaker_execution


In [13]:
from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(entry_point='train.py', 
#                          source_dir = './code',
                          role=role,
                          train_instance_count=1, 
#                           train_instance_type='local_gpu',
                          train_instance_type='ml.m4.xlarge',
                          framework_version='2.4.1',
                          py_version='py37',
                          script_mode=True,
                          hyperparameters={'epochs': 1}
                         )

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [14]:
%time
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})
print ("jupyter.cell:done!")

CPU times: user 10 µs, sys: 3 µs, total: 13 µs
Wall time: 29.6 µs
2021-11-21 22:05:20 Starting - Starting the training job...
2021-11-21 22:05:43 Starting - Launching requested ML instancesProfilerReport-1637532318: InProgress
...
2021-11-21 22:06:23 Starting - Preparing the instances for training............
2021-11-21 22:08:24 Downloading - Downloading input data...
2021-11-21 22:09:04 Training - Training image download completed. Training in progress..[34m2021-11-21 22:09:01.052933: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2021-11-21 22:09:01.058378: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2021-11-21 22:09:01.188005: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2021-11-21 22:09:05,126 sagemaker-tr


2021-11-21 22:09:27 Uploading - Uploading generated training model
2021-11-21 22:09:27 Failed - Training job failed
ProfilerReport-1637532318: Stopping


UnexpectedStatusException: Error for Training job tensorflow-training-2021-11-21-22-05-17-428: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/local/bin/python3.7 train.py --epochs 1 --model_dir s3://sagemaker-us-east-1-481126471388/tensorflow-training-2021-11-21-22-05-17-428/model"
2021-11-21 22:09:09.440833: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-11-21 22:09:09.440985: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2021-11-21 22:09:09.472235: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
Traceback (most recent call last):
  File "train.py", line 8, in <module>
    import keras
ModuleNotFoundError: No module named 'keras'