# Train a Keras Sequential Model
This notebook shows how to train and host a Keras Sequential model on SageMaker. The model used for this notebook is a simple deep CNN that was extracted from [the Keras examples](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py).

## The dataset
The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each:

![cifar10](https://maet3608.github.io/nuts-ml/_images/cifar10.png)

In this tutorial, we will train a deep CNN to recognize these images.

We'll compare trainig with file mode, pipe mode datasets and distributed training with Horovod

## Getting the data
Copy the cifar10 tfrecord datasets from s3://floor28/data/cifar10 to your local notebook

You can use the following AWS CLI command:

In [None]:
!aws s3 cp --recursive s3://floor28/data/cifar10 ./data

## Run the training locally

The script uses arguments for configuration. it requires the following configurations:
1. Model_dir - location where it'll save checkpoints and logs
2. train, validation, eval - location of the relevant tf records

Run the script locally:

In [1]:
!mkdir -p logs
!python training_script/cifar10_keras.py --model_dir ./logs \
                                         --train data/train \
                                         --validation data/validation \
                                         --eval data/eval \
                                         --epochs 1
!rm -rf logs

Using TensorFlow backend.
2019-07-08 09:11:31.335035: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2019-07-08 09:11:31.357245: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2019-07-08 09:11:31.357494: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56093ec8b4a0 executing computations on platform Host. Devices:
2019-07-08 09:11:31.357516: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-08 09:11:31.357709: I tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
INFO:root:getting data
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Colocations handled automatically by placer.
INFO:root:configuring model
In

**Altough the script was running on a SageMaker notebook, you can run the same script on you computer using the same command.**

## Use TensorFlow Script Mode
For TensorFlow versions 1.11 and later, the Amazon SageMaker Python SDK supports script mode training scripts. Script mode has the following advantages over legacy mode training scripts:

* Script mode training scripts are more similar to training scripts you write for TensorFlow in general, so it is easier to modify your existing TensorFlow training scripts to work with Amazon SageMaker.

* Script mode supports both Python 2.7- andPython 3.6-compatible source files.

* Script mode supports Horovod for distributed training.

For information about writing TensorFlow script mode training scripts and using TensorFlow script mode estimators and models with Amazon SageMaker, see https://sagemaker.readthedocs.io/en/stable/using_tf.html.

### Preparing your script for training in SageMaker
The training script is very similar to a training script you might run outside of SageMaker.
SageMaker runs the script with 1 argument, model_dir, an S3 location that it uses for logs and artifacts.

You can access useful properties about the training environment through various environment variables.
For this script, we are sending 3 data channels to the script: Train, Validation, Eval.

**Create a copy of the script (training_script/cifar10_keras.py) and save it as training_script/cifar10_keras_sm.py.**

In cifar10_keras_sm.py, update the train,validation,eval arguments to get the data by default from the relevant environment variable: SM_CHANNEL_TRAIN, SM_CHANNEL_VALIDATION, SM_CHANNEL_EVAL

For info see the SageMaker-python-sdk [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#preparing-a-script-mode-training-script)

SageMaker will not send the locations as arguments, it'll use environment variables instead, in the script, mark those arguments as not required

SageMaker send different useful environment variables to your scripts, e.g.:
* `SM_MODEL_DIR`: A string that represents the local path where the training job can write the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting. This is different than the model_dir argument passed in your training script which is a S3 location. SM_MODEL_DIR is always set to /opt/ml/model.
* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.
* `SM_OUTPUT_DATA_DIR`: A string that represents the path to the directory to write output artifacts to. Output artifacts might include checkpoints, graphs, and other files to save, but do not include model artifacts. These artifacts are compressed and uploaded to S3 to an S3 bucket with the same prefix as the model artifacts.

In this Example, to reduce the network latency. we would like to save the model checkpoints locally, they can be uploaded to S3 at the end of the job.

Add the following argument to your script:
```python
parser.add_argument(
        '--model_output_dir',
        type=str,
        default=os.environ.get('SM_MODEL_DIR'))
```
Change the ModelCheckPoint line to use to new location:
```python
callbacks.append(ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5'))
```

Change the save_model call to use that folder.  
From:  
```python
return save_model(model, args.model_dir)
```
To:  
```python
return save_model(model, args.model_output_dir)
```

### Test your script locally
For testing, run the new script with the same command as above, make sure it runs as expected.  
Add the new model_output_dir as an argument for the script. 

In [None]:
# Run the script locally
...

### Use SageMaker local for local testing
The local mode in the Amazon SageMaker Python SDK can emulate CPU (single and multi-instance) and GPU (single instance) SageMaker training jobs by changing a single argument in the TensorFlow or MXNet estimators.  To do this, it uses Docker compose and NVIDIA Docker.  It will also pull the Amazon SageMaker TensorFlow container from Amazon ECS.

Training in local mode also allows us to easily monitor metrics like GPU consumption to ensure that our code is written properly to take advantage of the hardware we’re using.

In [None]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

Using the sagemaker.tensorflow class we will create a new SageMaker TensorFlow job
We can use the command to pass different configuration or hyperparameters to the script

For info see the [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#training-with-tensorflow-estimator)

In [None]:
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(base_job_name='cifar10',
                       entry_point='cifar10_keras_sm.py',
                       source_dir='training_script',
                       role=role,
                       framework_version='1.12.0',
                       py_version='py3',
                       hyperparameters={'epochs' : 1},
                       train_instance_count=1, train_instance_type='local')

In [None]:
estimator.fit({'train' :  'file://data/train',
               'validation' :  'file://data/validation',
               'eval' :  'file://data/eval'})

The first time the estimator runs, it needs to download the container image from its Amazon ECR repository, but then training can begin immediately.  There’s no need to wait for a separate training cluster to be provisioned.  In addition, on subsequent runs, which may be necessary when iterating and testing, changes to your MXNet or TensorFlow script will start to run instantaneously.

### Using SageMaker for faster training time
In the next part, we'll use a GPU machine for faster training time
First, We'll upload the data to S3. 
SageMaker creates a default bucket per region

In [None]:
dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')
display(dataset_location)

Create a new estimator, this time use the **ml.p3.2xlarge** as the instance type and configure **epochs:20**

In [None]:
estimator = ...

This time, use the S3 data location for each of the channels

In [None]:
estimator.fit()

**Good job!** 
You were able to run 10 epochs on a bigger instance in SageMaker.  
Before continuing to the next notebook, take a look at the training jobs section in the SageMaker console, find your job and look at its configuration