# Using TensorFlow Scripts on Amazon SageMaker

Starting with TensorFlow version 1.11, you can use SageMaker's TensorFlow containers to train TensorFlow scripts the same way you would train outside SageMaker. This feature is named **Script Mode**. 

This example uses 
[Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow](https://github.com/sherjilozair/char-rnn-tensorflow). 
You can use the same technique for other scripts or repositories, including 
[TensorFlow Model Zoo](https://github.com/tensorflow/models) and 
[TensorFlow benchmark scripts](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks).

## Get the data
For training data, we use plain text versions of Sherlock Holmes stories.
Let's create a folder named **sherlock** to store our dataset:

In [None]:
import os
data_dir = os.path.join(os.getcwd(), 'sherlock')

os.makedirs(data_dir, exist_ok=True)

We need to download the dataset to this folder:

In [None]:
!wget https://sherlock-holm.es/stories/plain-text/cnus.txt --force-directories --output-document=sherlock/input.txt

## Preparing the training script

Let's start by cloning the repository that contains the example:

In [None]:
!git clone https://github.com/sherjilozair/char-rnn-tensorflow

To train with default parameters on the tinyshakespeare corpus, run **python train.py**. To access all the parameters use **python train.py --help.**

[train.py](https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/train.py#L11) uses the [argparse](https://docs.python.org/3/library/argparse.html) library and requires the following arguments:

```python
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# Data and model checkpoints directories
parser.add_argument('--data_dir', type=str, default='data/tinyshakespeare', help='data directory containing input.txt with training examples')
parser.add_argument('--save_dir', type=str, default='save', help='directory to store checkpointed models')
...
args = parser.parse_args()```

## Train locally, without using Amazon SageMaker

In [None]:
%cd char-rnn-tensorflow

In [None]:
!python3 train.py --num_epochs 1

In [None]:
%cd ..

## Train locally using SageMaker Python SDK TensorFlow Estimator

You can use the SageMaker Python SDK [`TensorFlow`](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/README.rst#training-with-tensorflow) estimator to easily train locally and in SageMaker. 

The training script executes in the container as shown bellow:

```bash
python train.py --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model
```

We need to replace the '--save_dir' option with '--model_dir'.

In [None]:
# this command will replace data_dir by model_dir in the training script
!sed -i 's/save_dir/model_dir/g' char-rnn-tensorflow/train.py

This notebook shows how to use the SageMaker Python SDK to run your code in a local container before deploying to SageMaker's managed training or hosting environments. Just change your estimator's train_instance_type to local or local_gpu. For more information, see: https://github.com/aws/sagemaker-python-sdk#local-mode.

In order to use this feature you'll need to install docker-compose (and nvidia-docker if training with a GPU). Running following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you.

Note, you can only run a single local notebook at a time.

In [None]:
!/bin/bash ./setup.sh

We create the `TensorFlow` Estimator, passing the flag `script_mode=True`. To train locally, you set `train_instance_type` to [local](https://github.com/aws/sagemaker-python-sdk#local-mode):

In [None]:
import os

import sagemaker
from sagemaker.tensorflow import TensorFlow

hyperparameters = {'num_epochs': 1, 'data_dir': '/opt/ml/input/data/training'}

estimator = TensorFlow(entry_point='train.py',
                       source_dir='char-rnn-tensorflow',
                       train_instance_type='local',
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       framework_version='1.11.0',
                       py_version='py3',
                       script_mode=True)

To start a training job, we call `estimator.fit(inputs)`, where inputs is a dictionary where the keys, named **channels**, have values pointing to the data location. 

`estimator.fit(inputs)` downloads the TensorFlow container with TensorFlow Python 3, CPU version, locally and simulates a SageMaker training job. 
When training starts, the TensorFlow container executes **train.py**, passing `hyperparameters` and `model_dir` as script arguments, executing the example as follows:
```bash
python -m train --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model
```

`/opt/ml/input/data/training` is the directory inside the container **where the training data is downloaded**. The data is downloaded to this folder because training is the channel name defined in estimator.fit({'training': inputs}). See training data for more information.

`/opt/ml/model` use this directory to **save models, checkpoints, or any other data**. Any data saved in this folder is saved in the S3 bucket defined for training. See model data for more information.


In [None]:
inputs = {'training': f'file://{data_dir}'}

estimator.fit(inputs)

# Train on infrastructure managed by Amazon SageMaker

After you test the training job locally, upload the dataset to an S3 bucket so SageMaker can access the data during training:

In [None]:
import sagemaker

inputs = sagemaker.Session().upload_data(path='sherlock', key_prefix='datasets/sherlock')

The returned variable inputs above is a string with a S3 location which SageMaker Tranining has permissions
to read data from.

In [None]:
inputs

To train in SageMaker:
- change the estimator argument `train_instance_type` to any SageMaker ml instance available for training.
- set the `training` channel to a S3 location.

In [None]:
hyperparameters = {'num_epochs': 1, 'data_dir': '/opt/ml/input/data/training'}

estimator = TensorFlow(entry_point='train.py',
                       source_dir='char-rnn-tensorflow',
                       train_instance_type='ml.c5.9xlarge',
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       framework_version='1.11.0',
                       py_version='py3',
                       script_mode=True)
             

estimator.fit({'training': inputs})