# Training PGGANs

- [EstimatorBase](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html?highlight=estimatorBase)
- [Sagemaker Tensorflow Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html)
- [Inputs](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html)
- [Amazon SageMaker, 배치 변환 기능 및 TensorFlow 컨테이너를 위한 파이프 입력 모드 추가](https://aws.amazon.com/ko/blogs/korea/amazon-sagemaker-adds-batch-transform-feature-and-pipe-input-mode-for-tensorflow-containers/)
- [sagemaker-tensorflow-extensions](https://github.com/aws/sagemaker-tensorflow-extensions)

In [87]:
from sagemaker.tensorflow.estimator import TensorFlow
from sagemaker.estimator import Framework
from sagemaker import get_execution_role
from sagemaker.session import s3_input
import boto3
import os 

In [96]:
BUCKET_NAME = "sagemaker-jhgan-workspace" # 결과를 저장할 버킷
JOB_NAME = "train-31"
S3_OUTPUT_LOCATION = f"s3://{BUCKET_NAME}/{JOB_NAME}"
DOCKER_IMAGE_URI = "349048005035.dkr.ecr.us-east-2.amazonaws.com/pggan:1.15.3-gpu-py3"
VOLUME_SIZE = 200
INSTANCE_COUNT = 1
INSTANCE_TYPE = "ml.p2.8xlarge"
FRAMEWORK_VERSION = "1.6.0"
PY_VERSION = "py3"
INPUT_MODE = "Pipe"
INPUTS = f"s3://{BUCKET_NAME}/nii-to-tfrecord-01/output/output-1/"
print(f"S3 Output Path: {S3_OUTPUT_LOCATION}")

S3 Output Path: s3://sagemaker-jhgan-workspace/train-31


In [97]:
inputs = {}
for i in range(2,10):
    name = f"r0{i}"
    inputs[name] = s3_input(INPUTS + name)

In [98]:
estimator = TensorFlow(
    entry_point = "train.py",
    image_name = DOCKER_IMAGE_URI,
    role = get_execution_role(),
    output_path=S3_OUTPUT_LOCATION,
    train_instance_count=INSTANCE_COUNT,
    train_instance_type=INSTANCE_TYPE,
    train_volume_size=VOLUME_SIZE,
    framework_version=FRAMEWORK_VERSION,
    py_version=PY_VERSION,
    input_mode=INPUT_MODE,
    source_dir = "./"
)

In [99]:
estimator.fit(
    inputs=inputs,
    job_name=JOB_NAME
)

INFO:sagemaker:Creating training-job with name: train-31


2020-07-24 11:52:36 Starting - Starting the training job...
2020-07-24 11:52:37 Starting - Launching requested ML instances.........
2020-07-24 11:54:12 Starting - Preparing the instances for training.........
2020-07-24 11:55:59 Downloading - Downloading input data...
2020-07-24 11:56:13 Training - Downloading the training image............
[0m
[34m2020-07-24 11:58:22,222 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2020-07-24 11:58:22,975 sagemaker-training-toolkit INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "r03": "/opt/ml/input/data/r03",
        "r02": "/opt/ml/input/data/r02",
        "r05": "/opt/ml/input/data/r05",
        "r04": "/opt/ml/input/data/r04",
        "r07": "/opt/ml/input/data/r07",
        "r06": "/opt/ml/input/data/r06",
        "r09": "/opt/ml/input/data/r09",
        "r08": "/opt/ml/input/data/r08"
   

UnexpectedStatusException: Error for Training job train-31: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/bin/python3 train.py --model_dir s3://sagemaker-jhgan-workspace/train-31/train-31/model"

# Excercise

## 1. 데이터셋 생성

결과로 만들어지는 `tfrecords` 파일들은 각각 $2^2$부터 $2^9$ 해상도에 해당하는 데이터셋이다.

> The datasets are represented by directories containing the same image data in several resolutions to enable efficient streaming. There is a separate *.tfrecords file for each resolution, and if the dataset contains labels, they are stored in a separate file as well:

In [77]:
shape = (1, 512, 512)
resolution_log2 = int(np.log2(shape[1]))
print(f"Image Shape: {shape}\nResolutionLog2: {resolution_log2}") 

Image Shape: (1, 512, 512)
ResolutionLog2: 9


In [78]:
assert shape[0] in [1, 3]
assert shape[1] == shape[2]
assert shape[1] == 2**resolution_log2

In [None]:
for lod in range(resolution_log2 - 1):
    size_log2 = (resolution_log2 - lod)
    tfr_file =  '-r%02d.tfrecords' % (size_log2)
    print(tfr_file, 2^size_log2)

## 2. 모델 훈련

 원본 코드에서는 각 `tfrecords` 파일에 대해 `TFRecordDataset` 인스턴스를 구성한다. 하지만 여기에서는 파일 시스템이 아니라 `Pipe` 인풋 모드를 사용하므로 코드 수정이 필요하다. 먼저 `tfr_shapes`를 하드코딩으로 할당해준 후 이후 코드들을 테스트한다.

In [40]:
import numpy as np
import tensorflow as tf




In [66]:
tfr_shapes = [(1, 2**i, 2**i) for i in range(1,10)]
max_shape = max(tfr_shapes, key=lambda shape: np.prod(shape))
resolution=None
resolution = resolution if resolution is not None else max_shape[1]
resolution_log2 = int(np.log2(resolution))
tfr_lods = [resolution_log2 - int(np.log2(shape[1])) for shape in tfr_shapes]

In [71]:
tfr_channels = ["r0" + str(i) for i in range(1,10)]

In [72]:
for shape, channel in zip(tfr_shapes, tfr_channels):
    print(shape, channel)

(1, 2, 2) r01
(1, 4, 4) r02
(1, 8, 8) r03
(1, 16, 16) r04
(1, 32, 32) r05
(1, 64, 64) r06
(1, 128, 128) r07
(1, 256, 256) r08
(1, 512, 512) r09


In [67]:
assert all(shape[0] == max_shape[0] for shape in tfr_shapes)
assert all(shape[1] == shape[2] for shape in tfr_shapes)
assert all(shape[1] == resolution // (2**lod) for shape, lod in zip(tfr_shapes, tfr_lods))
assert all(lod in tfr_lods for lod in range(resolution_log2 - 1))

다음으로 각 해상도의 `tfrecords` 파일들을 입력 채널로 구분하여 각 채널에 대해서 `PipeModeDataset`을 만들어준다.

```python
# Build TF expressions.
with tf.name_scope('Dataset'), tf.device('/cpu:0'):
    self._tf_minibatch_in = tf.placeholder(tf.int64, name='minibatch_in', shape=[])
    tf_labels_init = tf.zeros(self._np_labels.shape, self._np_labels.dtype)
    self._tf_labels_var = tf.Variable(tf_labels_init, name='labels_var')
    tfutil.set_vars({self._tf_labels_var: self._np_labels})
    self._tf_labels_dataset = tf.data.Dataset.from_tensor_slices(self._tf_labels_var)
#----------------------------------------------------------------------------
# Modified for Sagemaker Workflow: create dataset for each file > channels
    tfr_channels = ["r0" + str(i) for i in range(2,10)]
    for tfr_channel, tfr_shape, tfr_lod in zip(tfr_channels, tfr_shapes, tfr_lods):
        if tfr_lod < 0:
            continue
        dset = PipeModeDataset(channel=tfr_channel, record_format='TFRecord')
#                 dset = tf.data.TFRecordDataset(tfr_file, compression_type='', buffer_size=buffer_mb<<20)
        dset = dset.map(parse_tfrecord_tf, num_parallel_calls=num_threads)
        dset = tf.data.Dataset.zip((dset, self._tf_labels_dataset))
        bytes_per_item = np.prod(tfr_shape) * np.dtype(self.dtype).itemsize
        if shuffle_mb > 0:
            dset = dset.shuffle(((shuffle_mb << 20) - 1) // bytes_per_item + 1)
        if repeat:
            dset = dset.repeat()
        if prefetch_mb > 0:
            dset = dset.prefetch(((prefetch_mb << 20) - 1) // bytes_per_item + 1)
        dset = dset.batch(self._tf_minibatch_in)
        self._tf_datasets[tfr_lod] = dset
    self._tf_iterator = tf.data.Iterator.from_structure(self._tf_datasets[0].output_types, self._tf_datasets[0].output_shapes)
    self._tf_init_ops = {lod: self._tf_iterator.make_initializer(dset) for lod, dset in self._tf_datasets.items()}
#----------------------------------------------------------------------------
```