In [None]:
!git clone https://github.com/facebookresearch/pytorch_GAN_zoo

In [None]:
!pip install -r pytorch_GAN_zoo/requirements.txt

In [None]:
!conda install -y pytorch=1.5 torchvision=0.6

### 数据准备

PyTorch 框架的 torchvision.datasets 包提供了QMNIST 数据集，您可以通过如下指令下载 QMNIST 数据集到本地备用。

In [None]:
from torchvision import datasets

dataroot = './data'

trainset = datasets.CelebA(root=dataroot, split='train')
testset = datasets.CelebA(root=dataroot, split='test')

Amazon SageMaker 为您创建了一个默认的 Amazon S3 桶，用来存取机器学习工作流程中可能需要的各种文件和数据。 我们可以通过 SageMaker SDK 中 sagemaker.session.Session 类的 default_bucket 方法获得这个桶的名字。

In [None]:
from sagemaker.session import Session

sess = Session()

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sess.default_bucket()

# Location to save your custom code in tar.gz format.
s3_custom_code_upload_location = f's3://{bucket}/customcode/byos-pytorch-gan'

# Location where results of model training are saved.
s3_model_artifacts_location = f's3://{bucket}/artifacts/'

SageMaker SDK 提供了操作 Amazon S3 服务的包和类，其中 S3Downloader 类用于访问或下载 S3 里的对象，而 S3Uploader 则用于将本地文件上传至 S3。您将已经下载的数据上传至 Amazon S3，供模型训练使用。模型训练过程不要从互联网下载数据，避免通过互联网获取训练数据的产生的网络延迟，同时也规避了因直接访问互联网对模型训练可能产生的安全风险。


In [None]:
import os
from sagemaker.s3 import S3Uploader as s3up

s3_data_location = s3up.upload(os.path.join(dataroot, "QMNIST"), f"s3://{bucket}/data/qmnist")

### 训练执行




通过 sagemaker.get_execution_role() 方法，当前笔记本可以得到预先分配给笔记本实例的角色，这个角色将被用来获取训练用的资源，比如下载训练用框架镜像、分配 Amazon EC2 计算资源等等。

In [None]:
from sagemaker import get_execution_role

# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment. 
role = get_execution_role()

训练模型用的超参数可以在笔记本里定义，实现与算法代码的分离，在创建训练任务时传入超参数，与训练任务动态结合。

In [None]:
import json

hps = {
         'seed': 0,
         'learning-rate': 0.0002,
         'epochs': 15,
         'dataset': 'qmnist',
         'pin-memory': 1,
         'beta1': 0.5,
         'nc': 1,
         'nz': 100,
         'ngf': 64,
         'ndf': 64,
         'batch-size': 64,
         'sample-interval': 100,
         'log-interval': 20,
     }


print(json.dumps(hps, indent = 4))

sagemaker.pytorch 包里的 ```PyTorch``` 类是基于 PyTorch 框架的模型拟合器，可以用来创建、执行训练任务，还可以对训练完的模型进行部署。参数列表中， ``train_instance_type`` 用来指定CPU或者GPU实例类型，训练脚本和包括模型代码所在的目录通过 ``source_dir`` 指定，训练脚本文件名必须通过 ``entry_point`` 明确定义。这些参数将和其余参数一起被传递给训练任务，他们决定了训练任务的运行环境和模型训练时参数。

In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(role=role,
                        entry_point='train.py',
                        source_dir='dcgan',
                        output_path=s3_model_artifacts_location,
                        code_location=s3_custom_code_upload_location,
                        train_instance_count=1,
                        train_instance_type='ml.c5.large',
                        train_use_spot_instances=True,
                        train_max_wait=86400,
                        framework_version='1.4.0',
                        py_version='py3',
                        hyperparameters=hps)

请特别注意 ``train_use_spot_instances`` 参数，``True`` 值代表您希望优先使用 SPOT 实例。由于机器学习训练工作通常需要大量计算资源长时间运行，善用 SPOT 可以帮助您实现有效的成本控制，SPOT 实例价格可能是按需实例价格的 20% 到 60%，依据选择实例类型、区域、时间不同实际价格有所不同。 

您已经创建了 PyTorch 对象，下面可以用它来拟合预先存在 Amazon S3 上的数据了。下面的指令将执行训练任务，训练数据将以名为 **QMNIST** 的输入通道的方式导入训练环境。训练开始执行过程中，Amazon S3 上的训练数据将被下载到模型训练环境的本地文件系统，训练脚本 ```train.py``` 将从本地磁盘加载数据进行训练。

In [None]:
# Start training
estimator.fit({"QMNIST": s3_data_location}, wait=False)

In [None]:
!conda install -c conda-forge ipywidgets nodejs
!conda update -y tqdm

In [None]:
!jupyter labextension install @jupyter-widgets/jupyterlab-manager

## Example: PyTorch deployments using TorchServe and Amazon SageMaker

In this example, we’ll show you how you can build a TorchServe container and host it using Amazon SageMaker. With Amazon SageMaker hosting you get a fully-managed hosting experience. Just specify the type of instance, and the maximum and minimum number desired, and SageMaker takes care of the rest.

With a few lines of code, you can ask Amazon SageMaker to launch the instances, download your model from Amazon S3 to your TorchServe container, and set up the secure HTTPS endpoint for your application. On the client side, get prediction with a simple API call to this secure endpoint backed by TorchServe.

Code, configuration files, Jupyter notebooks and Dockerfiles used in this example are available here:
https://github.com/shashankprasanna/torchserve-examples.git


In [None]:
#For CPU
!conda install -y -c pytorch -c powerai pytorch=1.5 torchtext torchvision

In [None]:
#For GPU
!conda install -y -c pytorch -c powerai pytorch=1.5 torchtext torchvision cudatoolkit=10.1

In [None]:
!pip install --upgrade pip
!pip install --upgrade sagemaker awscli boto3 pandas

### Clone the TorchServe repository and install torch-model-archiver

You'll use `torch-model-archiver` to create a model archive file (.mar). The .mar model archive file contains model checkpoints along with it’s `state_dict` (dictionary object that maps each layer to its parameter tensor).

### Set up the environment

### Create a boto3 session and get specify a role with SageMaker access

In [None]:
import boto3
import sagemaker
from sagemaker.utils import name_from_base

role = sagemaker.get_execution_role()
account_id = role.split(':')[4]

sess = sagemaker.Session()
region = sess.boto_region_name
bucket = sess.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

In [None]:
job_name = name_from_base('byos-pytorch-gan')
prefix = job_name + '/pgan'

input_shape = [1, 512]
data_shape = '{"input0":[1, 512]}'

model_prefix = 'pgan'
model_folder = f'./tmp/{model_prefix}'

ecr_repository_name = f"{model_prefix}".replace('_', '-')

print(ecr_repository_name)

## Import VGG19 from TorchVision

We'll import [VGG19_bn](https://arxiv.org/pdf/1409.1556.pdf) model from TorchVision and create a model artifact `model.tar.gz`:

### Download a PyTorch model and create a TorchServe archive

In [None]:
import torch
use_gpu = True if torch.cuda.is_available() else False

# https://dl.fbaipublicfiles.com/gan_zoo/PGAN/celebaHQ_s6_i80000-6196db68.pth
# trained on high-quality celebrity faces "celebA" dataset
# this model outputs 512 x 512 pixel images
pgan = torch.hub.load('facebookresearch/pytorch_GAN_zoo:hub',
                       'PGAN', model_name='celebAHQ-512',
                       pretrained=True, useGPU=use_gpu)
# this model outputs 256 x 256 pixel images
# model = torch.hub.load('facebookresearch/pytorch_GAN_zoo:hub',
#                        'PGAN', model_name='celebAHQ-256',
#                        pretrained=True, useGPU=use_gpu)

model = pgan.netG
print(pgan.getSize())
# print(model)

In [None]:
!mkdir -p {model_folder}

torch.save(model.state_dict(), f"{model_folder}/pgan-celebAHQ-512.pth")

In [None]:
import os

model_file = "model.py"
model_def_path = os.path.join("./pgan/", model_file)
if not os.path.isfile(model_def_path):
    raise RuntimeError("Missing the model.py file")

state_dict = torch.load(f"{model_folder}/pgan-celebAHQ-512.pth", map_location="cpu")

from pgan.progressive_conv_net import GNet
model = GNet(512, 512)
model.addScale(512)
model.addScale(512)
model.addScale(512)
model.addScale(256)
model.addScale(128)
model.addScale(64)
model.addScale(32)
model.load_state_dict(state_dict)

# print(model)

In [None]:
model_state_dict = torch.load(f"{model_folder}/pgan-celebAHQ-512.pth")
model.load_state_dict(model_state_dict) # ,strict=False)

##### num_images = 8
noise, _ = pgan.buildNoiseData(num_images)

with torch.no_grad():
    generated_images = model(noise)

generated_images.shape

In [None]:
# let's plot these images using torchvision and matplotlib
import matplotlib.pyplot as plt
import torchvision
grid = torchvision.utils.make_grid(generated_images.clamp(min=-1, max=1),
                                   scale_each=True, normalize=True)
plt.imshow(grid.permute(1, 2, 0).cpu().numpy())

In [None]:
!torch-model-archiver --model-name pgan-celebAHQ-512 --export-path {model_folder} \
            --version 1.0 --serialized-file {model_folder}/pgan-celebAHQ-512.pth \
            --handler pgan/handler.py \
            --force \
            --model-file pgan/model.py \
            --extra-files pgan/custom_layers.py,pgan/mini_batch_stddev_module.py,pgan/utils.py


In [None]:
import tarfile

with tarfile.open(f"{model_folder}/pgan-celebAHQ-512.tar.gz", 'w:gz') as f:
    f.add(f"{model_folder}/pgan-celebAHQ-512.mar", arcname="pgan-celebAHQ-512.mar")

### Upload the generated densenet161.mar archive file to Amazon S3
Create a compressed tar.gz file from the densenet161.mar file since Amazon SageMaker expects that models are in a tar.gz file. 
Uploads the model to your default Amazon SageMaker S3 bucket under the models directory

In [None]:
s3_model_path = sess.upload_data(path=f"{model_folder}/pgan-celebAHQ-512.tar.gz",
                              key_prefix=f"{prefix}/models")

### Create an Amazon ECR registry
Create a new docker container registry for your torchserve container images.

In [None]:
!pygmentize -l docker "docker/Dockerfile.torchserve"

In [None]:
!pygmentize -l bash "docker/build_and_push.sh"

### Build a TorchServe Docker container and push it to Amazon ECR

In [None]:
# %%capture
%cd docker
!sh build_and_push.sh pytorch-torchserve Dockerfile.torchserve #$account_id $region $ecr_repository_name
%cd ..

In [None]:
container_image_uri = f'{account_id}.dkr.ecr.{region}.amazonaws.com/pytorch-torchserve:latest'
print(container_image_uri)

### Deploy endpoint and make prediction using Amazon SageMaker SDK

In [None]:
import time
from sagemaker.model import Model
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import RealTimePredictor

sm_model_name = f'model-{model_prefix}-'.replace('_', '-') \
                    + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

# model = PyTorchModel(model_data=s3_model_path, 
#                      image=container_image_uri,
#                      role=role,
#                      predictor_cls=RealTimePredictor,
#                      name=sm_model_name,
#                      entry_point='script.py',
#                      framework_version="1.5",
#                      sagemaker_session=sess)


model = Model(model_data=s3_model_path, 
                 image=container_image_uri,
                 role=role,
                 predictor_cls=RealTimePredictor,
                 name=sm_model_name)

In [None]:
import time
from sagemaker.model_monitor import DataCaptureConfig

endpoint_name = f'endpoint-{model_prefix}-'.replace('_', '-') \
                    + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

predictor = model.deploy(instance_type='ml.c5.xlarge',
                            initial_instance_count=1,
                            endpoint_name=endpoint_name)

In [None]:
model.delete_model()

#### Test the TorchServe hosted model

### Invoke the endpoint

Let's test with a cat image.

In [None]:
%%time

import json
import numpy as np

num_images = 5
noise, _ = pgan.buildNoiseData(num_images)
x = noise.numpy()
print(x.shape)
x = x.tobytes()

In [None]:
response = predictor.predict(data=x)
print(response)

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)

In [None]:
num_images = 4
noise, _ = pgan.buildNoiseData(num_images)

In [None]:
print(noise.size())

In [None]:
with torch.no_grad():
    generated_images = pgan.test(noise)

# let's plot these images using torchvision and matplotlib
import matplotlib.pyplot as plt
import torchvision
grid = torchvision.utils.make_grid(generated_images.clamp(min=-1, max=1),
                                   scale_each=True, normalize=True)
plt.imshow(grid.permute(1, 2, 0).cpu().numpy())
# plt.show()

In [None]:
!python3 pytorch_GAN_zoo/train.py StyleGAN -c pytorch_GAN_zoo/config_celebaHQ.json --restart -n style_gan_celeba