# PyTorch distributed training on SageMaker

## 1 说明
本章内容为用SageMaker进行训练，数据来自MNIST。

## 2 运行环境
Kernel 选择pytorch_latest_p36。  
本文在boto3 1.17.109和sagemaker 2.48.1下测试通过。

In [None]:
import boto3,sagemaker
print(boto3.__version__)
print(sagemaker.__version__)

如果版本较低，请执行以下命令，重启kernal后再检查版本

In [None]:
!pip install -U boto3 -i https://opentuna.cn/pypi/web/simple/

In [None]:
!pip install -U sagemaker -i https://opentuna.cn/pypi/web/simple/

## 3 设置/获取相关参数

In [None]:
#修改为自己的bucket
bucket='junzhong'
prefix = 'sagemaker/DEMO-pytorch-mnist'

In [None]:
import boto3
import sagemaker

sagemaker_session = sagemaker.Session()
iam = boto3.client('iam')
roles = iam.list_roles(PathPrefix='/service-role')
role=""
for current_role in roles["Roles"]:
    if current_role["RoleName"].startswith("AmazonSageMaker-ExecutionRole-"):
        role=current_role["Arn"]
        break
#如果role为空表示有问题，需要先打开https://cn-northwest-1.console.amazonaws.cn/sagemaker/home?region=cn-northwest-1#/notebook-instances/create以创建IAM Role
print(role)

## 4 数据
### 4.1 获取数据

In [None]:
from torchvision import datasets, transforms

datasets.MNIST('data', download=True, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
]))

### 4.2 上传数据到 S3

In [None]:
!aws s3 sync data/ s3://$bucket/$prefix/

In [None]:
inputs='s3://{}/{}/'.format(bucket, prefix)
inputs

## 5 训练

使用SageMaker训练时，环境变量已设置以下值，不用再在程序里指定
```
NCCL_SOCKET_IFNAME=eth0
MASTER_ADDR=algo-1
MASTER_PORT=7777
```

In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='mnist.py',
                    role=role,
                    framework_version='1.6.0',
                    py_version='py3',
                    instance_count=2,
                    instance_type='ml.p3.8xlarge',#CPU:ml.m5.xlarge,GPU:ml.p3.8xlarge
                    hyperparameters={
                        'epochs': 10,
                        'backend': 'nccl',#CPU使用gloo，GPU使用nccl
                    })

In [None]:
estimator.fit({'training': inputs})

## 6 部署推理服务
该步骤大概需要7分钟

In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.large',endpoint_name="mnist")

## 7 评估

In [None]:
import gzip 
import numpy as np
import random
import os

data_dir = 'data/MNIST/raw'
with gzip.open(os.path.join(data_dir, "t10k-images-idx3-ubyte.gz"), "rb") as f:
    images = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1, 28, 28).astype(np.float32)

In [None]:
image_size = 3
mask1 = random.sample(range(len(images)), image_size) # randomly select some of the test images
mask2 = np.array(mask1, dtype=np.int)
data = images[mask2]

In [None]:
from matplotlib import pyplot as plt
plt.figure(figsize=(2,2))
for index, mask in enumerate(mask1):
    plt.subplot(1,image_size,index+1)
    plt.axis('off')
    plt.imshow(images[mask])

重启kernel后，重新获取predictor；未重启kernal，不用执行

In [None]:
'''
from sagemaker.pytorch.model import PyTorchPredictor
endpoint_name = "mnist"
predictor = PyTorchPredictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker.Session())
'''

In [None]:
response = predictor.predict(np.expand_dims(data, axis=1))
print("Raw prediction result:")
print(response)
for i in range(0,image_size):
    print("Most likely answer: {}".format(np.argmax(response[i])))

## 8 清理

In [None]:
predictor.delete_endpoint()