# 使用FSx for Lustre作为SageMaker的训练数据输入

## 1 说明
本文为通过FSx for Lustre把S3数据作为SageMaker的训练数据输入，以解决直接从S3上下载训练数据耗时过长问题。  
注意：该功能暂不能在中国区使用。

## 2 运行环境
Kernel 选择tensorflow2_p36或pytorch_p36均可。  
本文在boto3 1.17.99和sagemaker 2.45.0下测试通过。  

In [None]:
import boto3,sagemaker
print(boto3.__version__)
print(sagemaker.__version__)

## 3 配置FSx
参考 https://docs.aws.amazon.com/zh_cn/fsx/latest/LustreGuide/create-fs-linked-data-repo.html 进行配置，将您的文件系统链接到S3存储桶。  
配置导入S3数据时，不要输入prefix。

## 4 在VPC中创建S3终端节点
打开VPC web控制台，在左边导航栏点击`终端节点`，再点击`创建终端节点`，在服务名称搜索框中输入`S3`，搜索结果选择类型为`Gateway`的记录，配置路由表中，勾选上主路由表的记录，再点击`创建终端节点`。  
不配置这步会报 Failed. Reason: InternalServerError: We encountered an internal error. Please try again.

## 5 获取/设置相关参数

In [None]:
import boto3
import sagemaker
from sagemaker.image_uris import retrieve

sagemaker_session = sagemaker.Session()
iam = boto3.client('iam')
roles = iam.list_roles(PathPrefix='/service-role')
role=""
for current_role in roles["Roles"]:
    if current_role["RoleName"].startswith("AmazonSageMaker-ExecutionRole-"):
        role=current_role["Arn"]
        break
print(role)

注意事项：
- 1.SageMaker Role必须要有使用FSx的权限
- 2.确认FSx的安全组，允许SageMaker访问

In [None]:
subnets = ["subnet-0eecdb20"]  # Should be same as Subnet used for FSx. Example: subnet-0f9XXXX
security_group_ids = ["sg-6478f13a"]  # Should be same as Security group used for FSx. sg-03ZZZZZZ
file_system_id = "fs-011671baa391568ab"  # FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'
mount_name="cm26jbmv" #FSx控制台页面上的挂载名称，mount name
s3_prefix="test" #S3前缀/目录

In [None]:
from sagemaker.inputs import FileSystemInput
file_system_directory_path = "/{}/{}".format(mount_name,s3_prefix)
file_system_access_mode = "ro"#read only
file_system_type = "FSxLustre"
train_fs = FileSystemInput(
    file_system_id=file_system_id,
    file_system_type=file_system_type,
    directory_path=file_system_directory_path,
    file_system_access_mode=file_system_access_mode,
)

## 6 训练
本文仅仅是列出了训练目录下的前100个文件，并没有实际训练，主要为演示通过FSx获取S3数据。

### 6.1 TensorFlow

In [None]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    base_job_name="big-data-input",
    entry_point="ListFile.py",
    role=role,
    py_version="py37",
    framework_version="2.4.1",
    instance_count=1,
    instance_type="ml.m5.large",
    sagemaker_session=sagemaker_session,
    hyperparameters={"path":"/opt/ml/input/data/training"},
    subnets=subnets,
    security_group_ids=security_group_ids,
)
estimator.fit(train_fs)

### 6.2 PyTorch

In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    base_job_name="big-data-input",
    entry_point="ListFile.py",
    role=role,
    py_version="py36",
    framework_version="1.6.0",
    instance_count=1,
    instance_type="ml.m5.large",
    sagemaker_session=sagemaker_session,
    hyperparameters={"path":"/opt/ml/input/data/training"},
    subnets=subnets,
    security_group_ids=security_group_ids,
)
estimator.fit(train_fs)