# Distributed data parallel MaskRCNN training with TensorFlow 2 and SageMaker

# (WIP)

- Original source : https://github.com/aws/amazon-sagemaker-examples/blob/master/training/distributed_training/tensorflow/data_parallel/maskrcnn/tensorflow2_smdataparallel_maskrcnn_demo.ipynb


## STEP 0. Preparing dataset, pretrained model, and training script

In this workshop, all resources (dataset, pretrained model, and training script) are ready to be run.  

**Workshop participants**  
If you are joining the workshop now, you can skip this step and jump to the step 1. 


**Self-configuration**  
_If you want to configure these resources manually, you can uncomment and run below sh cell. **(It will take more than 30 minutes.)**__



In [34]:
%%sh
pip install pycocotools

## ========================================================= ##
# 1). Clone Mask-RCNN utility and training code
rm -rf DeepLearningExamples
git clone --recursive https://github.com/HerringForks/DeepLearningExamples.git

## ========================================================= ##
# 2. Download and preprocess teh COCO dataset
CURRENT_PATH=$(pwd)
echo "CURRENT PATH="$CURRENT_PATH
DATA_TARGET_PATH="../coco_dataset"
SH_PATH=${CURRENT_PATH}"/DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN"

if [ -d ${DATA_TARGET_PATH} ]; then
    echo "dataset already exists"
else
    echo $DATA_TARGET_PATH
    ${SH_PATH}/dataset/download_and_preprocess_coco.sh ${DATA_TARGET_PATH}
    mkdir -p ${DATA_TARGET_PATH}/train
    mv ${DATA_TARGET_PATH}/*.tfrecord ${DATA_TARGET_PATH}/train/
fi

## ========================================================= ##
# 3. Download pretrained model weights
cd ${SH_PATH}/weights/
                                        
## Mask RCNN
BASE_URL="https://storage.googleapis.com/cloud-tpu-checkpoints/mask-rcnn/1555659850"
DEST_DIR="mask-rcnn/1555659850"

wget -N ${BASE_URL}/saved_model.pb -P ${DEST_DIR}
wget -N ${BASE_URL}/variables/variables.data-00000-of-00001 -P ${DEST_DIR}/variables
wget -N ${BASE_URL}/variables/variables.index -P ${DEST_DIR}/variables

## resnet-nhwc-2018-02-07 
BASE_URL="https://storage.googleapis.com/cloud-tpu-checkpoints/retinanet/resnet50-checkpoint-2018-02-07"
DEST_DIR="resnet/resnet-nhwc-2018-02-07"

wget -N ${BASE_URL}/checkpoint -P ${DEST_DIR}
wget -N ${BASE_URL}/model.ckpt-112603.data-00000-of-00001 -P ${DEST_DIR}
wget -N ${BASE_URL}/model.ckpt-112603.index  -P ${DEST_DIR}
wget -N ${BASE_URL}/model.ckpt-112603.meta -P ${DEST_DIR}

# VERIFY CHECKPOINTS
echo "Verifying and Processing Checkpoints..."

python ./pb_to_ckpt.py \
    --frozen_model_filename=mask-rcnn/1555659850/ \
    --output_filename=mask-rcnn/1555659850/ckpt/model.ckpt
python ./extract_RN50_weights.py \
    --checkpoint_dir=mask-rcnn/1555659850/ckpt/model.ckpt \
    --save_to=resnet/extracted_from_maskrcnn
echo "Generating list of tensors and their shape..."
python ./inspect_checkpoint.py --file_name=mask-rcnn/1555659850/ckpt/model.ckpt \
    > mask-rcnn/1555659850/tensors_and_shape.txt
python ./inspect_checkpoint.py --file_name=resnet/resnet-nhwc-2018-02-07/model.ckpt-112603 \
    > resnet/resnet-nhwc-2018-02-07/tensors_and_shape.txt
python ./inspect_checkpoint.py --file_name=resnet/extracted_from_maskrcnn/resnet50.ckpt \
    > resnet/extracted_from_maskrcnn/tensors_and_shape.txt
echo "Script Finished with Success"

# COPY PRETRAINED MODEL TO DATASET FOLDER
echo "Coping pretrained model to coco dataset folder..."
cd ${CURRENT_PATH}
cp -r ${SH_PATH}/weights/ ${DATA_TARGET_PATH}/model/

## ========================================================= ##
# 4. Upload your preprocessed dataset and pretrained model to S3

ACCOUNT_ID=$(aws sts get-caller-identity --query Account)
REGION=$(aws configure get region)

aws s3 sync ${DATA_TARGET_PATH}/train s3://sagemaker-${REGION}-${ACCOUNT_ID}/coco_dataset/train/ --quiet
aws s3 sync ${DATA_TARGET_PATH}/model s3://sagemaker-${REGION}-${ACCOUNT_ID}/coco_dataset/model/ --quiet

echo "Data was uploaded"

CURRENT PATH=/home/ec2-user/SageMaker/tf_sm_ddp_workshop
dataset already exists
Verifying and Processing Checkpoints...
Model saved to ckpt format
Total Vars Loaded: 265
Model save location: resnet/extracted_from_maskrcnn/resnet50.ckpt
Generating list of tensors and their shape...
Script Finished with Success
Coping pretrained model to coco dataset folder...
Data was uploaded


Cloning into 'DeepLearningExamples'...
--2021-06-22 14:19:58--  https://storage.googleapis.com/cloud-tpu-checkpoints/mask-rcnn/1555659850/saved_model.pb
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.73.240, 142.250.81.208, 172.217.2.112, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.73.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6080081 (5.8M) [application/octet-stream]
Saving to: ‘mask-rcnn/1555659850/saved_model.pb’

     0K .......... .......... .......... .......... ..........  0% 31.0M 0s
    50K .......... .......... .......... .......... ..........  1% 58.5M 0s
   100K .......... .......... .......... .......... ..........  2% 53.6M 0s
   150K .......... .......... .......... .......... ..........  3% 75.1M 0s
   200K .......... .......... .......... .......... ..........  4% 41.5M 0s
   250K .......... .......... .......... .......... ..........  5% 81.2M 0s
   300K .......... .......... ......

After downloading resources, final files in your folder will look like below:

```(bash)
# training images
<PATH>/train/train-00000-of-00256.tfrecord
             train-00001-of-00256.tfrecord
             train-00002-of-00256.tfrecord 
             ...
# pretrained model
<PATH>/model/resnet/resnet-nhwc-2018-02-07/checkpoint
                                           model.ckpt-112603.data-000...
                                           model.ckpt-112603.index
                                           model.ckpt-112603.meta
```

Here, `<PATH>` will be synched to `/opt/ml/input/data/train/` folder inside the SageMaker Tensorflow container later.



### And additional step to configure FSX Lustre

- Create filesystem with VPC and import S3 dataset

## <--- FSX Lustre guide (TBU) 

<br>
<br>

### Prepare SageMaker Training Images

1. SageMaker by default use the latest [Amazon Deep Learning Container Images (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) TensorFlow training image. In this step, we use it as a base image and install additional dependencies required for training MaskRCNN model.
2. In the Github repository https://github.com/HerringForks/DeepLearningExamples.git we have made `smdistributed.dataparallel` TensorFlow  MaskRCNN training script available for your use. We will be installing the same on the training image.

**Build and Push Docker Image to ECR**

If you are configure environemtn manually, uncomment and run the below command to build the docker image and push it to ECR. **(It will take 7~8 minutes.)**


In [35]:
# %%sh

# REGION=$(aws configure get region)
# IMAGE_NAME="tf2-mask-rcnn-smd-dataparallel-sagemaker"  
# TAG="latest"  

# chmod +x build_and_push.sh; bash build_and_push.sh $REGION $IMAGE_NAME $TAG


Login Succeeded
Sending build context to Docker daemon  1.601GB
Step 1/5 : ARG region
Step 2/5 : FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04
2.4.1-gpu-py37-cu110-ubuntu18.04: Pulling from tensorflow-training
171857c49d0f: Pulling fs layer
419640447d26: Pulling fs layer
61e52f862619: Pulling fs layer
2a93278deddf: Pulling fs layer
c9f080049843: Pulling fs layer
8189556b2329: Pulling fs layer
48201db7094c: Pulling fs layer
52c012740c87: Pulling fs layer
4ea8c97a046a: Pulling fs layer
cf567978f146: Pulling fs layer
6debf6d5a580: Pulling fs layer
31a291f7c7a3: Pulling fs layer
79b923967117: Pulling fs layer
87a6c868b245: Pulling fs layer
51c2b0566974: Pulling fs layer
2a93278deddf: Waiting
0670b55c8fe8: Pulling fs layer
c9f080049843: Waiting
d8db51f37e08: Pulling fs layer
4ea8c97a046a: Waiting
8189556b2329: Waiting
54b7c14aac6f: Pulling fs layer
8a648cfd9dd8: Pulling fs layer
48201db7094c: Waiting
cf567978f146: Waiting
9ecba1440a31

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



To see the Dockerfile and upload shell script you can use below commands.

In [None]:
# !pygmentize ./Dockerfile

In [None]:
# !pygmentize ./build_and_push.sh

---

## STEP 1. Amazon SageMaker Estimator using data parallel



### Amazon SageMaker Initialization

In [36]:
%%time
!python3 -m pip install --upgrade sagemaker
!pip install pycocotools
import sagemaker
from sagemaker import get_execution_role
import boto3

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role:{role}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account:{account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region:{region}")

SageMaker Execution Role:arn:aws:iam::308961792850:role/service-role/AmazonSageMaker-ExecutionRole-20200220T105738
AWS account:308961792850
AWS region:us-east-1
CPU times: user 234 ms, sys: 0 ns, total: 234 ms
Wall time: 5.77 s


To verify that the role above has required permissions:

1. Go to the IAM console: https://console.aws.amazon.com/iam/home.
2. Select **Roles**.
3. Enter the role name in the search box to search for that role. 
4. Select the role.
5. Use the **Permissions** tab to verify this role has required permissions attached.

### SageMaker TensorFlow Estimator function options

In the following code block, you can update the estimator function to use a different instance type, instance count, and distrubtion strategy. You're also passing in the training script you reviewed in the previous cell.

**Instance types**

`smdistributed.dataparallel` supports model training on SageMaker with the following instance types only. For best performance, it is recommended you use an instance type that supports Amazon Elastic Fabric Adapter (ml.p3dn.24xlarge and ml.p4d.24xlarge).

1. ml.p3.16xlarge
1. ml.p3dn.24xlarge [Recommended]
1. ml.p4d.24xlarge [Recommended]

**Instance count**

To get the best performance and the most out of `smdistributed.dataparallel`, you should use at least 2 instances, but you can also use 1 for testing this example.

**Distribution strategy**

Note that to use DDP mode, you update the the `distribution` strategy, and set it to use `smdistributed dataparallel`.

### Training script

In the Github repository https://github.com/HerringForks/DeepLearningExamples.git we have made reference `smdistributed.dataparallel` TensorFlow MaskRCNN training script available for your use. (This repository is cloned for the workshop or you did it at SETP 0.)

In [37]:
from sagemaker.tensorflow import TensorFlow

In [40]:
image = "tf2-mask-rcnn-smd-dataparallel-sagemaker"  
tag = "latest"
instance_type = "ml.p3.16xlarge"   # Other supported instance type: ml.p3.16xlarge, ml.p4d.24xlarge ml.p3dn.24xlarge
instance_count = 2                 # You can use 2, 4, 8 etc.
docker_image = f"{account}.dkr.ecr.{region}.amazonaws.com/{image}:{tag}"  # YOUR_ECR_IMAGE_BUILT_WITH_ABOVE_DOCKER_FILE
subnets = ["subnet-e7cf5bba"]      # Should be same as Subnet used for FSx. Example: subnet-0f9XXXX
security_group_ids = [
    "sg-a2df64d6"
]                                  # Should be same as Security group used for FSx. sg-03ZZZZZZ

file_system_id = "fs-095c0f2cb65a79424"  # FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'

In [41]:
SM_DATA_ROOT = "/opt/ml/input/data/train"

hyperparameters = {
    "mode": "train",
    "checkpoint": "/".join([SM_DATA_ROOT, "model/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603"]),
    "eval_samples": 5000,
    "init_learning_rate": 0.04,
    "learning_rate_steps": "3750,5000",
    "model_dir": "/opt/ml/code/checkpoints/tensorflow_mask_rcnn",
    "num_steps_per_eval": 462,
    "total_steps": 500,
    "train_batch_size": 4,
    "eval_batch_size": 8,
    "training_file_pattern": "/".join([SM_DATA_ROOT, "train"]),
    "validation_file_pattern": "/".join([SM_DATA_ROOT, "val"]),
    "val_json_file": "/".join([SM_DATA_ROOT, "annotations/instances_val2017.json"]),
    "amp": "",
    "use_batched_nms": "",
    "xla": "",
    "nouse_custom_box_proposals_op": "",
    "seed": 987,
}

### Set SageMaker Debugger Profiler

You can monitor GPU resource consumption with SageMaker Debugger. Set a SageMaker ProfilerConfig to monitor training instace every 500 miliseconds.


In [57]:
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile()
)

In [58]:
estimator = TensorFlow(
    entry_point="DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN/mask_rcnn_sm.py",
    role=role,
    image_uri=docker_image,
    source_dir=".",
    framework_version="2.4.1",
    py_version="py37",
    instance_count=instance_count,
    instance_type=instance_type,
    sagemaker_session=sagemaker_session,
    subnets=subnets,
    hyperparameters=hyperparameters,
    security_group_ids=security_group_ids,
    debugger_hook_config=False,
    # Training using smdistributed.dataparallel Distributed Training Framework
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
    profiler_config=profiler_config,
)

### Run training job

Configure FSX input dataset and submit training job. **(It will take 20~22 minutes.)**

In [59]:
# Configure FSx Input for your SageMaker Training job

from sagemaker.inputs import FileSystemInput

file_system_directory_path = "/l4po7bmv/coco"   # NOTE: '/fsx/' will be the root mount path. Example: '/fsx/mask_rcnn/PyTorch' l4po7bmv
file_system_access_mode = "ro"
file_system_type = "FSxLustre"
train_fs = FileSystemInput(
    file_system_id=file_system_id,
    file_system_type=file_system_type,
    directory_path=file_system_directory_path,
    file_system_access_mode=file_system_access_mode,
)
data_channels = {"train": train_fs}

In [None]:
%%time
# Submit SageMaker training job
estimator.fit(inputs=data_channels)

---
## STEP 2. Training script review

## <-- Where to change the code (TBU)

---
## STEP 3. Monitor training instances with SageMaker Debugger profiler

In [46]:
training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")

Training jobname: tf2-mask-rcnn-smd-dataparallel-sagemake-2021-06-22-14-30-16-185
Region: us-east-1


In [47]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()




[2021-06-22 15:00:22.097 ip-172-16-47-206:2668 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None


ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-308961792850/', 'ProfilingIntervalInMilliseconds': 500}
s3 path:s3://sagemaker-us-east-1-308961792850/tf2-mask-rcnn-smd-dataparallel-sagemake-2021-06-22-14-30-16-185/profiler-output


Profiler data from system is available


In [50]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

You will find the profiler report in s3://sagemaker-us-east-1-308961792850/tf2-mask-rcnn-smd-dataparallel-sagemake-2021-06-22-14-30-16-185/rule-output


In [56]:
!aws s3 sync {rule_output_path}/ ./ --quiet

### Explore the report from Debugger Profiler

Check the downloaded report in ProfilerReport-0000000000 folder 

### <- TBU

---
# Additional Resources

If you are a new user of Amazon SageMaker, you may find the following helpful to understand how SageMaker uses Docker to train custom models.
* To learn more about using Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms
](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

* To learn more about using Docker to train your own models with Amazon SageMaker, see [Example Notebooks: Use Your Own Algorithm or Model](https://docs.aws.amazon.com/sagemaker/latest/dg/adv-bring-own-examples.html).
* To see other examples of distributed training using Amazon SageMaker and TensorFlow, see [Distributed TensorFlow training using Amazon SageMaker
](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/distributed_tensorflow_mask_rcnn).