## SageMaker Yolo

This notebook details how to launch YOLO training on SageMaker, including how to train with multinode, EFA, and SageMaker Debugger.

### Prerequisites

Before running this notebook, make sure to follow the SageMaker instructions in the repo README in order to build your Docker image for training. You should upload this image to your ECR repo.

Second, you need to have the COCO data and labels stored on S3. The dataset is available at the [COCO website](https://cocodataset.org/#home). You'll also need to convert the COCO labels into the YOLO format. The COCO labels are stored in a json file with boxes in the format (x1, y1, x2, y2) describing the corners of the box. YOLO expects a text files for each image with a line for each object specifying category, and box in the format (x, y, w, h) where x and y are the box center. For example, if image `1234.jpg` has 2 people and 1 bicycle, the file `1234.txt` will have three lines

```
1 0.500000 0.842305 1.000000 0.315391
1 0.531875 0.465930 0.712083 0.782422
2 0.540583 0.837477 0.457042 0.300234
```

If you want to generate these labels yourself, or have your own dataset, [this repo](https://github.com/qwirky-yuzu/COCO-to-YOLO) contains a conversion script. If you just want to train on YOLO, you can find labels [here](). Your data should have the following file structure:

```
coco    
│
└───annotations
│   │   instances_val2017.json
│   
└───images
│   │
│   └───train2017
│   │   │   0001.jpg
│   │   │   0002.jpg
│   │   │   ...
│   │
│   └───val2017
│       │   5001.jpg
│       │   5002.jpg
│       │   ...
│
└───labels
│   │
│   └───train2017
│   │   │   0001.txt
│   │   │   0002.txt
│   │   │   ...
│   │
│   └───val2017
│       │   5001.txt
│       │   5002.txt
│       │   ...
│   train2017.txt (list of all image files)
│   val2017.txt 
```

In order to speed up startup time, we recommend tarring this entire directory, and uploading it to S3. At the start of training, this tar file will be downloaded to each SageMaker instance. Downloading as a single archive is much faster than downloading all ~250,000 files.

### SageMaker Debugger

This training uses SageMaker's "Bring your own container" functionality. Using debugger in this scenario requires a small modification to the training script. This imports the Debugger, and wraps the model with it when it finds a json config file in the expected location. The json file is generated when your training job starts based on the configuration you give in this notebook. These changes can be found on lines 31 and 206 of `train.py`.

```
import smdebug.pytorch as smd
from smdebug.core.config_constants import DEFAULT_CONFIG_FILE_PATH

...

    # wrap model in debugger
    if Path(DEFAULT_CONFIG_FILE_PATH).exists() and int(os.environ.get("RANK", 0))==0:
        hook = smd.get_hook(create_if_not_exists=True)
        hook.register_module(model)
```

You might also want to wrap your loss function to collect training progress. On many models this can be done in the same file by simply adding `hook.register_loss(loss_func)` after wrapping your model. This case is slightly more complicated, since Yolo uses a more complex loss. In this model, loss is generated in the `compute_loss` function which comes from `utils.general`. In that file, we'll again add the same imports, and wrap the loss fucntions on line 456. 

```
    # Wrap loss functions in debug hook
    if Path(DEFAULT_CONFIG_FILE_PATH).exists() and int(os.environ.get("RANK", 0))==0:
        hook = smd.get_hook(create_if_not_exists=True)
        hook.register_loss(BCEcls)
        hook.register_loss(BCEobj)
```

The `get_hook` class method will check global variables for an existing hook, and grab the same hook we used for the model. Notice we only wrapped the class and object loss functions in this case. This is one limitation for monitoring loss. The debugger expects loss functions to be a subclass of the `torch.module` class. The current model uses GIOU loss for box prediction, which is not written as a subclass of `torch.module`. This is not a major change, and will be updated in a future version.

In [None]:
import os
from datetime import datetime

import boto3
from sagemaker import analytics, image_uris
from sagemaker.pytorch import PyTorch
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
from sagemaker import get_execution_role
from sagemaker.debugger import (
    Rule,
    DebuggerHookConfig,
    TensorBoardOutputConfig,
    CollectionConfig,
    ProfilerConfig,
    FrameworkProfile,
    DetailedProfilingConfig,
    DataloaderProfilingConfig,
    rule_configs,
)
from smdebug.core.collection import CollectionKeys



### S3 Setup

The paragraph below sets up your S3 bucket, output locations, and SageMaker job name. This is optional, but makes it easier to keep track of multiple training jobs.

In [None]:
time_str = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")

region = boto3.session.Session().region_name
boto_sess = boto3.Session()
sm = boto_sess.client('sagemaker')

s3_bucket = "s3://[your S3 bucket]/"

base_job_name = "[your job name]"
date_str = datetime.now().strftime("%d-%m-%Y")
time_str = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")
job_name = f"{base_job_name}-{time_str}"

output_path = os.path.join(s3_bucket, "sagemaker-output", date_str, job_name)
code_location = os.path.join(s3_bucket, "sagemaker-code", date_str, job_name)

### Studio Experiments

This paragraph sets up SageMaker experiments to track training in Studio. This is also optional.

In [None]:
try: # Create new experiment
    experiment = Experiment.create(
        experiment_name=base_job_name,
        description='Yolo Training',
        sagemaker_boto_client=sm)
except: # Or reload existing
    experiment = Experiment.load(
        experiment_name=base_job_name,
        sagemaker_boto_client=sm)

trial = Trial.create(
    trial_name=job_name,
    experiment_name=experiment.experiment_name,
    sagemaker_boto_client=sm)
experiment_config = {
    'TrialName': trial.trial_name,
    'TrialComponentDisplayName': 'Training'}

# Configure metric definitions
metric_definitions = [
    {'Name': 'train_loss_step', 'Regex': 'train_loss_step: [0-9\\.]+'},
    {'Name': 'train_acc_step', 'Regex': 'train_acc_step: ([0-9\\.]+)'}]

### Tensorboard

This will tell the Debugger hook where to store Tensorboard events files. In this case, it will create a `tensorboard` directory in S3 at the output_path specified above.

In [None]:
tensorboard_output_config = TensorBoardOutputConfig(s3_output_path=os.path.join(output_path, 'tensorboard'))

### Debugger Hooks

Here we specify the collections for debugger. We collect loss information every 25 steps, and gradients and weights every 500 steps. We don't apply any reductions, so all model gradients and weights will be saved to S3.

In [None]:
collection_configs=[
    CollectionConfig(
        name=CollectionKeys.LOSSES,
        parameters={
            "save_interval": "25",
            # "reductions": "mean",
        }
    ),
    CollectionConfig(
        name=CollectionKeys.GRADIENTS,
        parameters={
            "save_interval": "500",
            # "reductions": "mean",
        }
    ),
    CollectionConfig(
        name=CollectionKeys.WEIGHTS,
        parameters={
            "save_interval": "500",
            # "reductions": "mean",
        }
    )
]

debugger_hook_config=DebuggerHookConfig(
    collection_configs=collection_configs
)

### System Profiler

We collect system level performance data every 500 ms.

In [None]:
profiler_config=ProfilerConfig(
    system_monitor_interval_millis=500,
)


### Model hyperparameters

These will be passed to training as command line arguments.

In [None]:
hyperparameters = {"batch-size": 64,
                   "epochs": 2,
                   "data": "coco_sagemaker.yaml",
                   "cfg": "yolov4-p5.yaml",
                   "sync-bn": "True",
                   "name": "yolov4-p5",
                   "logdir": "/opt/ml/model/"
                   }

### Distributed training

For distributed training, we'll use ddp and torchrun. SageMaker does not currently have direct support for torchrun, instead favoring mpi. So what we can do is turn off SageMaker's distribution, and set it up ourselves with the `launch_ddp.py` script. Instead of directly calling our `train.py`, this file will setup EFA, grab the SageMaker environment variables for distributed training, and launch our training script as a subprocess.

In [None]:
distribution=None
entry_point="launch_ddp.py"

### Cluster config and image

Set what type of instance you want, and how many. 

Grab the account number from the current instance in order to get your image from ECR.

In [None]:
instance_type = 'ml.p3.16xlarge'
# instance_type = 'local_gpu'
instance_count = 1

repo = "[your ECR repo]"
tag = "[your image tag]"
account = os.popen(f"aws sts get-caller-identity --region {region} --endpoint-url https://sts.{region}.amazonaws.com --query Account --output text").read().strip()
image_uri = f"{account}.dkr.ecr.{region}.amazonaws.com/{repo}:{tag}"

### Setup Estimator

In [None]:
estimator = PyTorch(
    source_dir=".",
    entry_point=entry_point,
    base_job_name=job_name,
    role=get_execution_role(),
    instance_count=instance_count,
    instance_type=instance_type,
    distribution=distribution,
    volume_size=400,
    max_run=7200,
    hyperparameters=hyperparameters,
    image_uri=image_uri,
    output_path=os.path.join(output_path, 'training-output'),
    checkpoint_s3_uri=os.path.join(output_path, 'training-checkpoints'),
    model_dir=os.path.join(output_path, 'training-model'),
    code_location=code_location,
    ## Debugger parameters
    metric_definitions=metric_definitions,
    enable_sagemaker_metrics=True,
    #rules=rules,
    debugger_hook_config=debugger_hook_config,
    disable_profiler=False,
    tensorboard_output_config=tensorboard_output_config,
    profiler_config=profiler_config,
    input_mode='File',
)

### Data channels

This should be the location of the coco.tar file you created earlier.

In [None]:
channels={"all_data": "s3://[your-s3-bucket]/data/yolo/"}

### Launch training

In [None]:
# Run training
estimator.fit(
    inputs=channels,
    wait=False,
    job_name=job_name,
    experiment_config=experiment_config,
)

In [None]:
estimator.logs()