# Train Sementic Segmentation model using Pytorch Lightning on Amazon SageMaker

## Overview
This notebook will demonstrate how you can train a semantic segmentation model by using custom training script with Pytorch lightning, similar to those you would use outside of SageMaker, with SageMaker's prebuilt containers for various frameworks.

SageMaker Script Mode is flexible so you'll also be seeing examples of how to include your own dependencies, such as a custom Python library, in your training and inference.

### Prerequisites
To follow along, you need to create an IAM role, SageMaker Notebook instance, and S3 bucket. 
Once the SageMaker Notebook instance is created, choose conda_python3 as the kernel.

### Imports

In [1]:
import sagemaker
import subprocess
import sys
import random
import math
import pandas as pd
import os
import boto3
import numpy as np
from sagemaker.pytorch import PyTorch
from sagemaker.s3 import S3Uploader, s3_path_join

In [3]:
# SageMaker Python SDK version 2.x is required
original_version = sagemaker.__version__
if sagemaker.__version__ != "2.103.1":
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "sagemaker"])
    import importlib

    importlib.reload(sagemaker)

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p38/bin/python -m pip install --upgrade pip' command.


In [None]:
# skip this step if you have already downloaded and unzipped the data
!wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_semantics.zip
!unzip data_semantics.zip -d data_semantics

In [14]:
# only run the below cells when you are using sagemaker notebook instances
!bash ./prepare-docker.sh

Redirecting to /bin/systemctl stop docker.service
  docker.socket
Redirecting to /bin/systemctl start docker.service


In [15]:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh
!/bin/bash ./local_mode_setup.sh

nvidia-docker2 already installed. We are good to go!
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


### define parameters and upload files to s3

In [2]:
sm_session = sagemaker.session.Session()
bucket = sm_session.default_bucket()
prefix = 'sagemaker/pytorch-lightning-example'
role = sagemaker.get_execution_role()

In [3]:
data_path = s3_path_join("s3://", bucket, prefix + "/data")
print(f"Uploading data to {data_path}")
data_url = S3Uploader.upload('data_semantics', data_path)

Uploading data to s3://sagemaker-us-east-1-631450739534/sagemaker/pytorch-lightning-example/data


In [4]:
data_url #= 's3://sagemaker-us-east-1-631450739534/sagemaker/pytorch-lightning-example/data'

's3://sagemaker-us-east-1-631450739534/sagemaker/pytorch-lightning-example/data'

## PyTorch
In this PyTorch example, we show how to using pytorch lightning to train a semantic segmentation model with multiple gpus.


In [8]:
hyperparameters = {"batch_size": 8}
enable_local_mode_training = True

if enable_local_mode_training:
    train_instance_type = "local_gpu"
    inputs = {"data_path": f"file:///home/ec2-user/SageMaker/amazon-sagemaker-pytorch-lightning-distributed-training/data_semantics"}
else:
    train_instance_type = "ml.g4dn.12xlarge"
    inputs = {"data_path": data_url}

estimator_parameters = {
    "entry_point": "semantic_segmentation_single.py",
    "source_dir": "code",
    "instance_type": train_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "pytorch-lightning",
    "image_uri": "570106654206.dkr.ecr.us-east-1.amazonaws.com/pt-ddp-custom:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker-2.6.0-numproc",
    "py_version": "py3",
    "distribution": {"pytorchddp":{"enabled": True}},
}

estimator = PyTorch(**estimator_parameters)
estimator.fit(inputs)

Creating 5jfl2hnhwa-algo-1-5y4z8 ... 
Creating 5jfl2hnhwa-algo-1-5y4z8 ... done
Attaching to 5jfl2hnhwa-algo-1-5y4z8
[36m5jfl2hnhwa-algo-1-5y4z8 |[0m 2022-08-16 02:49:20,901 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36m5jfl2hnhwa-algo-1-5y4z8 |[0m 2022-08-16 02:49:20,977 sagemaker-training-toolkit INFO     instance_groups entry not present in resource_config
[36m5jfl2hnhwa-algo-1-5y4z8 |[0m 2022-08-16 02:49:20,978 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36m5jfl2hnhwa-algo-1-5y4z8 |[0m 2022-08-16 02:49:20,981 sagemaker_pytorch_container.training INFO     Pytorch_ddp_enabled is:
[36m5jfl2hnhwa-algo-1-5y4z8 |[0m 2022-08-16 02:49:20,981 sagemaker_pytorch_container.training INFO     True
[36m5jfl2hnhwa-algo-1-5y4z8 |[0m 2022-08-16 02:49:20,982 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel for native PT DDP job
[36m5jfl2hnhwa-algo-1-5y4z8 |[0m 2022-08-16 0

In [15]:
hyperparameters = {"batch_size": 8}
enable_local_mode_training = False

if enable_local_mode_training:
    train_instance_type = "local_gpu"
    inputs = {"data_path": f"file:///home/ec2-user/SageMaker/amazon-sagemaker-pytorch-lightning-distributed-training/data_semantics"}
else:
    train_instance_type = "ml.p3.16xlarge"
    inputs = {"data_path": data_url}
    

estimator_parameters = {
    "entry_point": "semantic_segmentation.py",
    "source_dir": "code",
    "instance_type": train_instance_type,
    "instance_count": 4,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "pytorch-lightning",
    "image_uri": "570106654206.dkr.ecr.us-east-1.amazonaws.com/pt-ddp-custom:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker-2.6.0-numproc",
    "py_version": "py3",
    "distribution": {"pytorchddp":{"enabled": True}},
    "debugger_hook_config": False,
    "disable_profiler": True
}

estimator = PyTorch(**estimator_parameters)
estimator.fit(inputs)#, wait=False)

2022-08-16 06:39:31 Starting - Starting the training job.........
2022-08-16 06:40:38 Starting - Preparing the instances for training.........
2022-08-16 06:42:15 Downloading - Downloading input data...
2022-08-16 06:42:51 Training - Downloading the training image..........................[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[32mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[32mbash: no job control in this shell[0m
[35m2022-08-16 06:47:20,099 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[35m2022-08-16 06:47:20,180 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[35m2022-08-16 06:47:20,183 sagemaker_pytorch_container.training INFO    