# Train Sementic Segmentation model using Pytorch Lightning on Amazon SageMaker

## Overview
This notebook will demonstrate how you can train a semantic segmentation model by using custom training script with Pytorch lightning, similar to those you would use outside of SageMaker, with SageMaker's prebuilt containers for various frameworks.

SageMaker Script Mode is flexible so you'll also be seeing examples of how to include your own dependencies, such as a custom Python library, in your training and inference.

### Prerequisites
To follow along, you need to create an IAM role, SageMaker Notebook instance, and S3 bucket. 
Once the SageMaker Notebook instance is created, choose conda_python3 as the kernel.

### Imports

In [1]:
import sagemaker
import subprocess
import sys
import random
import math
import pandas as pd
import os
import boto3
import numpy as np
from sagemaker.pytorch import PyTorch
from sagemaker.s3 import S3Uploader, s3_path_join

In [2]:
# SageMaker Python SDK version 2.x is required
original_version = sagemaker.__version__
if sagemaker.__version__ != "2.24.1":
    subprocess.check_call([sys.executable, "-m", "pip", "install", "sagemaker==2.24.1"])
    import importlib

    importlib.reload(sagemaker)

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting sagemaker==2.24.1
  Downloading sagemaker-2.24.1.tar.gz (397 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 397.4/397.4 KB 9.5 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting botocore<1.28.0,>=1.27.42
  Downloading botocore-1.27.51-py3-none-any.whl (9.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.0/9.0 MB 46.8 MB/s eta 0:00:00
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py): started
  Building wheel for sagemaker (setup.py): finished with status 'done'
  Created wheel for sagemaker: filename=sagemaker-2.24.1-py2.py3-none-any.whl size=560575 sha256=9a317cf27472e145af4b6f3aa689240d2273c98937d3a3685a373ed4e919f78f
  Stored in directory: /home/ec2-user/.cache/pip/wheels/24/c1/31/f282472572e4dad1cabda2fb0e2c399936b7655010b3717096
Successfully built sagemaker
Insta

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.25.42 requires botocore==1.27.42, but you have botocore 1.27.51 which is incompatible.
aiobotocore 2.0.1 requires botocore<1.22.9,>=1.22.8, but you have botocore 1.27.51 which is incompatible.
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p38/bin/python -m pip install --upgrade pip' command.


In [None]:
# skip this step if you have already downloaded and unzipped the data
!wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_semantics.zip
!unzip data_semantics.zip -d data_semantics

In [14]:
# only run the below cells when you are using sagemaker notebook instances
!bash ./prepare-docker.sh

Redirecting to /bin/systemctl stop docker.service
  docker.socket
Redirecting to /bin/systemctl start docker.service


In [15]:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh
!/bin/bash ./local_mode_setup.sh

nvidia-docker2 already installed. We are good to go!
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


### define parameters and upload files to s3

In [5]:
sm_session = sagemaker.session.Session()
bucket = sm_session.default_bucket()
prefix = 'sagemaker/pytorch-lightning-example'
role = sagemaker.get_execution_role()

In [6]:
data_path = s3_path_join("s3://", bucket, prefix + "/data")
print(f"Uploading data to {data_path}")
data_url = S3Uploader.upload('data_semantics', data_path)

Uploading data to s3://sagemaker-us-east-1-631450739534/sagemaker/pytorch-lightning-example/data


In [7]:
data_url

's3://sagemaker-us-east-1-631450739534/sagemaker/pytorch-lightning-example/data'

## PyTorch
In this PyTorch example, we show how to using pytorch lightning to train a semantic segmentation model with multiple gpus.


In [17]:
hyperparameters = {"batch_size": 8}
enable_local_mode_training = True

if enable_local_mode_training:
    train_instance_type = "local_gpu"
    inputs = {"data_path": f"file:///home/ec2-user/SageMaker/amazon-sagemaker-pytorch-lightning-distributed-training/data_semantics"}
else:
    train_instance_type = "ml.p3.16xlarge"
    inputs = {"data_path": data_url}

estimator_parameters = {
    "entry_point": "semantic_segmentation.py",
    "source_dir": "code",
    "instance_type": train_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "pytorch-lightning",
    "image_uri": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38",
    "py_version": "py3",
}

estimator = PyTorch(**estimator_parameters)
estimator.fit(inputs)

Creating yc7222r2t8-algo-1-ex77p ... 
Creating yc7222r2t8-algo-1-ex77p ... done
Attaching to yc7222r2t8-algo-1-ex77p
[36myc7222r2t8-algo-1-ex77p |[0m 2022-08-15 11:25:41,630 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36myc7222r2t8-algo-1-ex77p |[0m 2022-08-15 11:25:41,707 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36myc7222r2t8-algo-1-ex77p |[0m 2022-08-15 11:25:41,710 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36myc7222r2t8-algo-1-ex77p |[0m 2022-08-15 11:25:41,931 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36myc7222r2t8-algo-1-ex77p |[0m /opt/conda/bin/python3.8 -m pip install -r requirements.txt
[36myc7222r2t8-algo-1-ex77p |[0m Collecting pytorch-lightning
[36myc7222r2t8-algo-1-ex77p |[0m Downloading pytorch_lightning-1.7.1-py3-none-any.whl (701 kB)
[36myc7222r2t8-algo-1-ex77p |[0m Collecting typin

In [18]:
enable_local_mode_training = False

if enable_local_mode_training:
    train_instance_type = "local_gpu"
    inputs = {"data_path": f"file:///home/ec2-user/SageMaker/amazon-sagemaker-pytorch-lightning-distributed-training/data_semantics"}
else:
    train_instance_type = "ml.p3.16xlarge"
    inputs = {"data_path": data_url}


estimator_parameters = {
    "entry_point": "semantic_segmentation.py",
    "source_dir": "code",
    "instance_type": train_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "pytorch-lightning",
    "image_uri": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38",
    "py_version": "py3",
}

estimator = PyTorch(**estimator_parameters)
estimator.fit(inputs)

2022-08-15 11:28:48 Starting - Starting the training job...ProfilerReport-1660562928: InProgress
...
2022-08-15 11:29:33 Starting - Preparing the instances for training.........
2022-08-15 11:31:17 Downloading - Downloading input data...
2022-08-15 11:31:43 Training - Downloading the training image...........................
2022-08-15 11:36:13 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-08-15 11:36:08,728 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-08-15 11:36:08,801 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-08-15 11:36:08,808 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2022-08-15 11:36:09,368 sagemaker-training-toolkit INFO     Installing dependencies from requir