# PyTorch MNIST Lift and Shift Exercise

For this exercise notebook, use the `Python 3 (PyTorch 1.13 Python 3.9 CPU Optimized)` kernel on SageMaker Studio, or `conda_pytorch_p38` on classic SageMaker Notebook Instances.

---

## Introduction

Your new colleague in the data science team (who isn't very familiar with SageMaker) has written a nice notebook to tackle an image classification problem with PyTorch: [Local Notebook.ipynb](Local%20Notebook.ipynb).

It works OK with the simple MNIST data set they were working on before, but now they'd like to take advantage of some of the features of SageMaker to tackle bigger and harder challenges.

**Can you help refactor the Local Notebook code, to show them how to use SageMaker effectively?**


#### Reference
- [PyTorch MNIST](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-python-sdk/pytorch_mnist)

## Getting Started

First, check you can **run the [Local Notebook.ipynb](Local%20Notebook.ipynb) notebook through** - reviewing what steps it takes.

**This notebook** sets out a structure you can use to migrate code into, and lists out some of the changes you'll need to make at a high level. You can either work directly in here, or duplicate this notebook so you still have an unchanged copy of the original.

Try to work through the sections first with an MVP goal in mind (fitting the model to data in S3 via a SageMaker Training Job, and deploying/using the model through a SageMaker Endpoint). At the end, there are extension exercises to bring in more advanced functionality.


## Dependencies

Listing all our imports at the start helps to keep the requirements to run any script/file transparent up-front, and is specified by nearly every style guide including Python's official [PEP 8](https://www.python.org/dev/peps/pep-0008/#imports)


In [1]:
# !pip install "ipycanvas<0.13" "ipywidgets<8" matplotlib

In [3]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import glob
import os

# External Dependencies:
import matplotlib.pyplot as plt
import numpy as np

# Local Notebook Utils:
import util


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Prepare the Data

Let's download the image data from the Repository of Open Data on AWS and sample a subset like we did in the [Local Notebook.ipynb](Local%20Notebook.ipynb).

**Check you understand** what data it's going to upload from this notebook, and where it's going to store it in S3, then start the upload running while you work on the rest.

In [4]:
local_dir = "/tmp/mnist"
training_dir = f"{local_dir}/training"
testing_dir = f"{local_dir}/testing"

# Download the MNIST data from the Registry of Open Data on AWS
!rm -rf {local_dir}
!mkdir -p {local_dir}
!aws s3 cp s3://fast-ai-imageclas/mnist_png.tgz {local_dir} --no-sign-request

# Un-tar the MNIST data, stripping the leading path element; this will leave us with directories
# {local_dir}/testing/ and {local_dir/training/
!tar zxf {local_dir}/mnist_png.tgz -C {local_dir}/ --strip-components=1 --no-same-owner

# Get the list of files in tne training and testing directories recursively
train_files = sorted(list(glob.iglob(os.path.join(training_dir, "*/*.png"), recursive=True)))
test_files = sorted(list(glob.iglob(os.path.join(testing_dir, "*/*.png"), recursive=True)))

print(f"Training files: {len(train_files)}")
print(f"Testing files:  {len(test_files)}")

# Reduce the data by keeping every Nth file and dropping the rest of the files.
reduction_factor = 2
train_files_to_keep = train_files[::reduction_factor]
test_files_to_keep = test_files[::reduction_factor]

print(f"Training files kept: {len(train_files_to_keep)}")
print(f"Testing files kept:  {len(test_files_to_keep)}")

# Delete all the files not to be kept
for fname in set(train_files) ^ set(train_files_to_keep):
    os.remove(fname)

for fname in set(test_files) ^ set(test_files_to_keep):
    os.remove(fname)

print("Done!")

download: s3://fast-ai-imageclas/mnist_png.tgz to ../../../../../../tmp/mnist/mnist_png.tgz
Training files: 60000
Testing files:  10000
Training files kept: 30000
Testing files kept:  5000
Done!


## Set Up Execution Role, Session and S3 Bucket

Now that we have downloaded and reduced the data in the local directory, we will need to upload it to Amazon S3 to make it available for Amazon Sagemaker training.

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region.
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [5]:
# TODO: This is where you can setup execution role, session and S3 bucket.
# 1. Setup the SageMaker role
from sagemaker import get_execution_role
role = get_execution_role()

# 2. Setup the SageMaker session
import sagemaker
sess = sagemaker.Session()



# 3. Setup the SageMaker default bucket
bucket_name = sess.default_bucket()
print("bucket_name: \n", bucket_name)


# Have a look at the previous examples to find out how to do it

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/SageMaker/.xdg/config/sagemaker/config.yaml
bucket_name: 
 sagemaker-us-east-1-057716757052


## Upload Data to Amazon S3

Next is the part where you need to upload the images to Amazon S3 for Sagemaker training. You can refer to the previous example on how to do it using the [aws s3 sync](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html) CLI command. The high-level command `aws s3 sync` command synchronizes the contents of the target bucket and source directory. It allows the use of options such as `--delete` that allows to remove objects from the target that are not present in the source and `--exclude` or `--include` options that filter files or objects to exclude or not exclude.

> ⏰ Note: Uploading to Amazon S3 typically takes about 2-3 minutes assuming a reduction_factor of 2.

In [6]:
# TODO: This is where you upload the training images using `aws s3 sync`.
# Fill in the missing source local directory and the target S3 bucket and folder in the command below.
import os

bucket_prefix = "sagemaker-mnist"
s3_data_location = os.path.join ('s3://', bucket_name, bucket_prefix)
print("s3_data_location: ", s3_data_location)

!aws s3 sync --quiet --delete {local_dir} {s3_data_location} --exclude "*.tgz" && echo "Done!"

s3_data_location:  s3://sagemaker-us-east-1-057716757052/sagemaker-mnist
Done!


In [7]:
! aws s3 ls {s3_data_location} 


                           PRE sagemaker-mnist/


In [8]:
! aws s3 ls {s3_data_location}sagemaker-mnist/training --recursive

In [9]:
! aws s3 ls s3://sagemaker-us-east-1-057716757052/sagemaker-mnist/training
! aws s3 ls s3://sagemaker-us-east-1-057716757052/sagemaker-mnist/testing

                           PRE training/
                           PRE testing/


You can check your data is uploaded by finding your bucket in the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home). Do you see the folders of images as expected?

## Data Input ("Channels") Configuration

The draft code has **2 data sets**: One for training, and one for test/validation. (For classification, the folder location of each image is sufficient as a label).

In SageMaker terminology, each input data set is a "channel" and we can name them however we like... Just make sure you're consistent about what you call each one!

For a simple input configuration, a channel spec might just be the S3 URI of the folder. For configuring more advanced options, there's the [s3_input](https://sagemaker.readthedocs.io/en/stable/inputs.html) class in the SageMaker SDK.


In [10]:
# TODO: Define your 2 data channels
# The data can be found in: "s3://{bucket_name}/mnist/training" and "s3://{bucket_name}/mnist/testing"
# Look at the previous example to see how the inputs were defined
train_s3_path = f'{s3_data_location}/training'
test_s3_path = f'{s3_data_location}/testing'
s3_inputs = { 'train': f'{train_s3_path}','test' : f'{test_s3_path}'}
print("s3_inputs: \n", s3_inputs)


s3_inputs: 
 {'train': 's3://sagemaker-us-east-1-057716757052/sagemaker-mnist/training', 'test': 's3://sagemaker-us-east-1-057716757052/sagemaker-mnist/testing'}


### Local Data inputs 생성 (로컬 모드 용)

In [11]:
local_inputs = {'train': f'file://{training_dir}',
          'test': f'file://{testing_dir}'}
print("local_inputs: \n", local_inputs)

local_inputs: 
 {'train': 'file:///tmp/mnist/training', 'test': 'file:///tmp/mnist/testing'}


## Algorithm ("Estimator") Configuration and Run

Instead of loading and fitting this data here in the notebook, we'll be creating a [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator) through the SageMaker SDK, to run the code on a separate container that can be scaled as required.

The ["Using PyTorch with the SageMaker Python SDK"](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html) docs give a good overview of this process. You should run your estimator in **Python 3**.

**Use the [src/main.py](src/main.py) file** as your entry point to port code into - which has already been created for you with some basic hints.


## Naive script 로 훈련 실행

Before running the actual training on SageMaker TrainingJob, it can be good to run it locally first using the code below. If there is any error, you can fix them first before running using SageMaker TrainingJob.

In [12]:
save_model_dir = f"{local_dir}/model"
print("save_model_dir: ", save_model_dir)

save_model_dir:  /tmp/mnist/model


In [13]:
!python3 src/main-solution.py \
    --train_dir {training_dir} \
    --test_dir {testing_dir} \
    --output-data-dir data/local-output \
    --model-dir {save_model_dir} \
    --epochs=1 --batch-size=128

## Naive mode is set
## args: 
 Namespace(train_dir='/tmp/mnist/training', test_dir='/tmp/mnist/testing', model_dir='/tmp/mnist/model', output_data_dir='data/local-output', batch_size=128, test_batch_size=1000, epochs=1)
args.training data:  /tmp/mnist/training
args.test data:  /tmp/mnist/testing
##: Starting X, y dataset creation
Loading label 0...1...2...3...4...5...6...7...8...9...
Shuffling trainset...
Shuffling testset...
Done!
##: Starting preprocess
x_train shape: (30000, 1, 28, 28)
input_shape: (1, 28, 28)
30000 train samples
5000 test samples
n_labels: 10
y_train shape: (30000, 10)
##: trainloader is successfully loaded 
##: testloader is successfully loaded 
## Start training 
epoch: 1
train_loss: 0.000861
Evaluating model
## Start testing 
val_loss: 0.0397
val_acc: 0.9276
Saving model at /tmp/mnist/model/model.pth
## model is saved at /tmp/mnist/model/model.pth


When you're ready to try your script in a Sagemaker training job, you can call `estimator.fit()` as we did in previous exercises:

## 로컬 모드 혹은 클라우드 모드 사용

In [25]:
# USE_LOCAL_MODE = True
USE_LOCAL_MODE = False

import torch

if USE_LOCAL_MODE:
    instance_type = 'local_gpu' if torch.cuda.is_available() else 'local'
    instance_count = 1
    from sagemaker.local import LocalSession
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}
    inputs = local_inputs 
    nKeepAliveSeconds = None # Warmpool feature
    print("## Local mode is set")
else:
    instance_type = 'ml.g5.2xlarge'
    instance_count = 1
    sagemaker_session = sagemaker.session.Session()
    inputs = s3_inputs
    nKeepAliveSeconds = 3600 # Warmpool feature, 1 hour
    print(f"## Cloud mode is set with {instance_type} and {instance_count} of instance_count")

## Cloud mode is set with ml.g5.2xlarge and 1 of instance_count


## Estimator 생성

In [26]:
from sagemaker.pytorch import PyTorch

hyperparameters = {'epochs': 3, 
}
import time
# define Training Job Name 
job_name = f'sagemaker-101-mnist-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'


sm_estimator = PyTorch(
    entry_point = 'main-solution.py',
    source_dir = 'src',
    role=role,
    base_job_name= job_name,           # the name of the training job
    framework_version='2.0.1',    
    py_version='py310',        
    instance_count=1,
    instance_type=instance_type, # local_gpu or local 지정
    session = sagemaker_session, 
    hyperparameters= hyperparameters,
    keep_alive_period_in_seconds = nKeepAliveSeconds     # warm pool                  
    
)



### 훈련 실행

In [27]:
sm_estimator.fit(inputs, wait=False)
train_job_name = sm_estimator.latest_training_job.job_name

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: sagemaker-101-mnist-2024-06-23-14-57-55-2024-06-23-14-57-57-565


### SageMaker training job, cloudwatch log 보기

In [28]:
from IPython.display import display, HTML
import boto3
region = boto3.Session().region_name

def make_console_link(region, train_job_name, train_task='[Training]'):
    train_job_link = f'<b> {train_task} Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={region}#/jobs/{train_job_name}">Training Job</a></b>'   
    cloudwatch_link = f'<b> {train_task} Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={region}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={train_job_name};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a></b>'
    return train_job_link, cloudwatch_link  
        
if not USE_LOCAL_MODE:
    train_job_link, cloudwatch_link = make_console_link(region, train_job_name, '[PyTorch DeepSpeed Training]')
    display(HTML(train_job_link))
    display(HTML(cloudwatch_link))

### 훈련 잡의 로그 보기

In [30]:
if not USE_LOCAL_MODE:
    sm_estimator.logs()

2024-06-23 14:58:02 Starting - Found matching resource for reuse
2024-06-23 14:58:02 Downloading - Downloading input data...
2024-06-23 14:58:27 Training - Training image download completed. Training in progress.bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2024-06-23 14:58:28,464 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2024-06-23 14:58:28,481 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2024-06-23 14:58:28,492 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2024-06-23 14:58:28,493 sagemaker_pytorch_container.training INFO     Invoking user training script.
2024-06-23 14:58:29,873 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2024-06-23 14:58:30,041 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2024-06-

### 모델 아티펙트 저장

In [31]:
model_s3_path = sm_estimator.model_data
print("model_s3_path: \n", model_s3_path)

%store model_s3_path


model_s3_path: 
 s3://sagemaker-us-east-1-057716757052/sagemaker-101-mnist-2024-06-23-14-57-55-2024-06-23-14-57-57-565/output/model.tar.gz
Stored 'model_s3_path' (str)
