# [Module 1.5] 체크 포인트를 활용한 훈련

본 워크샵의 모든 노트북은 `conda_python3` 여기에서 작업 합니다.

이 노트북은 아래와 같은 작업을 합니다.
- 아래는 세이지메이커의 어떤 피쳐도 사용하지 않고, PyTorch 만을 사용해서 훈련 합니다.

In [13]:
import sagemaker
import uuid

sagemaker_session = sagemaker.Session()
print('SageMaker version: ' + sagemaker.__version__)

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-pytorch-cnn-cifar10'

role = sagemaker.get_execution_role()
checkpoint_suffix = str(uuid.uuid4())[:8]
checkpoint_s3_path = 's3://{}/checkpoint-{}'.format(bucket, checkpoint_suffix)

print('Checkpointing Path: {}'.format(checkpoint_s3_path))

SageMaker version: 2.45.0
Checkpointing Path: s3://sagemaker-ap-northeast-2-057716757052/checkpoint-23753227


In [14]:
import os
import subprocess

instance_type = 'local'

if subprocess.call('nvidia-smi') == 0:
    ## Set type to GPU if one is present
    instance_type = 'local_gpu'
    
print("Instance type = " + instance_type)

Instance type = local_gpu


### Upload the data
We use the ```sagemaker.Session.upload_data``` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job.

In [15]:
inputs = sagemaker_session.upload_data(path="../data", bucket=bucket, key_prefix="data/cifar10")
print("s3 inputs: ", inputs)

s3 inputs:  s3://sagemaker-ap-northeast-2-057716757052/data/cifar10


In [25]:
hyperparameters = {'epochs': 1}

from sagemaker.pytorch import PyTorch
spot_estimator = PyTorch(
                            entry_point='cifar10-spot.py',
                            source_dir='source',                                                            
                            role=role,
                            framework_version='1.6.0',
                            py_version='py3',
                            instance_count=1,
                            instance_type='local_gpu',
                            base_job_name='cifar10-pytorch-spot-1',
                            hyperparameters=hyperparameters,
 
 
)

spot_estimator.fit(inputs, wait=False)

Creating 6nvhc7dnes-algo-1-7y9kb ... 
Creating 6nvhc7dnes-algo-1-7y9kb ... done
Attaching to 6nvhc7dnes-algo-1-7y9kb
[36m6nvhc7dnes-algo-1-7y9kb |[0m 2021-07-28 13:30:56,701 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36m6nvhc7dnes-algo-1-7y9kb |[0m 2021-07-28 13:30:56,744 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36m6nvhc7dnes-algo-1-7y9kb |[0m 2021-07-28 13:30:56,746 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36m6nvhc7dnes-algo-1-7y9kb |[0m 2021-07-28 13:30:56,779 botocore.credentials INFO     Found credentials in environment variables.
[36m6nvhc7dnes-algo-1-7y9kb |[0m 2021-07-28 13:30:56,888 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36m6nvhc7dnes-algo-1-7y9kb |[0m /opt/conda/bin/python3.6 -m pip install -r requirements.txt
[36m6nvhc7dnes-algo-1-7y9kb |[0m Collecting torchsummary==1.5.1
[36m6nvhc7dn

## Create a training job using the sagemaker.PyTorch estimator

The `PyTorch` class allows us to run our training function on SageMaker. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type. For local training with GPU, we could set this to "local_gpu".  In this case, `instance_type` was set above based on your whether you're running a GPU instance.

After we've constructed our `PyTorch` object, we fit it using the data we uploaded to S3. Even though we're in local mode, using S3 as our data source makes sense because it maintains consistency with how SageMaker's distributed, managed training ingests data.


In [16]:
print("instance_type: ", instance_type)
print("role: ", role)

instance_type:  local_gpu
role:  arn:aws:iam::057716757052:role/secure-vpc-client


In [17]:
use_spot_instances = True
max_run=600
max_wait = 1200 if use_spot_instances else None

## Simulating Spot interruption after 5 epochs

Our training job should run on 10 epochs.

However, we will simulate a situation that after 5 epochs a spot interruption occurred.

The goal is that the checkpointing data will be copied to S3, so when there is a spot capacity available again, the training job can resume from the 6th epoch.

Note the `checkpoint_s3_uri` variable which stores the S3 URI in which to persist checkpoints that the algorithm persists (if any) during training.

The `debugger_hook_config` parameter must be set to `False` to enable checkpoints to be copied to S3 successfully.

In [18]:
hyperparameters = {'epochs': 5}

from sagemaker.pytorch import PyTorch
spot_estimator = PyTorch(
                            entry_point='cifar10.py',
                            source_dir='source',                                                            
                            role=role,
                            framework_version='1.6.0',
                            py_version='py3',
                            instance_count=1,
                            instance_type='ml.p3.2xlarge',
                            base_job_name='cifar10-pytorch-spot-1',
                            hyperparameters=hyperparameters,
                            checkpoint_s3_uri=checkpoint_s3_path,
                            debugger_hook_config=False,
                            use_spot_instances=use_spot_instances,
                            max_run=max_run,
                            max_wait=max_wait)

spot_estimator.fit(inputs, wait=False)

In [19]:
spot_estimator.logs()

2021-07-28 12:52:09 Starting - Launching requested ML instances...ProfilerReport-1627476727: InProgress
......
2021-07-28 12:53:32 Starting - Preparing the instances for training......
2021-07-28 12:54:47 Downloading - Downloading input data...
2021-07-28 12:55:13 Training - Downloading the training image.........
2021-07-28 12:56:42 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-07-28 12:56:43,253 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-07-28 12:56:43,277 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-07-28 12:56:43,284 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-07-28 12:56:43,641 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt

### View the job training Checkpoint configuration
We can now view the Checkpoint configuration from the training job directly in the SageMaker console.

Log into the [SageMaker console](https://console.aws.amazon.com/sagemaker/home), choose the latest training job, and scroll down to the Checkpoint configuration section. 

Choose the S3 output path link and you'll be directed to the S3 bucket were checkpointing data is saved.

You can see there is one file there:

```python
checkpoint.pth
```

This is the checkpoint file that contains the epoch, model state dict, optimizer state dict, and loss.

### Continue training after Spot capacity is resumed

Now we simulate a situation where Spot capacity is resumed.

We will start a training job again, this time with 10 epochs.

What we expect is that the tarining job will start from the 6th epoch.

This is done when training job starts. It checks the checkpoint s3 location for checkpoints data. If there are, they are copied to `/opt/ml/checkpoints` on the training conatiner.

In the code you can see the function to load the checkpoints data:

```python
def _load_checkpoint(model, optimizer, args):
    print("--------------------------------------------")
    print("Checkpoint file found!")
    print("Loading Checkpoint From: {}".format(args.checkpoint_path + '/checkpoint.pth'))
    checkpoint = torch.load(args.checkpoint_path + '/checkpoint.pth')
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch_number = checkpoint['epoch']
    loss = checkpoint['loss']
    print("Checkpoint File Loaded - epoch_number: {} - loss: {}".format(epoch_number, loss))
    print('Resuming training from epoch: {}'.format(epoch_number+1))
    print("--------------------------------------------")
    return model, optimizer, epoch_number
```


In [9]:
hyperparameters = {'epochs': 10}


spot_estimator = PyTorch(entry_point='cifar10.py',
                            source_dir='source',                                                                                     
                            role=role,
                            framework_version='1.7.1',
                            py_version='py3',
                            instance_count=1,
                            instance_type='ml.p3.2xlarge',
                            base_job_name='cifar10-pytorch-spot-2',
                            hyperparameters=hyperparameters,
                            checkpoint_s3_uri=checkpoint_s3_path,
                            debugger_hook_config=False,
                            use_spot_instances=use_spot_instances,
                            max_run=max_run,
                            max_wait=max_wait)

spot_estimator.fit(inputs, wait=False)

In [10]:
spot_estimator.logs()

2021-07-28 12:32:35 Starting - Starting the training job...
2021-07-28 12:32:59 Starting - Launching requested ML instancesProfilerReport-1627475555: InProgress
......
2021-07-28 12:34:03 Starting - Preparing the instances for training......
2021-07-28 12:35:06 Downloading - Downloading input data...
2021-07-28 12:35:24 Training - Downloading the training image..................
2021-07-28 12:38:35 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-07-28 12:38:28,784 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-07-28 12:38:28,808 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-07-28 12:38:31,835 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-07-28 12:38:32,177 sagemaker-traini

### Analyze training job logs

Analyzing the training job logs, we can see that now, the training job starts from the 6th epoch.

We can see the output of `_load_checkpoint` function:

```
--------------------------------------------
Checkpoint file found!
Loading Checkpoint From: /opt/ml/checkpoints/checkpoint.pth
Checkpoint File Loaded - epoch_number: 5 - loss: 0.8455273509025574
Resuming training from epoch: 6
--------------------------------------------
```

### View the job training Checkpoint configuration after job completed 10 epochs

We can now view the Checkpoint configuration from the training job directly in the SageMaker console.  

Log into the [SageMaker console](https://console.aws.amazon.com/sagemaker/home), choose the latest training job, and scroll down to the Checkpoint configuration section. 

Choose the S3 output path link and you'll be directed to the S3 bucket were checkpointing data is saved.

You can see there is still that one file there:

```python
checkpoint.pth
```

You'll be able to see that the date of the checkpoint file was updated to the time of the 2nd Spot training job.

## 모델 아티펙트 저장

In [26]:
spot_artifact_path = spot_estimator.model_data
print("spot_artifact_path: ", spot_artifact_path)

%store spot_artifact_path

spot_artifact_path:  s3://sagemaker-ap-northeast-2-057716757052/cifar10-pytorch-spot-1-2021-07-28-13-30-49-900/model.tar.gz
Stored 'spot_artifact_path' (str)
