# Run Training in SageMaker
- References:
    - mnist.py example: [python file](https://github.com/aws/amazon-sagemaker-examples/blob/default/%20%20%20%20%20%20build_and_train_models/sm-hyperparameter_tuning_pytorch/mnist.py)
    - training and tuner example: [notebook](https://github.com/aws/amazon-sagemaker-examples/blob/default/%20%20%20%20%20%20build_and_train_models/sm-hyperparameter_tuning_pytorch/sm-hyperparameter_tuning_pytorch.ipynb)
    - Training Toolkit: [sagemaker-training-toolkit](https://github.com/aws/sagemaker-training-toolkit)
    - Python SDK: [link](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html)
    - Logging metrics: [link](https://docs.aws.amazon.com/sagemaker/latest/dg/define-train-metrics.html)

In [1]:
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import CategoricalParameter, ContinuousParameter, HyperparameterTuner

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name

bucket = 'neurobeacon'
prefix = 'dev/jason/data/base_model'
role = sagemaker.get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Single Training Job

In [2]:
estimator = PyTorch(
    base_job_name='full-model-mini-jason',
    entry_point='model.py',
    role=role,
    py_version='py38',
    framework_version='1.11.0',
    instance_type='ml.m5.4xlarge',
    instance_count=1,
    metric_definitions=[
        {'Name': 'train:error', 'Regex': r'Train Loss: ([0-9]+\.[0-9]+)'},
        {'Name': 'test:error', 'Regex': r'Test Loss: ([0-9]+\.[0-9]+)'}
    ]
)

In [3]:
data_dir = 's3://neurobeacon/tst/data'
data_folder = 'mini_dataset'

train_dir = f'{data_dir}/{data_folder}/train'
test_dir = f'{data_dir}/{data_folder}/test'

print('train_dir:', train_dir)
print('test_dir:', test_dir)

train_dir: s3://neurobeacon/tst/data/mini_dataset/train
test_dir: s3://neurobeacon/tst/data/mini_dataset/test


You can close the notebook once the training job is started. After training is complete, you can find training summary in Sagemaker.
- Open Sagemaker Studio > Jobs > Training > select your training job
    - Performance: only shows last logged metric
    - CloudWatch / Sagemaker AI Monitor*: training/validation curves to be configured

In [4]:
# inputs correspond to SM_CHANNEL_<var_name>
estimator.fit(
    inputs={
        'train': train_dir,  # SM_CHANNEL_TRAIN
        'test': test_dir  # SM_CHANNEL_TEST
    })

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: full-model-mini-jason-2025-02-28-17-54-16-009


2025-02-28 17:54:18 Starting - Starting the training job...
2025-02-28 17:54:32 Starting - Preparing the instances for training...
2025-02-28 17:55:08 Downloading - Downloading the training image.........
2025-02-28 17:56:29 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2025-02-28 12:56:47,809 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-02-28 12:56:47,811 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2025-02-28 12:56:47,814 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-02-28 12:56:47,825 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2025-02-28 12:56:47,827 sagemaker_pytorch_container.training INFO     Invoking user trai

## Hyperparameter Tuning (Single Run)

In [38]:
estimator = PyTorch(
    base_job_name='full-model-mini-lr-01-gamme-8',
    entry_point="model.py",
    role=role,
    py_version='py38',
    framework_version='1.11.0',
    instance_count=1,
    instance_type='ml.m5.4xlarge',
    hyperparameters={'lr': 0.01, 'gamma': 0.8, 'epochs': 20}, # additional argparse variables
    metric_definitions=[  # regex on logged out sent to Cloudwatch
        {'Name': 'train:error', 'Regex': r'Train Loss: ([0-9]+\.[0-9]+)'},
        {'Name': 'test:error', 'Regex': r'Test Loss: ([0-9]+\.[0-9]+)'}
    ]
)

In [None]:
estimator.fit(
    inputs={
        'train': train_dir,  # SM_CHANNEL_TRAIN
        'test': test_dir  # SM_CHANNEL_TEST
    })

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: full-model-mini-lr-01-gamme-8-2025-02-27-23-36-05-402


2025-02-27 23:36:07 Starting - Starting the training job...
2025-02-27 23:36:21 Starting - Preparing the instances for training...
2025-02-27 23:37:03 Downloading - Downloading the training image.........
2025-02-27 23:38:29 Training - Training image download completed. Training in progress...

## Hyperparameter Tuning (Multiple Run)
- not tested yet*
- according to this, we may not need to specify any hyperparameter_ranges at all: [article](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-automatic-model-tuning-now-automatically-chooses-tuning-configurations-to-improve-usability-and-cost-efficiency/)
- reference: [tuner example](https://github.com/aws/amazon-sagemaker-examples/blob/default/%20%20%20%20%20%20build_and_train_models/sm-hyperparameter_tuning_pytorch/sm-hyperparameter_tuning_pytorch.ipynb)

In [None]:
estimator = PyTorch(
    base_job_name='full-model-mini-lr-01-gamme-8-jason',
    entry_point="model.py",
    role=role,
    py_version='py38',
    framework_version='1.11.0',
    instance_count=1,
    instance_type='ml.m5.4xlarge',
    hyperparameters={'lr': 0.01, 'gamma': "0.8"}, # any hyperparameters we want to keep static
)

In [None]:
hyperparameter_ranges = {
    'lr': ContinuousParameter(0.001, 0.1),
    'gamma': CategoricalParameter([0.8, 0.9, 0.99])
}

In [None]:
objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]

In [None]:
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=9,
    max_parallel_jobs=3,
    objective_type=objective_type,
)

In [None]:
tuner.fit({
    'train': train_dir,  # SM_CHANNEL_TRAIN
    'test': test_dir  # SM_CHANNEL_TEST
})