* Notebook created by nov05 on 2025-02-07
* It was run locally with conda env `sagemaker_py310`.  
* Compare training results with [this model](https://github.com/silverbottlep/abid_challenge/blob/master/counting/train.py#L191)  
* [Issues during training](https://gist.github.com/nov05/1bdc15eda0e781640b46ab28d38f45bd)   
* Check [the wandb logs](https://wandb.ai/nov05/udacity-awsmle-resnet34-amazon-bin?nw=nwusernov05)

# 👉 **AWS Credentials**

In [3]:
%pwd

'd:\\github\\udacity-nd009t-capstone-starter\\starter'

In [8]:
## windows cmd to launch notepad to edit aws credential file
# !notepad C:\Users\guido\.aws\config
!notepad C:\Users\guido\.aws\credentials

In [9]:
## reset the session after updating credentials
import boto3 # type: ignore
boto3.DEFAULT_SESSION = None
import sagemaker # type: ignore
from sagemaker import get_execution_role # type: ignore

# Extract and print the account ID
sts_client = boto3.client('sts')
response = sts_client.get_caller_identity() 
account_id = response['Account']

role_arn = get_execution_role()  ## get role ARN
if 'AmazonSageMaker-ExecutionRole' not in role_arn:
    ## Go to "IAM - Roles", search for "SageMaker", find the execution role.
    voclabs_role_arn = role_arn
    sagemaker_role_arn = "arn:aws:iam::570668189909:role/service-role/AmazonSageMaker-ExecutionRole-20250126T194519"
session = sagemaker.Session()  ## "default"
region = session.boto_region_name
bucket = session.default_bucket()

print(f"Current AWS Account ID: {account_id}")
print("AWS Region: {}".format(region))
print("Default Bucket: {}".format(bucket))
print(f"Role voclabs ARN: {voclabs_role_arn}") ## If local, Role ARN: arn:aws:iam::807711953667:role/voclabs
print("SageMaker Role ARN: {}".format(sagemaker_role_arn)) 

## generate secrets.env. remember to add it to .gitignore  
import wandb
wandb.sagemaker_auth(path="../secrets") 

## get my own AWS account info
def get_secrets(name):
    path = '../secrets/' + name
    with open(path, 'r') as file:
        for line in file:
            return line.strip()
aws_account_number = get_secrets('aws_account_number')
aws_account_profile = get_secrets('aws_account_profile')

Current AWS Account ID: 570668189909
AWS Region: us-east-1
Default Bucket: sagemaker-us-east-1-570668189909
Role voclabs ARN: arn:aws:iam::570668189909:role/voclabs
SageMaker Role ARN: arn:aws:iam::570668189909:role/service-role/AmazonSageMaker-ExecutionRole-20250126T194519


# **👉 Training**

In [25]:
%%time 
from sagemaker.pytorch import PyTorch
from datetime import datetime
data_base_path = "s3://p5-amazon-bin-images/webdataset/"
train_data_path = data_base_path + "train/train-shard-{000000..000007}.tar"
val_data_path = data_base_path + "val/val-shard-{000000..000001}.tar"
test_data_path = data_base_path + "test/test-shard-{000000..000001}.tar"
print(train_data_path)
output_path = "s3://p5-amazon-bin-images-train/"  
## Manually set dataset sizes hyperparameters
l = 10441  ## 10K dataset
split_ratio=[0.7, 0.15, 0.15]
train_data_size = int(l*split_ratio[0])
val_data_size = int(l*split_ratio[1])
test_data_size = l - train_data_size - val_data_size
print(f"train_size: {train_data_size}, val_size: {val_data_size}, test_size: {test_data_size}")
## s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000007}.tar
## train_size: 7308, val_size: 1566, test_size: 1567
hyperparameters = {
    'epochs': 40,   
    'batch-size': 128,   
    'opt-type': 'sgd',         ## 👉 SGD optimizer
    'opt-momentum': 0.9,       ## 👉 SGD optimizer
    'opt-learning-rate': 0.1,  ## 👉 SGD optimizer
    # 'opt-weight-decay': 1e-4,  
    'lr-sched-step-size': 10,  ## 👉 optimizer learning rate scheduler
    # 'lr-sched-gamma': 0.5,
    'early-stopping-patience': 100,  ## set larger than epochs if no early stopping
    'model-arch': 'resnet34', 
    'wandb': True,  
    'debug': False, 
## input data 
    "train-data-path": train_data_path,
    "val-data-path": val_data_path,
    "test-data-path": test_data_path,
    "train-data-size": train_data_size, 
    "val-data-size": val_data_size,
    "test-data-size": test_data_size,
    "class-weights-dict": {
        1: 1.7004885993485341, 
        2: 0.9083079599826012, 
        3: 0.7832708177044261, 
        4: 0.8799831436999579, 
        5: 1.1137066666666666
    },
}
estimator = PyTorch(
    entry_point='train_v1.py',  # Your training script that defines the ResNet50 model and training loop
    source_dir='../scripts_train',  # Directory where your script and dependencies are stored
    role=sagemaker_role_arn,
    framework_version='1.13.1',  # Use the PyTorch version you need
    py_version='py39',
    instance_count=2,  ## multi-instance training, Udacity account level limit 2
    instance_type='ml.g4dn.xlarge',  ## 16GB, 1 GPU per instance
    output_path=output_path,  ## if not specify, output to the sagemaker default bucket
    hyperparameters=hyperparameters,
    distribution={"smdistributed": {"dataparallel": { "enabled": True}}},  # mpirun, activates SMDDP AllReduce OR AllGather
) 
estimator.fit(
    wait=True,  
    job_name=f"p5-amazon-bin-job-{datetime.now().strftime('%Y%m%d-%H%M%S')}", 
)

s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000007}.tar
train_size: 7308, val_size: 1566, test_size: 1567


2025-02-08 01:31:58 Starting - Starting the training job...
2025-02-08 01:32:12 Starting - Preparing the instances for training...
2025-02-08 01:32:45 Downloading - Downloading input data...
2025-02-08 01:33:16 Downloading - Downloading the training image............
2025-02-08 01:35:58 Training - Training image download completed. Training in progress...bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
  "cipher": algorithms.TripleDES,
  "class": algorithms.TripleDES,
2025-02-08 01:36:11,259 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2025-02-08 01:36:11,281 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-08 01:36:11,295 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2025-02-08 01:36:11,302 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel
2025-02-08 01:36:11,302 sagem

In [27]:
import boto3
sagemaker_client = boto3.client('sagemaker')
training_job_name = "p5-amazon-bin-job-20250207-193159"
response = sagemaker_client.describe_training_job(TrainingJobName=training_job_name)
status = response['TrainingJobStatus']
print(status)

Completed


* 5 epochs test run
    ```
    2025-02-07 18:25:49 Uploading - Uploading generated training model
    2025-02-07 18:25:49 Completed - Training job completed
    Training seconds: 1356
    Billable seconds: 1356
    CPU times: total: 21.7 s
    Wall time: 12min 44s
    ```

* 40 epochs
    ```
    Training seconds: 6552
    Billable seconds: 6552
    CPU times: total: 1min 34s
    Wall time: 56min 8s
    ```
    ```
    [1,mpirank:0,algo-1]<stdout>:👉 VAL: Average loss: 1.4845, Accuracy: 448/1536 (29.17%)
    [1,mpirank:0,algo-1]<stdout>:👉 Train Epoch: 19, Learning Rate: 0.010000000000000002
    ```