* Notebook created by nov05 on 2025-02-07
* It was run locally with conda env `sagemaker_py310`.  
* Compare training results with [this model](https://github.com/silverbottlep/abid_challenge/blob/master/counting/train.py#L191)  
* [Issues during training](https://gist.github.com/nov05/1bdc15eda0e781640b46ab28d38f45bd)   
* Check [the wandb logs](https://wandb.ai/nov05/udacity-awsmle-resnet34-amazon-bin?nw=nwusernov05)

# 👉 **AWS Credentials**

In [2]:
%pwd

'd:\\github\\udacity-nd009t-capstone-starter\\examples'

In [3]:
## windows cmd to launch notepad to edit aws credential file
# !notepad C:\Users\guido\.aws\config
!notepad C:\Users\guido\.aws\credentials

In [4]:
## reset the session after updating credentials
import boto3 # type: ignore
boto3.DEFAULT_SESSION = None
import sagemaker # type: ignore
from sagemaker import get_execution_role # type: ignore

# Extract and print the account ID
sts_client = boto3.client('sts')
response = sts_client.get_caller_identity() 
account_id = response['Account']

role_arn = get_execution_role()  ## get role ARN
if 'AmazonSageMaker-ExecutionRole' not in role_arn:
    ## Go to "IAM - Roles", search for "SageMaker", find the execution role.
    voclabs_role_arn = role_arn
    sagemaker_role_arn = "arn:aws:iam::570668189909:role/service-role/AmazonSageMaker-ExecutionRole-20250126T194519"
session = sagemaker.Session()  ## "default"
region = session.boto_region_name
bucket = session.default_bucket()

print(f"Current AWS Account ID: {account_id}")
print("AWS Region: {}".format(region))
print("Default Bucket: {}".format(bucket))
print(f"Role voclabs ARN: {voclabs_role_arn}") ## If local, Role ARN: arn:aws:iam::807711953667:role/voclabs
print("SageMaker Role ARN: {}".format(sagemaker_role_arn)) 

## generate secrets.env. remember to add it to .gitignore  
import wandb
wandb.sagemaker_auth(path="../secrets") 

## get my own AWS account info
def get_secrets(name):
    path = '../secrets/' + name
    with open(path, 'r') as file:
        for line in file:
            return line.strip()
aws_account_number = get_secrets('aws_account_number')
aws_account_profile = get_secrets('aws_account_profile')

Current AWS Account ID: 570668189909
AWS Region: us-east-1
Default Bucket: sagemaker-us-east-1-570668189909
Role voclabs ARN: arn:aws:iam::570668189909:role/voclabs
SageMaker Role ARN: arn:aws:iam::570668189909:role/service-role/AmazonSageMaker-ExecutionRole-20250126T194519


# **👉 Training**

In [15]:
%%time 
from sagemaker.pytorch import PyTorch
from datetime import datetime
data_base_path = "s3://p5-amazon-bin-images/webdataset/"
train_data_path = data_base_path + "train/train-shard-{000000..000007}.tar"
val_data_path = data_base_path + "val/val-shard-{000000..000001}.tar"
test_data_path = data_base_path + "test/test-shard-{000000..000001}.tar"
print(train_data_path)
output_path = "s3://p5-amazon-bin-images-train/"  
## Manually set dataset sizes hyperparameters
l = 10441  ## 10K dataset
split_ratio=[0.7, 0.15, 0.15]
train_data_size = int(l*split_ratio[0])
val_data_size = int(l*split_ratio[1])
test_data_size = l - train_data_size - val_data_size
print(f"train_size: {train_data_size}, val_size: {val_data_size}, test_size: {test_data_size}")
## s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000007}.tar
## train_size: 7308, val_size: 1566, test_size: 1567
hyperparameters = {
    'epochs': 40,   
    'batch-size': 128,   
    'opt-type': 'sgd',         ## 👉 SGD optimizer
    'opt-momentum': 0.9,       ## 👉 SGD optimizer
    'opt-learning-rate': 0.1,  ## 👉 SGD optimizer
    # 'opt-weight-decay': 1e-4,  
    'lr-sched-step-size': 10,  ## 👉 optimizer learning rate scheduler
    # 'lr-sched-gamma': 0.5,
    'early-stopping-patience': 1,  ## set larger than epochs if no early stopping
    'model-arch': 'resnet34', 
    'wandb': False,  
    'debug': False, 
## input data 
    "train-data-path": train_data_path,
    "val-data-path": val_data_path,
    "test-data-path": test_data_path,
    "train-data-size": train_data_size, 
    "val-data-size": val_data_size,
    "test-data-size": test_data_size,
    "class-weights-dict": {
        1: 1.7004885993485341, 
        2: 0.9083079599826012, 
        3: 0.7832708177044261, 
        4: 0.8799831436999579, 
        5: 1.1137066666666666
    },
}
estimator = PyTorch(
    entry_point='train_early_stop.py',  # Your training script that defines the ResNet50 model and training loop
    source_dir='../scripts_train',  # Directory where your script and dependencies are stored
    role=sagemaker_role_arn,
    framework_version='1.13.1',  # Use the PyTorch version you need
    py_version='py39',
    instance_count=2,  ## multi-instance training, Udacity account level limit 2
    instance_type='ml.g4dn.xlarge',  ## 16GB, 1 GPU per instance
    output_path=output_path,  ## if not specify, output to the sagemaker default bucket
    hyperparameters=hyperparameters,
    distribution={"smdistributed": {"dataparallel": { "enabled": True}}},  # mpirun, activates SMDDP AllReduce OR AllGather
) 
job_name = f"p5-amazon-bin-job-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
estimator.fit(
    wait=True,  
    job_name=job_name, 
)

s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000007}.tar
train_size: 7308, val_size: 1566, test_size: 1567


2025-02-08 05:39:12 Starting - Starting the training job...
2025-02-08 05:39:26 Starting - Preparing the instances for training...
2025-02-08 05:40:20 Downloading - Downloading the training image...............
2025-02-08 05:43:12 Training - Training image download completed. Training in progress...bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
  "cipher": algorithms.TripleDES,
  "class": algorithms.TripleDES,
2025-02-08 05:43:22,942 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2025-02-08 05:43:22,963 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-08 05:43:22,977 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2025-02-08 05:43:22,981 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel
2025-02-08 05:43:22,981 sagemaker_pytorch_container.training INFO     Invoking user tr

In [16]:
import boto3
sagemaker_client = boto3.client('sagemaker')
response = sagemaker_client.describe_training_job(TrainingJobName=job_name) # type: ignore
status = response['TrainingJobStatus']
print(status)

Completed


In [18]:
import torch
tensor_early_stop = torch.tensor(0, dtype=torch.int32)
tensor_early_stop = 1  ## ⚠️
print(tensor_early_stop, type(tensor_early_stop))

1 <class 'int'>


⚠️ Issue: 

```
 terminate called after throwing an instance of ':SMDDPTimeoutError:'
 what()
 #011Timeout: A call to 'allGather' has taken over 120.000000 seconds. Terminating the distributed job.It might be one of the workers failed during forward and backward propagation and failed to call "allGather".
 #011Extend timeout using dist.init_process_group(timeout=timedelta(minutes=60)
 #011Extend timeout using dist.init(timeout=timedelta(minutes=60)
```

⚠️ Issue: 
```
 #011Timeout: A call to 'broadcast' has taken over 120.000000 seconds. Terminating the distributed job.It might be one of the workers failed during forward and backward propagation and failed to call "broadcast".
 #011Extend timeout using dist.init_process_group(timeout=timedelta(minutes=60)
 ...
 #011Timeout: A call to 'allReduce' has taken over 120.000000 seconds. Terminating the distributed job.It might be one of the workers failed during forward and backward propagation and failed to call "allReduce".
```

⚠️ Issue:
```
[1,mpirank:1,algo-2]<stdout>:  File "train_early_stop.py", line 449, in main
[1,mpirank:1,algo-2]<stdout>:    dist.broadcast(braodcast_early_stop, src=0)  ## one to all, src is the process rank
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/torch/distributed.py", line 156, in wrapper
[1,mpirank:1,algo-2]<stdout>:    return func(*args, **kwargs)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/__init__.py", line 58, in wrapper
[1,mpirank:1,algo-2]<stdout>:    return func(*args, **kwargs)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/torch/distributed.py", line 200, in broadcast
[1,mpirank:1,algo-2]<stdout>:    return torchdst.broadcast(tensor, src=src, group=None, async_op=async_op)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1408, in broadcast
[1,mpirank:1,algo-2]<stdout>:    work.wait()
[1,mpirank:1,algo-2]<stdout>:RuntimeError: Timeout: A call to a collective SMDDP operation has taken over 120 seconds. Terminating the distributed job.
[1,mpirank:1,algo-2]<stderr>:terminate called after throwing an instance of '--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 4 DUP FROM 0
```

⚠️ Issue:

```
[1,mpirank:1,algo-2]<stdout>:  File "train_early_stop.py", line 450, in main
[1,mpirank:1,algo-2]<stdout>:    dist.all_reduce(broadcast_early_stop, op=torch.distributed.ReduceOp.SUM)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/torch/distributed.py", line 156, in wrapper
[1,mpirank:1,algo-2]<stdout>:    return func(*args, **kwargs)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/__init__.py", line 58, in wrapper
[1,mpirank:1,algo-2]<stdout>:    return func(*args, **kwargs)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/torch/distributed.py", line 192, in all_reduce
[1,mpirank:1,algo-2]<stdout>:    op=ReduceOpMap[op],
[1,mpirank:1,algo-2]<stdout>:TypeError: __eq__(): incompatible function arguments. The following argument types are supported:
[1,mpirank:1,algo-2]<stdout>:    1. (self: torch._C._distributed_c10d.ReduceOp, arg0: c10d::ReduceOp::RedOpType) -> bool
[1,mpirank:1,algo-2]<stdout>:    2. (self: torch._C._distributed_c10d.ReduceOp, arg0: torch._C._distributed_c10d.ReduceOp) -> bool
[1,mpirank:1,algo-2]<stdout>:
[1,mpirank:1,algo-2]<stdout>:Invoked with: <torch.distributed.distributed_c10d.ReduceOp object at 0x7f52fa00e8b0>, 0
```