* Notebook created by nov05 on 2024-12-29
* It was run locally with conda env `sagemaker_py310`.  

---   
* View the S3 bucket in your account   
    https://s3.console.aws.amazon.com/s3/buckets/aft-vbi-pds
* [Docs > Models and pre-trained weights > ResNet > resnet34](https://pytorch.org/vision/main/models/generated/torchvision.models.resnet34.html)  
* GitHub gist [code snippets](https://gist.github.com/nov05/95cb7edcbe2e8bb68c9d29bdc00b9ca8)   

* Check [SageMaker AI Pricing](https://aws.amazon.com/sagemaker-ai/pricing/) > On-Demand Pricing > Training  
    | Instance Type      | vCPU | Memory  | Price per Hour |
    |--------------------|------|---------|----------------|
    | ml.g4dn.xlarge      | 4    | 16 GiB  | $0.736         |
    |ml.p3.2xlarge	| 8	| 61 GiB	| $3.825 | 

* Documentation > Amazon SageMaker > Developer Guide   
  [**Use the PyTorch framework estimators in the SageMaker Python SDK**](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-framework-estimator.html)    

* sagemaker 2.239.0  
  [**PyTorch Guide to SageMaker’s distributed data parallel library**](https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/v1.0.0/smd_data_parallel_pytorch.html)  

In [2]:
%pwd

'd:\\github\\udacity-nd009t-capstone-starter\\starter'

In [33]:
## windows cmd to launch notepad to edit aws credential file
# !notepad C:\Users\guido\.aws\config
!notepad C:\Users\guido\.aws\credentials

In [34]:
## reset the session after updating credentials
import boto3 # type: ignore
boto3.DEFAULT_SESSION = None
import sagemaker # type: ignore
from sagemaker import get_execution_role # type: ignore

# Extract and print the account ID
sts_client = boto3.client('sts')
response = sts_client.get_caller_identity() 
account_id = response['Account']

role_arn = get_execution_role()  ## get role ARN
if 'AmazonSageMaker-ExecutionRole' not in role_arn:
    ## Go to "IAM - Roles", search for "SageMaker", find the execution role.
    voclabs_role_arn = role_arn
    sagemaker_role_arn = "arn:aws:iam::570668189909:role/service-role/AmazonSageMaker-ExecutionRole-20250126T194519"
session = sagemaker.Session()  ## "default"
region = session.boto_region_name
bucket = session.default_bucket()

print(f"Current AWS Account ID: {account_id}")
print("AWS Region: {}".format(region))
print("Default Bucket: {}".format(bucket))
print(f"Role voclabs ARN: {voclabs_role_arn}") ## If local, Role ARN: arn:aws:iam::807711953667:role/voclabs
print("SageMaker Role ARN: {}".format(sagemaker_role_arn)) 

## generate secrets.env. remember to add it to .gitignore  
import wandb
wandb.sagemaker_auth(path="../secrets") 

## get my own AWS account info
def get_secrets(name):
    path = '../secrets/' + name
    with open(path, 'r') as file:
        for line in file:
            return line.strip()
aws_account_number = get_secrets('aws_account_number')
aws_account_profile = get_secrets('aws_account_profile')

Current AWS Account ID: 570668189909
AWS Region: us-east-1
Default Bucket: sagemaker-us-east-1-570668189909
Role voclabs ARN: arn:aws:iam::570668189909:role/voclabs
SageMaker Role ARN: arn:aws:iam::570668189909:role/service-role/AmazonSageMaker-ExecutionRole-20250126T194519


## 👉 **Data Preparation** (Run only once)
**TODO:** Run the cell below to download the data.

The cell below creates a folder called `data`, downloads training data and arranges it in subfolders. Each of these subfolders contain images where the number of objects is equal to the name of the folder. For instance, all images in folder `1` has images with 1 object in them. Images are not divided into training, testing or validation sets. If you feel like the number of samples are not enough, you can always download more data (instructions for that can be found [here](https://registry.opendata.aws/amazon-bin-imagery/)). However, we are not assessing you on the accuracy of your final trained model, but how you create your machine learning engineering pipeline.

* View the S3 bucket in your account   
    https://s3.console.aws.amazon.com/s3/buckets/aft-vbi-pds
    

In [None]:
import os
import json
from tqdm import tqdm
import boto3
from botocore import UNSIGNED
from botocore.client import Config

def download_and_arrange_data(
        prefix='bin-images', 
        file_extension='.jpg',
        download_dir='../data/bin-images',
        partition=True):
    
    s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))  ## public access

    ## There are 140536 image file names in the list. 
    with open('file_list.json', 'r') as f:
        d = json.load(f)

    for k, v in d.items():  ## There are 5 items (for 5 classes) in the JSON file.
        print(f"Downloading images/metadata of images with {k} object...")
        if partition:
            download_dir = os.path.join(download_dir, k)
        if not os.path.exists(download_dir):
            os.makedirs(download_dir)
        for file_path in tqdm(v):
            file_name = os.path.basename(file_path).split('.')[0] + file_extension
            s3_client.download_file(
                'aft-vbi-pds', 
                prefix+'/'+file_name,  ## e.g. metadata/100313.json
                download_dir+'/'+file_name)
            
## download the 10K-dataset metadata, 17.9 MB, 56m 57.4s
download_and_arrange_data(
    prefix='metadata', 
    file_extension='.json',
    download_dir='../data/metadata',
    partition=False)
print("total metadata file number:", 1228 + 2299 + 2666 + 2373 + 1875)

```text
Downloading images/metadata of images with 1 object...
100%|██████████| 1228/1228 [06:36<00:00,  3.09it/s]
Downloading images/metadata of images with 2 object...
100%|██████████| 2299/2299 [12:38<00:00,  3.03it/s]
Downloading images/metadata of images with 3 object...
100%|██████████| 2666/2666 [14:35<00:00,  3.04it/s]
Downloading images/metadata of images with 4 object...
100%|██████████| 2373/2373 [12:54<00:00,  3.06it/s]
Downloading images/metadata of images with 5 object...
100%|██████████| 1875/1875 [10:11<00:00,  3.07it/s]

total metadata file number: 10441
```

## 👉 **Convert 10K-dataset on S3 to WebDataset tar files with SageMaker ScriptProcessor on a custome image** (Run only once) 

In [None]:
%%time
## TODO: Perform any data cleaning or data preprocessing
## This cell shuffle then split the 10K dataset to train, val, and test.  
## And convert the datasets to WebDataset tar files for SageMaker FastFile input mode.
from sagemaker.processing import ScriptProcessor
processor = ScriptProcessor(
    command=['python3'],
    ## You can use a custom image or use the default SageMaker image
    ## You can pull from AWS ECR or DockerHub
    image_uri=f'{aws_account_number}.dkr.ecr.us-east-1.amazonaws.com/udacity/p5-amazon-bin-images:latest', 
    role=sagemaker_role_arn,  # Execution role
    instance_count=1,
    instance_type='ml.t3.large',  # Use the appropriate instance type
    volume_size_in_gb=10,  # Minimal disk space since we're streaming
    base_job_name='p5-amazon-bin-images' 
)
processor.run(
    code='../scripts_process/convert_to_webdataset_10k.py',  # process the 10K files in the list
    arguments=[
        '--SM_INPUT_BUCKET', 'aft-vbi-pds',
        '--SM_INPUT_PREFIX_IMAGES', 'bin-images/',
        '--SM_INPUT_PREFIX_METADATA', 'metadata/',
        '--SM_OUTPUT_BUCKET', 'p5-amazon-bin-images',
        '--SM_OUTPUT_PREFIX', 'webdataset/',
    ]
)
## It took about 13 minutes to process 10.4K files (1.2 GB). If we keep 1K files per shard, 
## processing 500K files could take around 11 hours. I’ll probably increase it to 10K 
## files per shard, which would make each tar file around 1 GB and speed up the process.
## CPU times: total: 21.9 s
## Wall time: 12min 58s

........................................................................Starting data processing...
🟢 File list successfully loaded from s3://p5-amazon-bin-images/file_list.json
    Total number of image files: 10441
# writing train-shard-000000.tar 0 0.0 GB 0
# writing train-shard-000001.tar 1000 0.1 GB 1000
# writing train-shard-000002.tar 1000 0.1 GB 2000
# writing train-shard-000003.tar 1000 0.1 GB 3000
# writing train-shard-000004.tar 1000 0.1 GB 4000
# writing train-shard-000005.tar 1000 0.1 GB 5000
# writing train-shard-000006.tar 1000 0.1 GB 6000
# writing train-shard-000007.tar 1000 0.1 GB 7000
🟢 Successfully uploaded shard files to s3://p5-amazon-bin-images/webdataset/train/:
    ['train-shard-000000.tar', 'train-shard-000001.tar', 'train-shard-000002.tar', 'train-shard-000003.tar', 'train-shard-000004.tar', 'train-shard-000005.tar', 'train-shard-000006.tar', 'train-shard-000007.tar']
# writing val-shard-000000.tar 0 0.0 GB 0
# writing val-shard-000001.tar 1000 0.1 GB 1000
🟢 

'\nIt took about 13 minutes to process 10.4K files (1.2 GB). If we keep 1K files per shard, \nprocessing 500K files could take around 11 hours. I’ll probably increase it to 10K \nfiles per shard, which would make each tar file around 1 GB and speed up the process.\n'

## 👉 **Dataset**  

**TODO:** Explain what dataset you are using for this project. Give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understanding of it. You can find more information about the data [here](https://registry.opendata.aws/amazon-bin-imagery/).  

In [35]:
from sagemaker.pytorch import PyTorch
# from sagemaker.inputs import TrainingInput
data_base_path = "s3://p5-amazon-bin-images/webdataset/"
# train_data = TrainingInput(data_base_path + "train/", 
#                            content_type="application/x-tar")
# val_data = TrainingInput(data_base_path + "val/", 
#                          content_type="application/x-tar")
# test_data = TrainingInput(data_base_path + "test/", 
#                           content_type="application/x-tar")
train_data_path = data_base_path + "train/train-shard-{000000..000007}.tar"
val_data_path = data_base_path + "val/val-shard-{000000..000001}.tar"
test_data_path = data_base_path + "test/test-shard-{000000..000001}.tar"
print(train_data_path)
## ⚠️ don't use prefix in output_path, cause source folder will be created 
## at bucket level, while other folders, e.g. debug-output, at prefix levle.
output_path = "s3://p5-amazon-bin-images-train/"  

## Manually set dataset sizes hyperparameters
l = 10441
split_ratio=[0.7, 0.15, 0.15]
train_data_size = int(l*split_ratio[0])
val_data_size = int(l*split_ratio[1])
test_data_size = l - train_data_size - val_data_size
print(f"train_size: {train_data_size}, val_size: {val_data_size}, test_size: {test_data_size}")

s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000007}.tar
train_size: 7308, val_size: 1566, test_size: 1567


## 👉 **Model Training (Distributed Data Parallel)**  

**TODO:** This is the part where you can train a model. The type or architecture of the model you use is not important.   
**Note:** You will need to use the `train.py` script to train your model.

* Official document: [SageMaker distributed data parallel (SDP) with PyTorch](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#pytorch-distributed)   

* ⚠️ [Traning issues](https://gist.github.com/nov05/1bdc15eda0e781640b46ab28d38f45bd)   
* Training times
    * Train 2 epochs, val, test data sizes: 2K, 1K, 1K
        ```
        2025-02-06 22:26:36 Completed - Training job completed
        Training seconds: 874
        Billable seconds: 874
        CPU times: total: 14 s
        Wall time: 8min 45s
        ```

In [None]:
#TODO: Declare your model training hyperparameter.
#NOTE: You do not need to do hyperparameter tuning. You can use fixed hyperparameter values
from sagemaker.debugger import (
    Rule,
## debugger
    DebuggerHookConfig,
    rule_configs,
## profiler 
    ProfilerRule,
    ProfilerConfig,
    FrameworkProfile
)
## SageMaker will automatically append these as command-line arguments  
hyperparameters = {
    'epochs': 40,   
    'batch-size': 128,   
    'opt-learning-rate': 8e-5,  
    'opt-weight-decay': 1e-5,  
    'lr-sched-step-size': 5,  
    'lr-sched-gamma': 0.5,
    'early-stopping-patience': 5,
    'model-arch': 'resnet34', 
    'wandb': True,  
    'debug': False, 
## input data 
    "train-data-path": train_data_path,
    "val-data-path": val_data_path,
    "test-data-path": test_data_path,
    "train-data-size": train_data_size, 
    "val-data-size": val_data_size,
    "test-data-size": test_data_size,
    "class-weights-dict": {
        1: 1.7004885993485341, 
        2: 0.9083079599826012, 
        3: 0.7832708177044261, 
        4: 0.8799831436999579, 
        5: 1.1137066666666666
    },
}
rules = [
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]
hook_config = DebuggerHookConfig(
    hook_parameters={
        "train.save_interval": "100", 
        "eval.save_interval": "10"
    }
)
profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, 
    framework_profile_params=FrameworkProfile(num_steps=10)
)

In [None]:
%%time
#TODO: Create your training estimator
estimator = PyTorch(
    entry_point='train_draft.py',  # Your training script that defines the ResNet50 model and training loop
    source_dir='../scripts_train',  # Directory where your script and dependencies are stored
    role=sagemaker_role_arn,
    framework_version='1.13.1',  # Use the PyTorch version you need
    py_version='py39',
    instance_count=2,  ## multi-instance training, Udacity account level limit 2
    # instance_type='ml.p3.2xlarge',  ## 16GB, Use GPU instances for deep learning
    instance_type='ml.g4dn.xlarge',  ## 16GB, 1 GPU per instance
    output_path=output_path,  ## if not specify, output to the sagemaker default bucket
    hyperparameters=hyperparameters,
    # use_spot_instances=True,
## Debugger and profiler parameters
    # rules=rules,
    # debugger_hook_config=hook_config,    
    # profiler_config=profiler_config,
## Training using SMDataParallel Distributed Training Framework
    # distribution={"pytorchddp": {"enabled": True}}  # mpirun, activates SMDDP AllReduce OR AllGather
    # distribution={"torch_distributed": {"enabled": True}}  # torchrun, activates SMDDP AllGather
    distribution={"smdistributed": {"dataparallel": { "enabled": True}}},  # mpirun, activates SMDDP AllReduce OR AllGather
) 
# TODO: Fit your estimator
from datetime import datetime
estimator.fit(
    wait=True,  
    job_name=f"p5-amazon-bin-job-{datetime.now().strftime('%Y%m%d-%H%M%S')}", 
    ## Use WebDataset pipe to stream data instead 
    # inputs={
    #     "train": train_data,  
    #     "validation": val_data, 
    #     "test": test_data,
    # },  
)

2025-02-07 10:45:06 Starting - Starting the training job...
2025-02-07 10:45:21 Starting - Preparing the instances for training...
2025-02-07 10:46:00 Downloading - Downloading input data...
2025-02-07 10:46:25 Downloading - Downloading the training image..............bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
  "cipher": algorithms.TripleDES,
  "class": algorithms.TripleDES,
2025-02-07 10:49:24,176 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2025-02-07 10:49:24,201 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-07 10:49:24,215 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2025-02-07 10:49:24,218 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel
2025-

* Baseline accuracy 28.125%, [wandb logs](https://wandb.ai/nov05/udacity-awsmle-resnet34-amazon-bin/runs/p5-amazon-bin-job-20250207-044502-xfnie3-algo-2?nw=nwusernov05)  
  <img src="https://raw.githubusercontent.com/nov05/pictures/refs/heads/master/Udacity/20241119_aws-mle-nanodegree/2025-02-07%2005_28_14-baseline-p5-amazon-bin-job-20250207-044502-xfnie3-algo-2%20_%20udacity-awsmle-resnet.jpg" width=800>  

* ⚠️✅ **Early stopping** without `dist.broadcast()` and `dist.barrier()` cause `AllGather`  error and eventually the timeout error. 
    I have added the impletementation of the SMDDP synchronization for early stopping in `train_v1.py` (NOT in `train_draft.py`).    

### Standout Suggestions
You do not need to perform the tasks below to finish your project. However, you can attempt these tasks to turn your project into a more advanced portfolio piece.

### Hyperparameter Tuning
**TODO:** Here you can perform hyperparameter tuning to increase the performance of your model. You are encouraged to 
- tune as many hyperparameters as you can to get the best performance from your model
- explain why you chose to tune those particular hyperparameters and the ranges.


In [103]:
#TODO: Create your hyperparameter search space

In [104]:
#TODO: Create your training estimator

In [105]:
# TODO: Fit your estimator

In [106]:
# TODO: Find the best hyperparameters

### Model Profiling and Debugging
**TODO:** Use model debugging and profiling to better monitor and debug your model training job.

In [107]:
# TODO: Set up debugging and profiling rules and hooks

In [108]:
# TODO: Create and fit an estimator

In [109]:
# TODO: Plot a debugging output.

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [110]:
# TODO: Display the profiler output

### Model Deploying and Querying
**TODO:** Can you deploy your model to an endpoint and then query that endpoint to get a result?

In [111]:
# TODO: Deploy your model to an endpoint

In [112]:
# TODO: Run an prediction on the endpoint

In [113]:
# TODO: Remember to shutdown/delete your endpoint once your work is done

### Cheaper Training and Cost Analysis
**TODO:** Can you perform a cost analysis of your system and then use spot instances to lessen your model training cost?

In [114]:
# TODO: Cost Analysis

In [115]:
# TODO: Train your model using a spot instance

In [116]:
# TODO: Train your model on Multiple Instances