* notebook created by nov05 on 2024-12-05   
* windows os, powershell, conda env `awsmle_py310` (no cuda)    

---  

## **Issue**

* 🟢⚠️ Issue solved:     

  > ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateHyperParameterTuningJob 
  operation: The account-level service limit 'ml.g4dn.xlarge for training job usage' is 2 Instances, with current 
  utilization of 0 Instances and a request delta of 10 Instances. Please use AWS Service Quotas to request an 
  increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for 
  this quota.

  * You can still create an HPO job with as many `max_jobs` as you want. However, the number of concurrent jobs is limited to 2 (`max_parallel_jobs=2`). For example, if your `max_jobs` is set to 20, only 2 training jobs will run at a time. If each training job takes about an hour, the entire HPO job will take at least 10 hours to complete.

  * Go to `Service Quotas > AWS services > Amazon SageMaker`, search for `ml.g4dn.xlarg`.  

    <img src="https://raw.githubusercontent.com/nov05/pictures/refs/heads/master/Udacity/20241119_aws-mle-nanodegree/2024-12-03%2002_03_35-Quotas%20list%20-%20Amazon%20SageMaker%20_%20AWS%20Service%20Quotas.jpg" width=600>  

    <img src="https://raw.githubusercontent.com/nov05/pictures/refs/heads/master/Udacity/20241119_aws-mle-nanodegree/2024-12-03%2002_06_13-Quotas%20list%20-%20Amazon%20SageMaker%20_%20AWS%20Service%20Quotas.jpg" width=600>  

---  

## **Issue**  

* 🟢⚠️ Issue solved: This cell keeps running and doesn't return. The endpoint CloudWatch log shows 500 return code. If remove the `scripts\inference.py` file, it shows 200.   

    ```text
    2024-12-05T14:18:14,087 [INFO ] W-9002-model_1.0-stdout MODEL_LOG -     self._model = self._run_handler_function(self._model_fn, *(model_dir,))  
    2024-12-05T14:18:14,087 [INFO ] W-9002-model_1.0-stdout MODEL_LOG -     raise ModelLoadError(  
    2024-12-05T14:18:14,087 [INFO ] W-9002-model_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py", line 80, in default_model_fn  
    2024-12-05T14:20:18,535 [INFO ] W-9002-model_1.0 ACCESS_LOG - /169.254.178.2:53522 "GET /ping HTTP/1.1" 500 1
    ```   

    ```text 
    2024-12-05T14:18:14,087 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - sagemaker_pytorch_serving_container.default_pytorch_inference_handler.ModelLoadError: Failed to load /tmp/models/d40fd8f5cf5f48fb9dfa71137e4db3d9/model/model.pth. Please ensure model is saved using torchscript.   
    ```

* Solution: Using the code you've provided will cause an issue when loading the model in SageMaker for inference, because torch.save(model.state_dict(), f) only saves the model's state dictionary (i.e., its parameters), not the complete model architecture. SageMaker expects the model to be saved in TorchScript format (or as a complete PyTorch model including both architecture and weights) for inference.  

    ```python
    ## TODO: Save the trained model
    path = os.path.join(args.model_dir, 'model.pth')
    with open(path, 'wb') as f:
        torch.save(model.state_dict(), f)
    print(f"Model saved at '{path}'")
    ```

* Reference:  
    * https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html  
      > TorchScript is an intermediate representation of a PyTorch model (subclass of nn.Module) that can then be run in a high-performance environment such as C++.

* In the `scripts\train.py` file, change the code to the following one.  

    ```python
    def save(model, model_dir, model_name='model.pt'):
        ## ⚠️ Please ensure model is saved using torchscript.
        model.eval()
        path = os.path.join(model_dir, model_name)
        ## save model weights
        # with open(path, 'wb') as f:
        #     torch.save(model.state_dict(), f)
        ## If your model is simple and has a straightforward forward pass, use torch.jit.trace
        # example_input = torch.randn(1, 3, 224, 224)
        # traced_model = torch.jit.trace(model, example_input)
        # traced_model.save(path)
        ## If your model has dynamic control flow (like if statements based on input), use torch.jit.script
        scripted_model = torch.jit.script(model)
        scripted_model.save(path) 
        print(f"Model saved at '{path}'")
    ```

* As for the model weights saved in `args.model_dir`, download and load it with the original model structure, convert the model to TorchScript.   
    * AWS S3 URI: `s3://p3-dog-breed-image-classification/jobs/p3-dog-breeds-debug-20241204-124107/output/model.tar.gz`  

In [1]:
## current dir
%pwd

'd:\\github\\udacity-CD0387-deep-learning-topics-within-computer-vision-nlp-project-starter'

In [2]:
## unpack the file
!tar -xzvf data\models\resnet50_best.tar.gz -C data\models\

x model.pth


In [3]:
## rename the model file
import os
os.rename(r'data\\models\\model.pth', r'data\\models\\old_model.pth')

In [4]:
%%time
import torch
import torchvision
import torch.nn as nn
old_model_file = r"data\\models\\old_model.pth"
model_type = 'resnet50'
num_classes = 133
model = getattr(torchvision.models, model_type)(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, num_classes)
model.load_state_dict(torch.load(old_model_file, map_location=torch.device('cpu')))
model.eval()  # Put model in evaluation mode
scripted_model = torch.jit.script(model)
model_file = r"data\\models\\model.pth"
scripted_model.save(model_file)



CPU times: total: 2.53 s
Wall time: 4.72 s


In [None]:
!tar -czvf data/models/model.tar.gz data/models/model.pth
## in windows wsl
# gzip -c model.pth > model.gz

---   

## **Issue**

* **🟢⚠️ Issue:**  
    ```text
    2024-12-05T18:36:37,292 [INFO ] W-9003-model_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py", line 73, in default_model_fn
    2024-12-05T18:36:37,292 [INFO ] W-9003-model_1.0-stdout MODEL_LOG -     raise ValueError(
    2024-12-05T17:34:29,292 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - ValueError: Exactly one .pth or .pt file is required for PyTorch models: []  
    ```

* **Solution**: Use custome function `model_fn` in `inference.py`.   

    https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/master/src/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py   

    > For PyTorch, a default function to load a model only if Elastic Inference is used.  
    > In other cases, users should provide customized model_fn() in script.  

    ```python  
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_path = os.path.join(model_dir, DEFAULT_MODEL_FILENAME)
    if not os.path.exists(model_path):
        model_files = [file for file in os.listdir(model_dir) if self._is_model_file(file)]
        if len(model_files) != 1:
            raise ValueError(
                "Exactly one .pth or .pt file is required for PyTorch models: {}".format(model_files)
            )
    ```


* Read the endpoint container logs in AWS CloudWatch  
 
  <img src="https://raw.githubusercontent.com/nov05/pictures/refs/heads/master/Udacity/20241119_aws-mle-nanodegree/2024-12-05%2013_54_00-CloudWatch%20_%20us-east-1.jpg" width=600>

* Input arg model_dir=`/tmp/models/4765a8173953463fa048dfd3f5c0f889/model`
    ```text
    2024-12-05T20:07:11,559 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - 👉 Model path: /tmp/models/4765a8173953463fa048dfd3f5c0f889/model/model.pth` 
    ```

* In this case, there is no `model.pth`, which might caused by the improper packaging the file into `.tar.gz`.   

    ```text
    2024-12-05T20:29:09,478 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - 👉 Model dir: /tmp/models/5a7d4e053bc64df2a6a385971f122d3c/model, type: <class 'str'>   
    2024-12-05T20:29:09,478 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Directory: data    
    2024-12-05T20:29:09,478 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Directory: code   
    ```

* It should look like this.   

    ```text
    2024-12-05T22:38:18,849 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
    2024-12-05T22:38:18,979 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - 🟢 Loading model...
    2024-12-05T22:38:18,980 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - 👉 Device: cpu
    2024-12-05T22:38:18,981 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - 👉 Model dir: /opt/ml/model, type: <class 'str'>
    2024-12-05T22:38:18,982 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File: model.pth
    2024-12-05T22:38:18,982 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Directory: code
    ```

---  

## **Issue** 


> ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary and could not load the entire response body. See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/p3-dog-breed-classification in account 852125600954 for more information.

In [None]:
# !pip install sagemaker_inference
## Successfully installed retrying-1.3.4 sagemaker_inference-1.10.1

In [10]:
from sagemaker_inference import content_types, decoder
print(content_types.UTF8_TYPES)
# np_array = decoder.decode(input_data, content_type)

['application/json', 'text/csv']


In [16]:
import json
import numpy as np
str = "[[219, 233, 234], [224, 238, 239], [223, 237, 238]]"
data = np.array(json.loads(str))
data


array([[219, 233, 234],
       [224, 238, 239],
       [223, 237, 238]])

In [6]:
class Config:
    def __init__(self):
        self.debug = False
config = Config()
new_config_dict = {"wandb": True}
for key, value in new_config_dict.items():
    setattr(config, key, value)
print(config.__dict__)

{'debug': False, 'wandb': True}
