* notebook created by nov05 on 2024-12-01  
* local conda env [`awsmle_py310`](https://gist.github.com/nov05/d9c3be6c2ab9f6c050e3d988830db08b) (no cuda)    

---   

* https://sagemaker.readthedocs.io/en/v2.34.0/frameworks/pytorch/sagemaker.pytorch.html   
* https://docs.wandb.ai/guides/integrations/sagemaker/  

In [None]:
# TODO: Install any packages that you might need
# !pip install smdebug

In [8]:
!notepad C:\Users\guido\.aws\credentials

In [None]:
## reset the session after updating credentials
import boto3 # type: ignore
boto3.DEFAULT_SESSION = None
import sagemaker # type: ignore
from sagemaker import get_execution_role # type: ignore

role_arn = get_execution_role()  ## get role ARN
if 'AmazonSageMaker-ExecutionRole' not in role_arn:
    ## your own role here
    role_arn = "arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392"
print("Role ARN:", role_arn) ## If local, Role ARN: arn:aws:iam::807711953667:role/voclabs
session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()

print("AWS Region: {}".format(region))
print("Default Bucket: {}".format(bucket))
print("Role Arn: {}".format(role_arn))

import wandb
## generate secrets.env. remember add it to .gitignore  
wandb.sagemaker_auth(path="scripts")  

Role ARN: arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392


AWS Region: us-east-1
Default Bucket: sagemaker-us-east-1-061096721307
Role Arn: arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392


In [None]:
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput
from datetime import datetime
## Moving the 1.1GB data from one bucket to another takes 1 hours.
## This is roughly the same amount of time as uploading the data from a local machine to S3.
data_base_path = "s3://p3-dog-breed-classification/dogImages/"
train_data = TrainingInput(data_base_path+"train/", content_type="image/jpeg")
val_data = TrainingInput(data_base_path+"valid/", content_type="image/jpeg")
test_data = TrainingInput(data_base_path+"test/", content_type="image/jpeg")
output_path = "s3://p3-dog-breed-classification/jobs/"
hyperparameters = {
    'epochs': 40,  # Define how many epochs you want to train for
    'batch-size': 16,  ## ⚠️ this probably needs to be small for smaller training dataset?
    'opt-learning-rate': 1e-5,  ## optimizer lr. ⚠️ keep it small for pre-trained model
    'opt-weight-decay': 1e-4, ## optimizer weight decay
    'model-name': 'resnet152',  # Specify the ResNet model you want to use
}
# Define the PyTorch estimator
estimator = PyTorch(
    entry_point='train.py',  # Your training script that defines the ResNet50 model and training loop
    source_dir='scripts',  # Directory where your script and dependencies are stored
    role=role_arn,
    framework_version='1.13.1',  # Use the PyTorch version you need
    py_version='py39',
    instance_count=1,  # Adjust based on the number of instances you want to use
    # instance_type='ml.p3.2xlarge',  # 16GB, Use GPU instances for deep learning
    instance_type='ml.g4dn.xlarge',  ## 16GB
    output_path=output_path,
    hyperparameters=hyperparameters,
)

In [None]:
%%time
# Fit the estimator with the input channels (train, val)
estimator.fit(
    wait=True,  
    job_name=f"p3-dog-breeds-job-{datetime.now().strftime('%Y%m%d-%H%M%S')}",  
    inputs={
        "train": train_data, 
        "validation": val_data, 
        "test": test_data,
    },  
)

## Dataset
TODO: Explain what dataset you are using for this project. Maybe even give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understand of it.

In [None]:
#TODO: Fetch and upload the data to AWS S3
# Command to download and unzip data
# !wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
# !unzip dogImages.zip

## Hyperparameter Tuning
**TODO:** This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

**Note:** You will need to use the `hpo.py` script to perform hyperparameter tuning.

In [None]:
#TODO: Declare your HP ranges, metrics etc.

In [None]:
#TODO: Create estimators for your HPs

estimator = # TODO: Your estimator here

tuner = # TODO: Your HP tuner here

In [None]:
# TODO: Fit your HP Tuner
tuner.fit() # TODO: Remember to include your data channels

In [None]:
# TODO: Get the best estimators and the best HPs

best_estimator = #TODO

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()

## Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model

**Note:** You will need to use the `train.py` script to perform model profiling and debugging.

In [None]:
# TODO: Set up debugging and profiling rules and hooks

In [None]:
# TODO: Create and fit an estimator

estimator = # TODO: Your estimator here

In [None]:
# TODO: Plot a debugging output.

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
# TODO: Display the profiler output

## Model Deploying

In [None]:
# TODO: Deploy your model to an endpoint

predictor=estimator.deploy() # TODO: Add your deployment configuration like instance type and number of instances

In [None]:
# TODO: Run an prediction on the endpoint

image = # TODO: Your code to load and preprocess image to send to endpoint for prediction
response = predictor.predict(image)

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()

* 🟢⚠️ Issue solved:  
  pytorch: [OSError: image file is truncated (150 bytes not processed)](https://discuss.pytorch.org/t/oserror-image-file-is-truncated-150-bytes-not-processed/64445)   
  github: [OSError: image file is truncated (7 bytes not processed)](https://github.com/eriklindernoren/PyTorch-YOLOv3/issues/162)  
  [OSError: image file is truncated解决思路及方案](https://blog.csdn.net/qq_34097715/article/details/109646082)
```
File "/opt/conda/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 268, in default_loader
    return pil_loader(path)
File "/opt/conda/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 248, in pil_loader
    return img.convert("RGB")
File "/opt/conda/lib/python3.9/site-packages/PIL/Image.py", line 995, in convert
    self.load()
File "/opt/conda/lib/python3.9/site-packages/PIL/ImageFile.py", line 290, in load
    raise OSError(msg)
```