* notebook created by nov05 on 2024-12-01  
* local conda env [`awsmle_py310`](https://gist.github.com/nov05/d9c3be6c2ab9f6c050e3d988830db08b) (no cuda)     
* AWS SageMaker script mode  

---   

* https://sagemaker.readthedocs.io/en/v2.34.0/frameworks/pytorch/sagemaker.pytorch.html   
* https://docs.wandb.ai/guides/integrations/sagemaker/  

In [None]:
# TODO: Install any packages that you might need
# !pip install smdebug

In [8]:
!notepad C:\Users\guido\.aws\credentials

In [1]:
## reset the session after updating credentials
import boto3 # type: ignore
boto3.DEFAULT_SESSION = None

import sagemaker # type: ignore
from sagemaker import get_execution_role # type: ignore
role_arn = get_execution_role()  ## get role ARN
if 'AmazonSageMaker-ExecutionRole' not in role_arn:
    ## your own role here
    role_arn = "arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392"
print("Role ARN:", role_arn) ## If local, Role ARN: arn:aws:iam::807711953667:role/voclabs
session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()
print("AWS Region: {}".format(region))
print("Default Bucket: {}".format(bucket))
print("Role Arn: {}".format(role_arn))

import wandb
## generate secrets.env. remember add it to .gitignore  
wandb.sagemaker_auth(path="scripts")  



sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\guido\AppData\Local\sagemaker\sagemaker\config.yaml


Role ARN: arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392


AWS Region: us-east-1
Default Bucket: sagemaker-us-east-1-061096721307
Role Arn: arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


In [None]:
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput
from datetime import datetime
## Moving the 1.1GB data from one bucket to another takes 1 hours.
## This is roughly the same amount of time as uploading the data from a local machine to S3.
data_base_path = "s3://p3-dog-breed-classification/dogImages/"
train_data = TrainingInput(data_base_path+"train/", content_type="image/jpeg")
val_data = TrainingInput(data_base_path+"valid/", content_type="image/jpeg")
test_data = TrainingInput(data_base_path+"test/", content_type="image/jpeg")
output_path = "s3://p3-dog-breed-classification/jobs/"
hyperparameters = {
    'epochs': 30,  # Define how many epochs you want to train for
    'batch-size': 64,  ## ⚠️ this probably needs to be small for smaller training dataset?
    'opt-learning-rate': 5e-5,  ## optimizer lr. ⚠️ keep it small for pre-trained model
    'opt-weight-decay': 1e-4, ## optimizer weight decay
    'model-name': 'resnet101',  # Specify the ResNet model you want to use
}
# Define the PyTorch estimator
estimator = PyTorch(
    entry_point='train.py',  # Your training script that defines the ResNet50 model and training loop
    source_dir='scripts',  # Directory where your script and dependencies are stored
    role=role_arn,
    framework_version='1.13.1',  # Use the PyTorch version you need
    py_version='py39',
    instance_count=1,  # Adjust based on the number of instances you want to use
    # instance_type='ml.p3.2xlarge',  # Use GPU instances for deep learning
    instance_type='ml.g4dn.xlarge',
    output_path=output_path,
    hyperparameters=hyperparameters,
)

In [None]:
%%time
# Fit the estimator with the input channels (train, val)
estimator.fit(
    wait=True,  
    job_name=f"p3-dog-breeds-job-{datetime.now().strftime('%Y%m%d-%H%M%S')}",  
    inputs={
        "train": train_data, 
        "validation": val_data, 
        "test": test_data,
    },  
)
## baseline test accuracy 79%
## 92m 17s

2024-12-02 23:56:22 Starting - Starting the training job...
2024-12-02 23:56:37 Starting - Preparing the instances for training...
2024-12-02 23:57:07 Downloading - Downloading input data......
2024-12-02 23:58:17 Downloading - Downloading the training image..............bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
  "cipher": algorithms.TripleDES,
  "class": algorithms.TripleDES,
2024-12-03 00:01:01,969 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2024-12-03 00:01:01,992 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2024-12-03 00:01:02,006 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2024-12-03 00:01:02,010 sagemaker_pytorch_container.training INFO     Invoking user training script.
2024-12-03 00:01:03,377 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt
C

* 🟢⚠️ Issue solved:  
  * pytorch: [OSError: image file is truncated (150 bytes not processed)](https://discuss.pytorch.org/t/oserror-image-file-is-truncated-150-bytes-not-processed/64445)   
  * github: [OSError: image file is truncated (7 bytes not processed)](https://github.com/eriklindernoren/PyTorch-YOLOv3/issues/162)  
  * [OSError: image file is truncated 解决思路及方案](https://blog.csdn.net/qq_34097715/article/details/109646082)

```bash
File "/opt/conda/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 268, in default_loader
    return pil_loader(path)
File "/opt/conda/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 248, in pil_loader
    return img.convert("RGB")
File "/opt/conda/lib/python3.9/site-packages/PIL/Image.py", line 995, in convert
    self.load()
File "/opt/conda/lib/python3.9/site-packages/PIL/ImageFile.py", line 290, in load
    raise OSError(msg)
```