# Training an image classification model using Sagemaker and HLS imagery from Earthdata Cloud (EDC).
Using a set of training data defined [here](https://github.com/nasa-esdswg-ml/edc-notebooks/blob/main/Sagemaker/data-preparation.ipynb) we will train a model using SageMaker's 'image-classification' framework.

Note: In order for this to work correctly with EDC we would need to grant read access to the AWS user associated with the 'sandbox' profile. That is not currently possible. In this notebook, you should copy the data to an S3 bucket your user has normal read access to using the technique outline [here](https://github.com/nasa-esdswg-ml/edc-notebooks/blob/main/EDC%20Data%20Access/s3-access.ipynb). This is one of the primary findings of our EDC+ML investigation and recommendations will be made to EOSDIS to make Sagemaker direct data access (ie. usage of EDC data without having to copy) possible.

In [None]:
import time
import boto3
import re
import sagemaker
from sagemaker import get_execution_role
from sagemaker import image_uris

role = get_execution_role()

bucket = sagemaker.Session().default_bucket()

training_image = image_uris.retrieve(
    region=boto3.Session().region_name, framework="image-classification"
)

training_manifest_file_s3_location = "<your training data manifest file here>"
validation_manifest_file_s3_location = "<your validation data manifest file here>"
job_name_prefix = "<yourimage classfication job name prefix here>"

# Train model
In order to train our model we need labelled data. The labelling is supplied by an [augmented manifest file](https://github.com/nasa-esdswg-ml/edc-notebooks/blob/main/Sagemaker/data-preparation.ipynb) located in an S3 bucket that you own. 

That file labels data resident in a bucket in the EDC (which you do not own). For example, a line in that file would look like this:

`{"source-ref": "s3://lp-prod-public/HLSL30.020/HLS.L30.T01FBF.2021104T213801.v2.0/HLS.L30.T01FBF.2021104T213801.v2.0.jpg", "class": "1"}`

Indicating that the browse image in `lp-prod-public` is a cloudy image.

We supply a manifest file for both training and validation of the model.

The 'training_params' variable contains the configuration we need to train the model for our cloudy/not cloudy use case.

In [None]:
# For this training, we will use 18 layers
num_layers = "18"
# we need to specify the input image shape for the training data
image_shape = "1000,1000"
# In our training data preparation notebook we used 100 traingin samples
num_training_samples = "100"
# specify the number of output classes. Our classes are 'source-ref' and 'class'
num_classes = "2"
# batch size for training
mini_batch_size = "50"
# number of epochs
epochs = "2"
# learning rate
learning_rate = "0.01"

job_name = job_name_prefix + "-" + time.strftime("-%Y-%m-%d-%H-%M-%S", time.gmtime())
training_params = {
    "AlgorithmSpecification": {"TrainingImage": training_image, "TrainingInputMode": "Pipe"},
    "RoleArn": role,
    "OutputDataConfig": {"S3OutputPath": "s3://{}/{}/output".format(bucket, job_name_prefix)},
    "ResourceConfig": {"InstanceCount": 1, "InstanceType": "ml.p3.2xlarge", "VolumeSizeInGB": 50},
    "TrainingJobName": job_name,
    "HyperParameters": {
        "num_layers": str(num_layers),
        "num_training_samples": str(num_training_samples),
        "num_classes": str(num_classes),
        "mini_batch_size": str(mini_batch_size),
        "epochs": str(epochs),
        "learning_rate": str(learning_rate)
    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 360000},
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "AttributeNames": ["source-ref", "class"],
                    "S3DataType": "AugmentedManifestFile",
                    "S3Uri": training_manifest_file_s3_location,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-recordio",
            "CompressionType": "None",
            "RecordWrapperType": "RecordIO"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "AttributeNames": ["source-ref", "class"],
                    "S3DataType": "AugmentedManifestFile",
                    "S3Uri": validation_manifest_file_s3_location,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-recordio",
            "CompressionType": "None",
            "RecordWrapperType": "RecordIO"
        }
    ]
}
    
print("Training job name: {}".format(job_name))
print(
    "\nInput Data Location: {}".format(
        training_params
    )
)

We create our AWS session and obtain a SageMaker service handle. We then create a training job and wait for that job to complete. 

In [None]:
# create the Amazon SageMaker training job
session = boto3.Session(profile_name='sandbox')
sagemaker = session.client(service_name="sagemaker")
sagemaker.create_training_job(**training_params)

# confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)["TrainingJobStatus"]
print("Training job current status: {}".format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter("training_job_completed_or_stopped").wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info["TrainingJobStatus"]
    print("Training job ended with status: " + status)
except:
    print("Training failed to start")
    # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)["FailureReason"]
    print("Training failed with the following error: {}".format(message))

In [None]:
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info["TrainingJobStatus"]
print("Training job ended with status: " + status)

# Creating the model
Upone completionof the training job, we can create a model. Once created, we are given a model ARN that we can use to create an inference endpoint.

In [None]:
model_name = "DEMO-full-image-classification-model" + time.strftime(
    "-%Y-%m-%d-%H-%M-%S", time.gmtime()
)
print(model_name)
info = sagemaker.describe_training_job(TrainingJobName=job_name)
model_data = info["ModelArtifacts"]["S3ModelArtifacts"]
print(model_data)

hosting_image = image_uris.retrieve(
    region=session.region_name, framework="image-classification"
)

primary_container = {
    "Image": hosting_image,
    "ModelDataUrl": model_data,
}

create_model_response = sagemaker.create_model(
    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container
)

print("Model ARN: " + create_model_response["ModelArn"])
print("Model Name: " + model_name)
print("Training job prefix: " + job_name_prefix)