# Data & Model Preparation
This notebook will prepare the dataset and model for the module evaluation lab.  This is an optional step if you have kept your artifacts from previous modules.

## Import modules and initialize parameters for this notebook

In [3]:
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
sess = sagemaker.Session()

account = sess.account_id()
region = sess.boto_region_name
bucket = sess.default_bucket() # or use your own custom bucket name
prefix = 'clarify-explainability'

## Dataset
The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species (the original technical report can be found here). Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not.

Run the cell below to download the full dataset or download manually [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download.If you plan to complete the entire workshop, please keep the file to avoid re-download and re-process the data.

In [None]:
!wget 'https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011.tgz'
!tar xopf CUB_200_2011.tgz
!rm CUB_200_2011.tgz

s3_raw_data = f's3://{bucket}/{prefix}/full/data'
!aws s3 cp --recursive ./CUB_200_2011 $s3_raw_data
!rm -rf ./CUB_200_2011

In [None]:
s3_raw_data

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor

from sagemaker.processing import (
    ProcessingInput,
    ProcessingOutput,
)
import time 

timpstamp = str(time.time()).split('.')[0]
# SKlearnProcessor for preprocessing
output_prefix = f'{prefix}/outputs'
output_s3_uri = f's3://{bucket}/{output_prefix}'

class_selection = '13, 17, 35, 36, 47, 68, 73, 87'
input_annotation = 'classes.txt'
processing_instance_type = "ml.m5.xlarge"
processing_instance_count = 1

sklearn_processor = SKLearnProcessor(base_job_name = f"{prefix}-preprocess",  # choose any name
                                    framework_version='0.20.0',
                                    role=role,
                                    instance_type=processing_instance_type,
                                    instance_count=processing_instance_count)

In [None]:
sklearn_processor.run(
    code='preprocessing.py',
    arguments=["--classes", class_selection, 
               "--input-data", input_annotation],
    inputs=[ProcessingInput(source=s3_raw_data, 
            destination="/opt/ml/processing/input")],
    outputs=[
            ProcessingOutput(source="/opt/ml/processing/output/train", destination = output_s3_uri +'/train'),
            ProcessingOutput(source="/opt/ml/processing/output/valid", destination = output_s3_uri +'/valid'),
            ProcessingOutput(source="/opt/ml/processing/output/test", destination = output_s3_uri +'/test'),
            ProcessingOutput(source="/opt/ml/processing/output/manifest", destination = output_s3_uri +'/manifest'),
        ],
    )

.[34m['013.Bobolink', '017.Cardinal', '035.Purple_Finch', '036.Northern_Flicker', '047.American_Goldfinch', '068.Ruby_throated_Hummingbird', '073.Blue_Jay', '087.Mallard'][0m
[34mUsing 477 images from 8 classes[0m
[34mnum images total: 11788[0m
[34mnum train: 286[0m
[34mnum val: 95[0m
[34mnum test: 96[0m
[34mCopying files for 95 images in channel: valid...[0m
[34mCopying files for 96 images in channel: test...[0m
[34mCopying files for 286 images in channel: train...[0m
[34mFinished running processing job[0m



This is where your images and annotation files are located.  You will need these for this module.

In [None]:
print(f"Test dataset located here: {output_s3_uri +'/test'} ===========")

print(f"Test annotation file is located here: {output_s3_uri +'/manifest'} ===========")



In [None]:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep
from sagemaker.tensorflow import TensorFlow

TF_FRAMEWORK_VERSION = '2.1'

hyperparameters = {'initial_epochs':     5,
                   'batch_size':         8,
                   'fine_tuning_epochs': 20, 
                   'dropout':            0.4,
                   'data_dir':           '/opt/ml/input/data'}

metric_definitions = [{'Name': 'loss',      'Regex': 'loss: ([0-9\\.]+)'},
                  {'Name': 'acc',       'Regex': 'accuracy: ([0-9\\.]+)'},
                  {'Name': 'val_loss',  'Regex': 'val_loss: ([0-9\\.]+)'},
                  {'Name': 'val_acc',   'Regex': 'val_accuracy: ([0-9\\.]+)'}]


distribution = {'parameter_server': {'enabled': False}}
DISTRIBUTION_MODE = 'FullyReplicated'
    
train_in = TrainingInput(s3_data=output_s3_uri +'/train', distribution=DISTRIBUTION_MODE)
val_in   = TrainingInput(s3_data=output_s3_uri +'/valid', distribution=DISTRIBUTION_MODE)
test_in  = TrainingInput(s3_data=output_s3_uri +'/test', distribution=DISTRIBUTION_MODE)

inputs = {'train':train_in, 'test': test_in, 'validation': val_in}

training_instance_type = 'ml.c5.4xlarge'

training_instance_count = 1

In [None]:
model_path = f"s3://{bucket}/{prefix}"

estimator = TensorFlow(entry_point='train-mobilenet.py',
               source_dir='code',
               output_path=model_path,
               instance_type=training_instance_type,
               instance_count=training_instance_count,
               distribution=distribution,
               hyperparameters=hyperparameters,
               metric_definitions=metric_definitions,
               role=role,
               framework_version=TF_FRAMEWORK_VERSION, 
               py_version='py3',
               base_job_name=prefix,
               script_mode=True)

In [None]:
estimator.fit(inputs)

2022-06-09 14:18:24 Starting - Starting the training job...
2022-06-09 14:18:47 Starting - Preparing the instances for trainingProfilerReport-1654784304: InProgress
[34mEpoch 3/5[0m
[34mEpoch 4/5[0m
[34mEpoch 5/5[0m
[34mTrain for 50 steps, validate for 21 steps[0m
[34mEpoch 1/20[0m
[34mEpoch 7/20[0m
[34mEpoch 8/20[0m
[34mEpoch 9/20[0m
[34mEpoch 10/20[0m
[34mEpoch 11/20[0m
[34mEpoch 12/20[0m
[34mEpoch 13/20[0m
[34mEpoch 14/20[0m
[34mEpoch 15/20[0m
[34mEpoch 16/20[0m
[34mEpoch 17/20[0m
[34mEpoch 18/20[0m
[34mEpoch 19/20[0m
[34mEpoch 20/20[0m
[34mModel has been fit.[0m
[34mSaving model, since we are master host[0m
[34mSaving model to /opt/ml/model...[0m
[34mModel directory files BEFORE save: [][0m
[34mModel directory files AFTER save: ['/opt/ml/model/1/saved_model.pb', '/opt/ml/model/1/variables', '/opt/ml/model/1/assets'][0m
[34m...DONE saving model![0m
[34mCopying inference source files...[0m
[34mFiles after copying custom inference h

## Deploy SageMaker model

In [None]:
from sagemaker.tensorflow import TensorFlowModel
from time import gmtime, strftime
from sagemaker.serializers import IdentitySerializer
from sagemaker.deserializers import JSONDeserializer


timestamp_suffix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

model_name = f"{prefix}-classification-model-{timestamp_suffix}"


serializer = IdentitySerializer(content_type="application/x-image")
deserializer = JSONDeserializer(accept='application/json')

training_job_name = estimator.latest_training_job.name
model_artifacts = f'{model_path}/{training_job_name}/output/model.tar.gz'

model = TensorFlowModel(name=model_name,
              model_data=model_artifacts, 
              role=sagemaker.get_execution_role(),
              framework_version=TF_FRAMEWORK_VERSION,
              sagemaker_session=sess)

predictor = model.deploy(initial_instance_count=1, 
                         instance_type='ml.m5.xlarge',
                         serializer=serializer,
                         deserializer = deserializer)

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


-----!

In [None]:
from numpy import argmax

possible_classes = ['013.Bobolink',
                    '017.Cardinal',
                    '035.Purple_Finch',
                    '036.Northern_Flicker',
                    '047.American_Goldfinch',
                    '068.Ruby_throated_Hummingbird',
                    '073.Blue_Jay',
                    '087.Mallard']

fn = 'images/Cardinal_0102_17808.jpg'

with open(fn, 'rb') as img:
    f = img.read()
    
x = bytearray(f)

results = predictor.predict(x)['predictions']

predicted_class_idx = argmax(results)
predicted_class = possible_classes[predicted_class_idx]

predicted_class

'017.Cardinal'

In [None]:
import boto3
import random

s3_client = boto3.client('s3')


objects = s3_client.list_objects_v2(Bucket=bucket, Prefix=f'{output_prefix}/test')

for obj in objects['Contents']:
    rnd_num = random.randint(1, 10)
    
    if rnd_num == 1:
        filename =obj['Key'].split('/')[-1]
        copy_source = {
            'Bucket': bucket,
            'Key': obj['Key']
        }
        s3_client.copy(copy_source, bucket, f'{prefix}/clarify-images/{filename}')
        s3_client.download_file(bucket, obj['Key'], f'images/{filename}')

In [None]:
object_categories = class_selection.split(', ')

model_name

'clarify-explainability-classification-model-2022-06-09-14-30-25'

In [None]:
from sagemaker import clarify

s3_data_input_path = f's3://{bucket}/{prefix}/clarify-images'
clarify_output_prefix = f"{prefix}/cv_analysis_result"
analysis_result_path = "s3://{}/{}".format(bucket, clarify_output_prefix)
explainability_data_config = clarify.DataConfig(
    s3_data_input_path=s3_data_input_path,
    s3_output_path=analysis_result_path,
    dataset_type="application/x-image",
)

model_config = clarify.ModelConfig(
    model_name=model_name, instance_type="ml.m5.xlarge", instance_count=1, content_type="image/jpeg"
)

predictions_config = clarify.ModelPredictedLabelConfig(label_headers=object_categories)

image_config = clarify.ImageConfig(
    model_type="IMAGE_CLASSIFICATION", num_segments=20, segment_compactness=5
)

shap_config = clarify.SHAPConfig(num_samples=500, image_config=image_config)

In [None]:
import os

account_id = os.getenv("AWS_ACCOUNT_ID", "<your-account-id>")
sagemaker_iam_role = "<AmazonSageMaker-ExecutionRole>"

# Fetch the IAM role to initialize the sagemaker processing job
try:
    role = sagemaker.get_execution_role()
except ValueError as e:
    print(e)
    role = f"arn:aws:iam::{account_id}:role/{sagemaker_iam_role}"

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.m5.xlarge", sagemaker_session=sess
)

In [None]:
clarify_processor.run_explainability(
    data_config=explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config,
    model_scores=predictions_config,
)


Job Name:  Clarify-Explainability-2022-06-09-14-33-03-596
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-987720697751/clarify-explainability/clarify-images', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-987720697751/clarify-explainability/cv_analysis_result/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-987720697751/clarify-explainability/cv_analysis_result', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
...................

UnexpectedStatusException: Error for Processing job Clarify-Explainability-2022-06-09-14-33-03-596: Failed. Reason: ClientError: An error occurred (ModelError) when calling the InvokeEndpoint operation (reached max retries: 0): Received server error (500) from primary with message "{"error": "Error: 415, Unsupported content type \"image/jpeg\""}". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/sm-clarify-clarify-explainability-classificatio-1654785474-9299 in account 987720697751 for more information.

In [None]:
%%time
output_objects = s3_client.list_objects(Bucket=bucket, Prefix=clarify_output_prefix)
result_images = []

for file_obj in output_objects["Contents"]:
    file_name = os.path.basename(file_obj["Key"])
    if os.path.splitext(file_name)[1] == ".jpeg":
        result_images.append(file_name)

    print(f"Downloading s3://{output_bucket}/{file_obj['Key']} ...")
    s3_client.download_file(output_bucket, file_obj["Key"], file_name)

In [None]:
from IPython.display import Image

for img in result_images:
    display(Image(img))