# 图片分类

1. [介绍](#介绍)
2. [运行环境](#运行环境)
3. [预处理](#预处理)
  1. [权限和环境变量](#权限和环境变量)
  2. [准备数据](#准备数据)
  3. [数据拆分](#数据拆分)
4. [图像分类模型的微调](#图像分类模型的微调)
  1. [训练参数](#训练参数)
  2. [训练](#训练)
5. [部署模型](#部署模型)
  1. [创建模型](#创建模型)
  2. [推理](#推理)
    1. [创建终端节点配置](#创建终端节点配置) 
    2. [创建终端节点](#创建终端节点) 
    3. [进行推理](#进行推理) 
    4. [清理](#Clean-up)

## 介绍
[参看原文](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-lst-format.ipynb)  
欢迎来到图像分类算法训练的端到端示例。在这个演示中，我们将在迁移学习模式下使用Amazon-sagemaker图像分类算法来微调预先训练的模型（根据imagenet数据进行训练），以学习对新数据集进行分类。  
我们需要通过一些先决步骤来设置环境，这些步骤包括权限、配置等。

## 运行环境
Kernel 请选择 mxnet_p36。  
本文在boto3 1.4.57和sagemaker 2.5.4下测试通过。不能使用sagemaker 1.xx版本。

In [None]:
import boto3,sagemaker
print(boto3.__version__)
print(sagemaker.__version__)

## 预处理
### 权限和环境变量
设置到AWS服务的链接和身份验证。包含三个部分：

* 用于向学习和托管访问数据的角色。这将从用于启动笔记本的角色自动获取

* 用于存储训练数据和模型的S3

* Amazon sagemaker图像分类docker image，无需更改

In [None]:
%%time
import boto3
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve

#如果使用SageMaker的笔记本实例使用下一行
role = get_execution_role()
#如果使用自建的笔记本实例请自行获取Role，可从IAM控制台获取到
#role = "arn:aws-cn:iam::<<account id>>:role/service-role/AmazonSageMaker-ExecutionRole-20200430T124235"

bucket='<<your bucket>>' # 定义S3 bukcet，用于存放处理后的图片和训练模型

training_image = retrieve('image-classification',boto3.Session().region_name)

### 准备数据
这里没有使用原文的256 dataset，采用的是kaggle的[猫狗分类](https://www.kaggle.com/c/dogs-vs-cats/data?select=train.zip)。下载后按如下目录组织猫狗图片。
```
input_data
├── class1
│   ├── image001.jpg
│   ├── image002.jpg
│   └── ...
├── class2
│   ├── image001.jpg
│   ├── image002.jpg
│   └── ...
└── classn
    ├── image001.jpg
    ├── image002.jpg
    └── ...
```

#### 方式1、从官网下载
[猫狗数据下载](https://www.kaggle.com/c/dogs-vs-cats/data?select=train.zip)，然后上传到NoteBook当前目录。请下载train.zip，不要选择Download All。

In [None]:
%%bash
unzip -q train.zip
mv train cat-vs-dog
mkdir -p cat-vs-dog/cat
mkdir -p cat-vs-dog/dog
mv cat-vs-dog/cat.*.jpg cat-vs-dog/cat/
mv cat-vs-dog/dog.*.jpg cat-vs-dog/dog/

#### 方式2、从已准备好的文件下载

In [None]:
%%bash
wget -q -O cat-vs-dog.zip https://xxx/cat-vs-dog.zip
unzip -q cat-vs-dog.zip
#修改下行代码的第一个参数为实际目录
mv cat-vs-dog-1000 cat-vs-dog

### 数据拆分

In [None]:
%%bash

mkdir -p cat-vs-dog-val
for i in cat-vs-dog/*; do
    c=`basename $i`
    mkdir -p cat-vs-dog-val/$c
    for j in `ls $i/*.jpg | shuf | head -n 100`; do
        mv $j cat-vs-dog-val/$c/
    done
done

#比原文多了一个test数据集
mkdir -p cat-vs-dog-test
for i in cat-vs-dog/*; do
    c=`basename $i`
    mkdir -p cat-vs-dog-test/$c
    for j in `ls $i/*.jpg | shuf | head -n 100`; do
        mv $j cat-vs-dog-test/$c/
    done
done

mv cat-vs-dog cat-vs-dog-train

In [None]:
# Tool for creating lst file
!wget -q https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py

In [None]:
%%bash
python im2rec.py --list --recursive cat-vs-dog-train cat-vs-dog-train/
python im2rec.py --list --recursive cat-vs-dog-val cat-vs-dog-val/

In [None]:
!head -n 3 ./cat-vs-dog-train.lst > example.lst
f = open('example.lst','r')
lst_content = f.read()
print(lst_content)

In [None]:
# Four channels: train, validation, train_lst, and validation_lst
s3train = 's3://{}/image-classification/cat-vs-dog/train/'.format(bucket)
s3validation = 's3://{}/image-classification/cat-vs-dog/validation/'.format(bucket)
s3train_lst = 's3://{}/image-classification/cat-vs-dog/train_lst/'.format(bucket)
s3validation_lst = 's3://{}/image-classification/cat-vs-dog/validation_lst/'.format(bucket)

# upload the image files to train and validation channels
!aws s3 cp cat-vs-dog-train $s3train --recursive --quiet
!aws s3 cp cat-vs-dog-val $s3validation --recursive --quiet

# upload the lst files to train_lst and validation_lst channels
!aws s3 cp cat-vs-dog-train.lst $s3train_lst --quiet
!aws s3 cp cat-vs-dog-val.lst $s3validation_lst --quiet

## 图像分类模型的微调
### 训练参数

In [None]:
# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 101, 152 and 200
# For this training, we will use 18 layers
num_layers = 18
# we need to specify the input image shape for the training data
image_shape = "3,224,224"
# we also need to specify the number of training samples in the training set
num_training_samples = 15240
# specify the number of output classes
num_classes = 257
# batch size for training
mini_batch_size = 128
# number of epochs
epochs = 6
# learning rate
learning_rate = 0.01
# report top_5 accuracy
top_k = 5
# resize image before training
resize = 256
# period to store model parameters (in number of epochs), in this case, we will save parameters from epoch 2, 4, and 6
checkpoint_frequency = 2
# Since we are using transfer learning, we set use_pretrained_model to 1 so that weights can be 
# initialized with pre-trained weights
use_pretrained_model = 1

### 训练

In [None]:
%%time
import time
import boto3
from time import gmtime, strftime


s3 = boto3.client('s3')
# create unique job name 
job_name_prefix = 'sagemaker-imageclassification-notebook'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
training_params = \
{
    # specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p3.2xlarge",
        "VolumeSizeInGB": 50
    },
    "EnableManagedSpotTraining": True,
    "TrainingJobName": job_name,
    "HyperParameters": {
        "image_shape": image_shape,
        "num_layers": str(num_layers),
        "num_training_samples": str(num_training_samples),
        "num_classes": str(num_classes),
        "mini_batch_size": str(mini_batch_size),
        "epochs": str(epochs),
        "learning_rate": str(learning_rate),
        "top_k": str(top_k),
        "resize": str(resize),
        "checkpoint_frequency": str(checkpoint_frequency),
        "use_pretrained_model": str(use_pretrained_model)    
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000,
        "MaxWaitTimeInSeconds": 360000
    },
#Training data should be inside a subdirectory called "train"
#Validation data should be inside a subdirectory called "validation"
#The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3train,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3validation,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "train_lst",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3train_lst,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation_lst",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3validation_lst,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        }
    ]
}
print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

In [None]:
# create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

# confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

In [None]:
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)
print (training_info)

如果看到,

> `Training job ended with status: Completed`

这意味着训练成功完成，输出模型存储在`training_params['OutputDataConfig']`指定的输出路径中。

您还可以使用AWS SageMaker控制台查看有关训练作业的信息和状态。

## 部署模型

### 创建模型

In [None]:
%%time
import boto3
from time import gmtime, strftime

sage = boto3.Session().client(service_name='sagemaker') 

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
model_name="image-classification-model" + timestamp
print(model_name)
info = sage.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

hosting_image = retrieve('image-classification',boto3.Session().region_name)

primary_container = {
    'Image': hosting_image,
    'ModelDataUrl': model_data,
}

create_model_response = sage.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

### 推理

#### 创建终端节点配置

In [None]:
from time import gmtime, strftime

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_config_name = job_name_prefix + '-epc-' + timestamp
endpoint_config_response = sage.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.t2.medium',
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print('Endpoint configuration name: {}'.format(endpoint_config_name))
print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

#### 创建终端节点

In [None]:
%%time
import time

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = job_name_prefix + '-ep-' + timestamp
print('Endpoint name: {}'.format(endpoint_name))

endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': endpoint_config_name,
}
endpoint_response = sagemaker.create_endpoint(**endpoint_params)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

创建终端节点大概需要10-15分钟

In [None]:
# get the status of the endpoint
response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))
    
try:
    sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
finally:
    resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Arn: " + resp['EndpointArn'])
    print("Create endpoint ended with status: " + status)

    if status != 'InService':
        message = sagemaker.describe_endpoint(EndpointName=endpoint_name)['FailureReason']
        print('Training failed with the following error: {}'.format(message))
        raise Exception('Endpoint creation did not succeed')

如果看到,

> `Create endpoint ended with status: InService`

那恭喜你！现在有了一个正常的推理终端节点。您可以导航到AWS SageMaker控制台中的“终端节点”选项卡来确认终端节点配置和状态。


最后，我们将创建一个运行时对象，从中可以调用端点。

#### 进行推理

最后，客户现在可以验证模型以供使用。可以使用先前操作的结果从中获取终端节点，并使用该端点从经过训练的模型中生成分类。


In [None]:
import boto3
runtime = boto3.Session().client(service_name='runtime.sagemaker') 

In [None]:
import os
cat_test_dir='./cat-vs-dog-test/cat/'
files = os.listdir(cat_test_dir)
file_name = os.path.join(os.path.join(cat_test_dir, files[0]))
# test image
from IPython.display import Image
Image(file_name)  

In [None]:
import json
import numpy as np
with open(file_name, 'rb') as f:
    payload = f.read()
    payload = bytearray(payload)
response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='application/x-image', 
                                   Body=payload)
result = response['Body'].read()
# result will be in json format and convert it to ndarray
result = json.loads(result)
# the result will output the probabilities for all classes
# find the class with maximum probability and print the class index
index = np.argmax(result)
object_categories = ['cat','dog']
print("Result: label - " + object_categories[index] + ", probability - " + str(result[index]))

#### 清理

当我们处理完终端节点之后，我们可以删除它，然后释放后台实例。

In [None]:
sage.delete_endpoint(EndpointName=endpoint_name)