# 使用Amazon SageMaker训练汽车型号图像识别的模型

## 基于深度学习迁移学习的端到端图像分类器

深度学习中需要大量的数据和计算资源且需花费大量时间来训练模型，但在实际中难以满足这些需求，而使用迁移学习则能有效降低数据量、计算量和计算时间，并能定制在新场景的业务需求，可谓一大利器。
   
此解决方案基于 Amazon SageMaker完全托管的机器学习服务，使用自己的数据 [cars-data](https://ai.stanford.edu/~jkrause/cars/car_dataset.html) 来微调一个预训练的图像分类模型并且达到较高的准确率来构建一个车型号分类器。Amazon SageMaker 是一项完全托管的模块化机器学习服务，可帮助开发人员和数据科学家大规模地构建、训练和部署机器学习模型。

## 主要目标

在机器学习平台Amazon SageMaker中，基于机器学习框架MXNet，使用自己收集的数据在一个预训练的图像分类模型上使用迁移学习，构建一个具有较高准确率的汽车型号分类器。

# 架构图
![image.png](attachment:image.png)

# Step1: 创建笔记本实例
<kbd>用时约3分钟</kbd> <br />
注意：本演示以AWS中国区为例，同时也可以在AWS海外区域 (https://console.aws.amazon.com)  使用.<br />
登录到AWS中国区控制台(https://console.amazonaws.cn/) ，切换区域到“中国 (北京)cn-north-1”，在SageMaker服务中创建一个笔记本实例，按默认”创建 IAM 角色“,笔记本实例类型根据需要可选”ml.t3.medium“;<br />
等待该笔记本实例创建完成，状态由”Pending“变为”InService“，点击后面的链接”打开 Jupyter“即可进入Jupyter

# Step2：下载并预处理数据

<kbd>用时约3分钟</kbd> <br />
本次演示的训练数据截取自斯坦福大学提供的开源数据集"Cars"数据集(http://ai.stanford.edu/~jkrause/cars/car_dataset.html) ，该数据集包含 16,185 张 196 种汽车的图像。数据分为 8,144 个训练图像和 8,041 个测试图像，每个类大致拆分为一半训练一半测试。通常按制造，模型，年份分类，例如2012年特斯拉模型S或2012宝马M3轿跑车（2012 Tesla Model S or 2012 BMW M3 coupe）。<br /> 
出于演示目的，本次演示的训练数据只截取了3种车型的图片，共120张：<br />
Acura Integra Type R 2001，45张图片<br />
Acura RL Sedan 2012，32张图片<br />
Acura TL Sedan 2012，43张图片<br />

In [81]:
%%bash
python im2rec.py --list --recursive --train-ratio 1 data_train ./efs/newdataset/car_data/car_data/train
python im2rec.py --resize 224 --center-crop --num-thread 4 ./ ./efs/newdataset/car_data/car_data/train

AM General Hummer SUV 2000 0
Acura Integra Type R 2001 1
Acura RL Sedan 2012 2
Acura TL Sedan 2012 3
Acura TL Type-S 2008 4
Acura TSX Sedan 2012 5
Acura ZDX Hatchback 2012 6
Aston Martin V8 Vantage Convertible 2012 7
Aston Martin V8 Vantage Coupe 2012 8
Aston Martin Virage Convertible 2012 9
Aston Martin Virage Coupe 2012 10
Audi 100 Sedan 1994 11
Audi 100 Wagon 1994 12
Audi A5 Coupe 2012 13
Audi R8 Coupe 2012 14
Audi RS 4 Convertible 2008 15
Audi S4 Sedan 2007 16
Audi S4 Sedan 2012 17
Audi S5 Convertible 2012 18
Audi S5 Coupe 2012 19
Audi S6 Sedan 2011 20
Audi TT Hatchback 2011 21
Audi TT RS Coupe 2012 22
Audi TTS Coupe 2012 23
Audi V8 Sedan 1994 24
BMW 1 Series Convertible 2012 25
BMW 1 Series Coupe 2012 26
BMW 3 Series Sedan 2012 27
BMW 3 Series Wagon 2012 28
BMW 6 Series Convertible 2007 29
BMW ActiveHybrid 5 Sedan 2012 30
BMW M3 Coupe 2012 31
BMW M5 Sedan 2010 32
BMW M6 Convertible 2010 33
BMW X3 SUV 2012 34
BMW X5 SUV 2007 35
BMW X6 SUV 2012 36
BMW Z4 Convertible 2012 37
Bentley 

In [82]:
%%bash
python im2rec.py --list --recursive --train-ratio 1 data_val ./efs/newdataset/car_data/car_data/test
python im2rec.py --resize 224 --center-crop --num-thread 4 ./ ./efs/newdataset/car_data/car_data/test

AM General Hummer SUV 2000 0
Acura Integra Type R 2001 1
Acura RL Sedan 2012 2
Acura TL Sedan 2012 3
Acura TL Type-S 2008 4
Acura TSX Sedan 2012 5
Acura ZDX Hatchback 2012 6
Aston Martin V8 Vantage Convertible 2012 7
Aston Martin V8 Vantage Coupe 2012 8
Aston Martin Virage Convertible 2012 9
Aston Martin Virage Coupe 2012 10
Audi 100 Sedan 1994 11
Audi 100 Wagon 1994 12
Audi A5 Coupe 2012 13
Audi R8 Coupe 2012 14
Audi RS 4 Convertible 2008 15
Audi S4 Sedan 2007 16
Audi S4 Sedan 2012 17
Audi S5 Convertible 2012 18
Audi S5 Coupe 2012 19
Audi S6 Sedan 2011 20
Audi TT Hatchback 2011 21
Audi TT RS Coupe 2012 22
Audi TTS Coupe 2012 23
Audi V8 Sedan 1994 24
BMW 1 Series Convertible 2012 25
BMW 1 Series Coupe 2012 26
BMW 3 Series Sedan 2012 27
BMW 3 Series Wagon 2012 28
BMW 6 Series Convertible 2007 29
BMW ActiveHybrid 5 Sedan 2012 30
BMW M3 Coupe 2012 31
BMW M5 Sedan 2010 32
BMW M6 Convertible 2010 33
BMW X3 SUV 2012 34
BMW X5 SUV 2007 35
BMW X6 SUV 2012 36
BMW Z4 Convertible 2012 37
Bentley 

获取数据

In [80]:
%%bash


Process is terminated.


In [2]:
%%bash

wget https://shishuai-share-external.s3.cn-north-1.amazonaws.com.cn/script/im2rec.py
wget https://sagemaker-sample-dataset-bjs.s3.cn-north-1.amazonaws.com.cn/car_data_sample.zip
unzip car_data_sample.zip

mkdir train
mkdir validation
data_path=car_data_sample 
echo "data_path: ${data_path}" 
train_path=train/ 
echo "train_path: ${train_path}" 
val_path=validation/ 
echo "val_path: ${val_path}"

python im2rec.py --list --train-ratio 0.8 --recursive $data_path/data $data_path/train

python im2rec.py --resize 224 --center-crop --num-thread 4 $data_path/data $data_path/train

mv ${data_path}/data_train.rec $train_path
mv ${data_path}/data_val.rec $val_path

Archive:  car_data_sample.zip
  inflating: o 0.8 --recursive $data_path/data $data_path/train  
  inflating: n_path                  


--2020-12-11 02:30:13--  https://shishuai-share-external.s3.cn-north-1.amazonaws.com.cn/script/im2rec.py
Resolving shishuai-share-external.s3.cn-north-1.amazonaws.com.cn (shishuai-share-external.s3.cn-north-1.amazonaws.com.cn)... 54.222.49.93
Connecting to shishuai-share-external.s3.cn-north-1.amazonaws.com.cn (shishuai-share-external.s3.cn-north-1.amazonaws.com.cn)|54.222.49.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13828 (14K) [text/x-python-script]
Saving to: ‘im2rec.py.1’

     0K .......... ...                                        100% 12.1M=0.001s

2020-12-11 02:30:13 (12.1 MB/s) - ‘im2rec.py.1’ saved [13828/13828]

--2020-12-11 02:30:13--  https://sagemaker-sample-dataset-bjs.s3.cn-north-1.amazonaws.com.cn/car_data_sample.zip
Resolving sagemaker-sample-dataset-bjs.s3.cn-north-1.amazonaws.com.cn (sagemaker-sample-dataset-bjs.s3.cn-north-1.amazonaws.com.cn)... 54.222.49.93
Connecting to sagemaker-sample-dataset-bjs.s3.cn-north-1.amazonaws.com.c

CalledProcessError: Command 'b'\nwget https://shishuai-share-external.s3.cn-north-1.amazonaws.com.cn/script/im2rec.py\nwget https://sagemaker-sample-dataset-bjs.s3.cn-north-1.amazonaws.com.cn/car_data_sample.zip\nunzip car_data_sample.zip\n\nmkdir train\nmkdir validation\ndata_path=car_data_sample \necho "data_path: ${data_path}" \ntrain_path=train/ \necho "train_path: ${train_path}" \nval_path=validation/ \necho "val_path: ${val_path}"\n\npython im2rec.py --list --train-ratio 0.8 --recursive $data_path/data $data_path/train\n\npython im2rec.py --resize 224 --center-crop --num-thread 4 $data_path/data $data_path/train\n\nmv ${data_path}/data_train.rec $train_path\nmv ${data_path}/data_val.rec $val_path\n'' returned non-zero exit status 1.

# Step3.环境准备，上传处理后的数据到缺省的S3存储桶
<kbd>用时约1分钟</kbd> <br />
创建一些使用AWS服务所需的授权和环境。包含三部分：<br /> 
(1)模型训练所需的权限， 这个会自动从创建笔记本的role中自动获取 <br /> 
(2)指定存储训练数据和模型的S3存储桶 <br /> 
(3)Amazon Sagemaker 中预训练好的图像分类模型docker image<br /> 

最后上传处理后的数据到S3存储桶，然后可以到S3控制台上检查是否成功上传。

In [108]:
%%time
!pwd
import boto3
import re
import os 

import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

role = get_execution_role()
print(role)
sess = sagemaker.Session()
role = get_execution_role()

bucket = sess.default_bucket()
print('Default bucket:{}'.format(bucket))
prefix = 'car-classifier'

training_image = get_image_uri(sess.boto_region_name, 'image-classification')
print(training_image)

/home/ec2-user/SageMaker
arn:aws-cn:iam::383709301087:role/sgbootcamp-internal-SageMakerExecutionRole-1HIZ19SFK76J4


The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


Default bucket:sagemaker-cn-north-1-383709301087
390948362332.dkr.ecr.cn-north-1.amazonaws.com.cn/image-classification:1
CPU times: user 339 ms, sys: 0 ns, total: 339 ms
Wall time: 3.76 s


In [109]:
# Setup an output S3 location for the model artifact
output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://sagemaker-cn-north-1-383709301087/car-classifier/output


## Prepare File System Input
Next, we specify the details of file system as an input to your training job. 
Using file system as a data source eliminates the time your training job spends downloading data with data streamed
directly from file system into your training algorithm.

In [110]:
from sagemaker.inputs import FileSystemInput

# Specify file system id.
#file_system_id = 'fs-faf02d67'
file_system_id = 'fs-03042ec6053a32690'

# Specify directory path associated with the file system. You need to provide normalized and absolute path here.
#file_system_directory_train_path = '/newdataset/train'
#file_system_directory_val_path = '/newdataset/validation'

file_system_directory_train_path = '/sagemaker/training-data-fsx/train'
file_system_directory_val_path = '/sagemaker/training-data-fsx/training-data-fsx/validation'


# Specify the access mode of the mount of the directory associated with the file system. 
# Directory can be mounted either in 'ro'(read-only) or 'rw' (read-write).
file_system_access_mode = 'ro'

# Specify your file system type, "EFS" or "FSxLustre".
#file_system_type = 'EFS'
file_system_type = 'FSxLustre'

# Give Amazon SageMaker Training Jobs Access to FileSystem Resources in Your Amazon VPC.
security_groups_ids = ['sg-98d94ffe']
subnets = ['subnet-c666f5a2']

file_system_input_train = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=file_system_directory_train_path,
                                    file_system_access_mode=file_system_access_mode,
                                    content_type='application/x-recordio')
file_system_input_val = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=file_system_directory_val_path,
                                    content_type='application/x-recordio',
                                    file_system_access_mode=file_system_access_mode)

## s3 as datasource

In [2]:
def upload_to_s3(file):
    s3 = boto3.resource('s3')
    data = open(file, "rb")
    key = file
    s3.Bucket(bucket).put_object(Key=key, Body=data)
    print('Upload {}/{} successful'.format(bucket,key))


# upload to S3 bucket
s3_train_key = "train"
s3_validation_key = "validation"
s3_train = 's3://{}/{}/'.format(bucket, s3_train_key)
s3_validation = 's3://{}/{}/'.format(bucket, s3_validation_key)

upload_to_s3('train/data_train.rec')
upload_to_s3('validation/data_val.rec')


/home/ec2-user/SageMaker/bootcamp
arn:aws:iam::249517808360:role/service-role/AmazonSageMaker-ExecutionRole-20191211T161371
Default bucket:sagemaker-ap-northeast-1-249517808360
501404015308.dkr.ecr.ap-northeast-1.amazonaws.com/image-classification:1
Upload sagemaker-ap-northeast-1-249517808360/train/data_train.rec successful
Upload sagemaker-ap-northeast-1-249517808360/validation/data_val.rec successful
CPU times: user 1 s, sys: 79 ms, total: 1.08 s
Wall time: 2.05 s


# Step4.使用迁移学习进行模型训练
<kbd>用时约6分钟</kbd> <br />
数据集准备结束之后，我们就可以开始模型的训练了。但在开始训练任务之前， 我们需要配置模型训练的一系列超参数，具体的超参数含义如下：
```
Num_layers： 神经网络的层数，本例中可以选择18, 34, 50, 101, 152 and 200。很多经典网络模型的名字中包含的数字就代表了layer个数，如vgg16中的16就代表了权重层的个数
Image_shape： 输入图像的通道数，像素的长宽
Num_training_samples： 训练样本的个数
Num_classes: 训练样本图像分类的类目数，本例中为了简介，只选取了三个class做范例
mini_batch_size： 每轮训练的输入一批数据包含的数目
epochs： 训练轮次
learning_rate： 训练学习率
use_pretrained_model： 是否使用预训练模型进行迁移学习，如为1，则初始化中使用已经基于一个较大的开源数据集训练好的模型，如基于数据集imagenet上的ResNet模型，学习的网络结构
```

AWS SageMaker内置image-classification算法 https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html  <br />
之后，我们进行必要的 SageMaker API 的创建，构建对应的训练任务 – 其中有指定训练的输入与输出，训练的计算实例配置，这里，我们使用的是 ml.p3.2xlarge GPU 实例。需要注意的是，这里sagemaker notebook进行本地的数据处理，模型训练，模型推理是不同的环境，可以根据不同的计算任务的需求进行不同的机型选择。

In [111]:
import time
job_name = 'DEMO-spot-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)
train_use_spot_instances=True
checkpoint_s3_uri = ('s3://{}/{}/checkpoints/{}'.format(bucket, prefix, job_name) if train_use_spot_instances 
                      else None)
ic = sagemaker.estimator.Estimator(training_image,
                                     role, 
                                     subnets=subnets,
                                     security_group_ids=security_groups_ids,
                                     train_instance_count=1, 
                                     train_instance_type='ml.p3.8xlarge',
                                     train_use_spot_instances=train_use_spot_instances,
                                     train_max_wait=3600 if train_use_spot_instances else None,
                                     checkpoint_s3_uri=checkpoint_s3_uri,
                                     train_volume_size = 50,
                                     train_max_run = 3600,
                                     input_mode= 'File',
                                     output_path=s3_output_location,
                                     sagemaker_session=sess)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_max_run has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_use_spot_instances has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_max_wait has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [None]:
ic.set_hyperparameters(num_layers=18,
                         image_shape = "3,224,224",
                         use_pretrained_model=1,
                         num_classes=196,
                         num_training_samples=8144,
                         mini_batch_size=30,
                         epochs=10,
                         learning_rate=0.01,
                         top_k=2,
                         precision_dtype='float32')

In [102]:
'''
train_data = sagemaker.session.s3_input(s3_train, distribution='FullyReplicated', 
                        content_type='application/x-recordio', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation, distribution='FullyReplicated', 
                             content_type='application/x-recordio', s3_data_type='S3Prefix')
'''

data_channels = {'train': file_system_input_train, 'validation': file_system_input_val}

In [107]:
ic.fit(inputs=data_channels, logs=True)

2021-01-29 06:35:55 Starting - Starting the training job...
2021-01-29 06:35:57 Starting - Launching requested ML instances............
2021-01-29 06:38:07 Starting - Preparing the instances for training.........
2021-01-29 06:39:37 Downloading - Downloading input data...
2021-01-29 06:40:19 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34m[01/29/2021 06:40:39 INFO 140053937038464] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/image_classification/default-input.json: {u'beta_1': 0.9, u'gamma': 0.9, u'beta_2': 0.999, u'optimizer': u'sgd', u'use_pretrained_model': 0, u'eps': 1e-08, u'epochs': 30, u'lr_scheduler_factor': 0.1, u'num_layers': 152, u'image_shape': u'3,224,224', u'precision_dtype': u'float32', u'mini_batch_size': 32, u'weight_decay': 0.0001, u'learning_rate': 0.1, u'momentum': 0}[0m
[34m[01/29/2021 06:40:39 INFO 140053937038464] Merging with provided configuration from /opt/ml/input/config



- 在Sagemaker Console中可以查询训练任务进程，训练时间由训练实例类型和训练epoch决定
- 训练过程可以在cloudwatch查看训练loss的参数

等到返回Training job completed之后，训练任务完成
![image.png](attachment:image.png)

完成上述步骤后，您可以在Sagemaker Console中看到自己的训练任务

当status为completed时，代表训练完成。整个训练过程大概持续6分钟，这个时间会根据选择的机型和设置的epochs个数等进行变化。<br />

(演示)

同时，在训练过程中，您还可以通过监控cloudwatch logs来查看训练过程中的loss变化<br />

也可以在SageMaker控制台台上查看”历史记录“，如下
![image.png](attachment:image.png)

# Step5.模型部署
<kbd>用时约6分钟</kbd> <br />
训练结束后，我们在之前配置的S3存储桶就获得了最新的模型文件。我们接下来，将其进行线上部署，这样就可以通过接受来自客户端的Restful API请求进行预测。

In [30]:
ic_classifier = ic.deploy(initial_instance_count = 1, instance_type = 'ml.m5.xlarge')

---------------!

# Step6. 推理及应用
<kbd>用时约2分钟</kbd> <br />
现在我们使用一个随意挑选的车的图片进行型号的分类。
我们直接调用创建的endpoint进行推理，可以看到结果与概率，可以看到，准确的判断出了相对应的分类。这里的概率并不是非常高，但鉴于我们作为范例只训练了不多的epoch，已经是个很不错的结果了。如果想要得到更高的准确率，请使用完整数据集进行更多轮次的训练。

In [29]:
import random
from IPython.display import Image

car = 'Acura Integra Type R 2001/'
testls = os.listdir('car_data_sample/test/'+car)
file_name = 'car_data_sample/test/'+car+testls[random.randint(0,len(testls)-1)]

# test image
Image(file_name)  

<IPython.core.display.Image object>

In [31]:
import json
import numpy as np

with open(file_name, 'rb') as f:
    payload = f.read()
    payload = bytearray(payload)
    
ic_classifier.content_type = 'application/x-image'
result = json.loads(ic_classifier.predict(payload))
# the result will output the probabilities for all classes
# find the class with maximum probability and print the class index
index = np.argmax(result)
object_categories = ['Acura Integra Type R 2001', 'Acura RL Sedan 2012', 'Acura TL Sedan 2012']
print("Result: label - " + object_categories[index] + ", probability - " + str(result[index]))

Result: label - Acura TL Sedan 2012, probability - 0.643511176109314


以上就是一个完整的使用Amazon Sagemaker构建图像分类模型，训练，部署的过程。您可以将它进行修改，完成不同场景下自己的图像分类任务。

# 最后一步,清除资源
<kbd>用时约1分钟</kbd> <br />
完成测试后，清除访问端点资源

In [None]:
ic_classifier.delete_endpoint()