# 使用Amazon FSx for Lustre 和 Amazon EFS 作数据源加快 Amazon Sagemaker 训练

## 测试数据集准备

本次演示的训练数据截取自斯坦福大学提供的开源数据集"Cars"数据集(http://ai.stanford.edu/~jkrause/cars/car_dataset.html) ，该数据集包含 16,185 张 196 种汽车的图像。数据分为 8,144 个训练图像和 8,041 个测试图像，按制造，模型，年份分类，例如2012年特斯拉模型S或2012宝马M3轿跑车（2012 Tesla Model S or 2012 BMW M3 coupe）。<br /> 
本次演示的训练数据使用kaggle上已经按照车类别分好文件夹的数据<br />
- 首先通过pip安装kaggle命令
- 使用kaggle download下载数据集

In [None]:
%%bash
pip install kaggle
kaggle datasets download -d jutrera/stanford-car-dataset-by-classes-folder

数据下载完成之后做解压，定义数据输入输出路径为后续做准备；

In [None]:
%%bash
unzip stanford-car-dataset-by-classes-folder.zip
mkdir train_data
mkdir validation
data_path=stanford-car-dataset-by-classes-folder 
echo "data_path: ${data_path}" 
train_path=train_data/ 
echo "train_path: ${train_path}" 
val_path=validation/ 
echo "val_path: ${val_path}"

执行im2rec脚本将训练集图片格式转化为recordio格式，"$data_path/car_data/train"也可以替换为目标文件夹的绝对路径
- 也可以直接将已经制作好的record文件分别放入train和val文件夹开始训练过程

In [81]:
%%bash
python im2rec.py --list --recursive --train-ratio 1 data_train $data_path/car_data/train
python im2rec.py --resize 224 --center-crop --num-thread 4 ./ $data_path/car_data/train
mv ${data_path}/data_train.rec $train_path

AM General Hummer SUV 2000 0
Acura Integra Type R 2001 1
Acura RL Sedan 2012 2
Acura TL Sedan 2012 3
Acura TL Type-S 2008 4
Acura TSX Sedan 2012 5
Acura ZDX Hatchback 2012 6
Aston Martin V8 Vantage Convertible 2012 7
Aston Martin V8 Vantage Coupe 2012 8
Aston Martin Virage Convertible 2012 9
Aston Martin Virage Coupe 2012 10
Audi 100 Sedan 1994 11
Audi 100 Wagon 1994 12
Audi A5 Coupe 2012 13
Audi R8 Coupe 2012 14
Audi RS 4 Convertible 2008 15
Audi S4 Sedan 2007 16
Audi S4 Sedan 2012 17
Audi S5 Convertible 2012 18
Audi S5 Coupe 2012 19
Audi S6 Sedan 2011 20
Audi TT Hatchback 2011 21
Audi TT RS Coupe 2012 22
Audi TTS Coupe 2012 23
Audi V8 Sedan 1994 24
BMW 1 Series Convertible 2012 25
BMW 1 Series Coupe 2012 26
BMW 3 Series Sedan 2012 27
BMW 3 Series Wagon 2012 28
BMW 6 Series Convertible 2007 29
BMW ActiveHybrid 5 Sedan 2012 30
BMW M3 Coupe 2012 31
BMW M5 Sedan 2010 32
BMW M6 Convertible 2010 33
BMW X3 SUV 2012 34
BMW X5 SUV 2007 35
BMW X6 SUV 2012 36
BMW Z4 Convertible 2012 37
Bentley 

同样对测试数据集做转换为recordio的操作；

In [82]:
%%bash
python im2rec.py --list --recursive --train-ratio 1 data_val ${data_path}car_data/test
python im2rec.py --resize 224 --center-crop --num-thread 4 ./ ${data_path}car_data/test
mv ${data_path}/data_val.rec $val_path

AM General Hummer SUV 2000 0
Acura Integra Type R 2001 1
Acura RL Sedan 2012 2
Acura TL Sedan 2012 3
Acura TL Type-S 2008 4
Acura TSX Sedan 2012 5
Acura ZDX Hatchback 2012 6
Aston Martin V8 Vantage Convertible 2012 7
Aston Martin V8 Vantage Coupe 2012 8
Aston Martin Virage Convertible 2012 9
Aston Martin Virage Coupe 2012 10
Audi 100 Sedan 1994 11
Audi 100 Wagon 1994 12
Audi A5 Coupe 2012 13
Audi R8 Coupe 2012 14
Audi RS 4 Convertible 2008 15
Audi S4 Sedan 2007 16
Audi S4 Sedan 2012 17
Audi S5 Convertible 2012 18
Audi S5 Coupe 2012 19
Audi S6 Sedan 2011 20
Audi TT Hatchback 2011 21
Audi TT RS Coupe 2012 22
Audi TTS Coupe 2012 23
Audi V8 Sedan 1994 24
BMW 1 Series Convertible 2012 25
BMW 1 Series Coupe 2012 26
BMW 3 Series Sedan 2012 27
BMW 3 Series Wagon 2012 28
BMW 6 Series Convertible 2007 29
BMW ActiveHybrid 5 Sedan 2012 30
BMW M3 Coupe 2012 31
BMW M5 Sedan 2010 32
BMW M6 Convertible 2010 33
BMW X3 SUV 2012 34
BMW X5 SUV 2007 35
BMW X6 SUV 2012 36
BMW Z4 Convertible 2012 37
Bentley 

执行完以上脚本应该获得的文件结构如下：

/efs <br />
└─car_data<br />
    ├─train<br />
    │    ├─Audi S4 Sedan 2007 16<br />
    │    ├─Acura ZDX Hatchback 2012 6<br />
    │    └─....<br />
    ├─test<br />
    │    ├─Audi S4 Sedan 2007 16<br />
    │    ├─Acura ZDX Hatchback 2012 6<br />
    │    └─....<br />
    ├─im2rec.py <br /> 
    ├─train_data<br />
    │  └─data_train.rec<br />
    ├─validation<br />
    │  └─data_val.rec<br />
...<br />

