# CNN模型：图片识别

作者：吴凡璐   
学校：中央财经大学   
学号：2018210803

### 花朵分类 

**数据集介绍**  

本次数据存储于一个文件夹中的三个子文件夹，每一个文件夹包含了一种花，分别是
* daisy: 雏菊
* dandelion: 蒲公英
* sunflowers: 太阳花  

本案例将使用CNN模型对这三种花进行区分

In [26]:
%matplotlib inline
import seaborn as sns
import os
import shutil
import random

# 从https://www.kaggle.com/c/dogs-vs-cats/data下载完整的样本集train.zip，解压到下面的目录
train = 'C:/Users/wufan/Desktop/data/flowers' 

daisy = [train+'/daisy/'+i for i in os.listdir(train+'/daisy')] # 所有狗的图片所在路径的集合
dandelion = [train+ '/dandelion/'+i for i in os.listdir(train+'/dandelion')]
sunflowers = [train+'/sunflowers/'+i for i in os.listdir(train+'/sunflowers')]
print('daisy count:' + str(len(daisy)))
print('dandelion count:' + str(len(dandelion)))
print('sunflowers count:' + str(len(sunflowers)))

daisy count:633
dandelion count:898
sunflowers count:699


由计数结果可知，daisy有633张图片，dandelion有898张图片，sunflowers有699张图片。在生成训练集和测试集时，我们将每个种类中1/3的图片划分为测试集，把2/3的图片划分为训练集，具体如下：

| 花的种类        | 训练集数量           | 测试集数量  |
| ------------- |:-------------:| -----:|
| daisy      | 422 | 211 |
| dandelion      | 598      |   300 |
| sunflowers | 466      |    233 |

In [37]:
target = 'C:/Users/wufan/Desktop/data/arrange/' # 目标训练集地址

# 随机化
random.shuffle(daisy)
random.shuffle(dandelion)
random.shuffle(sunflowers)

def ensure_dir(dir_path):
    if not os.path.exists(dir_path):
        try:
            os.makedirs(dir_path)
        except OSError:
            pass

# 生成文件夹
for flower in ['daisy','dandelion','sunflowers']:   
    ensure_dir(target + 'train/'+flower)
    ensure_dir(target + 'validation/'+flower)

# 复制图片
flower_list=[daisy,dandelion,sunflowers]
flower_name=['daisy','dandelion','sunflowers']
for i in range(len(flower_name)):
    sub_list=flower_list[i]
    for file in sub_list[0:int(2/3*len(sub_list))]:
        shutil.copyfile(file, target + 'train/'+flower_name[i]+'/' + os.path.basename(file))
    for file in sub_list[int(2/3*len(sub_list)):]:
        shutil.copyfile(file, target + 'validation/'+flower_name[i]+'/' + os.path.basename(file))

生成完训练集和测试集后，测试一下划分的是否正确，以雏菊图片为例：

In [45]:
d1 = [i for i in os.listdir('C:/Users/wufan/Desktop/data/arrange/train/dandelion')]
d2 = [i for i in os.listdir('C:/Users/wufan/Desktop/data/arrange/validation/dandelion')]# 所有狗的图片所在路径的集合
print ('dandelion训练集中的图片个数是：'+str(len(d1)))
print ('dandelion测试集集中的图片个数是：'+str(len(d2)))
#d2 = [train+ '/dandelion/'+i for i in os.listdir(train+'/dandelion')]

dandelion训练集中的图片个数是：598
dandelion测试集集中的图片个数是：300


使用数据增强技术对训练数据进行微小的扰动或者变化，提升模型的泛化能力和鲁棒性。

In [46]:
from keras.preprocessing.image import ImageDataGenerator

# 图片尺寸
img_width, img_height = 128, 128
input_shape = (img_width, img_height, 3)

train_data_dir = target + 'train'
validation_data_dir = target + 'validation'

# 生成变形图片
train_pic_gen = ImageDataGenerator(
        rescale=1./255, # 对输入图片归一化到0-1区间
        rotation_range=20, 
        width_shift_range=0.2, 
        height_shift_range=0.2, 
        shear_range=0.2, 
        zoom_range=0.5, 
        horizontal_flip=True, # 水平翻转
        fill_mode='nearest')

# 测试集不做变形处理，只需要归一化。
validation_pic_gen = ImageDataGenerator(rescale=1./255)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [47]:
# 按文件夹生成训练集流和标签，categorical：多分类   
train_flow = train_pic_gen.flow_from_directory(
        train_data_dir,
        target_size=(img_width, img_height),
        batch_size=32,
        class_mode='categorical')

# 按文件夹生成测试集流和标签
validation_flow = validation_pic_gen.flow_from_directory(
        validation_data_dir,
        target_size=(img_width, img_height),
        batch_size=32,
        class_mode='categorical')#categorical

Found 1486 images belonging to 3 classes.
Found 744 images belonging to 3 classes.


首先搭建最简单的CNN模型：

In [48]:
from keras.models import Sequential
from keras.layers import Convolution2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense

steps_per_epoch = 2000  
validation_steps = 1000
epochs = 1
#epochs = 50 # 循环50轮

# 两层卷积-池化，提取64个平面特征
model = Sequential([
Convolution2D(32, (3, 3), input_shape=input_shape, activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
Convolution2D(64, (3, 3), activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
Flatten(),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(3, activation='softmax'),
])

# 损失函数设置为多分类交叉熵
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

In [49]:
model.fit_generator(
        train_flow,
        steps_per_epoch=steps_per_epoch,
        epochs=epochs,
        validation_data=validation_flow,
        validation_steps=validation_steps)

Epoch 1/1


<keras.callbacks.History at 0x1709fd68748>

得到的模型在训练集上准确率是0.77，测试集准确率是0.84，还有待提升

In [51]:
model.save('C:/Users/wufan/Desktop/data/flowers/model.h5') # 保存权重

In [50]:
model.summary() # 查看模型基本架构

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 126, 126, 32)      896       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 63, 63, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 61, 61, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 30, 30, 64)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 57600)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                3686464   
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
__________

# 微训练模型（fine-tuning）

利用一些已经训练好的模型来进行迁移学习，以期提高准确率，采用VGG16作为基准模型

In [52]:
from keras.applications.inception_v3 import InceptionV3

base_model = InceptionV3(weights='imagenet')

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.5/inception_v3_weights_tf_dim_ordering_tf_kernels.h5


In [53]:
from keras.models import Model
from keras.optimizers import SGD
from keras.applications.vgg16 import VGG16

# 图片尺寸
img_width, img_height = 128, 128
input_shape = (img_width, img_height, 3)

In [54]:
base_model = VGG16(weights='imagenet', include_top=False, input_shape=input_shape)

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


In [55]:
from keras.layers import Dropout, Flatten, Dense

x = base_model.output
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
y = Dense(3, activation='softmax')(x)

model = Model(inputs=base_model.input, outputs=y)

In [60]:
from keras.preprocessing.image import ImageDataGenerator

# 数据地址
target = 'C:/Users/wufan/Desktop/data/arrange/'
train_data_dir = target + 'train'
validation_data_dir = target + 'validation'

# 模型参数
steps_per_epoch = 500
validation_steps = 100
epochs = 1
#epochs = 50 # 循环50轮

# 生成变形图片，并做去均值处理
train_pic_gen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

# 测试集不做变形处理，只需要去均值
validation_pic_gen = ImageDataGenerator(rescale=1./255)

# 按文件夹生成训练集流和标签
train_flow = train_pic_gen.flow_from_directory(
        train_data_dir,
        target_size=(img_width, img_height),
        batch_size=32,
        class_mode='categorical')

# 按文件夹生成测试集流和标签
validation_flow = validation_pic_gen.flow_from_directory(
        validation_data_dir,
        target_size=(img_width, img_height),
        batch_size=32,
        class_mode='categorical')

Found 1486 images belonging to 3 classes.
Found 744 images belonging to 3 classes.


In [61]:
# 冻结VGG中ImageNet的CNN结构部分，让ImageNet训练好的参数不变
for layer in base_model.layers:
    layer.trainable = False 

In [62]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',  # 更换成多分类交叉熵
              metrics=['accuracy'])


model.fit_generator(
        train_flow,
        steps_per_epoch=steps_per_epoch,
        epochs=epochs,
        validation_data=validation_flow,
        validation_steps=validation_steps)

Epoch 1/1


<keras.callbacks.History at 0x170af5f2d68>

初步迁移学习得到的模型在训练集上的准确率为0.82，在测试集上的准确为0.85

In [63]:
model.save_weights('C:/Users/wufan/Desktop/data/flowers/merge_model.h5') # 保存模型

### 微调连接部分

In [64]:
for i, layer in enumerate(model.layers):
   print(i, layer.name)

0 input_2
1 block1_conv1
2 block1_conv2
3 block1_pool
4 block2_conv1
5 block2_conv2
6 block2_pool
7 block3_conv1
8 block3_conv2
9 block3_conv3
10 block3_pool
11 block4_conv1
12 block4_conv2
13 block4_conv3
14 block4_pool
15 block5_conv1
16 block5_conv2
17 block5_conv3
18 block5_pool
19 flatten_2
20 dense_3
21 dropout_2
22 dense_4


In [65]:
for layer in model.layers[:15]:
    layer.trainable = False
for layer in model.layers[15:]:
    layer.trainable = True

In [66]:
model.compile(loss='categorical_crossentropy',
              optimizer=SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])

# 微调训练
model.fit_generator(
        train_flow,
        steps_per_epoch=steps_per_epoch,
        epochs=epochs,
        validation_data=validation_flow,
        validation_steps=validation_steps)

Epoch 1/1


<keras.callbacks.History at 0x170af5f2c18>

最后微调后得到的模型训练集准确率为0.94，测试集准确率为0.89，可能存在一定的过拟合现象。