### Dogs Vs Cats:  Data Preprocessing

## 卷积神经网络（Convolutional Neural Network, CNN）

## 项目：猫狗大战

### 项目内容

本项目拟采用 keras 结合 tensorflow 作为后端来完成编码。具体流程如下：

* [Step 0](#step0): 数据预处理
* [Step 1](#step1): 搭建模型
* [Step 2](#step2): 模型训练
* [Step 3](#step3): 模型评估
* [Step 4](#step4): 模型可视化

---
<a id='step0'></a>
## 步骤 0: 数据预处理

#### 数据集探索
首先从 Kaggle 网站下载本项目所需的训练集文件 train.zip 和测试集文件 test.zip，分别解压放置于 data 目录下。检索发现测试集中的图片文件名按猫狗分别附加有 "dog" 或者 "cat" 前缀。为了使用 sklearn.datasets.load_files 工具，将测试集的图片按猫和狗分为两个目录存储，最终目录层级组织如下：

```
data  
├── test  
│   ├── 1.jpg  
│   ├── 2.jpg  
│   ├── .....  
│   └── 12500.jpg  
└── train  
    ├── cat  
    │   ├── cat.0.jpg  
    │   ├── cat.1.jpg  
    │   ├── ......  
    │   └── cat.12500.jpg  
    └── dog  
        ├── dog.0.jpg  
        ├── dog.1.jpg  
        ├── ......  
        └── dog.12500.jpg  
```

In [None]:
import os, cv2
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.datasets import load_files       
from keras.utils import np_utils
from keras.applications import Xception

%matplotlib inline

# define function to load train dataset
def load_train_dataset(path):
    data = load_files(path)
    files = np.array(data['filenames'])
    targets = np_utils.to_categorical(np.array(data['target']), 2)
    return files, targets

# load train dataset
train_files, train_targets = load_train_dataset('data/train')
train_dogs = [path for path in train_files if 'dog' in path]
train_cats = [path for path in train_files if 'cat' in path]

# load test dataset
test_files = os.listdir('data/test')

# print statistics about the datasets
print("There are {} training images, include {} dogs images and {} cats images."
      .format(len(train_files), len(train_dogs), len(train_cats)))
print("There are {} testing images.".format(len(test_files)))

#### 图片尺寸范围

In [None]:
train_image_shapes = np.array([cv2.imread(file, 0).shape for file in train_files])
train_image_sizes  = np.array([shape[0] * shape[1] for shape in train_image_shapes])

print("The range of training images width is [{}, {}]"
      .format(min(train_image_shapes[:,0]), max(train_image_shapes[:,0])))
print("The range of training images height is [{}, {}]"
      .format(min(train_image_shapes[:,1]), max(train_image_shapes[:,1])))
print("The range of training imagse area(width*height) is [{}, {}]"
      .format(min(train_image_sizes), max(train_image_sizes)))

#### 绘制图片尺寸分布图

In [None]:
import matplotlib.pyplot as plt

# plot image height, width, area distribution
plt.subplot(221)
plt.hist(train_image_shapes[:,0], alpha=0.5, color=['green'])
plt.title('Image width distribution')

plt.subplot(222)
plt.hist(train_image_shapes[:,1], alpha=0.5, color=['green'])
plt.title('Image height distribution')

plt.subplot(212)
plt.hist(train_image_sizes, alpha=0.8, color=['green'])
plt.title('Image area distribution')

# 调整子图间距
plt.subplots_adjust(wspace=0.3, hspace=0.5)
plt.show()

### 图片预处理

1. 图片编码处理，转换为灰度图
2. 图片大小调整，缩放至适当大小
3. 随机对图片进行翻转、色彩处理
4. 将训练集分割为训练集和验证集

使用 opencv 结合 ImageDataGenerator 工具进行图片预处理。首先将图片 resized 到 299x299大小，然后对对图像实现归一化处理，将每张图像的像素值除以255，缩放到 0～1 之间。随后进行数据增强：随机左右、上下机旋转翻转，随机旋转一定角度。

In [None]:
from keras.preprocessing import image                  
from keras.preprocessing.image import ImageDataGenerator
from tqdm import tqdm

target_image_size = (256, 256)
batch_size = 25

X = np.array([cv2.resize(cv2.imread(path, cv2.IMREAD_COLOR), #立方差值
                         target_image_size,
                         interpolation=cv2.INTER_CUBIC)
              for path in train_files])
#Y = np.array([0 if 'dog' in path else 1 for path in train_files])
Y = train_targets

In [None]:
print(X.shape, Y.shape)

### 划分数据集

目标：将初始训练集按照 3:1:1 的比例拆分为训练集、验证集和测试集。
1. 首先按 4:1 的比例将测试集拆分成训练集和测试集  
2. 再将按 3:1 的比例将训练集拆分为训练集和验证集  

In [None]:
from sklearn.model_selection import train_test_split
# random_state设置为0, 取保每次分割都一样
#X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=0.2, random_state=0)
X_1, X_test, Y_1, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
X_train, X_validation, Y_train, Y_validation = train_test_split(X_1, Y_1, test_size=0.25, random_state=0)

print("The size of train set: {}, the size of validation set: {}, the size of test set: {}".
      format(len(X_train), len(X_validation), len(X_test)))

In [None]:
data_gen = ImageDataGenerator(rescale=1.0/255,      #像素值正则化0～1
                              rotation_range=45,    #随机旋转角度范围
                              horizontal_flip=True, #随机左右翻转
                              vertical_flip=True)   #随机上下翻转

train_generator = data_gen.flow(X_train, Y_train, batch_size=batch_size)
validation_generator = data_gen.flow(X_validation, Y_validation, batch_size=batch_size)

### 数据预处理
####  获取模型的特征向量

提取 Xception 模型中与训练、测试与验证集相对应的 bottleneck 特征。

---
<a id='step1'></a>
## 步骤1：搭建模型

现在使用迁移学习来建立一个CNN，从而可以从图像中区别出猫还是狗。这里选取 Keras 提供的几个预训练的模型中，比较新且参数数量比较小的 Xception 模型作为基础模型，所以选择 Xception 模型和 imagenet 预训练权重进行训练。

为了适合本项目猫狗的二分类，将Xception模型的输出层改为二分类的全连接。


In [None]:
from keras.layers import *
from keras.optimizers import *
from keras.applications import *
from keras.models import Model

base_model = Xception(include_top=False, weights='imagenet')

x = base_model.output
x = GlobalAveragePooling2D(name='avg_pool')(x)
x = Dropout(0.5)(x)
x = Dense(1024, activation='relu')(x)
#outputs = Dense(2, activation='softmax')(x)
outputs = Dense(1, activation='sigmoid')(x)

model = Model(inputs=base_model.input, outputs=outputs)

### 模型可视化

In [None]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
%matplotlib inline

SVG(model_to_dot(model).create(prog='dot', format='svg'))

---
<a id='step2'></a>
## 步骤2：训练模型

In [None]:
## 编译模型
#model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

#### 训练模型。

为防止过拟合，val_loss不再下降后的3个epochs后停止训练

In [None]:
from keras.callbacks import EarlyStopping

BATCH_SIZE = 25
EPOCHS = 50
earlyStopping = EarlyStopping(monitor='val_loss', patience=3, verbose=0, mode='auto')
CALLBACKS = [earlyStopping]
#history = model.fit(x=X_train, y=Y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=1,
#                    callbacks=CALLBACKS, validation_data=(X_validation, Y_validation))

history = model.fit_generator(
    train_generator, 
    steps_per_epoch=len(X_train) // BATCH_SIZE,
    epochs=EPOCHS,
    callbacks=CALLBACKS,
    validation_data=validation_generator,
    validation_steps=len(X_validation) // BATCH_SIZE
)

---
<a id='step3'></a>
## 步骤 3: 模型评估

---
<a id='step4'></a>
## 步骤4：模型可视化

## 参考文献  
1. [深度学习——分类之Xception和卷积的分组](https://zhuanlan.zhihu.com/p/32965380)
2. [手把手教你如何在Kaggle猫狗大战冲到Top2%](https://zhuanlan.zhihu.com/p/25978105)
3. [Keras Image Data Augmentation 各参数详解](https://zhuanlan.zhihu.com/p/30197320)