#   <center>Machine Learning Engineer Nanodegree</center>
##   <center>Project: Dogs vs. Cats Redux: Kernels Edition</center>
<center>
Author: Kyle Chen<br>
Date: 20180506<br>
Version: 20180506v1
</center>

---

###   写在前面
-   这次的毕业项目选做猫狗, 一方面是因为资料比较多, 资源比较丰富. 另外一方面, 图形识别也是当下的一大热点. 虽然毕业后可能会继续从事传统机器学习方面的研究, 但是能够有这段经历还是很不错的, 也能丰富自己的简历与深度学习的知识.
-   在这个项目中, 将通过评估几种模型与不同程度的调优, 来不断优化我们的模型框架, 以达到最终的top 10%标准.

###   准备数据
-   本文中的DataSet是已经从kaggle上拖取到本地, 并放入当前工作目录中, 但是由于Github的大小限制, 如果你想要执行并研究其中的一些代码, 请自行准备好数据集, 并存放到./DataSet目录下. kaggle传送门 <a href='https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/data'>dogs-vs-cats-redux-kernels-edition/data</a>.
-   或者你可以直接运行下框的代码创建目录结构与拖取数据.

In [None]:
# 如果你已经拖取了数据, 请勿执行此代码框中的代码
!mkdir DataSet
!wget -c "https://storage.googleapis.com/kaggle-competitions-data/kaggle/5441/train.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1525872567&Signature=Vp1XJm2t%2FeDE9TqhWRHhhDuCD7GmOw4LhuU9cVU%2FNKbur08UKSw8UkDRm%2B6quFq0NL41vn%2BA45YkXvwlmiyM%2Br51%2BvXpWUHtYi3XAMwxjEVn7HI7dwyEP2tSO1H9SS%2Bi5YM8e94zNQ5mrpUypxL52HDBH0BVJBGs40RFR7uAiSeLizUNwArPl5zyP11EkOPrcFC2umd8e5BmfxpfRWUNTwr4%2FpfN6AAu%2BOsPU3QakCnzqxYQ1idOsQ4AO4AecseLYtEdeXJaov0lwaUVh9BIRMZibW4Sylh9RW0QmTspNbeCdZ%2BiTzMMfHxII5DhuXznZcpHOLRIG7%2F8XqOULoMzAA%3D%3D" -O DataSet/train.zip
!wget -c "https://storage.googleapis.com/kaggle-competitions-data/kaggle/5441/test.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1525872601&Signature=aHXH%2B1qiQOk5ny9b3YrivSZQB23neGcHxVubwp6olX8SPz6V5wfkqpbs2ncy%2B%2BLozRBLx%2BG86KdjqsuGuo%2FbYYjMwwh%2F1784dKFaNlFxBR3x8jIn6ji221MWwCkX9Cij20xC9ECpaXWBap8jYypRlAp%2Ff8AlTJF1zY8xQ84su8Gs2y8tDVs9Gt0OuiKu4dNJ017ZPjclPYNjm2%2BCG1GcpgCmZy6qkqvW%2FsuMPr%2BLcGFB1X0xrqYLxmX1JJGlikoZ%2FjQiJ5ZYjIhnLm05BhWdegChS24hnDyF4Mo4DoI9r9NBpRIPqF3kW%2BSZ0ci%2FRgutEvqr7OXcuRpOIR4pPYVZdw%3D%3D" -O DataSet/test.zip
!unzip test/train.zip -d DataSet/
!unzip test/test.zip -d DataSet/

###   关于DataSet
-   先了解下数据集的组成.

In [24]:
!ls -ahl DataSet/

total 109136
drwxr-xr-x      7 Kyle  staff   224B May  7 21:33 [1m[36m.[m[m
drwxr-xr-x     10 Kyle  staff   320B May  7 21:32 [1m[36m..[m[m
-rw-r--r--      1 Kyle  staff   111K May  7 15:08 sample_submission.csv
drwxr-xr-x  12502 Kyle  staff   391K May  7 15:08 [1m[36mtest[m[m
-rw-r--r--@     1 Kyle  staff   3.7M May  7 21:33 test.zip
drwxr-xr-x  25002 Kyle  staff   781K May  7 15:08 [1m[36mtrain[m[m
-rw-r--r--@     1 Kyle  staff    49M May  7 21:33 train.zip


-   test目录下存放的是kaggle准备好的测试集.
-   train目录下存放的是训练集, 当然, 还需要将其细分为训练集与验证集两个部分.

In [25]:
!ls -ahl DataSet/train/ | head -n 5

total 1217008
drwxr-xr-x  25002 Kyle  staff   781K May  7 15:08 .
drwxr-xr-x      7 Kyle  staff   224B May  7 21:33 ..
-rw-r--r--      1 Kyle  staff    12K May  7 15:08 cat.0.jpg
-rw-r--r--      1 Kyle  staff    16K May  7 15:08 cat.1.jpg


-   发现数据集命名是有规则的, 遵循label.n.jpg的原则.

In [26]:
!echo "cats | $(find DataSet/train/ -name 'cat*' | wc -l)"
!echo "dogs | $(find DataSet/train/ -name 'dog*' | wc -l)"

cats |    12500
dogs |    12500


-   发现cats/dogs样本类型分布均匀.

###   Import Libs

In [32]:
# 导入我们后面需要用到的库
import os
import re
import numpy as np
from PIL import Image
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Dropout, Flatten, Dense, BatchNormalization
from keras.models import Sequential, Model
from keras.callbacks import ModelCheckpoint
from keras.applications.xception import Xception

###   Initial Global Variables

In [6]:
# initial global variables
TRAIN_DIR = 'DataSet/train'
TEST_DIR = 'DataSet/test'

###   导入数据集

In [7]:
# load data from TRAIN_DIR
def load_data(width, height, channels):
    img_list = os.listdir(TRAIN_DIR)
    nums = len(img_list)
    data = np.empty((nums, width, height, channels), dtype="float32")
    label = np.empty((nums, ))
    
    i = 0
    while i < nums:
        img = img_list[i]
        imgObj = Image.open("{}/{}".format(TRAIN_DIR, img))
        arr = np.asarray(imgObj, dtype="float32")
        arr.resize((width, height, channels))
        data[i, :, :, :] = arr
        
        if re.match(r'^cat\.', img) != None:
            label[i] = 0
            
        elif re.match(r'^dog\.', img) != None:
            label[i] = 1            
        i += 1
    return(data, label)

###   CNN模型框架
-   Load数据

In [60]:
# run load data
data, label = load_data(224, 224, 3)

-   CNN模型设计

In [46]:
cnn_model = Sequential()
shape_input = (len(data[0]), len(data[0][0]), len(data[0][0][0]))
cnn_model.add(Conv2D(filters=16, kernel_size=2, input_shape=shape_input))
cnn_model.add(BatchNormalization())
cnn_model.add(MaxPooling2D(pool_size=2, padding='valid'))
cnn_model.add(Dense(133, activation='relu'))
cnn_model.add(Conv2D(filters=32, kernel_size=2))
cnn_model.add(MaxPooling2D(pool_size=2, padding='valid'))
cnn_model.add(Dense(133, activation='relu'))
cnn_model.add(Conv2D(filters=64, kernel_size=2))
cnn_model.add(MaxPooling2D(pool_size=2, padding='valid'))
cnn_model.add(GlobalAveragePooling2D(dim_ordering='default'))
cnn_model.add(Dense(1, activation='sigmoid'))
cnn_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_17 (Conv2D)           (None, 223, 223, 16)      208       
_________________________________________________________________
batch_normalization_9 (Batch (None, 223, 223, 16)      64        
_________________________________________________________________
max_pooling2d_13 (MaxPooling (None, 111, 111, 16)      0         
_________________________________________________________________
dense_20 (Dense)             (None, 111, 111, 133)     2261      
_________________________________________________________________
conv2d_18 (Conv2D)           (None, 110, 110, 32)      17056     
_________________________________________________________________
max_pooling2d_14 (MaxPooling (None, 55, 55, 32)        0         
_________________________________________________________________
dense_21 (Dense)             (None, 55, 55, 133)       4389      
__________

  if sys.path[0] == '':


-   编译模型

In [47]:
# 编译模型
# 在keras中, 提供binary_crossentropy函数, 就是我们需要的LogLoss
cnn_model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

-   训练CNN模型

In [48]:
# 这里在8c 16g的mbpr上跑了差不多两个半小时
epochs = 5
batch_size = 20
checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.cnn.hdf5', 
                                   verbose=1, save_best_only=True)
cnn_model.fit(data, label, validation_split = 0.3,
                epochs = epochs, batch_size = batch_size, shuffle = True,
                callbacks=[checkpointer], verbose=1)

Train on 17500 samples, validate on 7500 samples
Epoch 1/5

Epoch 00001: val_loss improved from inf to 0.65021, saving model to saved_models/weights.best.cnn.hdf5
Epoch 2/5

Epoch 00002: val_loss improved from 0.65021 to 0.64852, saving model to saved_models/weights.best.cnn.hdf5
Epoch 3/5

Epoch 00003: val_loss improved from 0.64852 to 0.63517, saving model to saved_models/weights.best.cnn.hdf5
Epoch 4/5

Epoch 00004: val_loss improved from 0.63517 to 0.63454, saving model to saved_models/weights.best.cnn.hdf5
Epoch 5/5

Epoch 00005: val_loss improved from 0.63454 to 0.62007, saving model to saved_models/weights.best.cnn.hdf5


<keras.callbacks.History at 0x11ee91be0>

-   这里可以发现, 用CNN优化之后, 还是在64.4%. LogLoss没有太大的提升.

###   使用Xception
-   Load数据

In [8]:
# run load data, change size to Xception default (299,299)
Xception_data, Xception_label = load_data(299, 299, 3)

-   利用Xception训练特征向量

In [41]:
Xception_base = Xception(include_top = False, weights = 'imagenet')
Xception_model = Model(Xception_base.input, Xception_base.output)
Xception_model = Model(Xception_model.input, GlobalAveragePooling2D()(Xception_model.output))
Xception_model = Model(Xception_model.input, Dense(1, activation='sigmoid')(Xception_model.output))
Xception_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_26 (InputLayer)           (None, None, None, 3 0                                            
__________________________________________________________________________________________________
block1_conv1 (Conv2D)           (None, None, None, 3 864         input_26[0][0]                   
__________________________________________________________________________________________________
block1_conv1_bn (BatchNormaliza (None, None, None, 3 128         block1_conv1[0][0]               
__________________________________________________________________________________________________
block1_conv1_act (Activation)   (None, None, None, 3 0           block1_conv1_bn[0][0]            
__________________________________________________________________________________________________
block1_con

-   编译模型

In [43]:
Xception_model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

-   使用Xception训练模型

In [45]:
epochs = 5
batch_size = 20
checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.Xception.hdf5', 
                                   verbose=1, save_best_only=True)
Xception_model.fit(Xception_data, Xception_label, validation_split = 0.3,
                epochs = epochs, batch_size = batch_size, shuffle = True,
                callbacks=[checkpointer], verbose=1)

Train on 17500 samples, validate on 7500 samples
Epoch 1/5
  260/17500 [..............................] - ETA: 6:40:43 - loss: 0.7012 - acc: 0.5000

KeyboardInterrupt: 

In [1]:
from theano import function, config, shared, tensor
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

ERROR (theano.gpuarray): Could not initialize pygpu, support disabled
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/theano/gpuarray/__init__.py", line 227, in <module>
    use(config.device)
  File "/anaconda3/lib/python3.6/site-packages/theano/gpuarray/__init__.py", line 214, in use
    init_dev(device, preallocate=preallocate)
  File "/anaconda3/lib/python3.6/site-packages/theano/gpuarray/__init__.py", line 73, in init_dev
    pygpu_version.fullversion)
ValueError: Your installed version of pygpu(0.6.9) is too old, please upgrade to 0.7.0 or later (but below 0.8.0)


[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 1.259463 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the cpu
