# **Animal Classification**
这是动物分类器的Notebook，包括训练、测试、保存模型过程。

本项目用到的技术包括神经网络，随机森林，朴素贝叶斯，支持向量机，K最近邻等。

Notebook运行于Google Colab free云训练平台。

环境详情：

CPU 2*2.30GHz，GPU TeslaH T4，Memory 12.72GB

# **数据解压**

由于数据集保存在Google云硬盘中，直接读取速度非常缓慢。

为了实现快速读取，增加解压步骤如下。

In [0]:
import zipfile
file_dir = 'drive/My Drive/dataset.zip'  # 你的压缩包路径
zipFile = zipfile.ZipFile(file_dir)
for file in zipFile.namelist():
    zipFile.extract(file, '/content')  # 解压路径
zipFile.close()


# **环境加载**

本项目主要使用Numpy，Scikit-Learn，Tensorflow，Keras，Time，Os，OpenCV等第三方库实现项目。

该步骤用于加载本项目中用到的所有库和库函数。

In [0]:
import numpy as np
from sklearn.model_selection import GridSearchCV
import cv2
import os
import time
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
from keras.models import Sequential, Model, load_model
from keras.layers import Dropout, Flatten, Dense
import tensorflow as tf
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.externals import joblib

# **数据加载**

加载训练、测试数据。

读入图片的同时给数据加上标签，方便训练。

经过多次尝试，平衡训练时间和模型性能的结果，训练集包括每类601张图片，共3606张，其余图片作为测试数据，共554张。

图片读入时利用OpenCV，同时进行裁剪，使数据形状为(224,224,3)

In [0]:
#categories = {'cane': 'dog', 'gatto': 'cat'，"cavallo": "horse","mucca": "cow", "pecora": "sheep"，"chicken": "gallina"}
x_train = []
y_train = []
x_test = []
y_test = []
animals = ['dog', 'cat', 'chicken', 'cow', 'horse', 'sheep']
path="dataset/"
i = 0;
for files in animals:
    j = 0;
    for img in os.listdir(path+files):
        try:
            img_array = cv2.imread(path+files+'/'+img)
            new_img = cv2.resize(img_array, (224,224))
            #if j == 240: break
            if j <= 600:
                x_train.append([new_img])
                y_train.append(i)
            else:
                x_test.append([new_img])
                y_test.append(i)
            j = j + 1
        except: pass
    i = i + 1
    print("Load ",files,"...") 
x_train = np.array(x_train)
x_train = np.squeeze(x_train)
x_test = np.array(x_test)
x_test = np.squeeze(x_test)
print("x_train shape", np.array(x_train).shape)
print("x_test shape", np.array(x_test).shape)

# **数据预处理**

包括数据的形状调整和特征抽取。

该部分利用了卷积神经网络模型VGG16除去全连接层的部分进行特征抽取，参数使用预训练模型imagenet，该模型可以对1000种物体进行效果良好的分类，可以用于本项目的使用，进行迁移学习。

利用VGG16的特征抽取，可以获得有效特征，提高分类准确度。同时也可以数据降维，把(224,224,3)共150,528‬维降维至(7,7,512)共25088维。

In [0]:
model_vgg16 = VGG16(weights='imagenet', include_top=False)
print("Feature Extractor information:")
model_vgg16.summary()

y_train = np.array(y_train)
y_test = np.array(y_test)

x_train = preprocess_input(x_train)


x_train = model_vgg16.predict(x_train)


x_test = preprocess_input(x_test)
x_test = model_vgg16.predict(x_test)
print("x_train.shape",x_train.shape)
print("x_test.shape",x_test.shape)

# **神经网络**
实现了神经网络分类器的设计和训练。

共4层，第一层全连接层flatten；第二层隐藏层4096个神经元，relu激活；第三层512个神经元，relu激活；第四层输出6个神经元，softmax激活。

输入形状为(7,7,512)，15个epochs，使用adam优化。


In [0]:
def softmax_to_category(a):
    max2 = []
    for item in a:
        i=np.argmax(item)
        max2.append(i)
    return max2

net = tf.keras.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(4096, activation='relu'),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(6, activation='softmax'),
])
net.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
startt = time.time()
net.fit(x_train, y_train, epochs=15)
net_traintime = time.time() - startt
print(net.evaluate(x_test, y_test))
net_pred = softmax_to_category(net.predict(x_test))
print("Nerual Network Summary:")
net.summary()

# **数据二次处理**

使用其他分类器，将数据进行flatten操作

In [0]:
print("flatten the data:")
x1_train = x_train.reshape(x_train.shape[0], -1)
x1_test = x_test.reshape(x_test.shape[0], -1)
print("x1_train.shape",x1_train.shape)
print("x1_test.shape",x1_test.shape)

# **朴素贝叶斯分类器**

In [0]:
gnb = GaussianNB()
startt = time.time()
gnb.fit(x1_train, y_train)
bayes_traintime = time.time() - startt
bayes_pred = gnb.predict(x1_test)
print("Finish the naive_bayes classification.")

# **随机森林分类器**

In [0]:
rf0 = RandomForestClassifier(oob_score=True, random_state=10)
startt = time.time()
rf0.fit(x1_train, y_train)
rf_traintime = time.time() - startt
rf_pred = rf0.predict(x1_test)
print("Finish the random forest classification.")

# **K最近邻分类器**

In [0]:
knn = KNeighborsClassifier() 
startt = time.time()
knn.fit(x1_train, y_train)
knn_traintime = time.time() - startt
knn_pred = knn.predict(x1_test)
print("Finish the knn classification.")

# **支持向量机分类器**

In [0]:
svm = SVC(kernel='rbf')
startt = time.time()
svm.fit(x1_train, y_train)
svm_traintime = time.time() - startt
svm_pred = svm.predict(x1_test)
print("Finish the svm classification.")

# **分类器性能分析报表**

包括每种分类器在每一类上和总体平均的Precision，Recall，F1和训练耗费时间。

In [0]:
print("Evaluation Reports")
print("Naive Bayes:")
print(classification_report(y_test, bayes_pred, target_names=animals))
print("Naive Bayes Train Time: ", bayes_traintime, 's')
print("Random Forest:")
print(classification_report(y_test, rf_pred, target_names=animals))
print("Random Forest Train Time: ", rf_traintime, 's')
print("KNN:")
print(classification_report(y_test, knn_pred, target_names=animals))
print("KNN Train Time: ", knn_traintime, 's')
print("SVM:")
print(classification_report(y_test, svm_pred, target_names=animals))
print("SVM Train Time: ", svm_traintime, 's')
print("Neural Network:")
print(classification_report(y_test, net_pred, target_names=animals))
print("Neural Network Train Time: ", net_traintime, 's')

# **保存模型**

保存模型，方便其他用途不必再次训练。

In [0]:
print("Save the classification model...")
print("save the net...")
net.save("net_model.h5")
print("save the bayes...")
joblib.dump(gnb,"bayes_model.m")
print("save the knn...")
joblib.dump(knn,"knn_model.m")
print("save the rf...")
joblib.dump(rf0,"rf_model.m")
print("save the svm...")
joblib.dump(svm,"svm_model.m")
print("finish!")