## 对手写数字数据集(原版)进行分类预测

**这一篇主要是针对小型的mnist数据集进行分类预测，总共有70000个样本(训练集有60000个、测试集有10000个)，每个样本有784个特征(28 x 28的灰度图像)。**

In [1]:
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, ShuffleSplit
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd
import numpy as np
from keras.layers import Dense, Activation, Conv2D, MaxPool2D, Flatten
from keras.models import Sequential
from keras.utils import np_utils
from keras.datasets import mnist

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
Using TensorFlow backend.
  return f(*args, **kwds)


### 一、利用传统的机器学习算法对MNIST数据集进行分类预测

### 1.1 导入数据集

In [2]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [3]:
print(X_train.shape)

(60000, 28, 28)


In [4]:
print(X_test.shape)

(10000, 28, 28)


**对X_train和X_test的shape做修改**

In [5]:
X_train = X_train.reshape(X_train.shape[0], -1)  # shape修改为60000*784
X_test = X_test.reshape(X_test.shape[0], -1)  # shape修改为10000*784

### 1.2 模型训练

**首先对数据集进行规范化**

In [6]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

**构建分类器、调参，找出最优的分类器**

In [10]:
clf_list = [
    SGDClassifier(),
    KNeighborsClassifier(),
    SVC(probability=True),
    DecisionTreeClassifier(),
    MultinomialNB()
]  # 用于比较的分类器

In [11]:
for clf in clf_list:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(accuracy_score(y_test, y_pred))

0.9134
0.9688




0.9446
0.8773
0.8357


**可以看出，针对大型的MNIST数据集，KNN的预测效果最好，达到96.88%。接下来使用CNN来对MNIST数据集进行分类预测。**

### 二、利用CNN对MNIST数据集进行分类预测

### 2.1 导入数据

In [12]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

**对数据进行reshape，以符合CNN的输入条件**

In [13]:
X_train = X_train.reshape(-1, 28, 28, 1) / 255  # 标准化
X_test = X_test.reshape(-1, 28, 28, 1) / 255
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)

In [14]:
print(X_train.shape)

(60000, 28, 28, 1)


### 2.2 模型训练

**构建CNN**

In [15]:
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(5, 5), padding="same", input_shape=(28, 28, 1)))  # 第一层一定要添加输入shape,输出28*28*32
model.add(Activation("relu"))  # 激活
model.add(MaxPool2D(pool_size=(2,2)))  # 池化层，输出14*14*32
model.add(Conv2D(filters=64, kernel_size=(5, 5), padding="same"))  # 输出14*14*64
model.add(Activation("relu"))  # 激活
model.add(MaxPool2D(pool_size=(2, 2)))  # 池化层，输出7*7*64
model.add(Flatten())  # 变成[n_samples, 7*7*64]
model.add(Dense(1024))  # 全连接层，输出到1024个神经元
model.add(Activation("relu"))  # 激活
model.add(Dense(10))  # 输出到10个神经元
model.add(Activation("softmax"))  # softmax激活

**编译模型**

In [16]:
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

**训练**

In [17]:
model.fit(X_train, y_train, batch_size=32, epochs=2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.callbacks.History at 0x12c9a810488>

### 2.3 模型测试

In [18]:
model.evaluate(X_test, y_test)



[0.026893072640671745, 0.9905999898910522]

### 三、总结

**这里针对原版的MNIST数据集进行分类预测，分别使用了KNN，SVM，决策树，朴素贝叶斯，卷积神经网络(CNN)进行训练和预测，从预测结果看，在传统机器学习算法中，KNN的预测准确率达96.88%，是传统机器学习算法中表现最好的；而在深度学习中，CNN的预测准确率高达99.06%，比传统机器学习算法的准确率高得多。可以看出，针对大型的数据集，利用CNN进行训练预测的效果会更好。**