# 6、试手 MNIST 手写数据集

本节我们使用一个更大的手写数字数据集来看看 PCA 降维的功效。

In [2]:
import numpy as np
from sklearn.datasets import fetch_mldata

mnist = fetch_mldata('MNIST original', data_home='../input/')
mnist

{'COL_NAMES': ['label', 'data'],
 'DESCR': 'mldata.org dataset: mnist-original',
 'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'target': array([0., 0., 0., ..., 9., 9., 9.])}

In [2]:
X, y = mnist['data'], mnist['target']
X_train = np.array(X[:60000], dtype=float)
y_train = np.array(y[:60000], dtype=float)
X_test = np.array(X[60000:], dtype=float)
y_test = np.array(y[60000:], dtype=float)

In [3]:
X_train.shape

(60000, 784)

In [4]:
y_train.shape

(60000,)

In [5]:
X_test.shape

(10000, 784)

In [6]:
y_test.shape

(10000,)

## 使用 kNN 来训练手写数字识别

In [7]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train, y_train)

CPU times: user 30.1 s, sys: 158 ms, total: 30.3 s
Wall time: 30.3 s


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

如果没有降维处理，784 维的图片做训练的耗时 31 秒。

+ 注意，k 近邻算法注意耗时在预测。

In [8]:
%time knn_clf.score(X_test, y_test)

CPU times: user 9min 56s, sys: 1.41 s, total: 9min 58s
Wall time: 9min 59s


0.9688

## 使用 PCA 进行降维

+ 在降维的过程中，将原有的数据中含有的噪音也去掉了，很神奇。

In [9]:
from sklearn.decomposition import PCA

pca = PCA(0.90)
pca.fit(X_train)
X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)

In [10]:
X_train_reduction.shape

(60000, 87)

In [11]:
knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train_reduction, y_train)

CPU times: user 575 ms, sys: 68.5 ms, total: 644 ms
Wall time: 335 ms


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [12]:
%time knn_clf.score(X_test_reduction, y_test)

CPU times: user 1min, sys: 243 ms, total: 1min 1s
Wall time: 1min 1s


0.9728

从结果中我们看到，降维不仅提高了训练的速度，还去除了噪音，有可能让我们的准确率更高，这是一个神奇的功效。