## 实例：手写识别
构造k-近邻分类器的手写识别系统需要处理的数据对象为：图像数据。为了简单起见，这里构造的系统只能识别数字0到9，需要识别的数字已经使用图形处理软件，处理成具有相同色彩和大小，宽高为32x32像素的黑白图像。尽管采用文本格式存储图像不能有效利用内存空间，但是为了方便理解还是将图像转换为文本格式：

```
00000000000000011100000000000000
00000000000000111100000000000000
00000000000001111110000000000000
00000000000001111110000000000000
00000000000011111110000000000000
00000000000111111100000000000000
00000000001111111100000000000000
00000000011111110000000000000000
00000000011111100000000000000000
00000000111111000000000000000000
00000000111110000000000000000000
00000000111110000000000000000000
00000000111110000000000000000000
00000001111110000000000000000000
00000001111100000000000000000000
00000001111100000000000000000000
00000001111111111111000000000000
00000001111111111111100000000000
00000001111111111111111000000000
00000011111111111111111000000000
00000011111111111111111110000000
00000000111111111111111110000000
00000000111111000011111110000000
00000000111111000000111110000000
00000000011111100000011111000000
00000000001111111000011111000000
00000000001111111111111111000000
00000000000111111111111110000000
00000000000011111111111110000000
00000000000001111111111110000000
00000000000000001111111110000000
00000000000000000011111100000000

```
### 开发流程
（1）收集数据：提供手写数字的文本文件

（2）准备数据：将图像格式转换为分类器的可处理格式

（3）分析数据：检查数据，确保符合要求

（4）训练算法：(k-近邻法不需要训练)

（5）测试算法：使用提供的部分数据作为测试样本，测试样本与非测试样本的区别在于测试样本是已经完成分类的数据，如果预测分类与实际类别不同，则标记为一个错误

（6）使用算法：本例没有完成此步骤，后续还可以构建完整的应用程序，从图像中提取数字，并完成数字识别，美国的邮件分拣系统就是一个实际运行的类似系统.

### 准备数据：将图像转换为测试向量
训练集trainingDigits中包含了大约2000个例子，每个数字大约有200个样本；测试集中包含了大约900个测试数据，两组数据应没有覆盖。

为了将图像格式化处理为一个向量，需要把一个32x32的二进制图像矩阵转换为一个1x1024的向量，这样之前开发的分类器就可以处理图像信息了。

创建新函数img2vector，将图像转换为向量：该函数创建1x1024的numpy数组，然后打开给定的文件，循环读出文件的前32行，并将每行的头32个字符存储在numpy数组中，最后返回数组：

In [41]:
import numpy as np
import operator
from os import listdir
"""
Parameters:
    inX - 用于分类的数据(测试集)
    dataSet - 用于训练的数据(训练集)
    labes - 分类标签
    k - kNN算法参数,选择距离最小的k个点
Returns:
    sortedClassCount[0][0] - 分类结果
"""
# 函数说明:kNN算法,分类器
def classify0(inX, dataSet, labels, k):
    # distance caculation:
    dataSetSize = dataSet.shape[0]
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    
    # k nearest neighbour:
    sortedDistIndices = distances.argsort()
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndices[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        
    # Descending sort:
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

In [42]:
def img2vector(filename):
    returnVect = np.zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect

In [43]:
testVector = img2vector('./dataSet/trainingDigits/0_13.txt')
print(testVector[0,0:31])
print(testVector[0,31:63])

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]


创建函数handwritingClassTest()测试分类，从os模块中导入函数listdir可以输出给定目录的文件名。使用trainingMat矩阵存入所有的手写图并作为特征.

In [48]:
def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('./dataSet/trainingDigits')
    m = len(trainingFileList)
    trainingMat = np.zeros((m,1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:] = img2vector('./dataSet/trainingDigits/%s' % fileNameStr)
        
    testFileList = listdir('./dataSet/testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('_')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('./dataSet/testDigits/%s' % fileNameStr)
        #kNN:
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        
        print("the classifier came back with: %d, the real answer is : %d" % (classifierResult, classNumStr))
        if (classifierResult != classNumStr) : errorCount += 1.0
    
    print("\nthe total number of errors is: %d" % errorCount )
    print("\nthe total error rate is: %f" % (errorCount/float(mTest)))
        

In [49]:
handwritingClassTest()

the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 0, the real answer is : 0
the classifier came back with: 

the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 1, the real answer is : 1
the classifier came back with: 

the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 3, the real answer is : 3
the classifier came back with: 

the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 4, the real answer is : 4
the classifier came back with: 

the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 6, the real answer is : 6
the classifier came back with: 

the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 7, the real answer is : 7
the classifier came back with: 8, the real answer is : 8
the classifier came back with: 8, the real answer is : 8
the classifier came back with: 8, the real answer is : 8
the classifier came back with: 6, the real answer is : 8
the classifier came back with: 8, the real answer is : 8
the classifier came back with: 

the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 7, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 9, the real answer is : 9
the classifier came back with: 

### 评估算法
实际使用kNN的时候会发现算法的效率不高。因为算法需要为每一个测试向量做2000次距离计算，每个距离计算包括了1024个维度浮点运算，总计要执行900次，此外，还需要为测试向量准本2MB的存储空间，可见kNN算法的空间和计算的开销是非常大的。


### 总结
k近邻算法是分类数据最简单有效的算法，实际使用算法时我们必须有接近实际数据的训练样本数据。k近邻算法必须保存全部数据集，如果训练集非常大，必须使用大量的存储空间，此外由于必须对数据集中每个数据计算距离值，实际使用时可能非常耗时。

k近邻的另一个缺陷是：它无法给出任何数据的基础结构信息，因此我们也无法知晓平均实例样本和典型实例样本具有什么特征。