# 实验报告
<font size=4>
    
+ **姓名：于成俊**
+ **学号：2112066**
+ **专业：密码科学与技术**

</font>

## 实验要求

<font size=4>
题目：基于KNN 的手写数字识别
    
实验条件：给定semeion手写数字数据集，给定kNN分类算法
    
1. 基本要求：编程实现kNN算法；给出在不同k值（1，3，5）情况下，kNN算法对手写数字的识别精度（要求采用留一法）
2. 中级要求：与sklearn机器学习包中的kNN分类结果进行对比
3. 提高要求：采用旋转等手段对原始数据进行处理，扩增数据量，采用CNN或其他深度学习方法实现手写体识别)
</font>





## 导入需要的包

In [1]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import LeaveOneOut, train_test_split
from sklearn.metrics import accuracy_score
from keras.utils import to_categorical
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras.optimizers import Adam

## 处理数据

<font size=4>
将数据集 semesion 处理成二维的np数组，并将 one-hot 编码转换成对应的数字
</font>

In [2]:
# 处理数据
def process_data(filename):
    data_table = np.zeros((0, 257)) 
    with open(filename, 'r') as file:
        while True:
            line = file.readline()
            if len(line) == 0:
                break
            data = line.split()
            count = -1
            num = -1
            row = np.zeros((1, 257))
            for cursor in data:
                if count == 255:
                    num += 1
                    if cursor == '1':
                        row[0, 256] = num
                else:
                    count += 1
                    row[0, count] = cursor
            data_table = np.append(data_table, row, axis=0)
    return data_table

## 基本要求

<font size=4>
编程实现kNN算法；给出在不同k值（1，3，5）情况下，kNN算法对手写数字的识别精度（要求采用留一法）
</font>

In [3]:
# 计算欧几里得距离
def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2)) # 向量之间的操作

# 手动实现KNN算法
def knn(test_features, k, train_features, train_label):
    distances = []
    for i in range(len(train_features)):
        distance = euclidean_distance(train_features[i], test_features)
        distances.append((distance, train_label[i]))
    distances.sort(key=lambda x:x[0])  # 根据距离排序
    neighbors = distances[:k]          # 取前k个近邻
    counts = {}
    for neighbor in neighbors:
        counts[neighbor[1]] = counts.get(neighbor[1], 0) + 1
    return max(counts, key=counts.get)

# 留一法
def leave_one_out(data_table):
    k_values = [1, 3, 5]
    # 提取特征（前255列）和标签（最后一列）
    features = data_table[:, :256]     # 特征
    label = data_table[:, 256]         # 标签
    for k in k_values:
        correct_predictions = 0
        for i in range(len(features)):
            train_features = np.concatenate((features[:i], features[i + 1:]), axis=0)
            train_labels = np.concatenate((label[:i], label[i + 1:]), axis=0)
            result = knn(features[i], k, train_features, train_labels)
            a = label[i]
            if result == label[i]:
                correct_predictions += 1
        accuracy = correct_predictions / len(features)
        print(f" k={k}, accuracy: {accuracy}")

<font size=4>
实验结果：
</font>

In [4]:
data_table = process_data("semeion.data")
leave_one_out(data_table)

 k=1, accuracy: 0.9171374764595104
 k=3, accuracy: 0.9165097300690521
 k=5, accuracy: 0.9139987445072191


## 中级要求

<font size=4>
与sklearn机器学习包中的kNN分类结果进行对比
</font>

In [5]:
# 运用sklearn包实现knn算法
def sklearn_knn(data_table):
    # 提取特征（前255列）和标签（最后一列）
    X = data_table[:, :256]     # 特征
    y = data_table[:, 256]      # 标签
    k_values = [1, 3, 5]
    for k in k_values:
        # 创建 KNN 分类器，设置邻居数量为 k
        knn = KNeighborsClassifier(n_neighbors=k)
        # 使用留一法进行交叉验证，并计算模型在每次验证中的准确率
        loo = LeaveOneOut()
        accuracies = []
        for train_index, test_index in loo.split(X):
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]
            knn.fit(X_train, y_train)  # 使用训练集训练 KNN 分类器
            y_pred = knn.predict(X_test)  # 使用训练好的模型对测试集进行预测
            accuracy = accuracy_score(y_test, y_pred)
            accuracies.append(accuracy)
        mean_accuracy = np.mean(accuracies)  # 计算平均准确率
        print(f" k={k}, accuracy: {mean_accuracy}")  # 输出平均准确率

<font size=4>
实验结果：
</font>

In [6]:
sklearn_knn(data_table)

 k=1, accuracy: 0.9171374764595104
 k=3, accuracy: 0.903954802259887
 k=5, accuracy: 0.9052102950408035


## 提高要求

<font size=4>
采用旋转等手段对原始数据进行处理，扩增数据量，采用CNN实现手写体识别
</font>

In [7]:
# 采用旋转手段对原始数据进行处理，扩增数据量，采用CNN实现手写体识别
def cnn(data_table):
    # 提取特征（前255列）和标签（最后一列）
    X = data_table[:, :256]     # 特征
    y = data_table[:, 256]      # 标签

    # 数据预处理
    y = to_categorical(y, num_classes=10)  # 将类别进行独热编码
    # 划分训练集和测试集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # 将特征向量重塑为图像形式
    X_train_images = X_train.reshape(X_train.shape[0], 16, 16, 1)
    X_test_images = X_test.reshape(X_test.shape[0], 16, 16, 1)

    # 创建ImageDataGenerator对象并应用旋转操作
    datagen = ImageDataGenerator(rotation_range=20)
    datagen.fit(X_train_images)

    # 扩增训练数据
    augmented_data = datagen.flow(X_train_images, y_train, batch_size=X_train.shape[0], shuffle=False)

    # 获取扩增后的数据
    X_train_augmented, y_train_augmented = augmented_data.next()

    # 定义CNN模型
    model = Sequential()
    model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(16, 16, 1)))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dense(10, activation='softmax'))  # 共有10个类别

    # 编译模型
    model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

    # 训练模型
    history = model.fit(X_train_augmented, y_train_augmented, batch_size=64, epochs=30,
                        validation_data=(X_test_images, y_test))

    # 评估模型
    test_loss, test_acc = model.evaluate(X_test_images, y_test)

    print(f" 损失值: {test_loss}, 准确率: {test_acc}")

<font size=4>
实验结果：
</font>

In [8]:
cnn(data_table)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
 损失值: 0.23930761218070984, 准确率: 0.9373040795326233
