TensorFlow Schoolwork

@author：刘士坤 2016011371

使用TensorFlow设计K近邻模型，并使用鸢尾花数据集训练、验证模型。


导入相应的库

In [11]:
import tensorflow as tf
import numpy as np
import pandas as pd

### 1.将鸢尾花数据集安装8 : 2的比例划分成训练集与验证集（不使用Dataset API）。

#### 数据集导入

In [2]:
data = pd.read_csv('./data/iris.data.csv', header=None)   # iris数据集
data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'species']    # 特征及类别名称

#### 将原始数据集划分成训练集与测试集
##### 将三个类别的数据分别提取出来，setosa、versicolor、virginica分别用0、1、2来表示

In [3]:
X = data.iloc[0:150, 0:4].values
y = data.iloc[0:150, 4].values
y[y == 'Iris-setosa'] = 0                                 # Iris-setosa 输出label用0表示
y[y == 'Iris-versicolor'] = 1                             # Iris-versicolor 输出label用1表示
y[y == 'Iris-virginica'] = 2                              # Iris-virginica 输出label用2表示
X_setosa, y_setosa = X[0:50], y[0:50]                     # Iris-setosa 4个特征
X_versicolor, y_versicolor = X[50:100], y[50:100]         # Iris-versicolor 4个特征
X_virginica, y_virginica = X[100:150], y[100:150]         # Iris-virginica 4个特征

##### 将每个类别的所有样本分成训练样本（training set）和测试样本（test set），各占所有样本的比例分别为80%，20%。进行40：10的划分后合并

In [4]:
# training set
X_setosa_train = X_setosa[:40, :]
y_setosa_train = y_setosa[:40]
X_versicolor_train = X_versicolor[:40, :]
y_versicolor_train = y_versicolor[:40]
X_virginica_train = X_virginica[:40, :]
y_virginica_train = y_virginica[:40]
X_train = np.vstack([X_setosa_train, X_versicolor_train, X_virginica_train])
y_train = np.hstack([y_setosa_train, y_versicolor_train, y_virginica_train])

# test set
X_setosa_test = X_setosa[40:50, :]
y_setosa_test = y_setosa[40:50]
X_versicolor_test = X_versicolor[40:50, :]
y_versicolor_test = y_versicolor[40:50]
X_virginica_test = X_virginica[40:50, :]
y_virginica_test = y_virginica[40:50]
X_test = np.vstack([X_setosa_test, X_versicolor_test, X_virginica_test])
y_test = np.hstack([y_setosa_test, y_versicolor_test, y_virginica_test])

### 2.设计模型：
        使用TensorFlow设计K近邻模型（可不使用KD树优化算法）
        模型关键部分需添加注释

#### 开启eager模式

In [5]:
tf.enable_eager_execution()

In [6]:
tf.executing_eagerly()#检测是否开启

True

#### KNN的训练过程实际上是一种数据标类、数据存储的过程

##### 定义一个类（class）来实现KNN算法模块，在KNearestNeighbor类中定义训练函数，训练函数保存所有训练样本。

In [7]:
class KNearestNeighbor(object):
    def __init__(self):
        pass

    # 训练函数
    def train(self, X, y):
        self.X_train = X
        self.y_train = y
    
    # 预测函数
    def predict(self, X, k=1):
        # 计算L2距离
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))    # 初始化距离函数
        # because(X - X_train)*(X - X_train) = -2X*X_train + X*X + X_train*X_train, so
        d1 = -2 * np.dot(X, self.X_train.T)    # shape (num_test, num_train)
        d2 = np.sum(np.square(X), axis=1, keepdims=True)    # shape (num_test, 1)
        d3 = np.sum(np.square(self.X_train), axis=1)    # shape (1, num_train)
        dist = np.sqrt(d1 + d2 + d3)
        # 根据K值，选择最可能属于的类别
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            dist_k_min = np.argsort(dist[i])[:k]    # 最近邻k个实例位置
            y_kclose = self.y_train[dist_k_min]     # 最近邻k个实例对应的标签
            y_pred[i] = np.argmax(np.bincount(y_kclose.tolist()))    # 找出k个标签中从属类别最多的作为预测类别

        return y_pred

### 3.训练模型：
        使用TensorFlow完成训练相关的代码
        训练关键部分需添加注释

#### KNN的测试过程是核心部分：选择合适的K值

In [8]:
KNN = KNearestNeighbor()

In [9]:
num_folds = 5    # 训练数据分为5 folds
K_classes = [3, 5, 7, 9, 11, 13, 15]    # 所有K值

# 把训练数据分成5份
X_train_folds = []
y_train_folds = []
X_train_folds = np.split(X_train, num_folds)
y_train_folds = np.split(y_train, num_folds)

# 字典用来存储不同K值对应的准确率
K_accuracy = []
k_best = K_classes[0]

for k in K_classes:
    accuracies = []
    for i in range(num_folds):
        Xtr = np.concatenate(X_train_folds[:i] + X_train_folds[i+1:])
        ytr = np.concatenate(y_train_folds[:i] + y_train_folds[i+1:])
        Xcv = X_train_folds[i]
        ycv = y_train_folds[i]
        KNN.train(Xtr, ytr)
        ycv_pred = KNN.predict(Xcv, k=k)
        accuracy = np.mean(ycv_pred == ycv)
        accuracies.append(accuracy)
    accuracies_avg = np.mean(accuracies)
    K_accuracy.append(accuracies_avg)
    if accuracies_avg > k_best:
        k_best = accuracies_avg

# 打印出验证结果
for k in range(len(K_classes)):
    print('k = %d, accuracy: %f' % (K_classes[k], K_accuracy[k]))
print('Best K is: %d\n' % k_best)

k = 3, accuracy: 0.900000
k = 5, accuracy: 0.891667
k = 7, accuracy: 0.891667
k = 9, accuracy: 0.900000
k = 11, accuracy: 0.891667
k = 13, accuracy: 0.900000
k = 15, accuracy: 0.891667
Best K is: 3



### 4.验证模型：
        使用验证集检测模型性能
        使用验证集调整超参数

#### 选择完合适的K值之后，就可以对验证集进行预测分析

In [10]:
KNN.train(X_train, y_train)
y_pred = KNN.predict(X_test, k=3)
accuracy = np.mean(y_pred == y_test)
print('测试集预测准确率：%f' % accuracy)

测试集预测准确率：1.000000
