## 集成学习

> 《机器学习实战》第7章

> 当做重要决定时，会考虑多个专家而不是一个人的意见，这就是**元算法**背后的思路。元算法是对其他算法进行组合的一种方式。

集成学习是用多个分类器来通过同一个数据集来训练，每个分类器的结果再设置权值

步骤：
把数据集随机重组成n个数据集，交给n个分类器去处理，最后把每个分类器的输出乘上该分类器所对应的权值，求和作为最后的结果

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from sklearn.metrics import confusion_matrix
import time
from datetime import timedelta
import math
import os

# Use PrettyTensor to simplify Neural Network construction.
import prettytensor as pt

In [2]:
from tensorflow.examples.tutorials.mnist import input_data
data = input_data.read_data_sets('data/MNIST/', one_hot=True)

Extracting data/MNIST/train-images-idx3-ubyte.gz
Extracting data/MNIST/train-labels-idx1-ubyte.gz
Extracting data/MNIST/t10k-images-idx3-ubyte.gz
Extracting data/MNIST/t10k-labels-idx1-ubyte.gz


In [15]:
# axis 默认情况下，索引的是平铺的数组，否则沿指定的轴。 
# 比如axis=1则表示每一y轴上的点
data.test.cls = np.argmax(data.test.labels, axis=1)
data.validation.cls = np.argmax(data.validation.labels, axis=1)

In [18]:
combined_images = np.concatenate([data.train.images, data.validation.images], axis=0)
combined_labels = np.concatenate([data.train.labels, data.validation.labels], axis=0)

In [20]:
# 组合的数据集大小
combined_size = len(combined_images)
combined_size

60000

In [21]:
# 80%作为训练集
train_size = int(0.8 * combined_size)
train_size

48000

In [23]:
# 用于校验
validation_size = combined_size - train_size
validation_size

12000

In [35]:
# 随机生成训练集
def random_training_set():
    
    idx = np.random.permutation(combined_size)
    
    idx_train = idx[0:train_size]
    idx_validation = idx[train_size:]
    
    x_train = combined_images[idx_train, :]
    y_train = combined_labels[idx_train, :]
    
#     x_validation = combined_images[train_size, :]
    x_validation = combined_images[train_size]
    y_validation = combined_labels[train_size]
    
    return x_train, y_train, x_validation, y_validation

定义数据维度

In [36]:
# We know that MNIST images are 28 pixels in each dimension.
img_size = 28

# Images are stored in one-dimensional arrays of this length.
img_size_flat = img_size * img_size

# Tuple with height and width of images used to reshape arrays.
img_shape = (img_size, img_size)

# Number of colour channels for the images: 1 channel for gray-scale.
num_channels = 1

# Number of classes, one class for each of 10 digits.
num_classes = 10

画图函数

In [37]:
def plot_images(images,                  # Images to plot, 2-d array.
                cls_true,                # True class-no for images.
                ensemble_cls_pred=None,  # Ensemble predicted class-no.
                best_cls_pred=None):     # Best-net predicted class-no.

    assert len(images) == len(cls_true)
    
    # Create figure with 3x3 sub-plots.
    fig, axes = plt.subplots(3, 3)

    # Adjust vertical spacing if we need to print ensemble and best-net.
    if ensemble_cls_pred is None:
        hspace = 0.3
    else:
        hspace = 1.0
    fig.subplots_adjust(hspace=hspace, wspace=0.3)

    # For each of the sub-plots.
    for i, ax in enumerate(axes.flat):

        # There may not be enough images for all sub-plots.
        if i < len(images):
            # Plot image.
            ax.imshow(images[i].reshape(img_shape), cmap='binary')

            # Show true and predicted classes.
            if ensemble_cls_pred is None:
                xlabel = "True: {0}".format(cls_true[i])
            else:
                msg = "True: {0}\nEnsemble: {1}\nBest Net: {2}"
                xlabel = msg.format(cls_true[i],
                                    ensemble_cls_pred[i],
                                    best_cls_pred[i])

            # Show the classes as the label on the x-axis.
            ax.set_xlabel(xlabel)
        
        # Remove ticks from the plot.
        ax.set_xticks([])
        ax.set_yticks([])
    
    # Ensure the plot is shown correctly with multiple plots
    # in a single Notebook cell.
    plt.show()

定义输入和输出的变量

In [39]:
x = tf.placeholder(tf.float32, shape=[None, img_size_flat], name='x')
x_image = tf.reshape(x, [-1, img_size, img_size, num_channels])
y_true = tf.placeholder(tf.float32, shape=[None, 10], name='y_true')
y_true_cls = tf.argmax(y_true, dimension=1)

使用PrettyTensor构建

In [40]:
x_pretty = pt.wrap(x_image)

In [41]:
with pt.defaults_scope(activation_fn=tf.nn.relu):
    y_pred, loss = x_pretty.\
        conv2d(kernel=5, depth=16, name='layer_conv1').\
        max_pool(kernel=2, stride=2).\
        conv2d(kernel=5, depth=36, name='layer_conv2').\
        max_pool(kernel=2, stride=2).\
        flatten().\
        fully_connected(size=128, name='layer_fc1').\
        softmax_classifier(num_classes=num_classes, labels=y_true)

In [42]:
# 优化器
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(loss)

计算准确率

In [43]:
y_pred_cls = tf.argmax(y_pred, dimension=1)
correct_prediction = tf.equal(y_pred_cls, y_true_cls)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

## Saver

把多个神经网络保存到文件里面

In [44]:
saver = tf.train.Saver(max_to_keep=100) # 最多保存100个神经网络

In [46]:
save_dir = 'checkpoints/'
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

In [47]:
def get_save_path(net_number):
    return save_dir + 'network' + str(net_number)

Session

In [48]:
session = tf.Session()

In [49]:
def init_variables():
    session.run(tf.initialize_all_variables())

In [50]:
train_batch_size = 64

In [51]:
# 不同于之前的按顺序取出batch来训练，这里还要随机取batch
def random_batch(x_train, y_train):
    
    num_images = len(x_train)
    
    # 在[0, num_images) 生成train_batch_size 个随机数
    idx = np.random.choice(num_images,
                           size=train_batch_size,
                           replace=False)
    
    x_batch = x_train[idx, :]
    y_batch = y_train[idx, :]
    
    return x_batch, y_batch

In [52]:
def optimize(num_iterations, x_train, y_train):
    # Start-time used for printing time-usage below.
    start_time = time.time()

    for i in range(num_iterations):

        # Get a batch of training examples.
        # x_batch now holds a batch of images and
        # y_true_batch are the true labels for those images.
        x_batch, y_true_batch = random_batch(x_train, y_train)

        # Put the batch into a dict with the proper names
        # for placeholder variables in the TensorFlow graph.
        feed_dict_train = {x: x_batch,
                           y_true: y_true_batch}

        # Run the optimizer using this batch of training data.
        # TensorFlow assigns the variables in feed_dict_train
        # to the placeholder variables and then runs the optimizer.
        session.run(optimizer, feed_dict=feed_dict_train)

        # Print status every 100 iterations and after last iteration.
        if i % 100 == 0:

            # Calculate the accuracy on the training-batch.
            acc = session.run(accuracy, feed_dict=feed_dict_train)
            
            # Status-message for printing.
            msg = "Optimization Iteration: {0:>6}, Training Batch Accuracy: {1:>6.1%}"

            # Print it.
            print(msg.format(i + 1, acc))

    # Ending time.
    end_time = time.time()

    # Difference between start and end-times.
    time_dif = end_time - start_time

    # Print the time-usage.
    print("Time usage: " + str(timedelta(seconds=int(round(time_dif)))))

In [53]:
num_networks = 5
num_iterations = 1000

## 集成学习

In [54]:
if True:
    
    for i in range(num_networks):
        print("Neural network: {0}".format(i))
        
        x_train, y_train, _, _ = random_training_set()
        
        session.run(tf.global_variables_initializer())
        
        optimize(num_iterations=num_iterations,
                 x_train=x_train,
                 y_train=y_train)
        
        saver.save(sess=session, save_path=get_save_path(i))
        
        print()

Neural network: 0
Optimization Iteration:      1, Training Batch Accuracy:  10.9%
Optimization Iteration:    101, Training Batch Accuracy:  89.1%
Optimization Iteration:    201, Training Batch Accuracy:  89.1%
Optimization Iteration:    301, Training Batch Accuracy:  95.3%
Optimization Iteration:    401, Training Batch Accuracy:  93.8%
Optimization Iteration:    501, Training Batch Accuracy: 100.0%
Optimization Iteration:    601, Training Batch Accuracy:  96.9%
Optimization Iteration:    701, Training Batch Accuracy:  98.4%
Optimization Iteration:    801, Training Batch Accuracy:  98.4%
Optimization Iteration:    901, Training Batch Accuracy:  96.9%
Time usage: 0:01:55

Neural network: 1
Optimization Iteration:      1, Training Batch Accuracy:   6.2%
Optimization Iteration:    101, Training Batch Accuracy:  90.6%
Optimization Iteration:    201, Training Batch Accuracy:  92.2%
Optimization Iteration:    301, Training Batch Accuracy:  85.9%
Optimization Iteration:    401, Training Batch 

## ensemble
集成学习会根据一张图片的输入得到的五个分类器的输出，对同一类别的标签（一共有10类）去平均值，找出最大值的下标，作为该图片的最终输出

In [55]:
# 代码待写