# 文本分类——以IMDB为例

另一个Hello world级别的例子，学习基本的分类。

IMDB数据集是对不同电影的评论。对于每一条评论都有一个标签，0代表负面,1代表正面。

所以IMDB数据集用来训练分类模型是一个二分类问题。


## 获取数据集

使用keras获取数据集：


In [1]:
from tensorflow import keras

imdb = keras.datasets.imdb
# 首次运行，会从网络下载imdb.npz文件到用户目录，以我为例/home/allen/.keras/datasets/imdb.npz
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

Training entries: 25000, labels: 25000


上面获取的数据都是整形值，每一个值代表着一个词语。

因此，我们需要通过id将字符串取出来。

做法如下：


In [3]:
from tensorflow import keras

# 构造出字符到id的映射
# 首次运行会从网络下载imdb_word_index.json文件到用户目录，以我为例/home/allen/.keras/datasets/imdb_word_index.json
word2idx = imdb.get_word_index()

# 将word2idx中每一个词的索引值，往后推移3，因为我们需要在开始插入4个字符。
# 要插入4个字符，却只推移3，是因为原来的单词与id之间的对于关系中,id是从1开始的,
# 并且这个1代表的是一条评论开始，每个train_data都是以1开始！！
word2idx = {k: (v + 3) for k, v in word2idx.items()}
# 用来对齐，在短句子后面补齐该字符
word2idx['<PAD>'] = 0
# 用来标记句子的起始位置
word2idx['<START>'] = 1
# 用来标记未知词，因为word2idx只有10000个词语，有可能不完整
word2idx['<UNK>'] = 2
# 用来标记未使用词
word2idx['<UNUSED>'] = 3

# 构造出id到字符的映射
idx2word = dict([(value, key) for key, value in word2idx.items()])
print("%s,%s,%s,%s" % (
  idx2word.get(0, "?"), idx2word.get(1, "?"), idx2word.get(2, "?"),
  idx2word.get(3, "?")))

Training entries: 25000, labels: 25000
<PAD>,<START>,<UNK>,<UNUSED>


接下来，对于每一个句子的id表示，我们需要将它转化为字符串。

我们定义一个函数来完成这个转换：


In [None]:
def decode_review(text_ids):
    return " ".join([id2word.get(i, "?") for i in text_ids])

## 预处理数据

因为电影评论的句子长度不一，我们需要将它对齐：


In [None]:
# 对训练数据进行对齐，方法为在每个句子末尾追加`<PAD>`字符对齐，最大长度限制为256个词语
train_data = keras.preprocessing.sequence.pad_sequences(
  train_data,# 是ID值
  value=word2idx["<PAD>"],
  padding='post',
  maxlen=256)
print(len(train_data[0]), len(train_data[1]))
print(train_data[0])
print(decode_review(train_data[0]))

test_data = keras.preprocessing.sequence.pad_sequences(
  test_data,
  value=word2idx["<PAD>"],
  padding='post',
  maxlen=256)

输出结果如下：

```bash
256 256
[   1   14   22   16   43  530  973 1622 1385   65  458 4468   66 3941
    4  173   36  256    5   25  100   43  838  112   50  670    2    9
   35  480  284    5  150    4  172  112  167    2  336  385   39    4
  172 4536 1111   17  546   38   13  447    4  192   50   16    6  147
 2025   19   14   22    4 1920 4613  469    4   22   71   87   12   16
   43  530   38   76   15   13 1247    4   22   17  515   17   12   16
  626   18    2    5   62  386   12    8  316    8  106    5    4 2223
 5244   16  480   66 3785   33    4  130   12   16   38  619    5   25
  124   51   36  135   48   25 1415   33    6   22   12  215   28   77
   52    5   14  407   16   82    2    8    4  107  117 5952   15  256
    4    2    7 3766    5  723   36   71   43  530  476   26  400  317
   46    7    4    2 1029   13  104   88    4  381   15  297   98   32
 2071   56   26  141    6  194 7486   18    4  226   22   21  134  476
   26  480    5  144   30 5535   18   51   36   28  224   92   25  104
    4  226   65   16   38 1334   88   12   16  283    5   16 4472  113
  103   32   15   16 5345   19  178   32    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]
    
<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
```
可以发现，句子末尾全部是数字0，因为`<PAD>`的id值是0。
因为train_data每个句子都是以1开始，所以转化成字符串之后，都会有`<START>`这个字符。


## 构建模型

还是使用keras的序贯模型：


In [None]:
import keras

vocab_size = 10000

model = keras.Sequential()
# 词嵌入，每个词语用16个浮点数组成的一维向量表示
model.add(keras.layers.Embedding(vocab_size, 16))
# 平均池化，下面再解释
model.add(keras.layers.GlobalAveragePooling1D())
# 全连接层，使用ReLU激活函数
model.add(keras.layers.Dense(16, activation=keras.activations.relu))
# 输出层，使用sigmoid函数，将输出数值压扁在(0,1)区间
model.add(keras.layers.Dense(1, activation=keras.activations.sigmoid))

# 输出模型的综述
model.summary()

```bash
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________

```

`embedding_1`的output shape意思是(BATCH_SIZE, STEPS, FEATURES)，None代表任意值。
`global_average_pooling1d_1`的output shape意思是(BATCH_SIZE, FEATURES)。
`dense_1`的output shape意思也是(BATCH_SIZE, FEATURES)。

平均池化什么意思呢？做什么用处呢？

根据tensorflow的文档，这里的平均池化是在TIME_STEPS(STEPS)这个维度上据平均值，目的是处理变长的输入序列（STEPS代表序列的长度），将输入序列转化成一个固定长度的向量。使用平均池化可能是最简单的达成这个目的的做法。

这里的平均池化实现也非常简单，就是在STEPS这个维度取平均值。
以TensorFlow作为后端为例：

```python
import tensorflow as tf

# inputs就是我们的输入向量，也就是Embedding层的输出，形状是(BATCH_SIZE, STEPS, FEATURES)
# keepdim为False说明去掉在axis指定的维度，所以经过平均池化后，输出向量的形状变为(BATCH_SIZE, FEATURES)
tf.mean(inputs, axis=1, keepdim=False)
```

最后的输出层，只有一个节点。因为我们的标签要么是0要么是1，实际上这就是一个二分类的问题。所以我们使用sigmoid函数，它将输出值压缩在0~1的范围。

**为什么二分类问题就要用sigmoid函数呢**，请看知乎的回答：https://www.zhihu.com/question/35322351

## 编译模型

因为是二分类，所以loss我们选择binary_crossentropy。
我们使用Adam优化器。

代码如下：


In [None]:
model.compile(optimizer=keras.optimizers.Adam(),
              loss=keras.losses.binary_crossentropy,
              metrics=['accuracy'])

## 训练模型

首先需要准备数据，将train_data分成两部分，一部分用来训练，一部分用来验证。并且我们这次训练使用mini-batch为512来训练模型。

代码如下：


In [None]:
# 前10000个数据用来验证，之后的用来训练
x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

# 训练40轮，每一轮使用validation_data进行验证
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

## 估算结果

我们使用测试集进行估算：


In [None]:
results = model.evaluate(test_data, test_labels)

```bash
[0.34129497660636904, 0.86908]
```
结果是：loss值为0.341，正确率为0.869

## 训练过程数据的可视化

为了更清楚地获取训练过程中的loss，val_loss，acc和val_acc的变化，我们进行可视化。

model.fit()方法返回一个history对象，我们可以通过history对象获取这些信息。

代码如下：


In [None]:
history_dict = history.history
history_dict.keys()

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()  # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()


最终的结果如下图：

![training_and_validation_loss](images/imdb_training_and_validation_loss.png)
![training_and_validation_accuracy](images/imdb_training_and_validation_accuracy.png)