# 1. 加载IMDB数据

IMDB数据集包含50000条评论，其中25000是训练集，25000是测试集。训练集和测试集均包含50%的正面评论和50%的负面评论。

In [3]:
# 注意：目前需要使用numpy==1.16.2版本，1.16.3的版本下载时会报错
# 参考：https://blog.csdn.net/weixin_42096901/article/details/89855804
from keras.datasets import imdb
# num_words = 10000表示只保留训练集中最常见的10000个单词，其余低频词将被舍弃
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = 10000)

In [20]:
print('the first review in train dataset: \n', train_data[0])
print('label:', train_labels[0])
print('shape:', train_data.shape)

the first review in train dataset: 
 [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
label: 1
shape: (25000,

In [5]:
print('the max index:', max([max(sequence) for sequence in train_data]))

the max index: 9999


In [7]:
# 将索引解析为单词

# word_index是一个将单词映射为整数索引的字典
word_index = imdb.get_word_index()
# 键值颠倒，将整数映射为单词，构建reverse_word_index字典
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
# 解码第一条评论，将索引值减去３，因为前三个数字默认为'padding', 'start of sequence', 'unknown'的保留索引值
decoded_review = ' '.join([reverse_word_index.get(i-3, '?') for i in train_data[0]])

In [15]:
print('the decoded review: \n', decoded_review)

the decoded review: 
 ? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they h

# 2. 准备数据

In [18]:
import numpy as np

def vectorize_sequences(sequences, dimension = 10000):
    # 先构造一个形状为(len(sequences), dimension)的全０矩阵
    results = np.zeros((len(sequences), dimension))
    # 根据序列的查询,将result[i]指定的索引设置为１
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

# 将训练数据和测试数据进行one-hot编码
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

In [22]:
print('train data shape: ', x_train.shape)
print('test data shape: ', x_test.shape)

train data shape:  (25000, 10000)
test data shape:  (25000, 10000)


In [25]:
print('the first vectorize review in train dataset: \n', x_train[0])
print('the first vectorize review in test dataset: \n', x_test[0])

the first vectorize review in train dataset: 
 [0. 1. 1. ... 0. 0. 0.]
the first vectorize review in test dataset: 
 [0. 1. 1. ... 0. 0. 0.]


In [27]:
# 将标签向量化
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

# 3. 构建神经网络

选用三层全连接层堆叠的网络结构。对于选择的Dense层，需要确定每一层隐层单元的个数以及所选用的激活函数。

__激活函数的作用：为网络增加非线性因素，从而扩展假设空间。__

In [28]:
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

Instructions for updating:
Colocations handled automatically by placer.


In [29]:
model.compile(optimizer = 'rmsprop',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])