## 6.1 处理文本数据
### 6.1.1 单词和字符的one-hot编码
**代码清单 6-1** 单词级的one-hot编码（简单示例）

In [None]:
import numpy as np

# 初始数据：每个样本是列表的一个元素（本例中的样本是一个句子，但也可以是一整篇文档）
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# 构建数据中所有标记的索引
token_index = {}
for sample in samples:
    # 利用split方法对样本进行分词。在实际应用中，还需要从样本中去掉标点和特殊字符
    for word in sample.split():
        if word not in token_index:
            # 为每个唯一单词指定一个唯一索引。注意，没有为索引编号0指定单词
            token_index[word] = len(token_index) + 1

# 对样本进行分词。只考虑每个样本前max_length个单词
max_length = 10
# 将结果存在results中
results = np.zeros(shape=(len(samples), max_length, max(token_index.values()) + 1))

for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

In [None]:
token_index

In [None]:
results

**代码清单 6-2** 字符级的one-hot编码（简单示例）

In [None]:
import numpy as np
import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# 所有可打印的ASCII字符
characters = string.printable
token_index = dict(zip(characters, range(1, len(characters) + 1)))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample[:max_length]):
        index = token_index.get(character)
        results[i, j, index] = 1.

In [None]:
token_index.keys()

In [None]:
results.shape

**代码清单 6-3** 用Keras实现单词级的one-hot编码

In [None]:
from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# 创建一个分词器(tokenizer)，设置为只考虑前1000个最常见的单词
tokenizer = Tokenizer(num_words=1000)
# 构建单词索引
tokenizer.fit_on_texts(samples)

# 将字符串转换为整数索引组成的列表
sequences = tokenizer.texts_to_sequences(samples)

# 也可以直接得到one-hot二进制表示。这个分词器也支持除one-hot编码外的其它向量化模式
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

# 找回单词索引
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

In [None]:
sequences

In [None]:
one_hot_results.shape

In [None]:
word_index

**代码清单 6-4** 使用散列表技巧的单词级的one-hot编码（简单示例）

In [None]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# 将单词保存为长度为1000的向量，如果单词数量接近1000个（或者更多）
# 那么会遇到很多散列冲突，这会降低这种编码方法的准确性
dimensionality = 1000
max_length = 10

results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        # 将单词散列为0~1000范围内的一个随机整数索引
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1.

In [None]:
results

### 6.1.2 使用词嵌入
**1. 利用Embedding层学习词嵌入**

**代码清单 6-5** 将一个Embedding层实例化

In [None]:
from keras.layers import Embedding

# Embedding层至少需要两个参数：
# 标记的个数（这里是1000，即最大单词索引+1）和嵌入的维度（这里是64）
embedding_layer = Embedding(1000, 64)

**代码清单 6-6** 加载IMDB数据，准备用于Embedding层

In [None]:
from keras.datasets import imdb
from keras import preprocessing

import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a, **k: np_load_old(*a, allow_pickle=True, **k)

# 作为特征的单词个数
max_features = 10000
# 在这么多单词后截断文本（这些单词都属于前max_features个最常见的单词）
maxlen = 20

# 将数据加载为整数列表
(x_train, y_train), (x_test, y_test) = imdb.load_data(
    num_words=max_features)

# restore np.load for future normal usage
np.load = np_load_old

# 将整数列表转换成形状为(samples, maxlen)的二维整数张量
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

**代码清单 6-7** 在IMDB数据上使用Embedding层和分类器

In [None]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding

model = Sequential()
# 指定Embedding层的最大输入长度，以便后面将嵌入输入展平。
# Embedding层激活的形状为(samples, maxlen, 8)
model.add(Embedding(10000, 8, input_length=maxlen))

# 将三维的嵌入张量展平成形状为(samples, maxlen * 8)的二维张量
model.add(Flatten())

# 在上面添加分类器
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

In [None]:
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

### 6.1.3 整合在一起：从原始文本到词嵌入
**1. 下载IMDB数据的原始文本**

**代码清单 6-8** 处理IMDB原始数据的标签

In [None]:
import os

imdb_dir = '/Users/niujie/.keras/datasets/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

**2. 对数据进行分词**

**代码清单 6-9** 对IMDB原始数据的文本进行分词

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

# 在100个单词后截断评论
maxlen = 100
# 在200个样本上训练
training_samples = 5000
# 在10000个样本上验证
validation_samples = 10000
# 只考虑数据集中前10000个最常见的单词
max_words = 10000

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# 将数据划分为训练集和验证集，但首先要打乱数据，因为一开始数据中的样本是排好序的
# （所有负面评论都在前面，然后是所有正面评论）
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples : training_samples + validation_samples]
y_val = labels[training_samples : training_samples + validation_samples]

**4. 对嵌入进行预处理**

**代码清单 6-10** 解析GloVe词嵌入文件

In [None]:
glove_dir = '/Users/niujie/.keras/datasets/glove.6B'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

**代码清单 6-11** 准备GloVe词嵌入矩阵

In [None]:
embedding_dim = 100

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # 嵌入索引(embedings_index)中找不到词，其嵌入向量全为0
            embedding_matrix[i] = embedding_vector

**5. 定义模型**

**代码清单 6-12** 模型定义

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

**6. 在模型中加载GloVe嵌入**

**代码清单 6-13** 将预训练对词嵌入加载到Embedding层中

In [None]:
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

**7. 训练模型与评估模型**

**代码清单 6-14** 训练与评估

In [None]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')

**代码清单 6-15** 绘制结果

In [None]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

**代码清单 6-16** 在不使用预训练词嵌入的情况下，训练相同的模型

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

embedding_dim = 100

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))

In [None]:
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.figure()

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.show()

**代码清单 6-17** 对测试集数据进行分词

In [None]:
test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

**代码清单 6-18** 在测试集上评估模型

In [None]:
model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)