# 词嵌入Embedding
有两种方式可以实现词嵌入：
- 利用Eembedding层学习，实现词嵌入，这种学习方式和深度学习方式相似
- 预训练的词向量，在深度学习之前，就将嵌入的词向量训练好，然后再用
## 利用Embedding层学习
- 将一个词映射到一个密集的向量空间中，这个向量空间是没有规律结构的
- 这个新的向量空间是学习到了语义的，所以我们只知道近义词映射后在向量空间中的距离会比较近
- Embedding层相当于一个字典，输入单词的索引，然后经过Embedding层，输出对应的向量
- Embedding层输入的是一个2D张量(samples, sequence_length)，输出的是一个3D张量(samples, sequence_length, embedding_size)
- Embedding层向手动实现one-hot一样，也关心两个超参数：max_length每个样本中的最大单词数，不够的用0填充，超出的截断; max_features是词汇表的大小
## Embedding与one-hot相比
- Embedding同one-hot相比：输出的都是一个3D张量。Embedding的形状是(samples, sequence_length, embedding_size)，one-hot的形状是(samples, sequence_length, max_features)
- embdding是要通过学习得到的，最后和实际的label之间会有一定的信息偏差，本实验中仅用了前20个token得到的结果和实际语义相符70%。one-hot是不用通过学习的，直接是统计的结果
- embdding是一个稠密的矩阵，最后一个维度的大小是embedding_size。one-hot是一个稀疏的矩阵，最后一个维度的大小是max_features

In [1]:
from keras.layers import Embedding
from keras.datasets import imdb
from keras import preprocessing
# 只关心影评的前20个token，出现频率最高的10000个token
max_len, max_features = 20, 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(x_train.shape)
# 总共25000个影评，每个影评已转换成list
# x_train

2022-04-16 14:06:22.884532: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-16 14:06:22.884580: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


(25000,)


In [2]:
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=max_len)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=max_len)
# 将每个样本转换成一个长度为20的向量
print(x_train.shape)
x_train

(25000, 20)


array([[  65,   16,   38, ...,   19,  178,   32],
       [  23,    4, 1690, ...,   16,  145,   95],
       [1352,   13,  191, ...,    7,  129,  113],
       ...,
       [  11, 1818, 7561, ...,    4, 3586,    2],
       [  92,  401,  728, ...,   12,    9,   23],
       [ 764,   40,    4, ...,  204,  131,    9]], dtype=int32)

In [3]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding
model = Sequential()
# Embedding最后输出的形状是一个3D张量：(samples, maxlen, embedding_dim)
# 同one-hot编码方式相比，最后一个维度不再是max_features，而是embedding_dim
# 所以最后的输出被大大的压缩了
model.add(Embedding(max_features, 8, input_length=max_len))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 8)             80000     
                                                                 
 flatten (Flatten)           (None, 160)               0         
                                                                 
 dense (Dense)               (None, 1)                 161       
                                                                 
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


2022-04-16 14:06:40.162126: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-04-16 14:06:40.162201: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-04-16 14:06:40.162242: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (yy): /proc/driver/nvidia/version does not exist
2022-04-16 14:06:40.163862: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [4]:
model.predict(x_test[:10])

array([[0.16350755],
       [0.8678318 ],
       [0.58021647],
       [0.72069895],
       [0.9952805 ],
       [0.03812778],
       [0.60809636],
       [0.21556327],
       [0.22103658],
       [0.70488   ]], dtype=float32)