### Word Embedding
词嵌入在NLP领域几乎是不可或缺。以使用最广泛的英文为例，英文单词总量超过300k，在大规模数据集中，统计到的词通常超过100k。如果使用one hot表示一个词，其表示将是相当稀疏的。这样训练将比较缓慢，内存开销也相当大。  
一种有效的方法是使用词嵌入，把词唯一的对应到某个连续n维空间的词向量。
![word-embedding.jpg](word-embedding.jpg)
其反向传播将使用<SelectBackward>，取下标的操作将对部分参数产生梯度

In [1]:
import os
path = os.getcwd()
os.chdir('..')
from deepnotes import *
os.chdir(path)
import numpy as np
import matplotlib.pyplot as plt
# 使用Pytorch验算卷积和池化的梯度
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
embed1 = nn.Embedding(10,2)
embed2 = Embedding(10,2)
embed2.weight = embed1.weight.data.numpy()

x_train = torch.randint(0,10,(5,6))
x_train_numpy = x_train.data.numpy()
y_train = torch.randn(5,6,2)
y_train_numpy = y_train.data.numpy()

out = embed1(x_train)
loss = F.mse_loss(out,y_train,reduction='sum')
loss.backward()

out = embed2(x_train_numpy)
embed2.backward(2*(out-y_train_numpy))

print('dw:\n',embed1.weight.grad)
print('dw:\n',torch.FloatTensor(embed2._dw))

dw:
 tensor([[ 9.6908,  1.6885],
        [ 0.1929,  5.4208],
        [ 6.4563, -5.2453],
        [-2.0448, -1.6821],
        [19.7709, -0.1509],
        [-8.8800, 13.5255],
        [-2.9446, -1.8931],
        [-2.3794, -5.8645],
        [-6.3412,  2.5517],
        [ 7.4903,  6.4962]])
dw:
 tensor([[ 9.6908,  1.6885],
        [ 0.1929,  5.4208],
        [ 6.4563, -5.2453],
        [-2.0448, -1.6821],
        [19.7709, -0.1509],
        [-8.8800, 13.5255],
        [-2.9446, -1.8931],
        [-2.3794, -5.8645],
        [-6.3412,  2.5517],
        [ 7.4903,  6.4962]])


### Word2Vec
常用的基于语料库训练word embedding的方法是无监督的Word2Vec算法。语料库很容易收集，我们只需要下载真正的人类语言文档即可。  
常用的Word2Vec算法有CBOW和Skip-Gram等，CBOW模型是将一个词所在的上下文中的词作为输入，而那个词本身作为输出，也就是说，看到一个上下文，希望大概能猜出这个词和它的意思。而skip-gram模型是将一个词所在的上下文中的词作为输出，而那个词本身作为输入，也就是说，给出一个词，希望预测可能出现的上下文的词。我们认为这两种算法同样有效。
![w2v.png](w2v.png)

In [3]:
class CBOW:
    def __init__(self, vocab_size, n_dim):
        # context size = 2的CBOW模型
        self.vocab_size = vocab_size
        self.n_dim = n_dim
        self.embeddings = Embedding(vocab_size, n_dim)
        self.predictor = Sequential(
            Linear(4 * n_dim, 128),
            ReLU(),
            Linear(128, vocab_size)
        )
        self.optim = Adam(0.001)
        self.predictor.apply_optim(self.optim)
        self.optim.add_module(self.embeddings)
        
    def forward(self, inputs):
        '''
        inputs: int array, shape = (batch_size, 4)
        '''
        embeds = self.embeddings(inputs)
        # (batch_size, 4, n_dim)
        embeds = embeds.reshape(-1,4*self.n_dim)
        probs = self.predictor(embeds)
        return probs
    
    def backward(self, labels):
        '''
        dz: 1 d array: shape = (batch_size,)
        '''
        dx = self.predictor.backward(labels)
        self.embeddings.backward(dx)
        
    def fit(self, ids, num_iters, batch_size):
        '''
        ids: 1 d array: shape = (text_size,)
        '''
        loss_func = CrossEntropyLossWithSoftMax(self.vocab_size)
        for t in range(num_iters):
            indices = []
            for _ in range(batch_size):
                idx = np.random.randint(0,len(ids)-4)
                indices += list(range(idx,idx+5))
            x = ids[indices]
            x = x.reshape(batch_size,5)
            y = x[:,2]
            x = x[:,[0,1,3,4]]
            logits = self.forward(x)
            loss,dlogits = loss_func(logits,y)
            if (t+1)%100==0:
                print('iters: %d, cross entropy loss: %.4f'%(t+1,loss))
            self.embeddings.zero_grad()
            self.predictor.zero_grad()
            self.backward(dlogits)
            self.optim.step()
        return self.embeddings

In [4]:
from data_utils import *

In [5]:
corpus = Corpus()
ids = corpus.get_data('data/train.txt')
vocab_size = len(corpus.dictionary)

In [6]:
model = CBOW(vocab_size, 200)
embeddings = model.fit(ids, num_iters = 2000, batch_size = 1000)

iters: 100, cross entropy loss: 6.9052
iters: 200, cross entropy loss: 6.6338
iters: 300, cross entropy loss: 6.1137
iters: 400, cross entropy loss: 5.8990
iters: 500, cross entropy loss: 5.6250
iters: 600, cross entropy loss: 5.4762
iters: 700, cross entropy loss: 5.6873
iters: 800, cross entropy loss: 5.3851
iters: 900, cross entropy loss: 5.2231
iters: 1000, cross entropy loss: 5.2515
iters: 1100, cross entropy loss: 5.0302
iters: 1200, cross entropy loss: 5.1134
iters: 1300, cross entropy loss: 4.9689
iters: 1400, cross entropy loss: 4.9152
iters: 1500, cross entropy loss: 4.8894
iters: 1600, cross entropy loss: 4.6809
iters: 1700, cross entropy loss: 4.9487
iters: 1800, cross entropy loss: 4.7791
iters: 1900, cross entropy loss: 4.5680
iters: 2000, cross entropy loss: 4.4625


In [7]:
def get_word_vec(embeddings,word):
    idx = corpus.dictionary.word2idx[word]
    vec = embeddings(np.array([idx]))[0]
    return vec

def cosine_similarity(vec1, vec2):
    return np.sum(vec1*vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2))

In [8]:
vec1 = get_word_vec(embeddings,'we')
vec2 = get_word_vec(embeddings,'i')
vec3 = get_word_vec(embeddings,'me')

print('we - i',cosine_similarity(vec1, vec2))
print('me - i',cosine_similarity(vec2, vec3))

we - i 0.07368085702862864
me - i -0.15907262525966892


In [9]:
vec1 = get_word_vec(embeddings,'dog')
vec2 = get_word_vec(embeddings,'cat')
vec3 = get_word_vec(embeddings,'human')

print('dog - cat',cosine_similarity(vec1, vec2))
print('human - cat',cosine_similarity(vec2, vec3))

dog - cat 0.048153883506455455
human - cat 0.0679982945061183


In [11]:
vec1 = get_word_vec(embeddings,'man')
vec2 = get_word_vec(embeddings,'woman')
vec3 = get_word_vec(embeddings,'girl')

print('man - woman',cosine_similarity(vec1, vec2))
print('girl - woman',cosine_similarity(vec2, vec3))

man - woman 0.0555224650599688
girl - woman -0.01660520614757255


使用word-embedding就可以直接实现文本分类的任务。在embedding的维度上，我们把一个句子的各个词的词向量平均加权，或者用tf-idf加权加在一起，得到表征一个文本信息的句子向量，再在embedding空间中使用k近邻，就能实现简单的文本分类。