## 求近义词和类比词

在大规模语料上预训练的词向量常常可以应用到下游自然语言处理任务中。本节将演示如何用这些预训练的词向量来求近义词和类比词。

In [1]:
import torch
import torchtext.vocab as vocab

vocab.pretrained_aliases.keys()

dict_keys(['charngram.100d', 'fasttext.en.300d', 'fasttext.simple.300d', 'glove.42B.300d', 'glove.840B.300d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.6B.50d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d'])

In [2]:
[key for key in vocab.pretrained_aliases.keys() if 'glove' in key]

['glove.42B.300d',
 'glove.840B.300d',
 'glove.twitter.27B.25d',
 'glove.twitter.27B.50d',
 'glove.twitter.27B.100d',
 'glove.twitter.27B.200d',
 'glove.6B.50d',
 'glove.6B.100d',
 'glove.6B.200d',
 'glove.6B.300d']

In [4]:
glove = vocab.GloVe(name='6B', dim=50, cache='../data/pretrained_glove')

../data/pretrained_glove/glove.6B.zip: 862MB [07:00, 2.05MB/s]                               
100%|█████████▉| 399999/400000 [00:09<00:00, 40933.34it/s]


In [6]:
len(glove.stoi)

400000

In [7]:
glove.stoi['beautiful'], glove.itos[3366], glove.vectors[3366]

(3366,
 'beautiful',
 tensor([ 0.5462,  1.2042, -1.1288, -0.1325,  0.9553,  0.0405, -0.4786, -0.3397,
         -0.2806,  0.7176, -0.5369, -0.0046,  0.7322,  0.1210,  0.2809, -0.0881,
          0.5973,  0.5526,  0.0566, -0.5025, -0.6320,  1.1439, -0.3105,  0.1263,
          1.3155, -0.5244, -1.5041,  1.1580,  0.6880, -0.8505,  2.3236, -0.4179,
          0.4452, -0.0192,  0.2897,  0.5326, -0.0230,  0.5896, -0.7240, -0.8522,
         -0.1776,  0.1443,  0.4066, -0.5200,  0.0908,  0.0830, -0.0220, -1.6214,
          0.3458, -0.0109]))

### 1 求近义词
用top-k余弦相似度来计算近义词。

In [9]:
def knn(W, x, k):
    cos = torch.matmul(W, x.view((-1,))) / ((torch.sum(W * W, dim=1) + 1e-9).sqrt() * torch.sum(x * x).sqrt())
    _, top_k = torch.topk(cos, k)
    top_k = top_k.cpu().numpy()
    return top_k, [cos[i].item() for i in top_k]

In [10]:
def get_similar_tokens(query_token, k, embed):
    top_k, cos = knn(embed.vectors, embed.vectors[embed.stoi[query_token]], k + 1)
    for i, c in zip(top_k[1:], cos[1:]):
        print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))

In [17]:
get_similar_tokens('chip', 3, glove)

cosine sim=0.856: chips
cosine sim=0.749: intel
cosine sim=0.749: electronics


In [12]:
get_similar_tokens('beautiful', 3, glove)

cosine sim=0.921: lovely
cosine sim=0.893: gorgeous
cosine sim=0.830: wonderful


In [13]:
get_similar_tokens('study', 3, glove)

cosine sim=0.949: studies
cosine sim=0.888: researchers
cosine sim=0.883: research


In [15]:
get_similar_tokens('work', 3, glove)

cosine sim=0.916: working
cosine sim=0.891: done
cosine sim=0.878: well


### 2 求类比词
我们还可以使用预训练词向量求词与词之间的类比关系。例如，“man”（男人）: “woman”（女人）:: “son”（儿子） : “daughter”（女儿）是一个类比例子：“man”之于“woman”相当于“son”之于“daughter”。求类比词问题可以定义为：对于类比关系中的4个词 a:b::c:d，给定前3个词a、b和c，求d。设词w的词向量为$\vec{w}$。求类比词的思路是，搜索与$\vec{c}+\vec{b}-\vec{a}$的结果向量最相似的词向量。

In [18]:
def get_analogy(token_a, token_b, token_c, embed):
    vecs = [embed.vectors[embed.stoi[t]] for t in [token_a, token_b, token_c]]
    x = vecs[2] + vecs[1] - vecs[0]
    top_k, cos = knn(embed.vectors, x, 1)
    return embed.itos[top_k[0]]

In [19]:
get_analogy('man', 'woman', 'son', glove)

'daughter'

In [20]:
get_analogy('beijing', 'china', 'tokyo', glove)

'japan'

In [21]:
get_analogy('do', 'did', 'go', glove)

'went'

In [24]:
get_analogy('bad', 'worst', 'big', glove)

'biggest'