>原文作者: 刘焕勇
>
>个人博客: https://liuhuanyong.github.io/
>
>原文地址: https://mp.weixin.qq.com/s/fkgk8l_Vd4YDU_K6G54F4Q
>
>公众号: 老刘说NLP


word2vec、glove是两种静态的词向量模型，即每个词语只有一个固定的向量表示。但在不同语境中，词语的语义会发生变化，按道理词向量也应该动态调整。相比word2vec、glove生成的静态词向量， BERT是一种动态的技术，可以根据上下文情景，得到语义变化的词向量、句向量。


HuggingFace网站提供了简易可用的数据集、丰富的预训练语言模型， 通过transformer、sentence-transformer库(都与BERT相关)，我们可以使用HuggingFace内的预训练模型，解决不同情景的文本分析问题。

HuggingFace网站  https://huggingface.co/



<br>


## transformer

动态词向量与句向量， 这里选择bert-base-uncased作为预训练模型，该模型使用量很高，文档特别齐全。

https://huggingface.co/bert-base-uncased

![](img/bert-base-uncased-1.png)
![](img/bert-base-uncased-2.png)

In [15]:
from transformers import BertTokenizer, BertModel
import torch

# 下载对应的预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
text = "等潮水退去，就知道谁没穿裤子"
text_dict = tokenizer.encode_plus(text, 
                                  add_special_tokens=True, 
                                  return_attention_mask=True)
text_dict

{'input_ids': [101, 100, 100, 1893, 100, 100, 1989, 100, 100, 1957, 100, 100, 100, 100, 1816, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [17]:
input_ids = torch.tensor(text_dict['input_ids']).unsqueeze(0)
input_ids

tensor([[ 101,  100,  100, 1893,  100,  100, 1989,  100,  100, 1957,  100,  100,
          100,  100, 1816,  102]])

In [18]:
token_type_ids = torch.tensor(text_dict['token_type_ids']).unsqueeze(0)
token_type_ids 

tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [19]:
attention_mask = torch.tensor(text_dict['attention_mask']).unsqueeze(0)
attention_mask

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Bert的输入要有三个向量：
- input_ids 
- token_type_ids 
- attention_mask

这三个向量可以通过一行代码获得：

In [20]:
res = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
res

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.3270,  0.2295, -0.5083,  ..., -0.4196,  0.8889,  0.1733],
         [-0.0197,  0.1119, -0.0019,  ..., -0.1796,  1.1686, -0.3201],
         [-0.1917,  0.3858,  0.1898,  ..., -0.6780,  0.8482, -0.2124],
         ...,
         [-0.2072,  0.3184,  0.1925,  ..., -0.9684,  0.6707, -0.3190],
         [-0.2374, -0.2978, -0.5546,  ..., -0.2183,  1.0357, -0.7171],
         [ 0.3437,  0.2083, -0.4760,  ...,  0.0327, -0.3415, -0.3588]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-8.8259e-01, -4.5545e-01, -7.8801e-01,  7.9721e-01,  4.9784e-01,
         -2.2765e-01,  8.9097e-01,  3.0348e-01, -4.5668e-01, -9.9998e-01,
         -2.3392e-01,  8.1729e-01,  9.8620e-01,  3.2406e-01,  9.3232e-01,
         -6.6682e-01, -3.2629e-01, -6.0555e-01,  3.9752e-01, -6.0081e-01,
          7.3349e-01,  9.9986e-01,  1.8554e-01,  3.5466e-01,  5.3127e-01,
          9.0024e-01, -7.2323e-01,  9.0003e-01,  9.5168e-01,  6.062

res是一个包含2个元素的tuple（根据config的不同，返回的元素个数也不同，默认是2个）：

第一个是sequnce_output，即last_hidden_state，即``16*768``维度

In [21]:
(res[0].detach().squeeze(0)).shape

torch.Size([16, 768])

第二个是pooler_output，维度是768，为所有向量的池化结果。

In [22]:
(res[1].detach().squeeze(0)).shape

torch.Size([768])

<br>


## 使用sentence-transformers生成动态词向量与句向量

sentence-transformer框架提供了一种简便的方法来计算句子和段落的向量表示（也称为句子嵌入）

![](img/sentence-transformer.png)

In [None]:
!pip3 install -U sentence-transformers

In [11]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distiluse-base-multilingual-cased')

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/114 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/607 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/341 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/528 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

In [12]:
emb1 = model.encode('Natural language processing is a hard task for human')

emb2 = model.encode('自然语言处理对于人类来说是个困难的任务')
emb1

array([ 2.58186590e-02,  4.65703346e-02,  4.25276496e-02, -1.67875513e-02,
        5.56012690e-02, -3.44308838e-02, -6.53978735e-02,  1.77450478e-02,
       -3.47155109e-02,  2.86140274e-02,  2.48657260e-02,  7.94188876e-04,
        5.09755425e-02, -1.76107027e-02, -1.04308855e-02,  7.61642214e-03,
       -1.17232790e-02,  7.92307183e-02, -2.17613694e-03, -1.39151979e-03,
       -1.81538903e-03, -3.78509541e-03,  5.15571535e-02,  2.21809130e-02,
       -3.90616916e-02, -6.42164052e-02,  6.99017346e-02, -3.14403623e-02,
       -7.77501054e-03,  5.85549101e-02,  3.69212287e-03, -5.06816618e-03,
        1.64500028e-02, -1.20995415e-03, -5.03546232e-03, -4.83510531e-02,
        1.30432751e-02,  5.17136184e-03, -6.11135624e-02, -9.98734534e-02,
       -3.53461765e-02,  2.77488735e-02,  6.08363673e-02,  2.56946515e-02,
       -1.05534913e-02, -2.77093221e-02,  2.86590215e-02, -3.11025530e-02,
       -3.06264702e-02, -1.39095346e-02,  4.96107489e-02,  1.46360844e-02,
       -5.13296314e-02,  

In [13]:
emb1.shape

(512,)

In [14]:
cos_sim = util.pytorch_cos_sim(emb1, emb2)
cos_sim

tensor([[0.8960]])