# 用 Bert 生成词向量

用 Bert 生成中英文句子向量

In [1]:
# !pip install --upgrade ipywidgets

In [2]:
from transformers import BertModel, BertTokenizer
import torch

In [3]:
EN_BERT_PATH = './data/bert-base-uncased'
CN_BERT_PATH = './data/bert-base-chinese'

## 1. 英文句子向量

下载 [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) 的模型文件：

```bash
conda install pytorch -y
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download bert-base-uncased --local-dir ./data/bert-base-uncased
```

In [4]:
# 加载模型
tokenizer = BertTokenizer.from_pretrained(EN_BERT_PATH)
model = BertModel.from_pretrained(EN_BERT_PATH)

Some weights of the model checkpoint at ./data/bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
en_text = "china"
en_encoded_input = tokenizer(en_text, return_tensors='pt')
en_encoded_input

{'input_ids': tensor([[ 101, 2859,  102]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}

In [6]:
with torch.no_grad():
    outputs = model(**en_encoded_input)
    embeddings = outputs.last_hidden_state.mean(dim=1)

embeddings.shape

torch.Size([1, 768])

In [7]:
embeddings

tensor([[-7.2748e-02, -6.0616e-02, -3.6487e-01, -3.6696e-02, -2.1376e-01,
         -1.2931e-01,  9.5531e-02,  1.6451e-01,  6.3211e-02, -1.1386e-01,
          1.5692e-02, -2.0598e-01, -5.4117e-02,  1.3845e-01, -5.3391e-01,
         -1.0799e-01, -2.5073e-01,  1.4953e-01,  1.6047e-02,  1.5631e-01,
          2.2903e-01, -1.0620e-01,  2.0776e-01,  1.2699e-01, -5.3219e-02,
          2.6649e-01, -1.5513e-01, -1.9328e-02, -2.1553e-01, -1.6270e-01,
         -2.0865e-01, -1.7892e-01,  2.3612e-01,  3.7936e-01,  2.1672e-01,
         -6.1220e-02,  8.7686e-02,  1.0306e-01, -3.0434e-01, -1.2954e-01,
         -3.2665e-01, -1.4788e-01,  2.1459e-02,  9.8637e-02,  5.6040e-01,
         -2.7032e-01, -7.5218e-02,  2.6385e-01, -1.7370e-01,  5.4277e-02,
         -3.2903e-01,  1.4791e-01,  1.3729e-01,  2.1489e-01,  5.0297e-02,
          4.6308e-01,  5.4485e-02, -3.3327e-01,  9.9945e-02, -9.7959e-02,
          2.2447e-01,  2.0381e-02, -2.6579e-01, -2.1536e-01,  2.8245e-01,
          6.0419e-02,  2.7709e-01,  1.

## 2. 中文句子向量

下载 [bert-base-chinese](https://huggingface.co/google-bert/bert-base-chinese) 的模型文件：

```bash
conda install pytorch -y
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download bert-base-chinese --local-dir ./data/bert-base-chinese
```

In [8]:
# 加载模型
tokenizer = BertTokenizer.from_pretrained(CN_BERT_PATH)
model = BertModel.from_pretrained(CN_BERT_PATH)

Some weights of the model checkpoint at ./data/bert-base-chinese were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
cn_text = "阿美莉卡"
cn_encoded_input = tokenizer(cn_text, return_tensors='pt')
cn_encoded_input

{'input_ids': tensor([[ 101, 7350, 5401, 5799, 1305,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [10]:
with torch.no_grad():
    cn_outputs = model(**cn_encoded_input)
    cn_embeddings = cn_outputs.last_hidden_state.mean(dim=1)

cn_embeddings.shape

torch.Size([1, 768])

In [11]:
cn_embeddings

tensor([[ 4.4131e-02,  6.0813e-02, -3.2073e-02, -8.2149e-02,  7.5614e-01,
         -6.4655e-01, -4.7674e-01, -1.8938e-01, -5.8844e-02,  8.4807e-01,
         -3.1951e-01, -2.3787e-01,  2.8421e-01, -3.4114e-01,  1.5973e+00,
         -1.4003e-01,  7.1243e-01, -1.0589e+00,  1.3316e-01, -5.2681e-02,
          6.0259e-01,  4.6172e-01, -8.3582e-01, -2.8237e-01,  6.0521e-01,
         -1.0114e-02,  3.9432e-01, -1.0746e+00, -2.7550e-01, -3.6266e-02,
         -7.7592e-01,  1.6510e-01, -9.8088e-01,  2.5674e-01,  6.2955e-01,
          7.9850e-01, -3.7447e-01,  1.8022e-01, -5.0892e-01, -2.4844e-01,
          4.9330e-01,  2.8054e-01, -3.7991e-01, -5.2307e-01,  1.1925e-01,
         -4.2050e-01,  3.9872e-01, -9.3034e-02, -1.1711e-01,  1.0734e-01,
         -1.0053e+00,  8.1617e+00,  2.8432e-02,  8.6318e-01,  7.6757e-02,
          9.8966e-01,  7.6946e-01,  3.3368e-01,  5.5003e-01,  7.4619e-02,
         -2.6159e-01, -4.5547e-02,  1.1871e-01,  1.0063e+00,  3.9679e-01,
         -6.5552e-01, -3.7278e-01,  3.

其他可用的中文Embedding模型

- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- https://huggingface.co/hfl/chinese-roberta-wwm-ext
- https://huggingface.co/uer/sbert-base-chinese-nli