# 头脑风暴

一些尝试

In [1]:
from transformers import BertModel, BertTokenizer
import torch
import numpy as np

In [2]:
EN_BERT_PATH = './data/bert-base-uncased'

## 1. Embedding 的稳定性

同一个词做两次 embedding，结果是否相同？

In [3]:
# 加载模型
tokenizer = BertTokenizer.from_pretrained(EN_BERT_PATH)
model = BertModel.from_pretrained(EN_BERT_PATH)

Some weights of the model checkpoint at ./data/bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
# 批量计算 embeddings
def get_embeddings(corpus):
    encoded_inputs = tokenizer(corpus,
                               padding='max_length',
                               truncation=True,
                               return_tensors='pt')

    with torch.no_grad():
        outputs = model(**encoded_inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)

    return embeddings

In [5]:
e1 = get_embeddings('usa')
e2 = get_embeddings('usa')

all((e1 == e2).tolist())

True

## 2. Embedding 差值的意义

In [6]:
corpus = [
    'king',
    'queen',
    'male',
    'female',
    'man',
    'woman',
    'blue',
    'bad'
]
embeddings = get_embeddings(corpus)
c2e = {c: e for c, e in zip(corpus, embeddings)}

In [7]:
def squared_distances(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.sum((a - b) ** 2)

In [8]:
squared_distances(a=c2e['king'],
                  b=c2e['queen'])

12.46646

In [9]:
squared_distances(a=c2e['male'],
                  b=c2e['female'])

11.630802

In [10]:
squared_distances(a=c2e['man'],
                  b=c2e['woman'])

20.933697

In [11]:
squared_distances(a=c2e['blue'],
                  b=c2e['bad'])

46.82113

## 3. 输出 embedding 的数量

考虑一件事，输入三个英文单词，会输出三个 embedding 向量吗？

输出 embedding 的数量是由输入单词的数量决定的吗？

In [12]:
# 批量计算 last_hidden_state
def get_last_hidden_state(corpus):
    encoded_inputs = tokenizer(corpus,
                               truncation=True,
                               return_tensors='pt')

    with torch.no_grad():
        outputs = model(**encoded_inputs)
        last_hidden_state = outputs.last_hidden_state

    return encoded_inputs, last_hidden_state

In [13]:
corpus_1 = 'The number of embeddings corresponds to the number of tokens generated by the tokenizer.'
print('word count of corpus_1:', len(corpus_1.split(' ')))

corpus_2 = 'input a single word'
print('word count of corpus_2:', len(corpus_2.split(' ')))

corpus_3 = 'BERT will still generate an embedding for it'
print('word count of corpus_3:', len(corpus_3.split(' ')))

word count of corpus_1: 14
word count of corpus_2: 4
word count of corpus_3: 8


In [14]:
encoded_inputs, last_hidden_state = get_last_hidden_state(corpus_1)
print('Number of token:', encoded_inputs['input_ids'].shape[1])
print('Number of embedding:', last_hidden_state.shape[1])

encoded_inputs['input_ids'].shape, last_hidden_state.shape

Number of token: 22
Number of embedding: 22


(torch.Size([1, 22]), torch.Size([1, 22, 768]))

In [15]:
encoded_inputs, last_hidden_state = get_last_hidden_state(corpus_2)
print('Number of token:', encoded_inputs['input_ids'].shape[1])
print('Number of embedding:', last_hidden_state.shape[1])

encoded_inputs['input_ids'].shape, last_hidden_state.shape

Number of token: 6
Number of embedding: 6


(torch.Size([1, 6]), torch.Size([1, 6, 768]))

In [16]:
encoded_inputs, last_hidden_state = get_last_hidden_state(corpus_3)
print('Number of token:', encoded_inputs['input_ids'].shape[1])
print('Number of embedding:', last_hidden_state.shape[1])

encoded_inputs['input_ids'].shape, last_hidden_state.shape

Number of token: 12
Number of embedding: 12


(torch.Size([1, 12]), torch.Size([1, 12, 768]))

与 ChatGPT 的对话：

省流版 -> 生成嵌入向量的数量是由 `tokenizer` 决定的。

> **Q:** What determines the number of embeddings output by BERT? If I input a word, how many embeddings will it output?
> 
> **ChatGPT:**
> 
> The number of embeddings output by BERT depends on the length of the input and the specific tokenization process applied to it. Here's a detailed explanation:
> 
> 1. **Tokenization**: BERT uses a WordPiece tokenizer. This means that the input text is first split into tokens, which may be words or subwords. For example, the word "unhappiness" might be tokenized into ["un", "##happy", "##ness"]. Each of these tokens will then have its own embedding.
> 
> 2. **Input Length**: The number of embeddings corresponds to the number of tokens generated by the tokenizer. If your input is a single word, it will be tokenized into one or more tokens, depending on whether it is present in the tokenizer's vocabulary as a whole or needs to be broken down into subwords.
> 
> 3. **Special Tokens**: For sentences or longer text, BERT also adds special tokens like `[CLS]` at the beginning and `[SEP]` at the end. These special tokens are also considered in the count of embeddings.
> 
> ### Example
> 
> - **Single Word Input**: If you input the word "unhappiness":
>   - It might be tokenized as ["un", "##happy", "##ness"].
>   - BERT will output an embedding for each token, resulting in three embeddings in this case.
> 
> - **Sentence Input**: If you input the sentence "I am happy":
>   - It might be tokenized as ["[CLS]", "i", "am", "happy", "[SEP]"].
>   - BERT will output five embeddings, one for each token.
> 
> ### Embedding Dimensions
> 
> Regardless of the number of tokens, each token embedding from BERT has a fixed size (typically 768 dimensions for BERT base and 1024 dimensions for BERT large). Therefore, if you input a sentence and it gets tokenized into 10 tokens, you will receive 10 embeddings, each of size 768 (for BERT base).
> 
> ### Summary
> 
> - **Number of embeddings**: Determined by the number of tokens produced by the tokenizer.
> - **Embedding size**: Fixed and depends on the BERT model variant (768 for BERT base, 1024 for BERT large).
> 
> If you input a single word, the number of embeddings will be equal to the number of tokens that the word is split into by the tokenizer. For a single word that is not split further, it will output one embedding. If the word is split into subwords, it will output one embedding per subword.