# 快速分词器

hugging face提供了两种分词器：
- 快速分词器：Tokenizers库自带，rust编写
- 慢速分词器：Transformers库自带，python编写

## 查看分词结果

分词器返回的是BatchEncoding对象，它是基于python字典的子类，因此可以像字典一样来解析分词结果。
通过通过tokenizer 或 BatchEncoding 对象的 is_fast 属性来判断使用的是哪种分词器

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "hello World!"
encoding = tokenizer(example)

print(type(encoding))
print("------------")
print("tokenizer.is_fast:" , tokenizer.is_fast)
print("------------")
print("encoding.is_fast:" , encoding.is_fast)

<class 'transformers.tokenization_utils_base.BatchEncoding'>
------------
tokenizer.is_fast: True
------------
encoding.is_fast: True


对于快速分词器，BatchEncoding 对象还提供了一切额外的方法。例如，可以直接通过 tokens() 函数来获取切分出的token

In [4]:
print(encoding.tokens())

['[CLS]', 'hello', 'World', '!', '[SEP]']


## 追踪映射

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."

encoding = tokenizer(example)

print(encoding.tokens())

['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in', 'Brooklyn', '.', '[SEP]']


在上例中，索引为5的是 '##yl' ，它是 'Sylvain' 的一部分，因此在映射回原文时不应该被单独看待。
可以通过 word_ids() 函数来获取每一个token对应的词语索引

In [6]:
print(encoding.word_ids())

[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]


特殊的token [CLS]，[CLS] 被映射为None，其他的token被映射到对应的来源词语。
这可以为许多任务提供帮助，例如序列标注任务，就可以运用这个映射将词语的标签转换到token的标签；对于遮蔽语言模型， 就可以实现全词覆盖，将属于同一个词语的token全部覆盖掉。

## 序列标注任务

NER pipeline 模型实际上封装了三个过程：
- 对文本进行编码
- 将输入送入模型
- 对模型输出进行后处理

前两个步骤在所有的pipeline模型中都是一样的操作，只有第三个步骤（对模型输出进行后处理），则是根据任务类型的不同。

token 分类 pipeline 模型在默认情况下会加载 dbmdz/bert-large-cased-finetuned-conll03-english NER 模型，直接打印出它的输出

In [12]:
from transformers import pipeline

token_classifier = pipeline("token-classification")

results = token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

for result in results:
    print(result)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


{'entity': 'I-PER', 'score': 0.99938285, 'index': 4, 'word': 'S', 'start': 11, 'end': 12}
{'entity': 'I-PER', 'score': 0.99815494, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14}
{'entity': 'I-PER', 'score': 0.99590707, 'index': 6, 'word': '##va', 'start': 14, 'end': 16}
{'entity': 'I-PER', 'score': 0.99923277, 'index': 7, 'word': '##in', 'start': 16, 'end': 18}
{'entity': 'I-ORG', 'score': 0.9738931, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35}
{'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40}
{'entity': 'I-ORG', 'score': 0.9887976, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45}
{'entity': 'I-LOC', 'score': 0.9932106, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}


可以通过设置参数 grouped_entities=True 让模型自动合并属于同一个实体的token

In [13]:
results = token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.", grouped_entities=True)

for result in results:
    print(result)



{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18}
{'entity_group': 'ORG', 'score': 0.9796019, 'word': 'Hugging Face', 'start': 33, 'end': 45}
{'entity_group': 'LOC', 'score': 0.9932106, 'word': 'Brooklyn', 'start': 49, 'end': 57}


## 构造模型输出

In [17]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."

inputs = tokenizer(example, return_tensors="pt")

outputs = model(**inputs)

print(inputs["input_ids"].shape)
print("-------")
print(outputs.logits.shape)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([1, 19])
-------
torch.Size([1, 19, 9])


模型输入是长度为19的 token 序列，输出尺寸是 1\*19\*9 ， 即模型对每个token都会输出一个包含9个logits值的向量(9分类)

可以通过 model.config.id2label 属性来查看这9个标签

In [21]:
print(model.config.id2label)

{0: 'O', 1: 'B-MISC', 2: 'I-MISC', 3: 'B-PER', 4: 'I-PER', 5: 'B-ORG', 6: 'I-ORG', 7: 'B-LOC', 8: 'I-LOC'}


这里使用的是 IOB 标签格式，“B-XXX”表示某一种标签的开始，“I-XXX”表示某一种标签的中间，“O”表示非标签。

因此，该模型识别的实体类型共有 4 种：miscellaneous、person、organization 和 location。

与文本分类任务一样，可以通过softmax函数进一步将logits值转换为概率值，并且通过argmax函数获取每一个token的预测结果

In [36]:
import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()

predictions = outputs.logits.argmax(dim=-1)[0].tolist()

print(predictions)

results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != 'O':
        results.append(
            {'entity' : label, "score" : probabilities[idx][pred], "word" : tokens[idx]}
        )

for result in results:
    print(result)

[0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
{'entity': 'I-PER', 'score': 0.9993828535079956, 'word': 'S'}
{'entity': 'I-PER', 'score': 0.9981549382209778, 'word': '##yl'}
{'entity': 'I-PER', 'score': 0.995907187461853, 'word': '##va'}
{'entity': 'I-PER', 'score': 0.9992327690124512, 'word': '##in'}
{'entity': 'I-ORG', 'score': 0.9738931059837341, 'word': 'Hu'}
{'entity': 'I-ORG', 'score': 0.9761149883270264, 'word': '##gging'}
{'entity': 'I-ORG', 'score': 0.9887976050376892, 'word': 'Face'}
{'entity': 'I-LOC', 'score': 0.9932106137275696, 'word': 'Brooklyn'}
