### 1. tokenizer, 构造输入

- tokenizer, model: 相匹配，tokenizer outputs => model input
- Auto\*Tokenizer, AutoModel\*：Generic type
- tokenizer：服务于 model input
    - len(input_ids) == len(attention_mask)
    - tokenizer(test_senteces[0], ): tokenizer.\_\_call\_\_：encode
    - tokenizer.encode == tokenizer.tokenize + tokenizer.convert_tokens_to_ids
    - tokenizer.decode
    - tokenizer 工作的原理其实就是 tokenizer.vocab：字典，存储了 token => id 的映射关系
        - tokenizer.special_tokens_map
    - attention mask 与 padding 相匹配；

In [1]:
test_senteces = ['today is not that bad', 'today is so bad', 'so good']
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [4]:
batch_input = tokenizer(test_senteces, truncation=True, padding=True, return_tensors='pt')

In [5]:
batch_input

{'input_ids': tensor([[ 101, 2651, 2003, 2025, 2008, 2919,  102],
        [ 101, 2651, 2003, 2061, 2919,  102,    0],
        [ 101, 2061, 2204,  102,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 0, 0, 0]])}

In [6]:
tokenizer(test_senteces[0], )

{'input_ids': [101, 2651, 2003, 2025, 2008, 2919, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [7]:
tokenizer.encode(test_senteces[0], )

[101, 2651, 2003, 2025, 2008, 2919, 102]

In [8]:
tokenizer.convert_tokens_to_ids(tokenizer.tokenize(test_senteces[0]))

[2651, 2003, 2025, 2008, 2919]

In [9]:
tokenizer.decode([101, 2651, 2003, 2025, 2008, 2919, 102])

'[CLS] today is not that bad [SEP]'

In [10]:
tokenizer.special_tokens_map.values()

dict_values(['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'])

In [11]:
tokenizer.convert_tokens_to_ids([special for special in tokenizer.special_tokens_map.values()])

[100, 102, 0, 101, 103]

In [12]:
batch_input = tokenizer(test_senteces, truncation=True, padding=True, return_tensors='pt')

### 2. model，调用模型

In [13]:
import torch
import torch.nn.functional as F

In [14]:
model.config

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.34.1",
  "vocab_size": 30522
}

In [15]:
with torch.no_grad():
    outputs = model(**batch_input)
    print(outputs)
    scores = F.softmax(outputs.logits, dim=1)
    print(scores)
    labels = torch.argmax(scores, dim=1)
    print(labels)
    labels = [model.config.id2label[id] for id in labels.tolist()]
    print(labels)

SequenceClassifierOutput(loss=None, logits=tensor([[-3.4620,  3.6118],
        [ 4.7508, -3.7899],
        [-4.1938,  4.5566]]), hidden_states=None, attentions=None)
tensor([[8.4632e-04, 9.9915e-01],
        [9.9980e-01, 1.9531e-04],
        [1.5837e-04, 9.9984e-01]])
tensor([1, 0, 1])
['POSITIVE', 'NEGATIVE', 'POSITIVE']


### 3. parse output，输出解析

In [16]:
with torch.no_grad():
    outputs = model(**batch_input)
    print(outputs)
    scores = F.softmax(outputs.logits, dim=1)
    print(scores)
    labels = torch.argmax(scores, dim=1)
    print(labels)
    labels = [model.config.id2label[id] for id in labels.tolist()]
    print(labels)

SequenceClassifierOutput(loss=None, logits=tensor([[-3.4620,  3.6118],
        [ 4.7508, -3.7899],
        [-4.1938,  4.5566]]), hidden_states=None, attentions=None)
tensor([[8.4632e-04, 9.9915e-01],
        [9.9980e-01, 1.9531e-04],
        [1.5837e-04, 9.9984e-01]])
tensor([1, 0, 1])
['POSITIVE', 'NEGATIVE', 'POSITIVE']
