In Hugging Face, an encoder is a core component of many transformer-based models that processes input data—like text or images—into a rich, contextualized representation (digits or vectors).

This representation can then be used for various downstream tasks such as classification, question answering, or as input to a decoder in sequence-to-sequence models.

## Simulation of an encoder

In [10]:
sentence = 'Hi there , This is a beautiful day .'

vocab = {
    '<SOS>' : 0,
    '<EOS>' : 1,
    'Hi' : 3,
    'there' : 4,
    'This': 5,
    'is' : 6,
    'a' : 7,
    'beautiful': 8,
    'day' : 9,
    ',' : 10,
    '.' :11
}

In English, words are splitted by spaces. In Chinese, we can use jieba to split words.

In [11]:
sent = '<SOS> ' + sentence + ' <EOS>'
print(sent)

<SOS> Hi there , This is a beautiful day . <EOS>


In [12]:
words = sent.split()
print(words)

['<SOS>', 'Hi', 'there', ',', 'This', 'is', 'a', 'beautiful', 'day', '.', '<EOS>']


In [13]:
[vocab[i] for i in words]

[0, 3, 4, 10, 5, 6, 7, 8, 9, 11, 1]

## Use of the encoder

Models are encoders are in pairs. If you use a model, use its encoder accordingly.

The encoder usually shares the same name as its model.

In [14]:
from transformers import BertTokenizer

In [15]:
# Turn on VPN and execute below code to connet the VPN network
# import os
# os.environ['https_proxy'] = 'XXX.XX.XX.XX'

In [16]:
tokenizer = BertTokenizer.from_pretrained(
    pretrained_model_name_or_path = 'bert-base-chinese',
    cache_dir = None,
    force_download = False
)

In [17]:
sents = ['我是一个小可爱。',
        '我的裤腰很高。',
        '我喜欢吃冰淇淋。',
        '我喜欢开车。']

In [18]:
out = tokenizer.encode(
    text = sents[0],
    text_pair = sents[1],
    truncation = True,
    #is the sentence is not long enought, padding to maxlength
    padding = 'max_length',
    add_special_tokens = True,
    max_length = 25,
    return_tensors = None
)
print(out)

[101, 2769, 3221, 671, 702, 2207, 1377, 4263, 511, 102, 2769, 4638, 6175, 5587, 2523, 7770, 511, 102, 0, 0, 0, 0, 0, 0, 0]


`text`: This is your main input sentence. For example, sents[0] could be "The sky is blue."

`text_pair`: This is an optional second sentence that’s paired with the first, like sents[1] = "The grass is green."

When both are provided, the tokenizer will Encode them together as a single sequence and insert special separator tokens (like [SEP]) between them.

You’d use it when your model needs to understand relationships between two sentences, such as:

- Question and answer pairs
- Premise and hypothesis in entailment tasks
- Sentiment toward a specific topic

In [19]:
type(out)

list

In [20]:
#Revert the numbers back to string.
tokenizer.decode(out)

'[CLS] 我 是 一 个 小 可 爱 。 [SEP] 我 的 裤 腰 很 高 。 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

We can see in Bert, each Chinese character is encoded to one number.

## Encoder plus

In [21]:
out2 = tokenizer.encode_plus(
    text = sents[0],
    text_pair = sents[1],
    truncation = True,
    padding = 'max_length',
    max_length = 25,
    add_special_tokens = True,
    return_tensors = None,
    #encoder plus parameters below
    return_token_type_ids = True,
    return_attention_mask = True,
    return_special_tokens_mask = True,
    return_length = True
)

In [22]:
out2

{'input_ids': [101, 2769, 3221, 671, 702, 2207, 1377, 4263, 511, 102, 2769, 4638, 6175, 5587, 2523, 7770, 511, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 'length': 25}

In [23]:
for k, v in out2.items():
    print(k, ':', v)

input_ids : [101, 2769, 3221, 671, 702, 2207, 1377, 4263, 511, 102, 2769, 4638, 6175, 5587, 2523, 7770, 511, 102, 0, 0, 0, 0, 0, 0, 0]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
special_tokens_mask : [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
length : 25


### In encoder plus, several masks can be defined, more information will be revealed in the output. And the output will be a dictionary instead of a list.

## Batch encoding

In [24]:
out3 = tokenizer.batch_encode_plus(
    batch_text_or_text_pairs = [(sents[0], sents[1]), (sents[2], sents[3])], #put one pair into one tuple, all the tuples in one list
    truncation = True,
    padding = 'max_length',
    max_length = 25,
    add_special_tokens = True,
    return_tensors = None,
    #encoder plus parameters below
    return_token_type_ids = True,
    return_attention_mask = True,
    return_special_tokens_mask = True,
    return_length = True
)

In [25]:
for k, v in out3.items():
    print(k, ':', v)

input_ids : [[101, 2769, 3221, 671, 702, 2207, 1377, 4263, 511, 102, 2769, 4638, 6175, 5587, 2523, 7770, 511, 102, 0, 0, 0, 0, 0, 0, 0], [101, 2769, 1599, 3614, 1391, 1102, 3899, 3900, 511, 102, 2769, 1599, 3614, 2458, 6756, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0]]
token_type_ids : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]
special_tokens_mask : [[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
length : [18, 17]
attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]


### In batch encoder, the output will be dictionary, and the value of the dictionary will be lists as it will cover more sentences.

In [26]:
out4 = tokenizer.batch_encode_plus(
    batch_text_or_text_pairs = sents, #If there are not pairs, just simply use a list as input texts
    truncation = True,
    padding = 'max_length',
    max_length = 25,
    add_special_tokens = True,
    return_tensors = None,
    #encoder plus parameters below
    return_token_type_ids = True,
    return_attention_mask = True,
    return_special_tokens_mask = True,
    return_length = True
)

In [27]:
for k, v in out4.items():
    print(k, ':', v)

input_ids : [[101, 2769, 3221, 671, 702, 2207, 1377, 4263, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 2769, 4638, 6175, 5587, 2523, 7770, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 2769, 1599, 3614, 1391, 1102, 3899, 3900, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 2769, 1599, 3614, 2458, 6756, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
token_type_ids : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
special_tokens_mask : [[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1

## Dictionary operation

In [28]:
vocab = tokenizer.get_vocab()

In [29]:
vocab

{'[PAD]': 0,
 '[unused1]': 1,
 '[unused2]': 2,
 '[unused3]': 3,
 '[unused4]': 4,
 '[unused5]': 5,
 '[unused6]': 6,
 '[unused7]': 7,
 '[unused8]': 8,
 '[unused9]': 9,
 '[unused10]': 10,
 '[unused11]': 11,
 '[unused12]': 12,
 '[unused13]': 13,
 '[unused14]': 14,
 '[unused15]': 15,
 '[unused16]': 16,
 '[unused17]': 17,
 '[unused18]': 18,
 '[unused19]': 19,
 '[unused20]': 20,
 '[unused21]': 21,
 '[unused22]': 22,
 '[unused23]': 23,
 '[unused24]': 24,
 '[unused25]': 25,
 '[unused26]': 26,
 '[unused27]': 27,
 '[unused28]': 28,
 '[unused29]': 29,
 '[unused30]': 30,
 '[unused31]': 31,
 '[unused32]': 32,
 '[unused33]': 33,
 '[unused34]': 34,
 '[unused35]': 35,
 '[unused36]': 36,
 '[unused37]': 37,
 '[unused38]': 38,
 '[unused39]': 39,
 '[unused40]': 40,
 '[unused41]': 41,
 '[unused42]': 42,
 '[unused43]': 43,
 '[unused44]': 44,
 '[unused45]': 45,
 '[unused46]': 46,
 '[unused47]': 47,
 '[unused48]': 48,
 '[unused49]': 49,
 '[unused50]': 50,
 '[unused51]': 51,
 '[unused52]': 52,
 '[unused53]': 53

In [30]:
len(vocab)

21128

By default, **bert_base_chinese** has over 20K words encoded.

In [31]:
#bert_base_chinese treat a single Chinese character as a word
'心态' in vocab

False

In [32]:
'态' in vocab

True

In [33]:
#add new tokens
tokenizer.add_tokens(new_tokens =['心态', '正规', '权威'])

3

In [34]:
#add special character
tokenizer.add_special_tokens({'eos_token': '[EOS]'})

1

In [35]:
for word in ['心态', '正规', '权威','[EOS]']:
    print(tokenizer.get_vocab()[word])

21128
21129
21130
21131


Use the updated vacabulary to encode.

In [39]:
out5 = tokenizer.encode(
    text = '他的心态很正规[EOS]',
    truncation = True,
    padding = 'max_length',
    max_length = 10,
    add_special_tokens = True,
    return_tensors = None,
    #encoder plus parameters below
    return_token_type_ids = True,
    return_attention_mask = True,
    return_special_tokens_mask = True,
    return_length = True
)

In [40]:
out5

[101, 800, 4638, 21128, 2523, 21129, 21131, 102, 0, 0]

In [42]:
out6 = tokenizer.encode_plus(
    text = '他的心态很正规[EOS]',
    truncation = True,
    padding = 'max_length',
    max_length = 10,
    add_special_tokens = True,
    return_tensors = None,
    #encoder plus parameters below
    return_token_type_ids = True,
    return_attention_mask = True,
    return_special_tokens_mask = True,
    return_length = True
)

In [43]:
out6

{'input_ids': [101, 800, 4638, 21128, 2523, 21129, 21131, 102, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0, 0], 'length': 10}

In [44]:
tokenizer.decode(out5)

'[CLS] 他 的 心态 很 正规 [EOS] [SEP] [PAD] [PAD]'

In [46]:
tokenizer.decode(out6['input_ids'])

'[CLS] 他 的 心态 很 正规 [EOS] [SEP] [PAD] [PAD]'

## Workflow of encoder:
- defining dictionary
- preprocessing
- tokenization
- encoding