- date: 2022-02-22 11:17:17
- author: Jerry Su
- slug:  Mapping-Chars-and-Words-to-Tokens
- title: Mapping-Chars-and-Words-to-Tokens
- category:
- tags: NLP

[How to Convert Characters, Tokens, and Words](https://www.kaggle.com/c/feedback-prize-2021/discussion/298094)

## 1. Mapping char to token.

In [3]:
from transformers import BigBirdTokenizerFast

In [4]:
tokenizer = BigBirdTokenizerFast.from_pretrained('allenai/longformer-large-4096')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
text = "Phones\n\nModern humans today are always on their phone. "
text_encoded = tokenizer(text, return_offsets_mapping=True, max_length=512, truncation=True)
text_encoded

In [None]:
input_ids = text_encoded['input_ids']
tokens = tokenizer.convert_ids_to_tokens(input_ids)
offset_mapping = text_encoded['offset_mapping']
print(f"input_ids:\t {input_ids}, len: {len(input_ids)}")
print(f"tokens:\t\t {tokens}, len: {len(tokens)}")
print(f"offset_mapping:  {offset_mapping}, len: {len(offset_mapping)}")
text

offset_mapping:指对应token在原text的起止为止（start, end）左闭右开

In [None]:
tokens_with_offset_mapping = [text[ele[0]:ele[1]] for ele in offset_mapping]
print(f"tokens_with_offset_mapping: {tokens_with_offset_mapping}, len: {len(tokens_with_offset_mapping)}")

In [None]:
text = "Phones\n\nModern humans today are always on their phone. " 
# Modern为实体，则(8, 14)

In [None]:
start_mapping = {j[0]: i for i, j in enumerate(offset_mapping) if j != (0, 0)}
end_mapping = {j[-1] - 1: i for i, j in enumerate(offset_mapping) if j != (0, 0)}
print(start_mapping)
print(end_mapping)

In [None]:
# test "Modern humans"
char_start, char_end = 8, 20
entity_label = text[char_start:char_end]
print(f"entity_label: {entity_label}")

token_start, token_end = start_mapping[char_start], end_mapping[char_end]
entity_token = tokens[token_start:token_end+1]
print(f"entity_token; {entity_token}")

## 2.Mapping words to tokens.

In [11]:
text = "Phones\n\nModern. humans. today are always on their phone. " 
print(f"text split: {text.split()}")
encoding = tokenizer(text.split(), is_split_into_words=True, truncation=True, max_length=512)
input_ids = encoding['input_ids']
print(f"input_ids: {input_ids}, len: {len(input_ids)}")
tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(f"token: {tokens}, len: {len(tokens)}")
word_ids = encoding.word_ids() 
print(f"word_ids: {word_ids}, len: {len(word_ids)}")

text split: ['Phones', 'Modern.', 'humans.', 'today', 'are', 'always', 'on', 'their', 'phone.']
input_ids: [0, 48083, 39631, 4, 44734, 4, 34375, 1322, 30035, 261, 25017, 17283, 4, 2], len: 14
token: ['<s>', 'Phones', 'Modern', '.', 'humans', '.', 'today', 'are', 'always', 'on', 'their', 'phone', '.', '</s>'], len: 14
word_ids: [None, 0, 1, 1, 2, 2, 3, 4, 5, 6, 7, 8, 8, None], len: 14


In [12]:
print(text.split())

['Phones', 'Modern.', 'humans.', 'today', 'are', 'always', 'on', 'their', 'phone.']


In [13]:
word_start, word_end = 1, 3  # [)
text.split()[word_start:word_end]

['Modern.', 'humans.']

In [14]:
token_start = word_ids.index(word_start)
token_start

2

In [16]:
def mapping_word_to_token(word_ids, word_start, word_end):
    token_start, token_end = -1, -1
    for idx, word_id in enumerate(word_ids):
        if word_id == word_start:
            token_start = idx
            break
    for idx, word_id in enumerate(word_ids):
        if word_id == word_end:
            token_end = idx
    return token_start, token_end

mapping_word_to_token(word_ids, 1, 2)    

(2, 5)

In [18]:
def mapping_token_to_word(word_ids, token_start, token_end):
    return word_ids[token_stavimrt], word_ids[token_end]

mapping_token_to_word(word_ids, 2, 5)

(1, 2)