In [1]:
from transformers import AutoTokenizer

# Using Hugging Face Tokenizers

### Loading Tokenizer

In this notebook, we'll explore Hugging Face's tokenizers by using a pretrained
model. Hugging Face has many tokenizers available that have already been trained
for specific models and tasks!

In [None]:
# Choose & download a pretrained tokenizer to use
# BERT (encoder-only) and CASED: it cares about capitalization
my_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

## Encoding: Text to Tokens

### Tokens: String Representations

In [None]:
# Simple method getting tokens from text
raw_text = '''Rory's shoes are magenta and so are Corey's but they aren't nearly as dark!'''
tokens = my_tokenizer.tokenize(raw_text)

print(tokens)
# Sub-word tokenization: words are split into sub-words
# In the case of the BERT tokenizer, the successive split parts are prefixed with '##'
# ['Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!']

['Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!']


In [None]:
# This method also returns special tokens depending on the pretrained tokenizer
# Special tokens are used to mark the beginning and end of a sequence, etc.
# BERT uses [CLS] for the start of a sequence and [SEP] for the end
detailed_tokens = my_tokenizer(raw_text).tokens()
print(detailed_tokens)
# ['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']

['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']


### Tokens: Integer ID Representations

In [None]:
# Way to get tokens as integer IDs
print(my_tokenizer.encode(raw_text))
# [101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]

[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]


In [None]:
print(detailed_tokens)

# Tokenizer method to get the IDs if we already have the tokens as strings
detailed_ids = my_tokenizer.convert_tokens_to_ids(detailed_tokens)
print(detailed_ids)
# [101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]

['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']
[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]


Another way can look a little complex but can be useful when working with
tokenizers for certain tasks.

In [None]:
# Returns an object that has a few different keys available
my_tokenizer(raw_text)
# {'input_ids': [101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

{'input_ids': [101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
# focus on `input_ids` which are the IDs associated with the tokens.
print(my_tokenizer(raw_text).input_ids)
# [101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]

[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]


## Decoding: Tokens to Text

We of course can use the tokenizer to go from token IDs to tokens and back to text!

In [None]:
# Integer IDs for tokens
ids = my_tokenizer.encode(raw_text)

# The inverse of the .encode() method: .decode()
my_tokenizer.decode(ids)
# "[CLS] Rory ' s shoes are magenta and so are Corey ' s but they aren ' t nearly as dark! [SEP]"

"[CLS] Rory ' s shoes are magenta and so are Corey ' s but they aren ' t nearly as dark! [SEP]"

In [None]:
# To ignore special tokens (depending on pretrained tokenizer)
my_tokenizer.decode(ids, skip_special_tokens=True)
# "Rory ' s shoes are magenta and so are Corey ' s but they aren ' t nearly as dark!"

"Rory ' s shoes are magenta and so are Corey ' s but they aren ' t nearly as dark!"

In [None]:
# List of tokens as strings instead of one long string
my_tokenizer.convert_ids_to_tokens(ids)
# ['[CLS]', 'Rory', "'", 's', 'shoes', ...

['[CLS]',
 'Rory',
 "'",
 's',
 'shoes',
 'are',
 'mage',
 '##nta',
 'and',
 'so',
 'are',
 'Corey',
 "'",
 's',
 'but',
 'they',
 'aren',
 "'",
 't',
 'nearly',
 'as',
 'dark',
 '!',
 '[SEP]']

## A Note on the Unknown

> One thing to consider is if a string is outside of the tokenizer's vocabulary,
> also known as an "unkown" token.
> 
> They are typically represented with `[UNK]` or
> some other similar variant.


<!--
If the tokenizer encoded the text so each character was a token (which is
actually not as easy as it sounds), then it would be impossible to have an
"unknown" token. Word-based tokenization will always be in danger of having 
"unknown" tokens since it's virtually impossible to have every possible word (
and "non-word") in its vocabulary!

And so you might think that subword tokenization wouldn't have an issue with
"unknown" tokens. And although there are fewer than word-based tokenization, it
does happen!

--------------------------------------------------------------------------------

Tokenizers are specific so it's important to use a tokenizer that will recognize
most of the text you're working with! For example, a lot of tokenizers might not
consider emoji as tokens but could be really important if emoji are especially
numerous in your data (like a corpus of chat messages)!

If you're seeing a lot of "unknown" tokens with the text you're working with,
might consider using a different tokenizer appropiate for the task. Or it's also
possible to fine-tune a pretrained model or train one from scratch!

-->

In [None]:
phrase = 'ðŸ¥± the dog next door kept barking all night!!'
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids)) # ['[CLS]', '[UNK]', 'the', 'dog', ...
print(my_tokenizer.decode(ids))

ðŸ¥± the dog next door kept barking all night!!
['[CLS]', '[UNK]', 'the', 'dog', 'next', 'door', 'kept', 'barking', 'all', 'night', '!', '!', '[SEP]']
[CLS] [UNK] the dog next door kept barking all night!! [SEP]


In [13]:
phrase = '''wow my dad thought mcdonalds sold tacos \N{SKULL}'''
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))

wow my dad thought mcdonalds sold tacos ðŸ’€
['[CLS]', 'w', '##ow', 'my', 'dad', 'thought', 'm', '##c', '##don', '##ald', '##s', 'sold', 'ta', '##cos', '[UNK]', '[SEP]']
[CLS] wow my dad thought mcdonalds sold tacos [UNK] [SEP]
