<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/subword_tokenization/subword_tokenization_sentencepiece.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install and data preparation

We use the small training data (botchan.txt) in this example. 
([Botchan](https://en.wikipedia.org/wiki/Botchan) is a novel written by Natsume Sōseki in 1906.  The sample is English-translated one.)

In [9]:
!pip install sentencepiece -Uqq
!wget -q https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt

## Basic end-to-end example

In [21]:
import sentencepiece as spm
import tensorflow as tf

In [16]:
# Train sentencepiece model from 'botchan.txt' and makes 'm.model' and 'm.vocab'
# 'm.vocab' is just a reference. Not used in the segmentation.
train_args = '--input=botchan.txt --model_prefix=m --vocab_size=2000'
spm.SentencePieceTrainer.train(train_args)

# makes segmenter instance and loads the model file(m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))

# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
print(sp.decode_ids([212, 32, 10, 587, 446]))


['▁This', '▁is', '▁a', '▁t', 'est']
[212, 32, 10, 587, 446]
This is a test
This is a test


In [20]:
# returns vocab size
print(sp.get_piece_size())

# is <=> piece conversion
print(sp.id_to_piece(229))
print(sp.piece_to_id('_This'))

# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))

# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
  print(sp.id_to_piece(id), sp.is_control(id))

2000
▁W
0
0
<unk> False
<s> True
</s> True


## Loads model from byte stream

Sentencepiece's model file is just a serialized [protocol buffer](https://developers.google.com/protocol-buffers/). We can instantiate sentencepiece processor from byte object with **load_from_serialized_proto** method.

In [27]:
# assumens that m.model is stored in non_posix file system.
serialized_model_proto = tf.io.gfile.GFile('m.model', 'rb').read()

sp = spm.SentencePieceProcessor()
sp.load_from_serialized_proto(serialized_model_proto)

print(sp.encode_as_pieces('this is a test'))

['▁this', '▁is', '▁a', '▁t', 'est']


## User defined and control symbols

We can define special tokens (symbols) to tweak the DNN behavior through the tokens.   Typical examples are  [BERT](https://arxiv.org/abs/1810.04805)'s special symbols., e.g., [SEP] and [CLS].

There are two types of special tokens:

- **user defined symbols**: Always treated as one token in any context. These symbols can appear in the input sentence. 
- **control symbol**:  We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.

For experimental purpose, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However,  we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text.

In [42]:
cs = '--user_defined_symbols=<sep>,<cls>'
train_args = f'--input=botchan.txt --model_prefix=m_ctrl --vocab_size=2000 {cs}'
spm.SentencePieceTrainer.train(train_args)

sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')

# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>'))  # 3
print(sp_ctrl.piece_to_id('<cls>'))  # 4
print('3=', sp_ctrl.decode_ids([3]))  # decoded to empty
print('4=', sp_ctrl.decode_ids([4]))  # decoded to empty

['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
3
4
3= <sep>
4= <cls>


In [37]:
print(train_args)
'--input=botchan.txt --model_prefix=m_ctrl --vocab_size=2000 --control_symbols=<sep>,<cls>'

--input=botchan.txt --model_prefix=m_ctrl --vocab_size=2000 --control_symbols=<sep>,<cls>


'--input=botchan.txt --model_prefix=m_ctrl --vocab_size=2000 --control_symbols=<sep>,<cls>'

In [41]:
## Example of user defined symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user2 --user_defined_symbols=<sep>,<cls> --vocab_size=2000')

sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user2.model')

# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbol to apper in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>'))  # 3
print(sp_user.piece_to_id('<cls>'))  # 4
print('3=', sp_user.decode_ids([3]))  # decoded to <sep>
print('4=', sp_user.decode_ids([4]))  # decoded to <cls>

['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
3
4
3= <sep>
4= <cls>
