<a href="https://colab.research.google.com/github/mkaramib/NLP/blob/main/SentencePiece.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SentencePiece
SentencePiece is a library that provides the tokenization at the sub-word levels. 

## Segmentation Algorithms
It is able to be trained using 3 main popular algorithms. 

### 1- BPE
Byte Pair Encoding (BPE) is an algorihtm based on the next highest frequent pair which is proposed by Sennrich et al. (2016). This algorithm has been customized and adopted in GPT-2. 

**Algorithm**:
1. Prepare a large enough training data (i.e. corpus)
2. Define a desired subword vocabulary size
3. Split word to sequence of characters and appending suffix “</w>” to end of word with word frequency. So the basic unit is character in this stage. For example, the frequency of “low” is 5, then we rephrase it to “l o w </w>”: 5
4. Generating a new subword according to the high frequency occurrence.
5. Repeating step 4 until reaching subword vocabulary size which is defined in step 2 or the next highest frequency pair is 1.

### 2- WordPiece
WordPiece is a segmentation algorithm based on the *liklihood* that is propsoed by Schuster and Nakajima in 2012. 

**Algorithm**: 
1. Prepare a large enough training data (i.e. corpus)
2. Define a desired subword vocabulary size
3. Split word to sequence of characters
4. Build a languages model based on step 3 data
5. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
6. Repeating step 5until reaching subword vocabulary size which is defined in step 2 or the likelihood increase falls below a certain threshold.




### 3- Unigram Language Model
Unigram language model is another algorithm for subword segmentation which is proposed by Kudo. One of the assumption is all subword occurrence are independently and subword sequence is produced by the product of subword occurrence probabilities. Like the WordPiece, the Unigram Language Model employs the concept of languages model to build subword vocabulary.

**Algorithm**
1. Prepare a large enough training data (i.e. corpus)
2. Define a desired subword vocabulary size
3. Optimize the probability of word occurrence by giving a word sequence.
4. Compute the loss of each subword
5. Sort the symbol by loss and keep top X % of word (e.g. X can be 80). To avoid out-of-vocabulary, character level is recommend to be included as subset of subword.
6. Repeating step 3–5until reaching subword vocabulary size which is defined in step 2 or no change in step 5.

## Train Model
Sentencepiece can be trained using any of above algorithms. It needs following inputs as argument to train a model:
1.  Training Content: defined in *--input=train.txt*
2.  Model File name: defined in *--model_prefix=m*
    * *m* means the result model name will be m.model
3.  Model Type: defined in *--model_type*
    * default is *unigram*: Unigram Language Model
    * *bpe* will train a BPE model
    * *char* will be all the characters as term in vocab
    * *word* will be all the words as term in the vocab
4. Vocab Size: defined by *--vocab_size=100*
5. User defined symbols: defined by 
    * *--user_defined_symbols*=<sep>,<cls>,<s>,</s>
6. Change Pre_defined Symbols: can be done by adding following argument.
    * --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]

**Note**: BOS, EOS, PAD, UNK are defined by default. 

In [None]:
# install sentencepiece
!pip install sentencepiece
import sentencepiece as spm

model_dir = "./data/model/"

# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=./data/vocab_train.txt --model_prefix=./data/mbpe --model_type=bpe --vocab_size=20000')

# changind the pre-defined symbols
#spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('./data/mbpe.model')

# encode: text => id
s1 = "Four groups that advocate for immigrant rights said Thursday they will challenge Arizona 's new immigration law , which allows police to ask anyone for proof of legal U.S. residency ."

# print the encoded results, tokens and ids
print(sp.encode_as_pieces(s1))
print(sp.encode_as_ids(s1))

# print decoded 
print(sp.decode_pieces(sp.encode_as_pieces(s1)))
print(sp.decode_ids([1758, 1093, 32, 21, 3370, 25, 8133, 485, 26, 222, 70, 58, 959]))

## Symbols:
There are some pre-defined symbols such as *BOS*, *EOS*, *PAD*, *UNK*. they can be accessed using following codes.

In [None]:
# print pre-defined symbols
print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id())  # disabled by default

# print encoded a sentence
print(sp.encode_as_ids('Hello world'))

# Prepend or append bos/eos ids.
print([sp.bos_id()] + sp.encode_as_ids('Hello world') + [sp.eos_id()])

## Text normalization
SentencePiece provides ability for text normalization using *--normaliation_rule_name* argument. Following are common type of text normalization. 


1. **nmt_nfkc:** NFKC normalization with some additional normalization around spaces. (default)
2. **nfkc:** original: NFKC normalization.
3. **nmt_nfkc_cf:** nmt_nfkc + Unicode case folding (mostly lower casing)
4. **nfkc_cf:** nfkc + Unicode case folding.
5. **identity:** no normalization
