<a href="https://colab.research.google.com/github/mkaramib/nlp/blob/main/SentencePiece.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SentencePiece
SentencePiece is a library that provides the tokenization at the sub-word levels. 

## Segmentation Algorithms
It is able to be trained using 3 main popular algorithms. 

### BPE
Byte Pair Encoding (BPE) is an algorihtm based on the next highest frequent pair which is proposed by Sennrich et al. (2016). This algorithm has been customized and adopted in GPT-2. 

**Algorithm**:
1. Prepare a large enough training data (i.e. corpus)
2. Define a desired subword vocabulary size
3. Split word to sequence of characters and appending suffix “</w>” to end of word with word frequency. So the basic unit is character in this stage. For example, the frequency of “low” is 5, then we rephrase it to “l o w </w>”: 5
4. Generating a new subword according to the high frequency occurrence.
5. Repeating step 4 until reaching subword vocabulary size which is defined in step 2 or the next highest frequency pair is 1.

### WordPiece
WordPiece is a segmentation algorithm based on the *liklihood* that is propsoed by Schuster and Nakajima in 2012. 

**Algorithm**: 
1. Prepare a large enough training data (i.e. corpus)
2. Define a desired subword vocabulary size
3. Split word to sequence of characters
4. Build a languages model based on step 3 data
5. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
6. Repeating step 5until reaching subword vocabulary size which is defined in step 2 or the likelihood increase falls below a certain threshold.




### Unigram Language Model
Unigram language model is another algorithm for subword segmentation which is proposed by Kudo. One of the assumption is all subword occurrence are independently and subword sequence is produced by the product of subword occurrence probabilities. Like the WordPiece, the Unigram Language Model employs the concept of languages model to build subword vocabulary.

**Algorithm**
1. Prepare a large enough training data (i.e. corpus)
2. Define a desired subword vocabulary size
3. Optimize the probability of word occurrence by giving a word sequence.
4. Compute the loss of each subword
5. Sort the symbol by loss and keep top X % of word (e.g. X can be 80). To avoid out-of-vocabulary, character level is recommend to be included as subset of subword.
6. Repeating step 3–5until reaching subword vocabulary size which is defined in step 2 or no change in step 5.

## Train Model
Sentencepiece can be trained using any of above algorithms. It needs following inputs as argument to train a model:
1.  Training Content: defined in *--input=train.txt*
2.  Model File name: defined in *--model_prefix=m*
    * *m* means the result model name will be m.model
3.  Model Type: defined in *--model_type*
    * default is Unigram Language Model
    * *--model_type=bpe* will train a BPE model
4. Vocab Size: defined by *--vocab_size=100*





In [None]:
# install sentencepiece
!pip install sentencepiece

model_dir = "./data/model/"


# import the SentencePiece and its model
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file=model_dir+'sentencepiece.model')

# test the model
s0 = "Machine learning is interesting area and it is growing fast."

# encode: text => id
print(sp.encode_as_pieces(s0))
print(sp.encode_as_ids(s0))

# decode: id => text
#print(sp.decode_pieces(sp.encode_as_pieces(s0)))
#print(sp.decode_ids([12847, 277]))


Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |▎                               | 10kB 17.7MB/s eta 0:00:01[K     |▋                               | 20kB 20.0MB/s eta 0:00:01[K     |▉                               | 30kB 4.9MB/s eta 0:00:01[K     |█▏                              | 40kB 5.4MB/s eta 0:00:01[K     |█▌                              | 51kB 5.9MB/s eta 0:00:01[K     |█▊                              | 61kB 6.5MB/s eta 0:00:01[K     |██                              | 71kB 7.0MB/s eta 0:00:01[K     |██▍                             | 81kB 6.0MB/s eta 0:00:01[K     |██▋                             | 92kB 5.8MB/s eta 0:00:01[K     |███                             | 102kB 6.3MB/s eta 0:00:01[K     |███▎                            | 112kB 6.3MB/s eta 0:00:01[K     |███▌                 