Byte-Pair Encoding (BPE) 

was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa.

In [27]:
import os
import sentencepiece as spm

In [28]:
data = '/home/silver/NLP_Basics/Tokenazation/phx.txt'

model = 'models'

In [29]:
spm.SentencePieceTrainer.train(
    "--model_type=bpe "
    f"--input={data} "
    f"--model_prefix={model}/bpe "
    "--vocab_size=500"
)


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --model_type=bpe --input=/home/silver/NLP_Basics/Tokenazation/phx.txt --model_prefix=models/bpe --vocab_size=500
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: /home/silver/NLP_Basics/Tokenazation/phx.txt
  input_format: 
  model_prefix: models/bpe
  model_type: BPE
  vocab_size: 500
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
 

- Loaded all 9349 sentences.
- It means that he read more than 1,400,000 characters, and the character coverage is close to 100% (not even a single strange character was missed).
- The model was successfully trained without any problems. Use your file and cut the text into small pieces (BPE pieces) of 9349 lines.
   You now have two files ready:

   models/bpe.model: Contains the trained model.

   models/bpe.vocab: Contains the list of symbols (vocab).

In [30]:
import os
print("File exists?", os.path.exists(data))


File exists? True


In [31]:
if not os.path.exists(model):
    os.makedirs(model)

In [32]:
sp = spm.SentencePieceProcessor()
sp.load(os.path.join(model, 'bpe.model'))

True

In [33]:
input_string = "It's a Test"

In [None]:
# encode: text => id
print(sp.encode_as_pieces(input_string))  #['▁I', 't', "'", 's', '▁a', '▁T', 'est']

['▁I', 't', "'", 's', '▁a', '▁T', 'est']


In [None]:
print(sp.encode_as_ids(input_string))   #[81, 439, 454, 445, 5, 84, 355

[81, 439, 454, 445, 5, 84, 355]


In [36]:
# decode: id => text
print(sp.decode_pieces(['▁I', 't', "'", 's', '▁a', '▁T', 'est']))  #It's a Test

It's a Test


In [37]:
print(sp.decode_ids([81, 439, 454, 445, 5, 84, 355]))  #It's a Test

It's a Test


In [38]:
# returns vocab size
print(f"vocab size: {sp.get_piece_size()}")

vocab size: 500


In [39]:
# id <=> piece conversion
print(f"id 101 to piece: {sp.id_to_piece(101)}")
print(f"Piece ▁is to id: {sp.piece_to_id('▁is')}")

id 101 to piece: ur
Piece ▁is to id: 225


id_to_piece() returns the token (subword) that represents the number. Conversely, piece_to_id() returns the token number.

What's important?
What distinguishes SentencePiece from many other libraries is that:

It converts your sentence to IDs or tokens in a way that doesn't change the meaning.

More importantly, you can return the sentence as it is from the tokens or IDs without losing spaces, upper/lowercase letters, punctuation, etc.

In other traditional methods, the tokenizer breaks the sentence down for you without losing any information (for example, where the space is or how the punctuation is stored).

With SentencePiece, the subwords contain information about the space (the " " represents a space), so when you decode the tokens, the sentence is returned exactly as you entered it.

✅ In short, this is a huge advantage of SentencePiece over many other tokenization technologies.

In [40]:
tokens = ['▁I', 't', "'", 's', '▁a', '▁T', 'est']
merged = "".join(tokens).replace('▁', " ").strip()
assert merged == input_string, "Input string and detokenized sentence didn't match"

In [41]:
for id in range(3):
  print(sp.id_to_piece(id), sp.is_control(id))

<unk> False
<s> True
</s> True


This means that the first 3 IDs are always system-specific symbols and not actual text words.

In [43]:
# ## Example of user defined symbols
spm.SentencePieceTrainer.train(f'''\
    --model_type=bpe\
    --input={data}\
    --model_prefix={model}/bpe_user\
    --user_defined_symbols=<sep>,<cls>\
    --vocab_size=500''')
sp_user = spm.SentencePieceProcessor()
sp_user.load(os.path.join(model, 'bpe_user.model'))


sentencepiece_trainer.cc(178) LOG(INFO) Running command:     --model_type=bpe    --input=/home/silver/NLP_Basics/Tokenazation/phx.txt    --model_prefix=models/bpe_user    --user_defined_symbols=<sep>,<cls>    --vocab_size=500
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: /home/silver/NLP_Basics/Tokenazation/phx.txt
  input_format: 
  model_prefix: models/bpe_user
  model_type: BPE
  vocab_size: 500
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  user_defined_symbols: <sep>
  user_defined_symbols: <cls>
  required_chars: 
  byte_fallback: 0


True

In [44]:
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))

['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁wor', 'ld', '<cls>']


In [45]:
print(sp_user.piece_to_id('<sep>')) 

3


In [46]:
print(sp_user.piece_to_id('<cls>'))

4


In [47]:
print('3=', sp_user.decode_ids([3]))  # decoded to <sep>

3= <sep>


In [48]:
print('4=', sp_user.decode_ids([4]))  # decoded to <cls>

4= <cls>


In [49]:
print('bos=', sp_user.bos_id())     # 1
print('eos=', sp_user.eos_id())     # 2
print('unk=', sp_user.unk_id())     # 0
print('pad=', sp_user.pad_id())     # -1, disabled by default


bos= 1
eos= 2
unk= 0
pad= -1


In [50]:
print(sp_user.encode_as_ids('Hello world'))    

# Prepend or append bos/eos ids.
print([sp_user.bos_id()] + sp_user.encode_as_ids('Hello world') + [sp_user.eos_id()]) 

[28, 148, 441, 340, 79]
[1, 28, 148, 441, 340, 79, 2]


# Explanation of BPE Tokenization Process

## What is BPE Tokenization?

BPE (Byte Pair Encoding) is a subword tokenization method that:

- Splits text into smaller subword units rather than full words or characters.
- Reduces vocabulary size efficiently.
- Handles rare or unseen words by breaking them into known subword pieces.
- Works well with complex languages and new word forms.

---

## Steps We Took in BPE Tokenization

1. **Data Preparation**  
   - Used a text file (e.g., `phx.txt`) containing training corpus for tokenization.

2. **Training SentencePiece BPE Model**  
   - Called SentencePiece trainer with parameters:
     - `model_type=bpe`
     - `vocab_size=500`
     - Input file path and model prefix to save output.
   - The trainer learns frequent byte-pairs and merges them iteratively.

3. **Training Process**  
   - Counts characters and byte pairs frequency.
   - Merges most frequent pairs to form new tokens.
   - Builds a vocabulary of subword tokens + special tokens like `<unk>`, `<s>`, `</s>`.

4. **Output Files**  
   - `bpe.model`: The trained tokenizer model.
   - `bpe.vocab`: The vocabulary list with tokens and their frequencies.

5. **Encoding & Decoding**  
   - Encode a sentence to a list of subword tokens or integer IDs.
   - Decode tokens or IDs back to original text perfectly (including spaces).
   - Example:
     ```python
     input_string = "This is a test"
     tokens = ['▁T', 'h', 'is', '▁is', '▁a', '▁t', 'est']
     ```
     The special symbol `▁` marks a space before the token.

6. **Special/User-defined Tokens**  
   - Added tokens like `<sep>`, `<cls>` as single indivisible tokens.
   - Useful for models like BERT that need special markers.

---

## Why Use BPE Tokenization?

- Balances between word-level and character-level tokenization.
- Supports an open vocabulary that can handle unknown words.
- Ensures reversible tokenization, so original text can be exactly recovered.
- Allows use of special tokens to guide deep learning models.

---# Explanation of BPE Tokenization Process

## What is BPE Tokenization?

BPE (Byte Pair Encoding) is a subword tokenization method that:

- Splits text into smaller subword units rather than full words or characters.
- Reduces vocabulary size efficiently.
- Handles rare or unseen words by breaking them into known subword pieces.
- Works well with complex languages and new word forms.

---

## Steps We Took in BPE Tokenization

1. **Data Preparation**  
   - Used a text file (e.g., `phx.txt`) containing training corpus for tokenization.

2. **Training SentencePiece BPE Model**  
   - Called SentencePiece trainer with parameters:
     - `model_type=bpe`
     - `vocab_size=500`
     - Input file path and model prefix to save output.
   - The trainer learns frequent byte-pairs and merges them iteratively.

3. **Training Process**  
   - Counts characters and byte pairs frequency.
   - Merges most frequent pairs to form new tokens.
   - Builds a vocabulary of subword tokens + special tokens like `<unk>`, `<s>`, `</s>`.

4. **Output Files**  
   - `bpe.model`: The trained tokenizer model.
   - `bpe.vocab`: The vocabulary list with tokens and their frequencies.

5. **Encoding & Decoding**  
   - Encode a sentence to a list of subword tokens or integer IDs.
   - Decode tokens or IDs back to original text perfectly (including spaces).
   - Example:
     ```python
     input_string = "This is a test"
     tokens = ['▁T', 'h', 'is', '▁is', '▁a', '▁t', 'est']
     ```
     The special symbol `▁` marks a space before the token.

6. **Special/User-defined Tokens**  
   - Added tokens like `<sep>`, `<cls>` as single indivisible tokens.
   - Useful for models like BERT that need special markers.

---

## Why Use BPE Tokenization?

- Balances between word-level and character-level tokenization.
- Supports an open vocabulary that can handle unknown words.
- Ensures reversible tokenization, so original text can be exactly recovered.
- Allows use of special tokens to guide deep learning models.

---

