<a href="https://colab.research.google.com/github/pavansai26/tokenizers-in-nlp/blob/main/sentence_piece_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

SentencePiece is a subword text tokenizer that can be used in natural language processing (NLP) tasks such as machine translation, language modeling, and speech recognition. It was developed by Google and is open source.

SentencePiece is based on the unigram language model, which assigns probabilities to each subword unit based on the frequency of its occurrence in the training corpus. The subword units can be characters, bytes, or a mixture of both, and they are learned jointly with the task-specific model. This allows the model to represent rare words or words that were not present in the training corpus.

Advantages 

One of the main advantages of SentencePiece is its flexibility. It can be used with any language and script, and it allows for easy customization of the tokenization process. This is particularly useful for languages with complex morphology, such as agglutinative languages like Turkish or Korean.

Another advantage of SentencePiece is its ability to handle out-of-vocabulary (OOV) words. Since the model learns subword units based on the training corpus, it can represent words that were not present in the corpus by breaking them down into subword units.

disadvantages

However, one potential disadvantage of SentencePiece is that it requires more memory than traditional tokenization methods, as it needs to store the subword vocabulary. Additionally, the tokenization process can be slower, especially during training, as it requires running an additional optimization step to learn the subword units.

how SentencePiece works

SentencePiece is a subword tokenizer that works by breaking down words into smaller units called subwords. These subwords can be thought of as building blocks that can be used to represent any word, even if it's not present in the training corpus. This is useful for dealing with rare or unseen words, as well as for languages with complex morphology where words can have many different forms.



1.   First, the tokenizer is trained on a corpus of text. During training, the tokenizer learns a set of subword units and their corresponding frequencies. This is done using an unsupervised learning approach based on the unigram language model, which assigns probabilities to each subword unit based on its frequency in the training corpus.

2.   Once the tokenizer has been trained, it can be used to tokenize new text. When a word is encountered that is not in the tokenizer's vocabulary, it is broken down into subword units that are present in the vocabulary. This is done using a greedy algorithm that iteratively selects the longest matching subword unit from the vocabulary.



For example, let's say we have the word "unseen" and the tokenizer's vocabulary consists of the subwords "un", "seen", and "k". The tokenizer would break down "unseen" into "un" and "seen", because those subwords are present in the vocabulary. The subwords are then represented as separate tokens in the tokenized output.

During training or inference, the subwords can be encoded as integers or embeddings and used as input to a machine learning model. This allows the model to handle words that were not present in the training corpus, as well as to represent rare words more effectively.

In [None]:
import tensorflow as tf
import numpy as np
from collections import Counter

# Train tokenizer on a corpus of text
def train_tokenizer(data_path, model_prefix, vocab_size):
    # Load corpus
    with open(data_path, "r", encoding="utf-8") as f:
        corpus = f.read()

    # Compute subwords using unigram language model
    subwords = get_subwords(corpus, vocab_size)

    # Write subwords to file
    with open(f"{model_prefix}.vocab", "w", encoding="utf-8") as f:
        for subword in subwords:
            f.write(f"{subword} {subwords[subword]}\n")

    # Save tokenizer config
    config = {
        "vocab_size": vocab_size,
        "unk_token": "<unk>",
        "bos_token": "<s>",
        "eos_token": "</s>",
        "pad_token": "<pad>",
    }
    np.save(f"{model_prefix}.config", config)

# Load trained tokenizer
def load_tokenizer(model_prefix):
    # Load subwords and frequencies from file
    subwords = {}
    with open(f"{model_prefix}.vocab", "r", encoding="utf-8") as f:
        for line in f:
            subword, freq = line.strip().split()
            subwords[subword] = int(freq)

    # Load tokenizer config
    config = np.load(f"{model_prefix}.config", allow_pickle=True).item()

    # Initialize tokenizer
    tokenizer = Tokenizer(subwords, config)

    return tokenizer

# Tokenizer class
class Tokenizer:
    def __init__(self, subwords, config):
        self.subwords = subwords
        self.config = config
        self.id_to_subword = {i: subword for i, subword in enumerate(subwords)}
        self.subword_to_id = {subword: i for i, subword in self.id_to_subword.items()}
        self.unk_id = self.subword_to_id[self.config["unk_token"]]
        self.bos_id = self.subword_to_id[self.config["bos_token"]]
        self.eos_id = self.subword_to_id[self.config["eos_token"]]
        self.pad_id = self.subword_to_id[self.config["pad_token"]]
        self.vocab_size = self.config["vocab_size"]

    def tokenize(self, sentence):
        # Split sentence into words
        words = sentence.strip().split()

        # Tokenize each word into subwords
        tokens = []
        for word in words:
            subword_ids = self.get_subword_ids(word)
            tokens.extend(subword_ids)

        # Add special tokens
        tokens = [self.bos_id] + tokens + [self.eos_id]

        return tokens

    def get_subword_ids(self, word):
        # Initialize variables
        subword_ids = []
        start = 0
        end = len(word)

        while start < end:
            # Find the longest matching subword
            subword = None
            for i in range(end, start, -1):
                sub = word[start:i]
                if sub in self.subword_to_id:
                    subword = sub
                    break

            # If no matching subword is found, add the unknown token
            if subword is None:
                subword = self.config["unk_token"]

            # Add the subword id to the list of subword ids
            subword_id = self.subword_to_id[subword]
            subword_ids.append(subword_id)



In [None]:
import tensorflow as tf
import numpy as np
from collections import Counter

class SentencePieceTokenizer:
    def __init__(self, vocab_size):
        self.vocab_size = vocab_size
        self.vocab = None
        self.encode_cache = {}

    def train(self, data_path):
        # Load corpus
        with open(data_path, "r", encoding="utf-8") as f:
            corpus = f.read()

        # Compute subwords using unigram language model
        subwords = self._get_subwords(corpus, self.vocab_size)

        # Build vocabulary
        self.vocab = {"<unk>": 0, "<s>": 1, "</s>": 2}
        for i, subword in enumerate(subwords):
            self.vocab[subword] = i + 3

    def encode(self, sentence):
        # Check cache
        if sentence in self.encode_cache:
            return self.encode_cache[sentence]

        # Split sentence into words
        words = sentence.strip().split()

        # Tokenize each word into subwords
        tokens = []
        for word in words:
            subword_ids = self._get_subword_ids(word)
            tokens.extend(subword_ids)

        # Add special tokens
        tokens = [1] + tokens + [2]

        # Update cache
        self.encode_cache[sentence] = tokens

        return tokens

    def _get_subwords(self, corpus, vocab_size):
        # Count character frequencies
        char_counts = Counter(corpus)

        # Initialize subwords
        subwords = list(char_counts.keys())

        # Build vocabulary using unigram language model
        while len(subwords) < vocab_size:
            # Compute frequencies of all subword pairs
            pairs = Counter()
            for word in corpus.split():
                for i in range(len(word) - 1):
                    pairs[word[i:i+2]] += 1

            # Find the most frequent subword pair
            most_common_pair = pairs.most_common(1)[0][0]

            # Merge the most frequent subword pair
            new_subword = most_common_pair[0] + most_common_pair[1]
            subwords.append(new_subword)

            # Replace all occurrences of the most frequent subword pair with the new subword
            corpus = corpus.replace(most_common_pair, new_subword)

        return subwords

    def _get_subword_ids(self, word):
        # Initialize variables
        subword_ids = []
        start = 0
        end = len(word)

        while start < end:
            # Find the longest matching subword
            subword = None
            for i in range(end, start, -1):
                sub = word[start:i]
                if sub in self.vocab:
                    subword = sub
                    break

            # If no matching subword is found, add the unknown token
            if subword is None:
                subword = "<unk>"

            # Add the subword id to the list of subword ids
            subword_id = self.vocab[subword]
            subword_ids.append(subword_id)

            # Update variables
            start = i
            i += 1

        return subword_ids
