# Tokenization

Tokenization is the process of breaking down text data into smaller, manageable units called tokens, which can be words, phrases, subwords, or even individual characters. This is typically the first step in preprocessing text for machine learning (ML) and natural language processing (NLP) tasks, as it transforms raw text into a format that algorithms can analyze and understand.


Tokenization - https://tiktokenizer.vercel.app/

## Why is Tokenization Important ? 


- Text to Numbers: In machine learning models operate on numerical data, not raw text. Tokenization converts text into tokens, which are then mapped to numbers so that models can process them.

- Pattern Recognition: By splitting text into tokens, algorithms can more easily identify patterns, relationships, and context within the data.

- Efficiency: Tokenization makes it possible to handle large volumes of text efficiently, optimizing memory usage and computational speed-especially important in large language models..


- Generalization: Good tokenization strategies, such as subword or character tokenization, allow models to handle new or rare words by breaking them into familiar components.


## Types of Tokenization:


-  Word Tokenization: Splits text into individual words.
Example: "Chatbots are helpful." → ["Chatbots", "are", "helpful"]

- Character Tokenization: Breaks text into individual characters.
Example: "Chatbots" → ["C", "h", "a", "t", "b", "o", "t", "s"]

- Subword Tokenization: Splits words into smaller units (subwords), useful for handling rare or unknown words.
Example: "unhappiness" → ["un", "happiness"] or ["un", "hap", "pi", "ness"]

- Sentence Tokenization: Divides text into sentences, often used for tasks like summarization or translation.


## Tokenization for LLM's:


- Tokenization is the gateway for all LLM operations-text is tokenized, converted to embeddings, processed, and then detokenized for output.


- LLMs often use subword tokenization (like Byte-Pair Encoding or WordPiece) to balance vocabulary size, efficiency, and the ability to handle new words.


- Tokenization quality directly impacts the model’s ability to understand context, manage multilingual data, and generate accurate responses.

  

## Popular Tokenization Algorithms


| **Algorithm/**              | **Description & Approach**                                                                            | **Typical Use Cases & Models**                                                                  |
|----------------------------------|------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| **Whitespace/Regex**             | Splits text based on spaces or regular expressions.                                                   | Simple NLP tasks, preprocessing, rule-based systems.                                           |
| **Word Tokenizers**              | Divides text into words, often using libraries like NLTK, SpaCy, or Keras.                           | Text classification, sentiment analysis, topic modeling (Gensim), general NLP pipelines.       |
| **Character Tokenizers**        | Splits text into individual characters.                                                               | Handling misspellings, rare words, languages without clear word boundaries, deep learning models.|
| **Byte-Pair Encoding (BPE)**     | Iteratively merges frequent pairs of bytes/characters to form subwords.                              | Neural machine translation, GPT-2, RoBERTa, multilingual models.                             |
| **WordPiece**                    | Similar to BPE, but merges pairs that maximize likelihood of training data.                           | BERT, DistilBERT, Electra, other transformer models.                                            |
| **Unigram**                      | Starts with a large vocabulary, then prunes to optimize likelihood.                                  | Used with SentencePiece in models like T5, XLNet, ALBERT.                                      |
| **SentencePiece**                | Unsupervised, language-independent tokenizer supporting BPE and Unigram.                             | Neural machine translation, text generation, T5, XLNet; supports multiple languages.          |
| **Hugging Face Transformers**    | Library providing fast, model-specific tokenizers (BPE, WordPiece, Unigram).                         | BERT, GPT, RoBERTa, T5, and other transformer-based models.                                     |
| **Gensim Tokenizer**             | Tokenization for large-scale topic modeling and document similarity.                                 | Topic modeling, information retrieval.                                                        |
| **Keras Tokenizer**              | Converts text to sequences for deep learning pipelines.                                               | Text classification, sequence modeling in Keras and TensorFlow.                               |
