## Summary

[Huggingface](https://huggingface.co/) provide the [transformers](https://huggingface.co/docs/transformers/en/index) python library of tools.
Part of `transformers` are a number of `pre-trained Tokenizers`, which can be used in various tasks:
- Off the shelf LLM use
- Fine-tuning
- Training from scratch

Note - Tokenizers are not 'trained' in the same sense that a neural net is trained on data - they are just algorithms that tokenize text and are picked for specific LLM's prior to training on data. When people say a tokenizer is “pre-trained,” they usually mean:

- ✅ The vocabulary (the set of allowed tokens) has been determined in advance.
- ✅ The rules for splitting words into tokens were learned from massive text corpora.
- ✅ Once chosen, this token vocabulary is frozen and used consistently for the LLM’s entire life.

### How Are Tokenizers “Trained”?

Let’s clarify the “training” process:

- It’s not neural training but statistical or algorithmic analysis of huge text corpora.

- The goal is:

  - Minimize vocabulary size.

  - Maximize ability to represent diverse texts with fewer tokens.

- Example (BPE training):

  - Corpus: “low lower lowest”

  - Initially splits to characters:

        ["l", "o", "w", " ", "l", "o", "w", "e", "r", ...]

  - Finds frequent pairs like “lo,” “ow,” “low.”

  - Merges them into new tokens until the vocabulary is full.

Once trained, the tokenizer rules are saved and reused. The LLM’s embeddings map these tokens into vectors.

### Why Are They Custom to Each Model?

A tokenizer must exactly match the model’s embedding layer:

- Token ID #1234 in the vocabulary must correspond to the correct embedding vector in the model.

- If you swap tokenizers, the LLM might produce nonsense because token IDs don’t match embeddings.

That’s why models like GPT-2, GPT-3, LLaMA, Claude, etc. each come with their own tokenizer.

### In Short
- ✅ Tokenizers are trained, but mostly with algorithms, not neural networks.
- ✅ They’re “pre-trained” to create a stable vocabulary and rules before model training.
- ✅ They’re custom-built for each model architecture because embeddings depend on token IDs.


### Encoding Text as Tokens with Hugging Face
To tokenize text with Hugging Face, instantiate a tokenizer object with the `AutoTokenizer.from_pretrained` method. Pass in the name of the model as a string value.

        # 'bert-base-cased' can be replaced with a different model as needed
        my_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Then you can use the tokenizer object to generate either string tokens or integer ID tokens.

To generate string tokens, including special tokens:

        tokens = my_tokenizer(raw_text).tokens()

To generate integer ID tokens you can use the .encode method on raw text, or the .convert_tokens_to_ids method on string tokens.

        # Option for raw text
        token_ids = my_tokenizer.encode(raw_text)
        # Option for string tokens
        token_ids = my_tokenizer.convert_tokens_to_ids(tokens)

### Decoding Tokens to Text with Hugging Face
Integer ID tokens can be converted back to text using the .decode method:

        decoded_text = my_tokenizer.decode(token_ids)

### Unknown tokens
Pretrained tokenizers have a predetermined vocabulary. If a token is not in the tokenizer's vocabulary, it will be lost in the encoding + decoding process. In this example, unknown tokens were replaced with [UNK], but this behavior will vary depending on the tokenizer.

### Documentation on Hugging Face Tokenizers
[PreTrainedTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer)
[AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)

[Huggingface Tokenizers Exercise](./2.10e.ipynb)


## Additional References

[Karparthy: Deep Dive into LLMs like ChatGPT](https://www.youtube.com/watch?v=7xTGNNLPyMI)