## Part 1: Tiktoken (OpenAI’s Tokenizer)


### 1. What is tiktoken?

* `tiktoken` is OpenAI’s **official library** for tokenization.
* Any text sent to GPT models (GPT-3.5, GPT-4) is first converted into tokens (numbers).
* `tiktoken` lets you see how text is broken into tokens and then mapped to numbers.


### 2. Installation


In [13]:
# pip install tiktoken


### 3. Basic Usage


In [14]:
import tiktoken

# Load GPT-4 tokenizer
enc = tiktoken.get_encoding("cl100k_base")

# Encode text
text = "Hello, how are you?"
tokens = enc.encode(text)
print(tokens)   # [9906, 11, 703, 389, 345, 30]

# Decode back
print(enc.decode(tokens))  # "Hello, how are you?"

[9906, 11, 1268, 527, 499, 30]
Hello, how are you?


Explanation:

* `encode()` → converts text into tokens (IDs).
* `decode()` → converts tokens back into text.

### 4. Model-specific tokenizers

* `"cl100k_base"` → GPT-4, GPT-3.5
* `"p50k_base"` → Codex models
* `"r50k_base"` → older GPT-3

Each model uses its own tokenizer.

---

## Part 2: Hugging Face Tokenizers

### 1. What is Hugging Face Tokenizer?

* Hugging Face provides the **Transformers library**.
* Each model (BERT, GPT-2, RoBERTa, etc.) has its own tokenizer.
* `AutoTokenizer` automatically loads the correct tokenizer for the model.

### 2. Installation


In [15]:
# pip install transformers


### 3. Basic Usage


In [16]:
from transformers import AutoTokenizer

# Load GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Hello, how are you?"

# Tokens in words/subwords
tokens = tokenizer.tokenize(text)
print(tokens)   # ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']

# Convert to IDs
ids = tokenizer.encode(text)
print(ids)  # [15496, 11, 703, 389, 345, 30]

# Decode back
print(tokenizer.decode(ids))  # "Hello, how are you?"

['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']
[15496, 11, 703, 389, 345, 30]
Hello, how are you?


Explanation:

* `tokenize()` → returns tokens as strings.
* `encode()` → converts text into token IDs.
* `decode()` → converts IDs back into text.

---

### Common Hugging Face Tokenizers

1. GPT family

   * `"gpt2"`
   * `"distilgpt2"`

2. BERT family

   * `"bert-base-uncased"`
   * `"bert-base-cased"`
   * `"bert-base-multilingual-cased"`

3. RoBERTa family

   * `"roberta-base"`
   * `"roberta-large"`

4. Distil models

   * `"distilbert-base-uncased"`
   * `"distilroberta-base"`

5. Others

   * `"albert-base-v2"`
   * `"xlm-roberta-base"`
   * `"t5-small"`, `"t5-base"`

---

## Difference Between Both

| Feature            | Tiktoken (OpenAI)         | Hugging Face (Transformers)     |
| ------------------ | ------------------------- | ------------------------------- |
| Purpose            | OpenAI GPT models only    | Any model (GPT, BERT, etc.)     |
| Install size       | Lightweight, very fast    | Heavy, loads whole library      |
| Tokenization style | GPT-3.5/4 optimized       | Model-specific                  |
| Use case           | Token counting, cost calc | Research, training, fine-tuning |

---
