In [None]:
# 📦 Install required packages
%pip install tiktoken

# ✂️ Tokenization in Vector Databases
Tokenization is the process of breaking text into smaller pieces called tokens. It’s a crucial preprocessing step before generating embeddings or feeding input to LLMs.

## 📌 Why Tokenization Matters
- Ensures input fits within token limits of models (e.g., OpenAI models have max token limits)
- Impacts cost and performance (OpenAI charges per token)
- Affects how text is chunked and embedded

Token ≠ Word: Words like `ChatGPT` may tokenize into multiple parts depending on model.

In [None]:
# Example: Tokenizing text using OpenAI's tiktoken
import tiktoken

# Use encoding for OpenAI's text models (like GPT-3.5/4)
enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

sample_text = "LangChain is a framework to build LLM applications."
tokens = enc.encode(sample_text)

print(f"Original text: {sample_text}")
print(f"Number of tokens: {len(tokens)}")
print(f"Tokens: {tokens}")

## 🧠 How It Connects to Vector DBs
- Most embedding models also use tokenization internally
- Helps you chunk long documents effectively
- Knowing token count helps optimize prompt design and chunking logic

## ✅ Summary
- Tokenization breaks text into machine-readable pieces
- Important for embedding, chunking, and cost control
- Tools like `tiktoken` let you estimate and manage tokens for OpenAI and other models