# **Level 1: The Origins — Intro to LLMs & Chatbots**

## **Section 2: Introduction to Language Models**

### **Part 3: Tokens & Tokenization**

---

Before we can understand how AI models like ChatGPT process language, we need to appreciate a simple but crucial fact: **computers don’t understand human language the way we do**.

We see language as sentences, ideas, and meaning. Computers, on the other hand, deal with numbers and symbols. To bridge that gap, the first step in building modern AI systems that understand text is **breaking down language into smaller, manageable pieces**. These pieces are called **tokens**.

---

### **What are Tokens?**

In simple terms, a **token** is a unit of text that the model processes. Depending on the model and its design, a token can be:

* A full word (e.g., "cat")
* Part of a word (e.g., "inter" and "national" from "international")
* Punctuation (e.g., ".")
* Special symbols (e.g., `<|endoftext|>`)

Tokens are the building blocks of language for AI models.

---

### **Why Not Just Use Whole Words?**

Language is complex. Words can be long, short, combined, or made-up. If we treated only whole words as units, the model would struggle with:

* Rare words
* Misspellings
* New words never seen before

Instead, breaking text into smaller chunks (tokens) allows the model to handle language flexibly. Even if it has never seen the exact word "antidisestablishmentarianism," it can process its tokens and still understand parts of it.

---

### **How is Text Broken into Tokens?**

This process is called **tokenization**. A special algorithm breaks text into tokens according to predefined rules.

Different models use different tokenization strategies:

* Some use **WordPiece** (common in BERT models)
* Others use **Byte Pair Encoding (BPE)** (common in GPT models)
* Some use **SentencePiece** (common in multilingual models)

These methods aim to balance efficiency and flexibility.

---

**Illustration Example:**

Take the sentence:
*"I love international collaborations."*

A tokenization algorithm might break it down like this:

\[`I`, `love`, `inter`, `national`, `collaborations`, `.`]

Notice how:

* "international" becomes two tokens: "inter" and "national"
* Punctuation is kept as its own token

Alternatively, depending on the tokenizer, it might also look like:
\[`I`, `love`, `international`, `collaborations`, `.`]

The key takeaway: tokenization isn't always perfectly intuitive to humans, but it's optimized for the model to handle language efficiently.

---

### **Why Does Token Count Matter?**

Modern AI models process tokens one at a time, internally converting them into numerical representations the model can work with. However, they have a **maximum token limit**, known as the **context window**.

This limit defines how much text the model can handle at once. For example:

* GPT-3.5 has a limit of around **4,000 tokens**
* GPT-4 can handle up to **128,000 tokens** in some versions

If your text exceeds this limit, the model will:

* Truncate the beginning or end
* Lose context
* Be unable to process the full input

This is why understanding tokens is important, especially when building applications or chatbots that work with long text.

---

**Real-World Implication:**

Imagine you're building a chatbot to summarize legal contracts. If the contract is too long and exceeds the token limit, the chatbot won’t see the entire document — leading to incomplete or inaccurate responses.

---

### **Quick Clarifications:**

* **Tokens ≠ Characters.** A single token might contain multiple characters, or a single character might be its own token.
* **Token count ≠ Word count.** A sentence with five words may have 5, 7, or more tokens depending on the tokenizer.

---

### **Summary:**

* Tokens are the basic units of text that AI models process.
* Tokenization breaks text into these chunks.
* The token limit defines how much information a model can process at once.
* Understanding tokens helps you design better AI applications and prevents errors due to exceeding context limits.

---

In the next part, we'll build on this by exploring the **Transformer**, the engine that processes these tokens and enables models to understand language.