# Interpreting Words: From Strings to Tokens

> *Before we dive into Transformers, let’s master the humble n-gram.*

---

## 1. What Is a “Word” in NLP?

A “word” can be represented in at least three simple ways:  
1. **String of characters** (e.g., `"cat"` → `['c','a','t']`)  
2. **Index in a vocabulary** (e.g., `"cat"` → `537`)  
3. **Dense vector embedding** (e.g., `"cat"` → `[0.12, –0.03, …]`)  

Early models such as n-grams treat words purely as symbolic tokens.

---

## 2. The n-Gram Model

An n-gram model predicts the next word (token) based only on the previous *n–1* words:

- **Unigram** (n=1): no history, just the overall frequency of each word  
- **Bigram** (n=2): conditions on the one previous word  
- **Trigram** (n=3): conditions on the two previous words  

> **Note on smoothing:** In real corpora we use techniques like Laplace or Kneser–Ney smoothing to avoid zero probabilities, but we’ll ignore smoothing in this toy example.

---

## 3. Bigram Demo Code: What’s Happening and Why

Below is a minimal Python script that demonstrates:

1. **How we collect statistics** on word pairs (bigrams) from a tiny corpus.  
2. **How we convert counts into probabilities**, forming a simple language model.  
3. **How we sample** (“generate”) the next word given a history.




In [1]:
# Minimal bigram language model demo
from collections import Counter, defaultdict  # count occurrences, default zero for missing keys
import random                               # random sampling utilities

def train(bigrams):
    """
    Given a list of (w1, w2) pairs, count how often each appears,
    then compute P(w2 | w1) = count(w1, w2) / total_count(w1).
    Returns a dict mapping (w1, w2) → probability.
    """
    counts = Counter(bigrams)
    totals = defaultdict(int)
    for w1, w2 in bigrams:
        totals[w1] += counts[(w1, w2)]
    return {
        (w1, w2): c / totals[w1]
        for (w1, w2), c in counts.items()
    }

# Prepare a tiny corpus and extract all adjacent word pairs
corpus = "I like cats . I like dogs .".split()
bigrams = list(zip(corpus, corpus[1:]))

# Train the model: learn P(next_word | current_word)
prob = train(bigrams)

def next_word(w1):
    """
    Sample a next word given the previous word w1,
    using the learned bigram probabilities.
    """
    # Filter for bigrams that start with w1
    choices = [(w2, p) for (a, w2), p in prob.items() if a == w1]
    words, ps = zip(*choices)
    return random.choices(words, ps)[0]

# Try it out! Run and see which words are generated.
print(next_word("like"))   # → "cats" or "dogs"
print(next_word("I"))      # → "like"
print(next_word("cats"))   # → "."

dogs
like
.


> **Why n-grams alone aren’t enough**  
>  
> When words are treated as **isolated tokens**—whether as raw strings or as integer indexes—there is no built-in notion of similarity or meaning.  
>  
> - “dog” and “cat” are just two different symbols, no more related than “cat” and “salad.”  
> - There’s no sense of context beyond the fixed window of n-grams.  
>  
> After all, we ultimately want a **model of word meaning** that can tell us:  
>  
> - Which words are similar in meaning (e.g. cat ≈ dog)  
> - Which words are opposites (e.g. cold ↔ hot)  
> - Which carry positive or negative connotations (e.g. happy vs. sad)  
> - How different verbs relate to the same event from various perspectives (buy / sell / pay - If I buy something from you,
you’ve probably sold it to me, and I likely paid you.)
>  
> A good semantic model should let us draw inferences for tasks like question answering, dialogue, or any meaning-driven application.  
>  
> In the next section, **02_Lexical_Semantics_Definitions**, we’ll explore foundational concepts from **lexical semantics**, the linguistic study of word meaning.