# Zero to Hero: Understanding Embeddings

Welcome! In this lesson, we will de-mystify "Embeddings". 
We will strip away the complex libraries and build our own understanding from scratch.

**Goal**: Understand how a computer turns "Text" into "Numbers".

## Level 1: The Primitive Way (Word IDs)
The simplest way a computer understands words is by giving each word a unique ID number.
Imagine a dictionary where every word has a page number.

In [2]:
# A simple dictionary
word_to_id = {
    "apple": 1,
    "banana": 2,
    "dog": 3,
    "cat": 4
}

def get_ids(sentence):
    words = sentence.lower().split()
    return [word_to_id.get(w, 0) for w in words]

print(f"'Apple' -> {get_ids('apple')}")
print(f"'Dog'   -> {get_ids('dog')}")

'Apple' -> [1]
'Dog'   -> [3]


### The Problem
Look at the numbers above (`1` and `3`). 
Are they close? Yes.
Are "Apple" and "Dog" similar? No.

**Conclusion**: Simply assigning numbers doesn't capture *meaning*.

## Level 2: The "Toy Model" (Dense Vectors)
To capture meaning, we need **Attributes** (Dimensions). 
Instead of 1 number, let's give each word a LIST of numbers representing its features.

Let's build a fake AI model that rates words on 3 features:
1.  **Fruitiness**
2.  **Dog-ness**
3.  **Cat-ness**

**How the Code Works:**
The function `toy_embedding_model` below scans the input sentence for specific keywords.
- If it sees "apple", "banana", or "fruit", it sets the **first number** (Fruitiness) to `0.9`.
- If it sees "dog", "puppy", or "bark", it sets the **second number** (Dog-ness) to `0.9`.
- If it sees "cat", "kitten", or "meow", it sets the **third number** (Cat-ness) to `0.9`.

This converts text into a meaningful list of numbers!

In [1]:
def toy_embedding_model(text):
    text = text.lower()
    # Initialize vector [Fruitiness, Dogness, Catness]
    vector = [0.0, 0.0, 0.0]
    
    # 1. Check for Fruits
    if any(w in text for w in ["apple", "banana", "fruit"]):
        vector[0] = 0.9
    
    # 2. Check for Dogs
    if any(w in text for w in ["dog", "puppy", "bark"]):
        vector[1] = 0.9

    # 3. Check for Cats
    if any(w in text for w in ["cat", "kitten", "meow"]):
        vector[2] = 0.9
        
    return vector

# Let's embed some sentences!
vec_apple = toy_embedding_model("I ate an apple")
vec_dog   = toy_embedding_model("My puppy barks")
vec_cat   = toy_embedding_model("The kitten meows")

print(f"Apple Vector: {vec_apple}")
print(f"Dog Vector:   {vec_dog}")
print(f"Cat Vector:   {vec_cat}")

Apple Vector: [0.9, 0.0, 0.0]
Dog Vector:   [0.0, 0.9, 0.0]
Cat Vector:   [0.0, 0.0, 0.9]


## Level 3: Similarity Search
Now that we have vectors, we can find similarities.
We use a math operation called **Dot Product** (multiplying meaningful numbers together).

In [2]:
def calculate_similarity(vec_a, vec_b, label_b="Target"):
    print(f"\nCalculating Similarity with {label_b}...")
    print(f"  Query Vector:  {vec_a}")
    print(f"  {label_b} Vector: {vec_b}")
    
    dot_product = 0
    log_steps = []
    
    for val_a, val_b in zip(vec_a, vec_b):
        step_val = val_a * val_b
        dot_product += step_val
        log_steps.append(f"({val_a:.1f} * {val_b:.1f})")
        
    # Show the math
    math_str = " + ".join(log_steps)
    print(f"  Math: {math_str} = {dot_product:.2f}")
    return dot_product

# Let's search for "Dog"
query = "I saw a dog"
query_vec = toy_embedding_model(query)

print(f"Query: '{query}' {query_vec}")
print("------------------------------------------------")

score_apple = calculate_similarity(query_vec, vec_apple, "Apple")
score_dog   = calculate_similarity(query_vec, vec_dog, "Dog")

print("\nSummary Results:")
print(f"Similarity to Apple: {score_apple:.2f}")
print(f"Similarity to Dog:   {score_dog:.2f}")

Query: 'I saw a dog' [0.0, 0.9, 0.0]
------------------------------------------------

Calculating Similarity with Apple...
  Query Vector:  [0.0, 0.9, 0.0]
  Apple Vector: [0.9, 0.0, 0.0]
  Math: (0.0 * 0.9) + (0.9 * 0.0) + (0.0 * 0.0) = 0.00

Calculating Similarity with Dog...
  Query Vector:  [0.0, 0.9, 0.0]
  Dog Vector: [0.0, 0.9, 0.0]
  Math: (0.0 * 0.0) + (0.9 * 0.9) + (0.0 * 0.0) = 0.81

Summary Results:
Similarity to Apple: 0.00
Similarity to Dog:   0.81


### The Aha! Moment
The computer successfully identified that "I saw a dog" is mathematically closer to "My puppy barks" (`0.81`) than to "I ate an apple" (`0.0`).

**This is how Vector Databases work!**