# How Embeddings Work Under the Hood

## 1. The Journey from Text to Numbers

Embeddings transform text into numbers through several steps:

1. **Tokenization**: Breaking text into smaller pieces
   - Words: "hello world" → ["hello", "world"]
   - Subwords: "playing" → ["play", "##ing"]
   - Characters: For languages like Chinese

2. **Token IDs**: Each token gets mapped to a number
   - Example: "hello" → 15234
   - These mappings are stored in the model's vocabulary

3. **Neural Network Processing**:
   - Tokens pass through multiple transformer layers
   - Each layer learns different aspects (syntax, semantics, context)
   - Final layer outputs the embedding vector

4. **Vector Space**: The final embedding places similar meanings close together
   - Each dimension captures different semantic features
   - Typical sizes: 384, 768, or 1024 dimensions

In [None]:
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
import numpy as np

# 1. Tokenization
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
text = "I love machine learning"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Original text:", text)
print("Tokens:", tokens)
print("Token IDs:", token_ids)

# 2. Getting Embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode(text)

print("\nEmbedding shape:", embedding.shape)
print("First 5 dimensions:", embedding[:5])

## 2. Understanding Embedding Dimensions

Each dimension in an embedding vector captures different semantic features:
- Word types (noun, verb, etc.)
- Topics (technology, nature, etc.)
- Sentiment (positive, negative)
- And many other abstract features

The more dimensions, the more nuanced the representation, but also:
- More computational cost
- More storage needed
- Risk of overfitting

#### Let's see how embeddings group similar concepts


In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Create word groups with clear relationships
words = {
    'food': ['pizza', 'burger', 'pasta', 'sushi'],
    'drinks': ['coffee', 'tea', 'juice', 'water'],
    'colors': ['red', 'blue', 'green', 'yellow']
}

# Get embeddings for all words
all_words = [word for group in words.values() for word in group]
embeddings = model.encode(all_words)

# Reduce to 2D for visualization
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Plot with different colors for each group
plt.figure(figsize=(10, 6))
colors = ['#FF9999', '#66B2FF', '#99FF99']
for (group_name, group_words), color in zip(words.items(), colors):
    start_idx = all_words.index(group_words[0])
    x = embeddings_2d[start_idx:start_idx+4, 0]
    y = embeddings_2d[start_idx:start_idx+4, 1]
    plt.scatter(x, y, c=color, label=group_name)
    
    # Add word labels
    for i, word in enumerate(group_words):
        plt.annotate(word, (x[i], y[i]))

def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

plt.title("Word Embeddings Visualized in 2D")
plt.legend()
plt.grid(True)
plt.show()

# Show some similarity scores
print("\nSimilarity Scores (closer to 1 = more similar):")
print(f"pizza-burger: {cosine_similarity(model.encode('pizza'), model.encode('burger')):.3f}")
print(f"coffee-tea: {cosine_similarity(model.encode('coffee'), model.encode('tea')):.3f}")
print(f"pizza-coffee: {cosine_similarity(model.encode('pizza'), model.encode('coffee')):.3f}")

## 3. Why Different Models for Different Languages?

Language-specific models perform better because:

1. **Vocabulary Coverage**: 
   - Better handling of language-specific words
   - Proper subword tokenization for morphologically rich languages

2. **Cultural Context**:
   - Understanding idioms and expressions
   - Proper handling of formal/informal speech

3. **Syntactic Structure**:
   - Word order differences
   - Grammar patterns

This is why we use different models in our code:

## 4. Practical Tips for Using Embeddings

1. **Choosing Model Size**:
   - Smaller models (384 dims): Faster, good for simple tasks
   - Larger models (768+ dims): Better quality, slower

2. **Preprocessing**:
   - Clean text (remove noise)
   - Consistent casing
   - Handle special characters

3. **Storage Considerations**:
   - 384 dimensions × 4 bytes = ~1.5KB per embedding
   - Plan database capacity accordingly

--- 

# Understanding Embeddings in Language Models

Embeddings are numerical representations of text that capture semantic meaning. They convert words or sentences into vectors (lists of numbers) that can be compared mathematically.

Key points about embeddings:
- Similar texts should have similar vector representations
- The dimensionality and quality of embeddings affects performance
- Different embedding models are trained on different data

Let's explore this with some examples.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load different embedding models
english_model = SentenceTransformer('all-MiniLM-L6-v2')  # English-focused
german_model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True) # Multi-Lingual

## Example 1: Comparing Similar Sentences

Let's see how embeddings capture similarity between sentences in different languages:

In [None]:
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

def compare_sentences(model, sent1, sent2):
    emb1 = model.encode(sent1)
    emb2 = model.encode(sent2)
    return cosine_similarity(emb1, emb2)

# Test pairs in different languages
english_pairs = [
    ("I love Berlin", "Berlin is my favorite city"),
    ("I love Berlin", "I hate vegetables")
]

german_pairs = [
    ("Ich liebe Berlin", "Berlin ist meine Lieblingsstadt"),
    ("Ich liebe Berlin", "Ich hasse Gemüse")
]

print("\nEnglish Model Results:")
for sent1, sent2 in english_pairs:
    sim = compare_sentences(english_model, sent1, sent2)
    print(f"{sent1} <-> {sent2}: {sim:.3f}")

print("\nGerman-English Model Results:")
for sent1, sent2 in german_pairs:
    sim = compare_sentences(german_model, sent1, sent2)
    print(f"{sent1} <-> {sent2}: {sim:.3f}")

## Example 2: Cross-lingual Capabilities

Let's compare how different models handle cross-lingual similarity:

In [None]:
cross_lingual_pairs = [
    ("The weather is nice today", "Das Wetter ist heute schön"),
    ("I need a coffee", "Ich brauche einen Kaffee"),
    ("Berlin is the capital of Germany", "Berlin ist die Hauptstadt von Deutschland")
]

print("Cross-lingual similarity:")
print("\nEnglish-only model:")
for en, de in cross_lingual_pairs:
    sim = compare_sentences(english_model, en, de)
    print(f"{en} <-> {de}: {sim:.3f}")

print("\nGerman-English model:")
for en, de in cross_lingual_pairs:
    sim = compare_sentences(german_model, en, de)
    print(f"{en} <-> {de}: {sim:.3f}")

## Example 3: Domain-Specific Comparisons

Let's see how models handle domain-specific terminology:

In [None]:
tourism_pairs = [
    ("Guided tour of the Brandenburg Gate", "Führung durch das Brandenburger Tor"),
    ("Skip the line tickets for museums", "Eintrittskarten ohne Anstehen für Museen"),
    ("Best restaurants in Berlin", "Beste Restaurants in Berlin")
]

print("Tourism domain comparisons:")
for en, de in tourism_pairs:
    en_sim = compare_sentences(english_model, en, de)
    de_sim = compare_sentences(german_model, en, de)
    print(f"\nPair: {en} <-> {de}")
    print(f"English model similarity: {en_sim:.3f}")
    print(f"German-English model similarity: {de_sim:.3f}")

## Key Takeaways

1. **Language Support**: Models trained on specific languages perform better for those languages
2. **Cross-lingual Capabilities**: Specialized multilingual models handle cross-language comparisons better
3. **Domain Relevance**: Consider your use case when choosing an embedding model

When building a RAG system:
- Choose embedding models that match your content languages
- Consider using specialized models for specific domains
- Test different models with your actual use cases (And evaluate those)
- Balance performance vs computational cost