# LangChain Notebook (Part 3): Embeddings (in depth) — Types, Choices, and Code

This notebook continues from:
- Part 1: Readers + `Document` + Cleaning
- Part 2: Chunking + Tokenization + Context Window

Here we focus **only on embeddings**:
- What embeddings are (math + intuition)
- **Different types of embeddings** used in modern RAG/search
- How to use embeddings in LangChain (runnable examples)
- Similarity computation and debugging
- **Embedding cache** (production must-have)

> ✅ Runs even without API keys using `FakeEmbeddings`.
> Optional cells show Hugging Face or OpenAI embeddings.

## 0) Install & Imports

If you see import errors, run:
```python
%pip install -U langchain langchain-core langchain-community numpy
# Optional for Hugging Face local models:
%pip install -U sentence-transformers
# Optional for OpenAI embeddings:
%pip install -U langchain-openai
```

In [1]:
import os
import numpy as np
from pathlib import Path
from typing import List, Dict, Any, Tuple

from langchain_core.documents import Document

print("Ready ✅")

Ready ✅


## 1) What is an embedding?

An **embedding** is a vector (list of numbers) that represents meaning.

**In RAG**:
1) Embed each chunk  
2) Store vectors in a vector DB  
3) Embed the query  
4) Retrieve nearest vectors → relevant chunks

### Similarity metric: cosine similarity

In [2]:
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    a = np.asarray(a, dtype=np.float32)
    b = np.asarray(b, dtype=np.float32)
    denom = (np.linalg.norm(a) * np.linalg.norm(b))
    if denom == 0:
        return 0.0
    return float(np.dot(a, b) / denom)

cosine_similarity([1, 0], [1, 0]), cosine_similarity([1, 0], [0, 1])

(1.0, 0.0)

## 2) Embedding types you should know (interview + design)

### A) Dense embeddings (most common)
- Dense vector (e.g., 384/768/1024 dims)
- Best for semantic similarity
- Examples: MiniLM, E5, BGE, OpenAI text-embedding

### B) Sparse embeddings (lexical)
- Mostly zeros (BM25/TF-IDF-like, or neural sparse like SPLADE)
- Strong on exact keywords, IDs, numbers

### C) Hybrid retrieval (dense + sparse)
- Often best in enterprise search

### D) Late interaction / multi-vector (e.g., ColBERT)
- Multiple vectors per document (token-level)
- Higher accuracy, higher compute

### E) Instruction-tuned embeddings
- Optimized for query↔passage matching (E5/BGE instruction variants)

### F) Multilingual embeddings
- Cross-language aligned vectors (English/Italian/Kannada, etc.)

### G) Domain/code embeddings
- For logs, incidents, code search

## 3) Sample `Document`s for experiments

If you already have `cleaned_docs` or `token_chunks` from previous notebooks,
you can swap `sample_docs` with them later.

In [3]:
sample_docs: List[Document] = [
    Document(page_content="How to reset password in the portal: go to Settings -> Security -> Reset Password.", metadata={"source": "kb", "id": "d1"}),
    Document(page_content="Contact support by emailing support@example.com or opening a ticket in the portal.", metadata={"source": "kb", "id": "d2"}),
    Document(page_content="Quarterly summary report: revenue increased and customer satisfaction improved.", metadata={"source": "report", "id": "d3"}),
    Document(page_content="This doc mentions account security, MFA, and password policies.", metadata={"source": "kb", "id": "d4"}),
]

queries = [
    "I forgot my password. How can I reset it?",
    "How do I reach support?",
    "Tell me about quarterly performance.",
]
len(sample_docs), [d.metadata for d in sample_docs]

(4,
 [{'source': 'kb', 'id': 'd1'},
  {'source': 'kb', 'id': 'd2'},
  {'source': 'report', 'id': 'd3'},
  {'source': 'kb', 'id': 'd4'}])

## 4) LangChain Embeddings interface + a runnable baseline (`FakeEmbeddings`)

Embeddings typically support:
- `embed_documents(list[str]) -> list[list[float]]`
- `embed_query(str) -> list[float]`

We start with `FakeEmbeddings` so everything runs without downloads/API keys.

In [None]:
from langchain_community.embeddings import FakeEmbeddings

emb = FakeEmbeddings(size=384)
doc_vectors = emb.embed_documents([d.page_content for d in sample_docs])
q_vec = emb.embed_query(queries[0])

len(doc_vectors), len(doc_vectors[0]), len(q_vec)

## 5) Manual similarity search (top-k)

Before vector DBs, do this to validate:
- vector dims
- similarity code
- query/doc pipeline

In [None]:
def top_k_similar(query: str, docs: List[Document], embedding, k: int = 3) -> List[Tuple[float, Document]]:
    q = np.array(embedding.embed_query(query), dtype=np.float32)
    dvecs = embedding.embed_documents([d.page_content for d in docs])
    scored = []
    for d, v in zip(docs, dvecs):
        s = cosine_similarity(q, np.array(v, dtype=np.float32))
        scored.append((s, d))
    scored.sort(key=lambda x: x[0], reverse=True)
    return scored[:k]

for q in queries:
    print("\nQUERY:", q)
    for score, doc in top_k_similar(q, sample_docs, emb, k=3):
        print(f"  score={score:.4f} id={doc.metadata['id']}  {doc.page_content[:70]}...")

## 6) Real dense embeddings (Hugging Face) — Optional

If you want real semantic vectors locally, use `sentence-transformers`.

Starter models:
- `sentence-transformers/all-MiniLM-L6-v2` (fast, 384 dim)
- `intfloat/e5-small-v2` (instruction tuned; use prefixes)
- `BAAI/bge-small-en-v1.5` (popular for RAG)

> First run downloads weights (internet needed on your machine).

In [None]:
USE_HF = False  # set True when ready

if USE_HF:
    # %pip install -U sentence-transformers
    from langchain_community.embeddings import HuggingFaceEmbeddings

    hf_model_name = "sentence-transformers/all-MiniLM-L6-v2"
    hf_emb = HuggingFaceEmbeddings(
        model_name=hf_model_name,
        encode_kwargs={"normalize_embeddings": True},
    )

    for q in queries:
        print("\nQUERY:", q)
        for score, doc in top_k_similar(q, sample_docs, hf_emb, k=3):
            print(f"  score={score:.4f} id={doc.metadata['id']}  {doc.page_content[:70]}...")
else:
    print("HF embeddings not enabled. Set USE_HF=True to run.")

## 7) OpenAI embeddings — Optional (API key required)

If you want OpenAI embeddings:
- Install `langchain-openai`
- Set environment variable `OPENAI_API_KEY`
- Use `OpenAIEmbeddings(model="text-embedding-3-small")`

In [None]:
USE_OPENAI = False  # set True when ready

if USE_OPENAI:
    # %pip install -U langchain-openai
    from langchain_openai import OpenAIEmbeddings

    oai_emb = OpenAIEmbeddings(model="text-embedding-3-small")
    for q in queries:
        print("\nQUERY:", q)
        for score, doc in top_k_similar(q, sample_docs, oai_emb, k=3):
            print(f"  score={score:.4f} id={doc.metadata['id']}  {doc.page_content[:70]}...")
else:
    print("OpenAI embeddings not enabled. Set USE_OPENAI=True to run.")

## 8) Instruction-tuned embeddings (E5/BGE) — input formatting

Many models work better when you format inputs:

### E5 format
- query: `query: ...`
- passage: `passage: ...`

Why it matters:
- The model was trained on this exact pattern.

In [None]:
def format_for_e5(query: str = None, passage: str = None) -> str:
    if query is not None:
        return f"query: {query}"
    if passage is not None:
        return f"passage: {passage}"
    raise ValueError("Provide query or passage")

print(format_for_e5(query="reset password"))
print(format_for_e5(passage="To reset password, go to Settings -> Security"))

## 9) Debugging embeddings (practical checklist)

When embeddings feel “bad”, check:
1) Cleaning quality (noise hurts embeddings)
2) Chunk sizes (too big/too small)
3) Normalization (`normalize_embeddings=True` for cosine)
4) Query/passage prefixes (E5)
5) Domain mismatch (use domain embeddings or hybrid retrieval)

We’ll create a quick debug table: vector norm + preview.

In [None]:
def embedding_debug_table(docs: List[Document], embedding, max_preview: int = 60) -> List[Dict[str, Any]]:
    vectors = embedding.embed_documents([d.page_content for d in docs])
    table = []
    for d, v in zip(docs, vectors):
        v = np.array(v, dtype=np.float32)
        table.append({
            "id": d.metadata.get("id"),
            "source": d.metadata.get("source"),
            "preview": (d.page_content[:max_preview] + "...") if len(d.page_content) > max_preview else d.page_content,
            "dim": int(v.shape[0]),
            "norm": float(np.linalg.norm(v)),
        })
    return table

embedding_debug_table(sample_docs, emb)

## 10) Embedding cache (production must-have)

Embedding the same chunks repeatedly is expensive.

LangChain supports cache-backed embeddings:
- First run → compute + store
- Next run → load from cache

This is great for:
- repeated experiments
- incremental indexing
- speeding up development

In [None]:
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore

cache_dir = Path("emb_cache_store")
cache_dir.mkdir(exist_ok=True)

store = LocalFileStore(str(cache_dir))

cached_emb = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=emb,   # swap to hf_emb or oai_emb later
    document_embedding_cache=store,
    namespace="demo_v1"
)

vecs_1 = cached_emb.embed_documents([d.page_content for d in sample_docs])
vecs_2 = cached_emb.embed_documents([d.page_content for d in sample_docs])

print("dim:", len(vecs_1[0]))
print("cache files:", len(list(cache_dir.glob('*'))))

## 11) Choosing an embedding model (quick decision guide)

- **Prototype fast**: MiniLM
- **Best RAG retrieval**: E5 / BGE instruction variants
- **Multilingual corpora**: multilingual-e5 / LaBSE-like
- **ID/keyword heavy**: Hybrid retrieval (BM25 + dense)
- **Highest precision**: late interaction (ColBERT) or rerankers

Next notebook (recommended):
✅ Vector DB insertion (Chroma/FAISS/Milvus) + Retrieval + Filters + MMR