# Text Embeddings - Section Summary

## Core Concept
Embeddings convert text into numerical representations (lists of numbers from -1 to +1) that capture semantic meaning. This enables semantic search - finding relevant chunks by meaning, not just keyword matching.

## How It Works
1. Feed text into embedding model → get list of numbers
2. Each number = score for some learned feature of the text
3. Similar meanings = similar number patterns
4. We don't know exactly what each number represents (model learns during training)

## VoyageAI Implementation
Anthropic doesn't provide embeddings - use VoyageAI instead.
```python
# Setup
%pip install voyageai
from dotenv import load_dotenv
import voyageai

load_dotenv()
client = voyageai.Client()

def generate_embedding(text, model="voyage-3-large", input_type="query"):
    result = client.embed([text], model=model, input_type=input_type)
    return result.embeddings[0]
```

## Key Points
- Embeddings are like "GPS coordinates for meaning"
- `input_type="query"` for questions, different type for documents
- Generate embeddings once during document ingestion
- At query time: embed question → find closest embeddings → retrieve those chunks

## Next Step
Learn to compare embeddings to find most similar chunks (similarity/distance metrics).

In [1]:
# Install VoyageAI lib
%pip install voyageai

Collecting voyageai
  Using cached voyageai-0.3.7-py3-none-any.whl (34 kB)
Collecting aiohttp
  Using cached aiohttp-3.13.2-cp39-cp39-macosx_11_0_arm64.whl (490 kB)
Collecting ffmpeg-python
  Using cached ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Collecting tenacity
  Using cached tenacity-9.1.2-py3-none-any.whl (28 kB)
Collecting tokenizers>=0.14.0
  Using cached tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl (2.9 MB)
Collecting numpy
  Using cached numpy-2.0.2-cp39-cp39-macosx_14_0_arm64.whl (5.3 MB)
Collecting aiolimiter
  Using cached aiolimiter-1.2.1-py3-none-any.whl (6.7 kB)
Collecting pillow
  Using cached pillow-11.3.0-cp39-cp39-macosx_11_0_arm64.whl (4.7 MB)
Collecting langchain-text-splitters>=0.3.8
  Using cached langchain_text_splitters-0.3.11-py3-none-any.whl (33 kB)
Collecting langchain-core<2.0.0,>=0.3.75
  Using cached langchain_core-0.3.80-py3-none-any.whl (450 kB)
Collecting jsonpatch<2.0.0,>=1.33.0
  Using cached jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Coll

In [2]:
# Client Setup
from dotenv import load_dotenv
import voyageai

load_dotenv()

client = voyageai.Client()



In [3]:
# Chunk by section
import re


def chunk_by_section(document_text):
    pattern = r"\n## "
    return re.split(pattern, document_text)

In [4]:
# Embedding Generation
def generate_embedding(text, model="voyage-3-large", input_type="query"):
    result = client.embed([text], model=model, input_type=input_type)

    return result.embeddings[0]

In [5]:
with open("./report.md", "r") as f:
    text = f.read()

chunks = chunk_by_section(text)

generate_embedding(chunks[0])

[-0.05453480780124664,
 0.01431055087596178,
 -0.016921261325478554,
 0.0005922440905123949,
 0.02136913500726223,
 0.037903621792793274,
 -0.047186143696308136,
 0.0024173229467123747,
 0.0025865354109555483,
 -0.01798488199710846,
 0.014020473696291447,
 0.048926617950201035,
 -0.00918582733720541,
 -0.0037710238248109818,
 -0.013730393722653389,
 0.007010236848145723,
 0.03326236084103584,
 0.08857070654630661,
 -0.04737953096628189,
 -0.010201102122664452,
 0.06033638119697571,
 -0.057628974318504333,
 0.06497763842344284,
 -0.03461606428027153,
 -0.018468346446752548,
 0.007638740353286266,
 0.007203621789813042,
 0.04486551135778427,
 -0.03248881921172142,
 -0.014407243579626083,
 -0.000906496134120971,
 0.022142676636576653,
 0.01885511912405491,
 0.05685543641448021,
 0.012473385781049728,
 0.04099779576063156,
 -0.06188346818089485,
 -0.020402204245328903,
 -0.02726740390062332,
 -0.054921574890613556,
 0.010587873868644238,
 0.005245590582489967,
 0.032682206481695175,
 -0.07