# Level 2 - Week 2 - 01 Vector DB Fundamentals

**Estimated time:** 60-90 minutes

## Learning Objectives

- Define minimal metadata schema
- Keep chunk_id and doc_id traceable
- Prepare data for upsert


## Overview

A vector DB is not “magic memory”. It is a storage + query system for:

- vectors (embeddings)
- stable identifiers (`chunk_id`)
- metadata (traceability + filters)

The single most important Week 2 outcome is **debuggable retrieval**.

## Intuition

If you can’t answer “why did this chunk come back?”, you can’t improve the system.

So every chunk should be traceable:

- `chunk_id` → source document (`doc_id`)
- `chunk_id` → location (file path / url / section / page)
- `chunk_id` → exact chunk text used later for grounding and citations

## Practice Steps

- Define a minimal `Chunk` data model.
- Define the minimum metadata needed to locate the original source.
- Choose a stable `chunk_id` scheme so re-ingestion overwrites instead of duplicating.

### Sample code

Minimal chunk model and metadata example.


In [None]:
from dataclasses import dataclass
from typing import Dict

@dataclass
class Chunk:
    doc_id: str
    chunk_id: str
    text: str
    metadata: Dict

metadata = {
    'doc_id': 'fastapi_docs',
    'chunk_id': 'fastapi#001',
    'source': 'docs',
    'url': 'https://fastapi.tiangolo.com/',
}
print(metadata)


### Student fill-in

Add any extra metadata fields you want to track.


In [None]:
MIN_METADATA = {
    'doc_id': '',
    'chunk_id': '',
    'source': '',
    'text': '',
}

# TODO: add url or section fields if useful
print(MIN_METADATA)


## Self-check

- Can you trace a chunk_id back to its source?
- Is metadata small but sufficient?


## Underlying theory: embeddings + similarity search

### Embeddings (definition)

An embedding model is a function:

$$
f: \text{Text} \rightarrow \mathbb{R}^d
$$

It maps text into a $d$-dimensional vector. Individual coordinates are not “human interpretable”; the vector is meaningful as a whole.

### Similarity search (what retrieval does)

At ingest time you store chunk embeddings $\mathbf{x}_i = f(\text{chunk}_i)$.

At query time you compute $\mathbf{q} = f(\text{query})$ and retrieve the nearest vectors to $\mathbf{q}$ under some metric.

Key implication:

- retrieval returns what is *numerically close* under your embedding + metric, not what is *semantically correct*

### Cosine similarity (formula + meaning)

Given vectors $\mathbf{x}$ and $\mathbf{y}$:

$$
\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_2\,\|\mathbf{y}\|_2}
$$

- it measures alignment (angle) between vectors
- it is insensitive to magnitude (vector length)
- it is commonly used for embeddings where “meaning” is encoded in direction

### Practical note: normalization

If embeddings are L2-normalized (unit length), dot product and cosine similarity are equivalent:

$$
\mathbf{x}\cdot\mathbf{y} = \cos(\theta) \quad \text{when } \|\mathbf{x}\|_2=\|\mathbf{y}\|_2=1
$$

Record:

- embedding model name
- whether vectors are normalized
- which metric the vector DB uses

## Practice: make retrieval debuggable (data model first)

Goal: define the minimum fields you need so that every retrieved chunk can be traced back to its origin.

You should be able to answer, for any returned `chunk_id`:

- what document did this come from (`doc_id`)?
- where did it come from (file path/url/section/page)?
- what exact text will be used for grounding/citations (`text`)?

In practice you’ll store:

- stable ids (`chunk_id`)
- the raw chunk text (`text`)
- metadata for traceability and filtering (`doc_id`, `source`, optional `url`/`section`)

Next: implement a simple chunker that outputs `Chunk` objects with stable ids + metadata.

In [None]:
# Exercise 1: Chunking (baseline)
#
# Implement chunk_text that splits a single string into fixed-size chunks.
#
# Requirements:
# - deterministic: same input => same chunks
# - traceable metadata: include doc_id and chunk_index
# - stable chunk_id scheme: use doc_id + chunk_index

In [None]:
from typing import List


def chunk_text(text: str, doc_id: str, chunk_size: int = 200) -> List[Chunk]:
    if chunk_size <= 0:
        raise ValueError("chunk_size must be > 0")

    chunks: List[Chunk] = []
    chunk_index = 0
    for start in range(0, len(text), chunk_size):
        chunk_str = text[start : start + chunk_size]
        chunk_id = f"{doc_id}#{chunk_index:05d}"
        chunks.append(
            Chunk(
                doc_id=doc_id,
                chunk_id=chunk_id,
                text=chunk_str,
                metadata={"doc_id": doc_id, "chunk_index": chunk_index},
            )
        )
        chunk_index += 1

    return chunks


sample_chunks = chunk_text("Example document text." * 10, doc_id="doc-1", chunk_size=120)
print("chunks:", len(sample_chunks))
print("first chunk_id:", sample_chunks[0].chunk_id)
print("first metadata:", sample_chunks[0].metadata)

### Exercise 2: Embed and upsert (stubs)

For now, treat embedding and upsert as *pure functions* with clear interfaces.

In a real system:

- `embed_texts` calls an embedding model and returns vectors (lists of floats)
- `upsert_chunks` writes (ids, vectors, documents, metadata) into your vector DB

Debuggability requirements:

- assert the number of vectors matches the number of chunks
- keep stable ids so repeated ingestion overwrites instead of duplicating
- keep enough metadata to trace each chunk back to its source

In [None]:
from typing import Iterable, List

Vector = List[float]


def embed_texts(texts: Iterable[str]) -> List[Vector]:
    return [[0.0] * 5 for _ in texts]


def upsert_chunks(chunks: List[Chunk], vectors: List[Vector]) -> int:
    assert len(chunks) == len(vectors)
    return len(chunks)


vectors = embed_texts([c.text for c in sample_chunks])
count = upsert_chunks(sample_chunks, vectors)
print("upserted:", count)