# Level 2 - Week 2 - 01 Vector DB Fundamentals

**Estimated time:** 60-90 minutes

## Learning Objectives

- Define minimal metadata schema
- Keep chunk_id and doc_id traceable
- Prepare data for upsert


## Overview

Vector DBs store vectors plus IDs and metadata.
Your job is to keep retrieval debuggable.

## Practice Steps

- Define a Chunk model with metadata.
- Create a sample metadata payload.


### Sample code

Minimal chunk model and metadata example.


In [None]:
from dataclasses import dataclass
from typing import Dict

@dataclass
class Chunk:
    doc_id: str
    chunk_id: str
    text: str
    metadata: Dict

metadata = {
    'doc_id': 'fastapi_docs',
    'chunk_id': 'fastapi#001',
    'source': 'docs',
    'url': 'https://fastapi.tiangolo.com/',
}
print(metadata)


### Student fill-in

Add any extra metadata fields you want to track.


In [None]:
MIN_METADATA = {
    'doc_id': '',
    'chunk_id': '',
    'source': '',
    'text': '',
}

# TODO: add url or section fields if useful
print(MIN_METADATA)


## Self-check

- Can you trace a chunk_id back to its source?
- Is metadata small but sufficient?


Legacy practice content from practice.ipynb

# Level 2 — Week 2 Practice: Vector DB Fundamentals

**Estimated time:** 60–90 minutes

## Learning Objectives

- Describe the ingestion pipeline (parse → chunk → embed → upsert)
- Implement a basic chunking function with metadata
- Prepare data structures for vector DB ingestion
- Define a minimal retrieval query shape


Legacy practice content from practice.ipynb

## Overview

This week focuses on the ingestion side of RAG systems. We will build a small,
checkable pipeline shape and define the data structures you will reuse later.

You will:

1. Define a chunk data model with metadata.
2. Implement a simple chunking function (start naive, improve later).
3. Sketch the embedding and upsert steps.

## Practice Steps

- Start with a single document string.
- Split into chunks with IDs and metadata.
- Stub the embed and upsert calls.


In [None]:
# Legacy practice content
### Task 2.1: Chunking data model

Define the `Chunk` structure and implement `chunk_text`.
Start with naive chunking (e.g., fixed size or paragraph split).

Tip: keep metadata small and consistent (doc_id, chunk_index, source).


In [None]:
# Legacy practice content
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class Chunk:
    doc_id: str
    chunk_id: str
    text: str
    metadata: Dict

def chunk_text(text: str, doc_id: str, chunk_size: int = 200) -> List[Chunk]:
    # TODO: replace with your strategy (sentence split, overlap, etc.)
    chunks = []
    for idx in range(0, len(text), chunk_size):
        chunk_text = text[idx : idx + chunk_size]
        chunks.append(
            Chunk(
                doc_id=doc_id,
                chunk_id=f"{doc_id}-{idx // chunk_size}",
                text=chunk_text,
                metadata={"doc_id": doc_id, "chunk_index": idx // chunk_size},
            )
        )
    return chunks

sample = chunk_text("Example document text." * 10, doc_id="doc-1")
print("chunks:", len(sample))


Legacy practice content from practice.ipynb

### Task 2.2: Embed and upsert (stubs)

Create stub functions for embedding and upserting. Keep interfaces simple so you
can swap in a real vector DB later.


In [None]:
# Legacy practice content
from typing import Iterable

Vector = List[float]

def embed_texts(texts: Iterable[str]) -> List[Vector]:
    # TODO: replace with real embedding model
    return [[0.0] * 5 for _ in texts]

def upsert_chunks(chunks: List[Chunk], vectors: List[Vector]) -> int:
    # TODO: replace with vector DB client
    assert len(chunks) == len(vectors)
    return len(chunks)

vectors = embed_texts([c.text for c in sample])
count = upsert_chunks(sample, vectors)
print("upserted:", count)
