# Basic Example

This tutorial demonstrates the core concepts of TableVault through a practical example: building a document processing pipeline with searchable embeddings.


## Step 1: Install Dependencies

First, install TableVault.

In [1]:
!pip install tablevault



## Step 2: Setup ArangoDB

Run ArangoDB locally using Docker with vector index support enabled.

In [2]:
import subprocess
subprocess.run([
    "docker", "run", "-d",
    "--name", "arangodb",
    "-e", "ARANGO_ROOT_PASSWORD=rootpassword",
    "-p", "8529:8529",
    "arangodb:3.12",
    "arangod", "--experimental-vector-index=true"
], check=True)

16d64ae1085a895ef2a1e1b1d9cdf1d28628fdac98775a685394ec5c79b6d45b


CompletedProcess(args=['docker', 'run', '-d', '--name', 'arangodb', '-e', 'ARANGO_ROOT_PASSWORD=rootpassword', '-p', '8529:8529', 'arangodb:3.12', 'arangod', '--experimental-vector-index=true'], returncode=0)

Once ArangoDB is running, you can explore your database using the built-in web UI at [http://localhost:8529](http://localhost:8529). Log in with username `root` and password `rootpassword` to browse collections, run queries, and inspect your data like a typical database.

If the command fails with a port binding error, port 8529 is already in use. Find and stop the conflicting process before continuing:

```bash
lsof -i :8529          # find what is using the port
docker stop <name>     # if it is a Docker container
```

Verify ArangoDB is running:

In [3]:
from arango import ArangoClient
from arango.exceptions import ArangoError

client = ArangoClient(hosts="http://localhost:8529")

try:
    sys_db = client.db("_system", username="root", password="rootpassword")
    info = sys_db.version()
    version = info.get("version") if isinstance(info, dict) else info
    print(f"ArangoDB is ready: {version}")
except ArangoError as exc:
    raise RuntimeError("ArangoDB started, but auth failed. Check root password setup.") from exc

ArangoDB is ready: 3.12.7-2


## Step 3: Initialize the Vault

Create a TableVault instance connected to your ArangoDB. The `process_name` identifies this run — all data written through this vault is attributed to the `document_pipeline` process, making it easy to trace where each item came from later.

In [4]:
from tablevault import Vault

# Create a new TableVault process
vault = Vault(
    user_id="tutorial_user",
    process_name="document_pipeline",
    arango_url="http://localhost:8529",
    arango_db="tutorial_db",
    new_arango_db=True,  # Start fresh
    arango_root_password="rootpassword"
)

print("Vault initialized successfully!")

Vault initialized successfully!


## Step 4: Create Item Lists

TableVault organizes data into typed lists:

**Document lists**: Store text content
**Embedding lists**: Store vector embeddings
**Record lists**: Store structured metadata

Each list stores items at sequential integer positions. Items across lists can be linked by position range to track lineage — for example, recording that embedding position 2 was derived from document positions 2–3.

In [5]:
# Create a document list for storing text chunks
vault.create_document_list("research_papers")

# Create an embedding list (using 384-dim for this example)
EMBEDDING_DIM = 384
vault.create_embedding_list("paper_embeddings", ndim=EMBEDDING_DIM)

# Create a record list for metadata
vault.create_record_list("paper_metadata", column_names=["title", "author", "chunk_id"])

print("Item lists created!")


---[ TableVault Record ]---
Item lists created!
---[ TableVault Record ]---



## Step 5: Add Documents and Track Lineage

We'll add sample documents and their embeddings, tracking the lineage between them. The `input_items` argument on `append_embedding` records which source positions the embedding was derived from, forming an explicit link that can be queried later.

In [6]:
# Sample document chunks
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are inspired by biological neurons.",
    "Deep learning has revolutionized computer vision.",
    "Transformers have changed natural language processing.",
]

# Mock embedding function (replace with your actual model like sentence-transformers)
def get_embedding(text):
    import hashlib
    import random

    seed = int.from_bytes(hashlib.sha256(text.encode()).digest(), "big")
    rng = random.Random(seed)
    return [rng.random() for _ in range(EMBEDDING_DIM)]

print(f"Mock embedding dimension: {len(get_embedding('test'))}")


---[ TableVault Record ]---
Mock embedding dimension: 384
---[ TableVault Record ]---



In [7]:
# Add documents and their embeddings with lineage tracking
for idx, doc in enumerate(documents):
    # Add document
    vault.append_document("research_papers", doc)

    # Generate and add embedding with lineage tracking
    embedding = get_embedding(doc)
    vault.append_embedding(
        "paper_embeddings",
        embedding,
        input_items={"research_papers": [idx, idx + 1]},  # Links to source document
        index_rebuild_count=max(0, len(documents) - 1),  # Force index build for small demo sets
    )

    # Add metadata
    vault.append_record("paper_metadata", {
        "title": f"Paper Section {idx + 1}",
        "author": "Tutorial Author",
        "chunk_id": idx
    })

    print(f"Added document {idx + 1}: {doc[:50]}...")

    # All writes for this item are complete — safe to stop or pause the process here
    vault.checkpoint_execution()

has_index = vault.has_vector_index(EMBEDDING_DIM)
print(f"\nVector index created: {has_index}")
if not has_index:
    print("Vector index was not created; approximate search may be unavailable on this ArangoDB setup.")
print("All documents added with lineage tracking!")


---[ TableVault Record ]---
Added document 1: Machine learning is a subset of artificial intelli...
Added document 2: Neural networks are inspired by biological neurons...
Added document 3: Deep learning has revolutionized computer vision....
Added document 4: Transformers have changed natural language process...

Vector index created: True
All documents added with lineage tracking!
---[ TableVault Record ]---



### Why `checkpoint_execution` matters

`vault.checkpoint_execution()` marks a safe boundary at the end of each loop iteration.

Stop/pause behavior: requests only take effect at a checkpoint, never mid-write or during a pending API call.
Resume behavior: resume also happens at a checkpoint, which keeps pipeline state consistent.

Without checkpoints, stop/pause requests can be deferred indefinitely.

## Step 6: Add Descriptions

Each item list can have a description — a short text and optional embedding that annotates what the list contains. Descriptions serve two purposes: they make lists self-documenting, and they act as a semantic filter when querying. In Step 9 you will see how `description_text` narrows a search to only the lists relevant to your query.

In [8]:
# Add semantic descriptions for queryability
vault.create_description(
    "research_papers",
    description="Collection of machine learning research paper excerpts",
    embedding=get_embedding("machine learning research papers")
)

vault.create_description(
    "paper_embeddings",
    description="Vector embeddings of research paper text chunks",
    embedding=get_embedding("document embeddings vectors")
)

print("Descriptions added!")


---[ TableVault Record ]---
Descriptions added!
---[ TableVault Record ]---



## Step 7: Query Content

Now let's query the stored content.

In [9]:
# Get all documents
all_docs = vault.query_item_content("research_papers")
print("All documents:")
for i, doc in enumerate(all_docs):
    print(f"  [{i}]: {doc}")


---[ TableVault Record ]---
All documents:
  [0]: Machine learning is a subset of artificial intelligence.
  [1]: Neural networks are inspired by biological neurons.
  [2]: Deep learning has revolutionized computer vision.
  [3]: Transformers have changed natural language processing.
---[ TableVault Record ]---



In [10]:
# Get specific document by index
first_doc = vault.query_item_content("research_papers", index=0)
print(f"First document: {first_doc}")


---[ TableVault Record ]---
First document: Machine learning is a subset of artificial intelligence.
---[ TableVault Record ]---



In [11]:
# Get item metadata
metadata = vault.query_item_list("research_papers")
print(f"Document list info: {metadata}")


---[ TableVault Record ]---
Document list info: {'_key': 'research_papers', '_id': 'document_list/research_papers', '_rev': '_lGqFMQW--_', 'deleted': -1, 'length': 210, 'n_items': 4, 'name': 'research_papers', 'process_index': 0, 'process_name': 'document_pipeline', 'timestamp': 3}
---[ TableVault Record ]---



## Step 8: Query Lineage

Lineage lets you trace exactly which source data produced each derived item. This is useful for debugging data quality issues, reproducing results, and auditing how your pipeline transformed data over time. You can traverse in either direction — from a derived item back to its sources, or from a source forward to everything derived from it.

In [12]:
# Find what the embeddings were derived from
parents = vault.query_item_parent("paper_embeddings")
print(f"Embedding parents: {parents}")


---[ TableVault Record ]---
Embedding parents: [[3, 4, 'document_list', 'document_list/research_papers', 3, 4], [2, 3, 'document_list', 'document_list/research_papers', 2, 3], [1, 2, 'document_list', 'document_list/research_papers', 1, 2], [0, 1, 'document_list', 'document_list/research_papers', 0, 1]]
---[ TableVault Record ]---



In [13]:
# Find what was derived from the documents
children = vault.query_item_child("research_papers")
print(f"Document children: {children}")


---[ TableVault Record ]---
Document children: [[3, 4, 'embedding', 'paper_embeddings', 3, 4], [2, 3, 'embedding', 'paper_embeddings', 2, 3], [1, 2, 'embedding', 'paper_embeddings', 1, 2], [0, 1, 'embedding', 'paper_embeddings', 0, 1]]
---[ TableVault Record ]---



In [14]:
# Get specific range lineage
first_embedding_source = vault.query_item_parent(
    "paper_embeddings",
    start_position=0,
    end_position=1
)
print(f"First embedding came from: {first_embedding_source}")


---[ TableVault Record ]---
First embedding came from: []
---[ TableVault Record ]---



## Step 9: Similarity Search

TableVault supports both vector similarity search over embeddings and full-text search over documents. Both query types accept optional filters to narrow the search scope: `description_text` restricts results to lists whose description matches, and `code_text` restricts to lists created by processes whose source code contains the given string.

In [15]:
# Search by embedding similarity
query_text = "artificial intelligence and deep learning"
query_embedding = get_embedding(query_text)

# Find similar embeddings
similar = vault.query_embedding_list(
    embedding=query_embedding,
    use_approx=False,
)
print(f"Similar embeddings: {similar}")


---[ TableVault Record ]---
Similar embeddings: [['paper_embeddings', 3, [], []], ['paper_embeddings', 1, [], []], ['paper_embeddings', 2, [], []], ['paper_embeddings', 0, [], []]]
---[ TableVault Record ]---



In [16]:
# Search documents by text
results = vault.query_document_list(
    document_text="neural networks"
)
print(f"Documents matching 'neural networks': {results}")


---[ TableVault Record ]---
Documents matching 'neural networks': [['research_papers', 1, [], []]]
---[ TableVault Record ]---



Combining these filters is especially useful in large vaults with many lists: you can target exactly the data that is semantically relevant and was produced by the right pipeline stage.

In [17]:
# Filter embedding search by description text
# Only searches within lists whose description contains "document embeddings"
similar_filtered = vault.query_embedding_list(
    embedding=query_embedding,
    description_text="document embeddings",
    use_approx=False,
)
print(f"Embeddings filtered by description: {similar_filtered}")


---[ TableVault Record ]---
Embeddings filtered by description: []
---[ TableVault Record ]---



In [18]:
# Filter document search by description text and code text
# description_text: restricts to lists whose description mentions "research paper"
# code_text: restricts to lists created by processes whose code contains "append_document"
results_filtered = vault.query_document_list(
    document_text="neural networks",
    description_text="research paper",
    code_text="append_document",
)
print(f"Documents filtered by description and code: {results_filtered}")


---[ TableVault Record ]---
Documents filtered by description and code: [['research_papers', 1, ['research_papers_BASE_DESCRIPT'], [['document_pipeline', 2]]]]
---[ TableVault Record ]---



## Step 10: Process Queries

Every item in TableVault is attributed to the process that created it. Process queries let you audit the full output of a given pipeline run, or find which process was responsible for a particular item — useful when you have multiple pipelines writing to the same vault.

In [19]:
# Find which process created these items
creation_process = vault.query_item_creation_process("research_papers")
print(f"Created by process: {creation_process}")


---[ TableVault Record ]---
Created by process: [{'process_id': 'process_list/document_pipeline', 'index': 0}]
---[ TableVault Record ]---



In [20]:
# Find all items created in this process
process_items = vault.query_process_item("document_pipeline")
print(f"Items in process: {process_items}")


---[ TableVault Record ]---
Items in process: [{'name': 'paper_embeddings', 'start_position': None, 'end_position': None}, {'name': 'paper_embeddings', 'start_position': 0, 'end_position': 4}, {'name': 'paper_embeddings_BASE_DESCRIPT', 'start_position': None, 'end_position': None}, {'name': 'paper_metadata', 'start_position': None, 'end_position': None}, {'name': 'paper_metadata', 'start_position': 0, 'end_position': 4}, {'name': 'research_papers', 'start_position': None, 'end_position': None}, {'name': 'research_papers', 'start_position': 0, 'end_position': 210}, {'name': 'research_papers_BASE_DESCRIPT', 'start_position': None, 'end_position': None}]
---[ TableVault Record ]---

