# beyond-vector-search â€” Databricks demo

This notebook runs the repo **inside Databricks** (Repos checkout) without any external network calls.

It demonstrates:
- **Adaptive retrieval routing** (keyword vs vector)
- **Offline evaluation loop** that updates router weights
- **SQLite telemetry** inspection

> Tip: If you want the telemetry DB to persist across cluster restarts, set `DB_PATH` to a DBFS location (example below).


In [None]:
from __future__ import annotations

import os
import sys
from pathlib import Path


def find_repo_root(start: Path | None = None) -> Path:
    """Find the repo root by walking up until we find pyproject.toml."""
    p = (start or Path.cwd()).resolve()
    for _ in range(12):
        if (p / "pyproject.toml").exists():
            return p
        p = p.parent
    raise RuntimeError("Could not find repo root (pyproject.toml not found).")


REPO_ROOT = find_repo_root()
SRC_DIR = REPO_ROOT / "src"

# Make the package importable without pip install.
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

# Telemetry backend selection:
# - Default (local/dev): SQLite
# - Databricks Lakehouse: Delta tables ("Lakebase")
#
# For Delta telemetry, set these env vars in the cluster (or in a cell below):
#   BVS_TELEMETRY=delta
#   BVS_DELTA_RUNS_TABLE=<catalog>.<schema>.beyond_vector_search_runs
#   BVS_DELTA_STATE_TABLE=<catalog>.<schema>.beyond_vector_search_router_state
#
# If you keep SQLite, you can set BVS_DB_PATH to a DBFS path like:
#   /dbfs/tmp/beyond_vector_search.sqlite
DB_PATH = os.environ.get("BVS_DB_PATH", str(REPO_ROOT / "runs" / "beyond_vector_search.sqlite"))

print("REPO_ROOT:", REPO_ROOT)
print("PYTHONPATH[0]:", sys.path[0])
print("BVS_TELEMETRY:", os.environ.get("BVS_TELEMETRY", "sqlite"))
print("SQLite DB_PATH (if used):", DB_PATH)
print("Delta runs table:", os.environ.get("BVS_DELTA_RUNS_TABLE"))
print("Delta state table:", os.environ.get("BVS_DELTA_STATE_TABLE"))


In [None]:
import os

# --- Choose your Databricks catalog/schema for Delta telemetry ---
# If you're not using Unity Catalog, you can try "hive_metastore.default".
CATALOG = os.environ.get("BVS_CATALOG", "main")
SCHEMA = os.environ.get("BVS_SCHEMA", "default")

# Enable Delta telemetry (Lakehouse) by default in Databricks.
os.environ.setdefault("BVS_TELEMETRY", "delta")
os.environ.setdefault("BVS_DELTA_RUNS_TABLE", f"{CATALOG}.{SCHEMA}.beyond_vector_search_runs")
os.environ.setdefault("BVS_DELTA_STATE_TABLE", f"{CATALOG}.{SCHEMA}.beyond_vector_search_router_state")

print("Using telemetry:", os.environ["BVS_TELEMETRY"])
print("Runs table:", os.environ["BVS_DELTA_RUNS_TABLE"])
print("State table:", os.environ["BVS_DELTA_STATE_TABLE"])

from beyond_vector_search.run import run_once

out = run_once(query="How to fix INC-10010?", k=5, db_path=DB_PATH)
out


In [None]:
from beyond_vector_search.evaluate import evaluate_all

report = evaluate_all(k=5, db_path=DB_PATH)
{
  "mean_score": report["mean_score"],
  "n": report["n"],
  "router_state": report["router_state"],
}


In [None]:
# Inspect the most recent runs from the Delta runs table
from pyspark.sql import functions as F

runs_df = spark.table(os.environ["BVS_DELTA_RUNS_TABLE"]).orderBy(F.col("run_id").desc()).limit(10)
display(runs_df)


In [None]:
# Inspect the current router state stored in the Delta state table
state_df = spark.table(os.environ["BVS_DELTA_STATE_TABLE"]).orderBy("key")
display(state_df)


## Notes

- If you want to **reset** learning, delete the SQLite file at `DB_PATH` (or point `BVS_DB_PATH` to a new file).
- The core decision logic lives in `src/beyond_vector_search/router.py`.
- The offline loop that updates weights lives in `src/beyond_vector_search/evaluate.py`.
- The architecture diagram is `diagrams/architecture.html`.
