# NVIDIA Video Search & Summarization (VSS) â€” A100 Single-GPU (80GB) Local Deployment

This notebook deploys VSS on a **single A100 (80GB)** instance with **all models running locally**:
- Cosmos Reason1 VLM (inside the VSS engine container)
- LLM NIM (local)
- Embedding NIM (local)
- Reranker NIM (local)

It uses the repo's **single-GPU docker compose** at `deploy/docker/local_deployment_single_gpu/compose.yaml`.

Notes:
- This notebook is designed for **1 GPU only**.
- It does **not** assume `/ephemeral` storage.
- It does **not** modify Docker daemon settings.


## 1) Prerequisites

You need:
- NVIDIA driver + CUDA working (`nvidia-smi` should show the A100)
- Docker + Docker Compose v2
- An `NGC_API_KEY` with access to the required images/models


In [None]:
import os
import subprocess
from pathlib import Path

# ---------------------------
# REQUIRED: set your NGC API key
# ---------------------------
os.environ.setdefault("NGC_API_KEY", "***")  # TODO: replace

# Resolve repo root robustly
try:
    VSS_REPO_DIR = subprocess.check_output(
        ["git", "rev-parse", "--show-toplevel"],
        text=True,
        stderr=subprocess.STDOUT,
    ).strip()
except Exception:
    VSS_REPO_DIR = str(Path.cwd().resolve().parent)

os.environ["VSS_REPO_DIR"] = VSS_REPO_DIR

COMPOSE_DIR = str(Path(VSS_REPO_DIR) / "deploy" / "docker" / "local_deployment_single_gpu")
os.environ["VSS_COMPOSE_DIR"] = COMPOSE_DIR

print("VSS_REPO_DIR=", VSS_REPO_DIR)
print("VSS_COMPOSE_DIR=", COMPOSE_DIR)


## 2) Configure deployment settings (A100 single-GPU baseline)

This section sets:
- Local data directories (host-mounted)
- Ports
- Cosmos Reason1 VLM selection
- Conservative GPU-memory defaults for running *everything on one GPU*


In [None]:
import os
from pathlib import Path

home = Path.home().resolve()
data_root = (home / "vss-data").resolve()

# Host paths (avoid filling container layers on a 256GB disk)
asset_dir = (data_root / "assets").resolve()
milvus_dir = (data_root / "milvus").resolve()
nim_cache_dir = (data_root / "nim-cache").resolve()
ngc_model_cache_dir = (data_root / "ngc-model-cache").resolve()
via_logs_dir = (data_root / "via-logs").resolve()
trt_engine_dir = (data_root / "trt-engines").resolve()

for p in [asset_dir, milvus_dir, nim_cache_dir, ngc_model_cache_dir, via_logs_dir, trt_engine_dir]:
    p.mkdir(parents=True, exist_ok=True)

# Ports
os.environ.setdefault("BACKEND_PORT", "8100")
os.environ.setdefault("FRONTEND_PORT", "9100")

# DB credentials (local dev defaults)
os.environ.setdefault("GRAPH_DB_USERNAME", "neo4j")
os.environ.setdefault("GRAPH_DB_PASSWORD", "password")
os.environ.setdefault("ARANGO_DB_USERNAME", "root")
os.environ.setdefault("ARANGO_DB_PASSWORD", "password")

# Host mounts
os.environ["ASSET_STORAGE_DIR"] = str(asset_dir)
os.environ["NGC_MODEL_CACHE"] = str(ngc_model_cache_dir)
os.environ["VIA_LOG_DIR"] = str(via_logs_dir)
os.environ["TRT_ENGINE_PATH"] = str(trt_engine_dir)
os.environ["LOCAL_NIM_CACHE"] = str(nim_cache_dir)

# Compose + config mounts (important so the VSS container uses the local NIM endpoints)
compose_dir = Path(os.environ["VSS_COMPOSE_DIR"]).resolve()
compose_yaml = (compose_dir / "compose.yaml").resolve()
ca_rag_cfg = (compose_dir / "config.yaml").resolve()
guardrails_dir = (compose_dir / "guardrails").resolve()

if not compose_yaml.exists():
    raise FileNotFoundError(f"compose.yaml not found: {compose_yaml}")
if not ca_rag_cfg.exists():
    raise FileNotFoundError(f"config.yaml not found: {ca_rag_cfg}")
if not guardrails_dir.exists():
    raise FileNotFoundError(f"guardrails dir not found: {guardrails_dir}")

os.environ["CA_RAG_CONFIG"] = str(ca_rag_cfg)
os.environ["GUARDRAILS_CONFIG"] = str(guardrails_dir)

# This env var is used by compose.yaml to persist Milvus data (mounted into milvus-standalone at /var/lib/milvus).
os.environ["MILVUS_DATA_DIR"] = str(milvus_dir)

# VLM: Cosmos Reason1 (local)
os.environ["VLM_MODEL_TO_USE"] = "cosmos-reason1"

# MODEL_PATH should point to a local model directory OR a supported remote spec (e.g. hf/git/ngc).
os.environ.setdefault("MODEL_PATH", "git:https://huggingface.co/nvidia/Cosmos-Reason1-7B")

# Single GPU only
os.environ["NUM_GPUS"] = "1"
os.environ["NIM_GPU_DEVICE"] = "0"

# GPU memory knobs (because LLM + embed + reranker + VLM share the same A100)
os.environ.setdefault("TRT_LLM_MEM_USAGE_FRACTION", "0.6")
os.environ.setdefault("VLLM_GPU_MEMORY_UTILIZATION", "0.6")

# Conservative batching
os.environ.setdefault("VLM_BATCH_SIZE", "8")

# Optional features
os.environ.setdefault("DISABLE_CV_PIPELINE", "true")
os.environ.setdefault("ENABLE_AUDIO", "false")
os.environ.setdefault("DISABLE_GUARDRAILS", "true")

# Keep assets bounded on a 256GB disk (optional)
os.environ.setdefault("MAX_ASSET_STORAGE_SIZE_GB", "80")

print("Configured host data root:", data_root)
print("VSS_COMPOSE_DIR=", str(compose_dir))
print("compose.yaml=", str(compose_yaml))
print("CA_RAG_CONFIG=", os.environ["CA_RAG_CONFIG"])
print("GUARDRAILS_CONFIG=", os.environ["GUARDRAILS_CONFIG"])
print("BACKEND_PORT=", os.environ["BACKEND_PORT"], "FRONTEND_PORT=", os.environ["FRONTEND_PORT"])
print("VLM_MODEL_TO_USE=", os.environ["VLM_MODEL_TO_USE"])
print("MODEL_PATH=", os.environ["MODEL_PATH"])


## 3) Log in to NGC (Docker)
This is required to pull NIM images and the VSS engine image.


In [None]:
%%bash
set -euo pipefail
if [ -z "${NGC_API_KEY:-}" ] || [ "${NGC_API_KEY}" = "***" ]; then
  echo "ERROR: Please set NGC_API_KEY in the notebook cell above."
  exit 1
fi
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin


## 4) Start the stack (VSS + local NIMs + databases)
This uses docker compose in `deploy/docker/local_deployment_single_gpu`.


In [None]:
%%bash
set -euo pipefail
cd "${VSS_COMPOSE_DIR}"
docker compose up -d --quiet-pull
docker compose ps


## 5) Wait for services to become ready
We check:
- LLM NIM (`http://localhost:8000/v1/health/ready`)
- Embedding NIM (`http://localhost:8006/v1/health/ready`)
- Reranker NIM (`http://localhost:8005/v1/health/ready`)
- VSS backend (`http://localhost:${BACKEND_PORT}/health/ready`)


In [None]:
import os
import time
import requests

def wait_ready(url: str, timeout_s: int = 1800, interval_s: int = 5):
    start = time.time()
    last_err = None
    while time.time() - start < timeout_s:
        try:
            r = requests.get(url, timeout=3)
            if r.status_code == 200:
                return True
        except Exception as e:
            last_err = e
        time.sleep(interval_s)
    raise RuntimeError(f"Timed out waiting for ready: {url} (last_err={last_err})")

backend_port = os.environ["BACKEND_PORT"]
checks = [
    ("LLM NIM", "http://localhost:8000/v1/health/ready"),
    ("Embedding NIM", "http://localhost:8006/v1/health/ready"),
    ("Reranker NIM", "http://localhost:8005/v1/health/ready"),
    ("VSS Backend", f"http://localhost:{backend_port}/health/ready"),
]

for name, url in checks:
    print(f"Waiting for {name}: {url}")
    wait_ready(url)
    print(f"{name} is ready")

print("All services are ready.")
print("Backend URL: ", f"http://localhost:{backend_port}")
print("Frontend URL: ", f"http://localhost:{os.environ['FRONTEND_PORT']}")


## 6) (Optional) View logs
If something is slow (first run model download/engine build), tail logs.


In [None]:
%%bash
set -euo pipefail
cd "${VSS_COMPOSE_DIR}"
docker compose logs --no-color --tail=200


## 7) Shutdown / cleanup
Bring the stack down when finished.


In [None]:
%%bash
set -euo pipefail
cd "${VSS_COMPOSE_DIR}"
docker compose down
