# Streaming Data to RAG

Traditional retrieval-augmented generation (RAG) systems rely on static data ingested in batches, which limits their ability to support time-critical use cases like emergency response or live monitoring. These situations require immediate access to dynamic data sources such as sensor feeds or radio signals. 

The Streaming Data to RAG developer example solves this by enabling RAG systems to process live data streams in real-time. It features a GPU-accelerated software-defined radio (SDR) pipeline that continuously captures radio frequency (RF) signals, transcribes them into searchable text, embeds, and indexes them in real time. This live data is then fed to a large language model (LLM), allowing context-aware queries over dynamic streams.

Designed for scalability across edge and cloud environments, this reference example unlocks real-time situational awareness for use cases like spectrum monitoring, intelligence gathering, and other mission-critical applications—while retaining RAG’s strengths in delivering accurate, relevant results. 


## Full Architecture Diagram
<img src="../docs/arch-diagram.png" alt="Architecture Diagram" style="max-width: 600px; width: 100%;" />

## Submodule Initialization

This blueprint utilizes 2 open-source NVIDIA repositories that have been augmented for this workflow:
- NeMo Agent Toolkit UI: Open-source repository used as the UI
- Context-Aware RAG: Open-source RAG repository originally shown in the [VSS Blueprint](https://build.nvidia.com/nvidia/video-search-and-summarization)

In [1]:
!git submodule update --init --recursive

Submodule 'external/NeMo-Agent-Toolkit-UI' (https://github.com/NVIDIA/NeMo-Agent-Toolkit-UI.git) registered for path '../external/NeMo-Agent-Toolkit-UI'
Submodule 'external/context-aware-rag' (https://github.com/NVIDIA/context-aware-rag.git) registered for path '../external/context-aware-rag'
Cloning into '/home/deustice/Projects/streaming-data-to-rag/external/NeMo-Agent-Toolkit-UI'...
Cloning into '/home/deustice/Projects/streaming-data-to-rag/external/context-aware-rag'...
Submodule path '../external/NeMo-Agent-Toolkit-UI': checked out 'b9ccc559efbd0ac378269da1d3427a5954bd5f8b'
Submodule path '../external/context-aware-rag': checked out '04bb45b89598fb47d253fb15d50a0acb444ef95d'


## Docker Login

To pull images required by the blueprint from NGC, you must first [authenticate Docker with nvcr.io](https://docs.nvidia.com/launchpad/ai/base-command-coe/latest/bc-coe-docker-basics-step-02.html#logging-in-to-ngc-on-a-workstation). Paste your NVIDIA API key in the cell below. If you don't have an API key, one can be obtained for free with an NVIDIA developer account at [build.nvidia.com](https://build.nvidia.com).

In [2]:
import subprocess
import os

# ADD YOUR API KEY
NVIDIA_API_KEY = "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"
os.environ['NVIDIA_API_KEY'] = NVIDIA_API_KEY
os.environ['NGC_API_KEY'] = NVIDIA_API_KEY

# Authenticate local Docker with NGC
cmd = f"echo {NVIDIA_API_KEY} | docker login nvcr.io -u '$oauthtoken' --password-stdin"
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
print(result.stdout)

Login Succeeded



## Model Directory

For the NIMs that will be deployed locally, we will set up a directory to use as a cache for model weights, so that we don't need to re-download the models each time the NIMs are restarted:

In [3]:
# Set up local directory to cache model weights downloaded from NGC
import os
os.environ['MODEL_DIRECTORY'] = os.path.abspath(os.path.join(os.getcwd(), "../models"))
!echo "Setting MODEL_DIRECTORY to '$MODEL_DIRECTORY'"
!mkdir -p $MODEL_DIRECTORY
!chmod 777 -R $MODEL_DIRECTORY

Setting MODEL_DIRECTORY to '/home/deustice/Projects/streaming-data-to-rag/models'


## Expose Frontend

In order to access the frontend, ensure that port 3000 is exposed on whatever instance the blueprint is running on.

For Brev deployments, go to your instance page on [brev.nvidia.com](brev.nvidia.com) and scroll down to "Using Ports", then expose port 3000.

## File Replay Setup

This notebook is configured to run via file replay, which is used when a physical antenna & SDR are not hooked up to the system. The file replay service reads audio from file, FM modulates it, and streams the resulting baseband I/Q samples over UDP.

Here, we only set the files to replay (which **must** be located in `src/file-replay/files/sample_files`) and the max replay time. There are other configurable options to the replay outlined in the README.

Some sample files from [NVIDIA's AI Podcast](https://ai-podcast.nvidia.com/) have been included here for demo purposes. The audio has been clipped for brevity, and names and voices have been modified for privacy.

You may also upload your own audio files to explore the Streaming Data to RAG developer example on custom content. Any audio file that can be loaded with [`librosa.load`](https://librosa.org/doc/latest/generated/librosa.load.html) can be used.

<div class="alert alert-block alert-info">
    <b>Note:</b>
    
In the [Holoscan SDR's parameter file](../src/software-defined-radio/params.yaml), be sure to check that the following options match or do not contradict your replay configuration:
    
- Sample rate (Hz): `sensor.sample_rate`
- UDP port: `network_rx.dst_port`
- UDP max payload size: `network_rx.max_payload_size`
- Number files / channels: `channelizer.num_channels`
- Channel spacing (Hz): `channelizer.channel_spacing`
</div>

### Set Environment Variables

In [4]:
# Replay files must be comma-separated and located in `src/file-replay/files`
%env REPLAY_FILES=sample_files/ai_gtc_1.mp3, sample_files/ai_gtc_2.mp3, sample_files/ai_gtc_3.mp3

# Max time to replay in seconds - set to 1 hour
%env REPLAY_TIME=3600

# Set the default max file size to 50MB
%env REPLAY_MAX_FILE_SIZE=50

env: REPLAY_FILES=sample_files/ai_gtc_1.mp3, sample_files/ai_gtc_2.mp3, sample_files/ai_gtc_3.mp3
env: REPLAY_TIME=3600
env: REPLAY_MAX_FILE_SIZE=50


### Check file sizes

To prevent issues running out of GPU memory, the default max file size for audio files is 50 MB. This is configured by the `REPLAY_MAX_FILE_SIZE` environment variable set in the cell above. The cell below will check the configured replay files and increase the max size to accommodate the largest one. If it is increased, be mindful of the available GPU memory, which can always be checked with `nvidia-smi` from the command line.

In [5]:
import os

# Get REPLAY_FILES from environment and split and strip whitespace
replay_files_env = os.environ.get("REPLAY_FILES", "")
replay_files = [f.strip() for f in replay_files_env.split(",") if f.strip()]

print("Checking file sizes for REPLAY_FILES:")
original_max_size_mb = float(os.environ.get("REPLAY_MAX_FILE_SIZE"))
file_directory = os.path.join("..", "src", "file-replay", "files")
for fname in replay_files:
    file_path = os.path.join(file_directory, fname)  # Files are relative to src/file-replay/files
    file_exists = os.path.exists(file_path)

    # Check the file size and compare to the configured max size
    max_size_mb = float(os.environ.get("REPLAY_MAX_FILE_SIZE"))
    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
    under_size_limit = file_size_mb < max_size_mb

    if file_exists and under_size_limit:
        # File exists and is under the max size limit
        print(f"  ✅ '{fname}': {file_size_mb:.2f} MB")
    elif not under_size_limit:
        # File exists and exceeds the max size limit
        new_limit = file_size_mb + 1
        print(
            f"  🚧 '{fname}' ({file_size_mb:.2f} MB) exceeds {max_size_mb:0.2f} MB limit... "
            f"REPLAY_MAX_FILE_SIZE --> {new_limit:0.2f} MB"
        )
        os.environ["REPLAY_MAX_FILE_SIZE"] = str(new_limit)
    else:
        # File does not exist
        print(f"  ❌ '{fname}': File not found at {file_path}")
        raise RuntimeError(f"File {fname} not found at {file_path}")

max_size_mb = float(os.environ.get("REPLAY_MAX_FILE_SIZE"))
if original_max_size_mb < max_size_mb:
    print("\n----------------------------------------------------------------------------")
    print(f"🚧 WARNING: REPLAY_MAX_FILE_SIZE was increased to {max_size_mb:0.2f} MB to accommodate the largest file.")
    print("🚧 Be aware that the replay container holds all audio files in GPU memory while running.")
    print("🚧 For higher file sizes, be mindful of the availablility of GPU memory, which can be checked with `nvidia-smi`.")


Checking file sizes for REPLAY_FILES:
  ✅ 'sample_files/ai_gtc_1.mp3': 26.03 MB
  ✅ 'sample_files/ai_gtc_2.mp3': 30.18 MB
  ✅ 'sample_files/ai_gtc_3.mp3': 24.30 MB


## Validate Required Environment Variables Are Set

The below cell will throw and error if there was an issue setting either of the required ENV variables above:

In [6]:
import os

# Check NVIDIA_API_KEY
api_key = os.environ.get("NVIDIA_API_KEY")
if not api_key or api_key == "nvapi-your-api-key":
    raise RuntimeError(
        "NVIDIA_API_KEY environment variable is not set or is set to the default placeholder value. "
        "Please set your NVIDIA API key from build.nvidia.com."
    )

# Check REPLAY_FILES
replay_files = os.environ.get("REPLAY_FILES")
if not replay_files or not replay_files.strip():
    raise RuntimeError(
        "REPLAY_FILES environment variable is not set. "
        "Please specify the files to replay (comma-separated, located in src/file-replay/files)."
    )

## Build Required Docker Services

### Note on `docker_scripts`

In this notebook, we use a small helper module called `docker_scripts` to make running Docker and Docker Compose commands easier within a Jupyter environment. This module simply takes a shell command (like `docker build ...` or `docker compose ...`), runs it, and prints the output in a more readable way for notebook users.

**You do not need to use `docker_scripts` if you are running commands in a terminal.** For normal development or deployment, just use the standard Docker and Docker Compose commands as shown in the README.

The following cells use `docker_scripts` only to improve the notebook experience.

First, we'll build the images needed for the context-aware RAG system:


In [7]:
# Import scripts used to manage docker containers
from docker_scripts import tail_bash_command, wait_for_service, docker_ps

### Build RAG Backend

In [8]:
# Build the Context-Aware RAG docker image
tail_bash_command(
    "docker build -t ctx_rag "
    "-f ../external/context-aware-rag/docker/Dockerfile "
    "../external/context-aware-rag",
    n=25
)

#14 39.71  + sniffio==1.3.1
#14 39.71  + sqlalchemy==2.0.43
#14 39.71  + starlette==0.46.2
#14 39.71  + tenacity==8.5.0
#14 39.71  + tiktoken==0.11.0
#14 39.71  + tqdm==4.67.1
#14 39.71  + typing-extensions==4.15.0
#14 39.71  + typing-inspect==0.9.0
#14 39.71  + typing-inspection==0.4.1
#14 39.71  + tzdata==2025.2
#14 39.71  + ujson==5.11.0
#14 39.71  + urllib3==2.5.0
#14 39.71  + uvicorn==0.30.6
#14 39.71  + vss-ctx-rag==0.5.1rc5 (from file:///app/dist/vss_ctx_rag-0.5.1rc5-py3-none-any.whl)
#14 39.71  + wrapt==1.17.3
#14 39.71  + yarl==1.20.1
#14 39.71  + zipp==3.23.0
#14 DONE 40.1s

#15 exporting to image
#15 exporting layers
#15 exporting layers 2.9s done
#15 writing image sha256:053b6aa06de81550e95ebe69a9ee01120e35662326528b7ac6beabc19b141a05 done
#15 naming to docker.io/library/ctx_rag 0.0s done
#15 DONE 2.9s
✅ Done


In [9]:
# Build remaining images used by Context-Aware RAG workflow
tail_bash_command(
    "docker compose "
    "-f ../external/context-aware-rag/docker/deploy/compose.yaml "
    "build",
    n=25
)

✅ Done


### Build Ingestion Workflow

Now, we build the images for the FM radio ingestion workflow:
- Parakeet ASR NIM (`asr-nim`)
- Holoscan-based SDR (`holscan-sdr`)
- NeMo Agent Toolkit UI (`agentiq-ui`)
- File replay (`fm-file-replay`)

This could take up to 10 minutes as images are pulled onto the machine and built.

In [10]:
# Build containers used in ingestion workflow
tail_bash_command(
    "docker compose "
    "-f ../deploy/docker-compose.yaml "
    "--profile replay build",
    n=25
)

#49 128.3   - Updating numba (0.57.1+1.gc785c8f1f /rapids/numba-0.57.1+1.gc785c8f1f-cp310-cp310-linux_x86_64.whl -> 0.60.0)
#49 128.3   - Updating pooch (1.7.0 -> 1.8.2)
#49 128.3   - Updating scikit-learn (1.2.0 /rapids/scikit_learn-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -> 1.6.1)
#49 128.3   - Updating soundfile (0.12.1 -> 0.13.1)
#49 128.3   - Installing soxr (0.5.0.post1)
#49 128.3   - Downgrading typing-extensions (4.15.0 -> 4.14.0)
#49 138.9   - Updating cupy-cuda12x (12.1.0 /rapids/cupy_cuda12x-12.1.0-cp310-cp310-linux_x86_64.whl -> 13.4.1)
#49 138.9   - Updating librosa (0.9.2 -> 0.11.0)
#49 DONE 143.7s

#53 [replay 5/5] WORKDIR /workspace
#53 DONE 0.1s

#54 [replay] exporting to image
#54 exporting layers
#54 exporting layers 2.4s done
#54 writing image sha256:58b9f7a480914d1a7534699db752db32df2b9f4c7818e858b0e2ebdeca4886e9 done
#54 naming to docker.io/library/fm-file-replay:latest 0.0s done
#54 DONE 2.4s

#55 [replay] resolving provenance for metadat

## Deployment

Now we'll deploy the services needed to run the blueprint. We start by first deploying the context-aware RAG services, then once those are running, the FM radio ingestion pipeline.

At any point, containers can be spun down with the following command:
```bash
docker compose \
    -f ../external/context-aware-rag/docker/deploy/compose.yaml \
    -f ../deploy/docker-compose.yaml --profile replay \
    down
```

<div class="alert alert-block alert-info">
    <b>Note:</b>
    
We use helper scripts from `docker_scripts` to display Docker Compose output and logs in the notebook. This is only necessary because Docker Compose's native STDOUT printing does not display well in Jupyter cells. 

These helpers simply make the logs viewable in the notebook environment. If you are running these commands in your own terminal, you should use the standard Docker CLI commands.
</div>

### Context-Aware RAG

Note that the embedding NIM is deployed locally, while the LLM points to a cloud-hosted endpoint specified in `external/context-aware-rag/config/config.yaml --> chat.llm.base_url`.

<img src="../docs/arch-diagram-retrieval.png" alt="Retrieval Diagram" style="max-width: 600px; width: 100%;" />

In [40]:
# Validate ENV
api_key = os.environ.get("NVIDIA_API_KEY")
assert api_key is not None and api_key != "nvapi-your-api-key", "NVIDIA_API_KEY environment variable is not set"

# Deploy all Context-Aware RAG containers
tail_bash_command(
    "docker compose "
    "-f ../external/context-aware-rag/docker/deploy/compose.yaml "
    "up -d",
    n=25
)

 Container cassandra  Started
 Container cassandra  Waiting
 Container grafana  Started
 Container vss-ctx-rag-retriever  Started
 Container milvus-etcd  Started
 Container milvus-minio  Started
 Container milvus-standalone  Starting
 Container prometheus  Started
 Container embedding-nim  Started
 Container milvus-standalone  Started
 Container milvus-standalone  Waiting
 Container milvus-standalone  Healthy
 Container vss-ctx-rag-data-ingestion  Starting
 Container vss-ctx-rag-data-ingestion  Started
 Container cassandra  Healthy
 Container cassandra_schema  Starting
 Container cassandra_schema  Started
 Container cassandra  Waiting
 Container cassandra  Healthy
 Container jaeger  Starting
 Container jaeger  Started
 Container jaeger  Waiting
 Container jaeger  Healthy
 Container otel_collector  Starting
 Container otel_collector  Started
✅ Done


<div class="alert alert-block alert-info">
    <b>Note: In case of ❌ errors or "unhealthy" services!</b>
    
There are other containers deployed by Context-Aware RAG for telemetry and vector RAG, which we do not use in this walkthough. Feel free to disregard errors that are not for one of:
1. vss-ctx-rag-retriever
2. vss-ctx-rag-data-ingestion
3. embedding-nim
4. neo4j
5. milvus

Telemetry services are: `jaeger`, `cassandra`, `cassandra-schema`, `otel_collector`, `prometheus`, and `grafana`, any errors or "unhealthy" notifications from those services can be ignored without impacting core functionality.
</div>


We'll run a function that will wait for each of the required services to send a healthy signal before moving on to deploying the FM radio ingestion workflow.

In [41]:
# Wait for required services to be ready and healthy
wait_for_service("http://localhost:8000/health", name="CA RAG Retrieval")
wait_for_service("http://localhost:8001/health", name="CA RAG Ingestion")
wait_for_service("http://localhost:7474", name="Neo4j")
wait_for_service("http://localhost:9091/healthz", name="Milvus")
wait_for_service("http://localhost:8002/v1/health/ready", name="Embeddings")

✅ Service 'CA RAG Retrieval' is ready
✅ Service 'CA RAG Ingestion' is ready
✅ Service 'Neo4j' is ready
✅ Service 'Milvus' is ready
✅ Service 'Embeddings' is ready


True

<div class="alert alert-block alert-info">
    <b>Note: In case of Embeddings timeout ⏳</b>
    
If the Embeddings container does not start successfully, make sure that you correctly set your NVIDIA_API_KEY in the setup cell at the top of the notebook. If `!echo $NVIDIA_API_KEY` shows no output or does not display your API key, spin down all containers (see last cell), restart the Jupyter kernel, set your key, and start again.
</div>

<div class="alert alert-block alert-info">
    <b>Note: What to do if other services fail to start ❌</b>
    
If any other of the services above fail to start, re-running the deployment cell is often enough to fix things. Simply re-running won't impact any healthy containers, it just gives one's that had trouble starting another chance.

If that still doesn't work, check out the logs during startup with `tail_bash_command("docker logs <service-name>", n=25)` to watch for obvious errors.
</div>

If the above cell has run, the RAG workflow should be running! Let's check the container status:

In [22]:
docker_ps()  # Print all running containers

CONTAINER ID   NAMES                        STATUS
b6b3ae138b45   otel_collector               Up About a minute
c6f2b06f24e6   jaeger                       Up About a minute (healthy)
a84dbeb6bfb1   milvus-standalone            Up About a minute (healthy)
edbca0350a18   embedding-nim                Up About a minute
c5d8a4dc2cce   milvus-minio                 Up About a minute (healthy)
989691bdda45   prometheus                   Up About a minute
1aae6038b8ef   vss-ctx-rag-retriever        Up About a minute
badbde575665   grafana                      Up 2 seconds (health: starting)
a5bba80cc891   vss-ctx-rag-data-ingestion   Up About a minute
1b13cec7000d   milvus-etcd                  Up About a minute (healthy)
6187bebbb526   cassandra                    Up About a minute (healthy)
3a80a85caa10   neo4j                        Up About a minute



You should see something similar to:
```
CONTAINER ID   NAMES                        STATUS
e242f3243245   otel_collector               Up About a minute
72b7d2678424   jaeger                       Up About a minute (healthy)
e00d263b5dac   milvus-standalone            Up About a minute (healthy)
0a3eb9eebb30   milvus-minio                 Up About a minute (healthy)
48eaec9020ec   embedding-nim                Up About a minute
6dd52a851429   neo4j                        Up About a minute
72872988d2f5   cassandra                    Up About a minute (healthy)
715f6178a6c4   vss-ctx-rag-retriever        Up About a minute
c5049894c769   prometheus                   Up About a minute
35ee497ada16   grafana                      Up About a minute
82d3f9573cb8   milvus-etcd                  Up About a minute (healthy)
77df529fa73e   vss-ctx-rag-data-ingestion   Up About a minute
```

### FM Radio Ingestion Workflow and UI

We deploy the containers needed to run the ingestion workflow and for the UI, again using Python subprocesses.

Once these containers have spun up, go to [http://localhost:3000](http://localhost:3000) or "http://\<your-brev-ip\>:3000" in your brower to view and interact with the UI.

<img src="../docs/arch-diagram-ingestion.png" alt="Ingestion Diagram" style="max-width: 600px; width: 100%;" />

<div class="alert alert-block alert-info">
    <b>Note: The first time you deploy ⏳</b>
   
The first time the ingestion workflow is deployed, the ASR NIM will need to download model weights, which will take a few minutes. The weights are cached in the `MODEL_DIRECTORY` folder that we set above so that subsequent deployments are much faster.
</div>

In [42]:
# Validate ENV
api_key = os.environ.get("NVIDIA_API_KEY")
assert api_key is not None and api_key != "nvapi-your-api-key", "NVIDIA_API_KEY environment variable is not set"
assert os.environ.get("REPLAY_FILES") is not None, "REPLAY_FILES environment variable is not set"

max_file_config = float(os.environ.get("REPLAY_MAX_FILE_SIZE", 50))
assert max_file_config < 1000, \
    f"REPLAY_MAX_FILE_SIZE is recommended to be less than 1 GB (currently {max_file_config/1e3:0.6f} GB). " \
    f"You can bypass this check by commenting out this assertion, however, be mindful of the " \
    "available GPU memory, which can be checked with `nvidia-smi` from the command line."

# Deploy all containers used in ingestion workflow
tail_bash_command(
    "docker compose "
    "-f ../deploy/docker-compose.yaml "
    "--profile replay up -d",
    n=25
)

 Container holoscan-sdr  Creating
 Container agentiq-ui  Creating
 Container fm-file-replay  Creating
 Container asr-nim  Creating
 Container agentiq-ui  Created
 Container holoscan-sdr  Created
 Container fm-file-replay  Created
 Container asr-nim  Created
 Container holoscan-sdr  Starting
 Container agentiq-ui  Starting
 Container fm-file-replay  Starting
 Container asr-nim  Starting
 Container agentiq-ui  Started
 Container fm-file-replay  Started
 Container holoscan-sdr  Started
 Container asr-nim  Started
✅ Done


The ASR NIM typically takes the longest to spin up (see note above), we'll wait for that service to send a healthy signal:

In [24]:
# Wait for ASR NIM to be ready
wait_for_service("http://localhost:50050/v1/health/ready", name="ASR NIM", timeout=600, interval=10)

⏳ Waiting 10 seconds for service 'ASR NIM'...
⏳ Waiting 10 seconds for service 'ASR NIM'...
⏳ Waiting 10 seconds for service 'ASR NIM'...
✅ Service 'ASR NIM' is ready


True

<div class="alert alert-block alert-success">
    <b>Tip:</b>

If using Brev, go to "http://\<your-brev-ip\>:3000". Find your Brev instance's IP on the instance page on [brev.nvidia.com](brev.nvidia.com).
</div>

We'll need to wait a minute or so for those services to spin up.

Once they have, see the container status below:

In [25]:
docker_ps()  # Print all running containers

CONTAINER ID   NAMES                        STATUS
07e88a43ff7e   asr-nim                      Up 31 seconds
9d531e9a63dd   holoscan-sdr                 Up 31 seconds
6cde989449fc   fm-file-replay               Up 31 seconds
7404104de866   agentiq-ui                   Up 31 seconds
b6b3ae138b45   otel_collector               Up About a minute
c6f2b06f24e6   jaeger                       Up About a minute (healthy)
a84dbeb6bfb1   milvus-standalone            Up About a minute (healthy)
edbca0350a18   embedding-nim                Up About a minute
c5d8a4dc2cce   milvus-minio                 Up About a minute (healthy)
989691bdda45   prometheus                   Up About a minute
1aae6038b8ef   vss-ctx-rag-retriever        Up About a minute
badbde575665   grafana                      Up 6 seconds (health: starting)
a5bba80cc891   vss-ctx-rag-data-ingestion   Up About a minute
1b13cec7000d   milvus-etcd                  Up About a minute (healthy)
6187bebbb526   cassandra                   

You should see something similar to:
```
CONTAINER ID   NAMES                        STATUS
71f2de42d1a3   asr-nim                      Up 2 minutes
9bef5ebf2abe   holoscan-sdr                 Up 2 minutes
b41e8615cff1   fm-file-replay               Up 2 minutes
4de614fed065   agentiq-ui                   Up 2 minutes
e242f3243245   otel_collector               Up 5 minutes
72b7d2678424   jaeger                       Up 5 minutes (healthy)
e00d263b5dac   milvus-standalone            Up 5 minutes (healthy)
0a3eb9eebb30   milvus-minio                 Up 5 minutes (healthy)
48eaec9020ec   embedding-nim                Up 5 minutes
6dd52a851429   neo4j                        Up 5 minutes
72872988d2f5   cassandra                    Up 5 minutes (healthy)
715f6178a6c4   vss-ctx-rag-retriever        Up 5 minutes
c5049894c769   prometheus                   Up 5 minutes
35ee497ada16   grafana                      Up 5 minutes (healthy)
82d3f9573cb8   milvus-etcd                  Up 5 minutes (healthy)
77df529fa73e   vss-ctx-rag-data-ingestion   Up 5 minutes
```

## Interacting

View the frontend at `http://<your-brev-ip>:3000` or [http://localhost:3000](http://localhost:3000). You may have to refresh the page once or twice as the services start up.

If using Brev, you can find `<your-brev-ip>` on the instance page on [brev.nvidia.com](brev.nvidia.com).

You should see something like this:

<img src="../docs/ui-example.jpg" alt="Frontend Example" style="max-width: 600px; width: 100%;" />

To view the complete history of transcripts exported to the CA-RAG workflow, click the "History" button in the header tab:

<img src="../docs/ui-header-history.jpg" alt="History Button" style="max-width: 600px; width: 100%;" />

<div class="alert alert-block alert-success">
    <b>Tip:</b>

When stopping / restarting containers, make sure to refresh the frontend.
</div>

### Note about latency on document ingest

Documents are *not* ingested into the database right away. They are batched into sets of N documents; once a batch is full it is processed and subsequently injected into the graph DB.

You will have to wait for documents to be ingested into the database before being able to ask questions about them.

The batch size is parameterized in `external/context-aware-rag/config/config.yaml --> chat.params.batch_size`. Other parameters for chat are set here as well, including the maximum documents retrieved (`top_k`, with a default of 25). Keep in mind that for smaller batch sizes, the graph formation may not be able to keep up.

When the SDR workflow finalizes a transcript and sends it to the ingestion service, it will appear in the transcript history page with a header that looks similar to:

<img src="../docs/pending-document.jpg" alt="Pending Document" style="max-width: 300px; width: 100%;" />

Once the document's batch is processed and available for retrieval, the header will change to:

<img src="../docs/ingested-document.jpg" alt="Ingested Document" style="max-width: 300px; width: 100%;" />

At this point, the document is available to be retrieved by the RAG workflow. See the image below; the top entry is not yet accessible for retrieval, while the bottom is:

<img src="../docs/transcript-history.jpg" alt="Transcript History" style="max-width: 600px; width: 100%;" />

## Sample Questions

The context-aware RAG workflow has the ability to filter and retrieve data based on the document ingestion time and document stream. Note that all times are in UTC - to see the relative time, check the timestamps on the transcript history page.

For best results, be sure to include the relevant channel(s) and a time constraint in your query — for example, “channel 1 at 6:10 AM,” “5 minutes ago in channel 0,” or “a a 10-minute window around 9:00 PM in channels 1 and 2.”

Some suggested questions to start with:

### Recent summary:

+ *"Summarize the last 10 minutes on channel 0"*
    - Retrieves documents added on stream 0 between 600 seconds ago and now
+ *"Summarize the main topics discussed, excluding channel 2, for the past hour."*
    - Retrieves up to `top_k` documents from all streams but stream 2 between 3600 seconds ago and now

### Time window:

+ *"What was the topic of conversation on channel 2 15 minutes ago?"*
    - Retrieves documents added on stream 2 900 seconds ago, with a default 5 minute window
+ *"Between 2 minutes and half an hour ago, what was the most interesting fact you heard?"*
    - Retrieves documents between 120 and 1800 seconds ago on all channels

### Specific time:
+ *"At 9 oclock, what was the topic on channel 3?"*
    - Retrieves documents in a default 5 minute window around either 9 AM or 9 PM, whichever was more recent
+ *"At 10:00 PM, what was the topic on channel 3, using a 10 minute window?"*
    - Retrieves documents in a 10 minute window around 10 PM on channel 3

### Excluding recent entries
+ *"What was the main topic of conversation on channel 0, excluding the past ten minutes?"*
    - Retrieves up to `top_k` documents on channel 0 added prior to 600 seconds ago

## Troubleshooting

The most common error is an incorrect HTTP URL for the Chat Completion call, which may default to `http://127.0.0.1:8000/call`. If this is occuring, the error will likely look something like this:

<img src="../docs/chat-error.jpg" alt="Chat Error" style="max-width: 500px; width: 100%;" />

Go to `Settings -> HTTP URL for Chat Completion` and ensure it is set to `http://vss-ctx-rag-retriever:8000/call`:

<img src="../docs/settings.jpg" alt="Settings" style="max-width: 200px; width: 100%;" />

Then verify clicking "Regenerate Response":

<img src="../docs/chat-success.jpg" alt="Chat Success" style="max-width: 500px; width: 100%;" />

## Spinning Down

To end all services, run the cell below:


<div class="alert alert-block alert-info">
    <b>Note:</b>

If any containers fail to spin down after a reasonable period of time, interrupt the Jupyter process and run: `sudo systemctl restart docker.socket docker.service`.
</div>

In [43]:
# Stop all containers
tail_bash_command(
    "docker compose "
    "-f ../external/context-aware-rag/docker/deploy/compose.yaml "
    "-f ../deploy/docker-compose.yaml --profile replay "
    "down",
    n=25
)

 Container cassandra  Removing
 Container cassandra  Removed
 Container fm-file-replay  Stopped
 Container fm-file-replay  Removing
 Container holoscan-sdr  Stopped
 Container holoscan-sdr  Removing
 Container fm-file-replay  Removed
 Container holoscan-sdr  Removed
 Container vss-ctx-rag-data-ingestion  Stopped
 Container vss-ctx-rag-data-ingestion  Removing
 Container vss-ctx-rag-data-ingestion  Removed
 Container neo4j  Stopped
 Container neo4j  Removing
 Container neo4j  Removed
 Container embedding-nim  Stopped
 Container embedding-nim  Removing
 Container embedding-nim  Removed
 Container asr-nim  Stopped
 Container asr-nim  Removing
 Container asr-nim  Removed
 Container grafana  Stopped
 Container grafana  Removing
 Container grafana  Removed
 Network ctx-rag  Removing
 Network ctx-rag  Removed
✅ Done


Verify that all services are stopped:

In [35]:
docker_ps()  # Print all running containers

CONTAINER ID   NAMES     STATUS



# Next Steps

Now that you've completed the quickstart, you can explore the codebase and start building your own extensions!

- **View the source code and documentation:**  
  Visit the [Streaming Data to RAG GitHub repository](https://github.com/NVIDIA-AI-Blueprints/streaming-data-to-rag) to browse the code, open issues, and read more detailed docs.

- **Clone the repository:**  
  You can clone the repo locally to experiment and develop your own features:
  ```
  git clone git@github.com:NVIDIA-AI-Blueprints/streaming-data-to-rag.git
  ```

- **Extend:**  
  Feel free to fork the repository and use it as a starting point for your own streaming RAG applications.

For more information, see the [README](https://github.com/NVIDIA-AI-Blueprints/streaming-data-to-rag#readme) and the [API integration guide](https://github.com/NVIDIA-AI-Blueprints/streaming-data-to-rag/blob/main/api-integration.md).
