# ChatNVIDIA with Dynamo KV Cache Optimization

[NVIDIA Dynamo](https://developer.nvidia.com/dynamo) is an open-source, low-latency inference framework for serving generative AI models across GPU fleets. It includes four core components:

- **Smart Router** — a KV cache-aware routing engine that uses Radix Tree data structures to track KV cache entries across GPUs. It computes overlap scores between incoming requests and cached KV blocks, routing each request to the worker that already holds the relevant cache — avoiding costly recomputation.
- **GPU Resource Planner** — dynamically allocates resources between prefill and decode phases based on real-time capacity.
- **Distributed KV Cache Manager** — manages KV cache across a memory hierarchy (GPU HBM, host DRAM, SSD, networked storage).
- **NIXL** — a low-latency communication library for rapid KV cache movement between GPUs.

`ChatNVIDIADynamo` is a drop-in replacement for `ChatNVIDIA` that automatically injects `nvext.agent_hints` into every request. These hints tell the Smart Router:

- **`osl`** (output sequence length) — how many tokens to expect, so the scheduler can plan memory allocation
- **`iat`** (inter-arrival time) — how quickly requests arrive, so the router can anticipate load
- **`latency_sensitivity`** — how latency-critical a request is, so interactive calls get priority routing
- **`priority`** — request priority, so background work can yield to critical-path requests

A unique `prefix_id` is auto-generated for every request, enabling the router to track KV cache affinity.

## Development Setup

If you are working from the `langchain-nvidia` repository, this project uses [Poetry](https://python-poetry.org/) to manage dependencies. Run the following from the `libs/ai-endpoints` directory:

```bash
cd libs/ai-endpoints
pip install poetry                         # if not already installed
poetry config virtualenvs.in-project true --local
poetry install --with test
```

This creates a `.venv` inside `libs/ai-endpoints/`. Then install `ipykernel` directly via the venv's pip (not `poetry run`, which can recreate the environment) and register the Jupyter kernel:

```bash
.venv/bin/pip install ipykernel
.venv/bin/python -m ipykernel install --user --name langchain-nvidia --display-name "langchain-nvidia"
```

After this, reload your editor window and select the **langchain-nvidia** kernel in the notebook kernel picker.

## Install the Package

In [None]:
%pip install --upgrade --quiet langchain-nvidia-ai-endpoints

## Prerequisites

This notebook targets a **local NIM deployment behind NVIDIA Dynamo**. Unlike the standard `ChatNVIDIA` workflow with the NVIDIA API Catalog, you do not need an `NVIDIA_API_KEY` — the NIM is running on your infrastructure.

### Starting a Dynamo Deployment

The fastest way to get started is with the [Dynamo Quickstart Guide](https://docs.nvidia.com/dynamo/latest/getting-started/quickstart). For more details, including Kubernetes deployment and multi-node setups, see the [Dynamo documentation](https://docs.nvidia.com/dynamo/latest/).

> **Important:** When deploying the model, ensure the Dynamo worker's `--context-length` is at least **2x** the `MAX_TOKENS` value configured below. The context window must accommodate both the input prompt tokens and the completion tokens. For example, if `MAX_TOKENS = 4096`, deploy with `--context-length 8192` or higher.

In [None]:
NIM_BASE_URL = "http://localhost:8099/v1"
MODEL = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
MAX_TOKENS = 4096

## Basic Usage

`ChatNVIDIADynamo` accepts all the same parameters as `ChatNVIDIA`, plus four Dynamo-specific fields:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `osl` | `int` | `512` | Expected output sequence length (tokens) |
| `iat` | `int` | `250` | Expected inter-arrival time (ms) |
| `latency_sensitivity` | `float` | `1.0` | Latency sensitivity hint (0.0 = tolerant, 1.0 = critical) |
| `priority` | `int` | `1` | Request priority (higher = more important) |

It is a drop-in replacement — swap `ChatNVIDIA` for `ChatNVIDIADynamo` and every request will automatically include routing hints.

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA, ChatNVIDIADynamo

# Standard ChatNVIDIA — no Dynamo hints
llm_standard = ChatNVIDIA(base_url=NIM_BASE_URL, model=MODEL, max_completion_tokens=MAX_TOKENS)

# ChatNVIDIADynamo — identical interface, automatically injects agent_hints
llm = ChatNVIDIADynamo(base_url=NIM_BASE_URL, model=MODEL, max_completion_tokens=MAX_TOKENS)

result = llm.invoke("What is KV cache optimization?")
print(result.content)

## Setting Defaults at Construction Time

You can configure Dynamo hints when creating the model instance. This is useful when you know a model instance will always serve a particular role — e.g. a high-priority interactive assistant vs. a low-priority background summarizer.

In [None]:
# High-priority: short responses, latency-critical
llm_critical = ChatNVIDIADynamo(
    base_url=NIM_BASE_URL,
    model=MODEL,
    max_completion_tokens=MAX_TOKENS,
    osl=20,
    priority=10,
    latency_sensitivity=1.0,
)

# Low-priority: long responses, latency-tolerant
llm_background = ChatNVIDIADynamo(
    base_url=NIM_BASE_URL,
    model=MODEL,
    max_completion_tokens=MAX_TOKENS,
    osl=512,
    priority=1,
    latency_sensitivity=0.1,
)

print(f"Critical:   osl={llm_critical.osl}, priority={llm_critical.priority}, "
      f"latency_sensitivity={llm_critical.latency_sensitivity}")
print(f"Background: osl={llm_background.osl}, priority={llm_background.priority}, "
      f"latency_sensitivity={llm_background.latency_sensitivity}")

## Per-Invocation Overrides

Dynamo parameters can also be overridden on each call. This is useful when the same model instance handles requests with varying characteristics.

In [None]:
# Override all four Dynamo parameters for a single request
result = llm.invoke(
    "Classify this as positive or negative: 'I love this product!'",
    osl=10,
    iat=100,
    latency_sensitivity=1.0,
    priority=10,
)
print(result.content)

## Streaming with Dynamo Hints

Dynamo hints are included in the initial streaming request. The Smart Router uses them to select the optimal worker before tokens start flowing.

In [None]:
for chunk in llm_critical.stream("Give a one-sentence summary of GPU computing."):
    print(chunk.content, end="", flush=True)
print()  # newline

## LangChain Chain Integration

`ChatNVIDIADynamo` works seamlessly in LangChain chains and pipelines, just like `ChatNVIDIA`.

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "Classify the user's intent into exactly one category: "
               "billing, technical_support, general_inquiry, or complaint. "
               "Respond with only the category name."),
    ("user", "{input}"),
])

classifier_chain = (
    prompt
    | ChatNVIDIADynamo(
        base_url=NIM_BASE_URL,
        model=MODEL,
        max_completion_tokens=MAX_TOKENS,
        osl=5,
        priority=10,
        latency_sensitivity=1.0,
    )
    | StrOutputParser()
)

intent = classifier_chain.invoke({"input": "My invoice has the wrong amount"})
print(f"Detected intent: {intent}")

## Latency Sensitivity in Agentic Workflows

The real power of `ChatNVIDIADynamo` emerges in multi-step agentic pipelines where **not all LLM calls are equally urgent**.

Consider a customer support triage workflow. The first call (intent classification — ~5 output tokens) and the final call (quality review — ~20 tokens) are on the **critical path**: the user is actively waiting for a response. In between, several analysis branches run in **parallel** generating hundreds of tokens each. These background calls are important, but they have slack — they don't need to jump the queue.

Without priority scheduling, all requests compete equally for GPU decode slots. When the GPU is saturated with long-running background decode requests, a short critical-path request (like a 5-token classification) gets queued behind them. With Dynamo's priority scheduling enabled (`--enable-priority-scheduling`), the short critical-path request **jumps the queue**, resulting in dramatically lower perceived latency.

### Pipeline Design

```
                           ┌──────────────────────────┐
                           │   Customer Query Input   │
                           └────────────┬─────────────┘
                                        │
                           ┌────────────▼─────────────┐
                           │  classify_query          │
                           │  HIGH priority (10)      │
                           │  osl=10, sensitivity=1.0 │
                           └────────────┬─────────────┘
                                        │
          ┌──────────────┬──────────────┼──────────────┐
          │              │              │              │
  ┌───────▼───────┐ ┌───▼──────┐ ┌──────▼─────┐ ┌──────▼────────┐
  │ research      │ │ lookup   │ │ check      │ │ analyze       │
  │ _context      │ │ _policy  │ │ _compli-   │ │ _sentiment    │
  │ LOW (1)       │ │ LOW (1)  │ │ ance       │ │ LOW (1)       │
  │ osl=500       │ │ osl=500  │ │ LOW (1)    │ │ osl=500       │
  │ sens=0.1      │ │ sens=0.1 │ │ osl=500    │ │ sens=0.1      │
  └───────┬───────┘ └───┬──────┘ │ sens=0.1   │ └────┬──────────┘
          │             │        └─ ───┬──────┘      │
          └─────────────┴──────────────┼─────────────┘
                                       │
                          ┌────────────▼─────────────┐
                          │  draft_response          │
                          │  MED priority (5)        │
                          │  osl=500, sensitivity=0.5│
                          └────────────┬─────────────┘
                                       │
                          ┌────────────▼─────────────┐
                          │  review_response         │
                          │  HIGH priority (10)      │
                          │  osl=20, sensitivity=1.0 │
                          └──────────────────────────┘
```

| Node | Priority | `osl` | Role |
|------|----------|-------|------|
| `classify_query` | HIGH (10) | 10 | Entry point — all downstream nodes depend on it |
| `research_context` | LOW (1) | 500 | Parallel background — has slack |
| `lookup_policy` | LOW (1) | 500 | Parallel background — has slack |
| `check_compliance` | LOW (1) | 500 | Parallel background — has slack |
| `analyze_sentiment` | LOW (1) | 500 | Parallel background — has slack |
| `draft_response` | MED (5) | 500 | Join point — synthesizes all branches |
| `review_response` | HIGH (10) | 20 | Exit point — user is waiting for this |

### Define Model Instances & Chains

In [None]:
import time

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

from langchain_nvidia_ai_endpoints import ChatNVIDIADynamo

# High-priority model: critical-path calls (classification, review)
llm_high = ChatNVIDIADynamo(
    base_url=NIM_BASE_URL,
    model=MODEL,
    max_completion_tokens=MAX_TOKENS,
    priority=10,
    latency_sensitivity=1.0,
)

# Medium-priority model: join point (draft)
llm_med = ChatNVIDIADynamo(
    base_url=NIM_BASE_URL,
    model=MODEL,
    max_completion_tokens=MAX_TOKENS,
    priority=5,
    latency_sensitivity=0.5,
)

# Low-priority model: parallel background analysis
llm_low = ChatNVIDIADynamo(
    base_url=NIM_BASE_URL,
    model=MODEL,
    max_completion_tokens=MAX_TOKENS,
    priority=1,
    latency_sensitivity=0.1,
)

# --- Stage 1: Intent Classification (HIGH) ---
classify_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Classify the customer query into one of: billing, "
                   "technical_support, account, or general. Respond with "
                   "only the category."),
        ("user", "{query}"),
    ])
    | llm_high.bind(osl=10)
    | StrOutputParser()
)

# --- Stage 2a: Research Context (LOW) ---
research_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Research relevant context for this customer query. Provide "
                   "background information, common causes, and any relevant "
                   "product details that would help a support agent."),
        ("user", "Category: {category}\nQuery: {query}"),
    ])
    | llm_low.bind(osl=500)
    | StrOutputParser()
)

# --- Stage 2b: Lookup Policy (LOW) ---
policy_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Based on this customer query, identify the relevant company "
                   "policies. Include refund policies, SLAs, escalation "
                   "procedures, and any applicable customer guarantees."),
        ("user", "Category: {category}\nQuery: {query}"),
    ])
    | llm_low.bind(osl=500)
    | StrOutputParser()
)

# --- Stage 2c: Check Compliance (LOW) ---
compliance_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Review this customer interaction for compliance considerations. "
                   "Flag any regulatory requirements, data privacy concerns, "
                   "or mandatory disclosures that must be included in the response."),
        ("user", "Category: {category}\nQuery: {query}"),
    ])
    | llm_low.bind(osl=500)
    | StrOutputParser()
)

# --- Stage 2d: Analyze Sentiment (LOW) ---
sentiment_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Analyze the sentiment and emotional tone of this customer "
                   "message. Identify the level of urgency, frustration, and "
                   "any specific emotional cues that should inform the response tone."),
        ("user", "{query}"),
    ])
    | llm_low.bind(osl=500)
    | StrOutputParser()
)

# --- Stage 3: Draft Response (MED) ---
draft_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "You are a customer support agent. Using the analysis below, "
                   "draft a helpful and empathetic response to the customer.\n\n"
                   "Category: {category}\n"
                   "Sentiment: {sentiment}\n"
                   "Context: {context}\n"
                   "Policy: {policy}\n"
                   "Compliance: {compliance}"),
        ("user", "{query}"),
    ])
    | llm_med.bind(osl=500)
    | StrOutputParser()
)

# --- Stage 4: Review Response (HIGH) ---
review_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Review this draft customer support response. Reply with only "
                   "APPROVED if it is ready to send, or NEEDS_REVISION followed "
                   "by a brief reason."),
        ("user", "Draft response:\n{draft}"),
    ])
    | llm_high.bind(osl=20)
    | StrOutputParser()
)

### Run the Pipeline

In [None]:
import asyncio


async def triage_customer_query(query: str) -> dict:
    """Run the full triage pipeline with timing."""

    # Stage 1: Classification (critical path — HIGH priority)
    t0 = time.time()
    category = await classify_chain.ainvoke({"query": query})
    category = category.strip()
    t1 = time.time()
    print(f"  classify_query:    {t1 - t0:.2f}s  [HIGH]  -> {category}")

    # Stage 2: Parallel background analysis (LOW priority)
    t2 = time.time()
    context, policy, compliance, sentiment = await asyncio.gather(
        research_chain.ainvoke({"query": query, "category": category}),
        policy_chain.ainvoke({"query": query, "category": category}),
        compliance_chain.ainvoke({"query": query, "category": category}),
        sentiment_chain.ainvoke({"query": query}),
    )
    t3 = time.time()
    print(f"  parallel_analysis: {t3 - t2:.2f}s  [LOW]   (4 branches)")

    # Stage 3: Draft response (MED priority)
    t4 = time.time()
    draft = await draft_chain.ainvoke({
        "query": query,
        "category": category,
        "sentiment": sentiment,
        "context": context,
        "policy": policy,
        "compliance": compliance,
    })
    t5 = time.time()
    print(f"  draft_response:    {t5 - t4:.2f}s  [MED]")

    # Stage 4: Review (critical path — HIGH priority)
    t6 = time.time()
    review = await review_chain.ainvoke({"draft": draft})
    t7 = time.time()
    print(f"  review_response:   {t7 - t6:.2f}s  [HIGH]  -> {review.strip()[:40]}")

    return {
        "category": category,
        "sentiment": sentiment,
        "context": context,
        "policy": policy,
        "compliance": compliance,
        "draft": draft,
        "review": review,
        "timing": {
            "classify": t1 - t0,
            "parallel_analysis": t3 - t2,
            "draft": t5 - t4,
            "review": t7 - t6,
            "total": t7 - t0,
            "critical_path": (t1 - t0) + (t7 - t6),
        },
    }


customer_query = (
    "I've been a loyal customer for 5 years, but my last three orders "
    "(#ORD-8821, #ORD-8834, #ORD-8901) have all arrived damaged. "
    "I was charged $149.99 for the most recent one on Jan 15th and "
    "still haven't received a refund. Your support chat was down "
    "yesterday when I tried to reach out. I'm very frustrated."
)

print("Running triage pipeline...\n")
results = await triage_customer_query(customer_query)

### Display Results

In [None]:
from IPython.display import HTML, display

t = results["timing"]

html = f"""
<h3>Triage Results</h3>

<table style="border-collapse:collapse; width:100%; margin-bottom:20px;">
  <tr>
    <td style="padding:8px; font-weight:bold; width:140px;">Category</td>
    <td style="padding:8px;">{results['category'].split('\\n')[-1].strip()}</td>
  </tr>
</table>

<details style="margin-bottom:12px;">
  <summary style="cursor:pointer; font-weight:bold;">Sentiment Analysis</summary>
  <pre style="white-space:pre-wrap; background:#f8f8f8; padding:12px; margin-top:8px; border-radius:4px;">{results['sentiment'][:500]}</pre>
</details>

<details style="margin-bottom:12px;">
  <summary style="cursor:pointer; font-weight:bold;">Draft Response</summary>
  <pre style="white-space:pre-wrap; background:#f8f8f8; padding:12px; margin-top:8px; border-radius:4px;">{results['draft'][:800]}</pre>
</details>

<details style="margin-bottom:12px;">
  <summary style="cursor:pointer; font-weight:bold;">Review Verdict</summary>
  <pre style="white-space:pre-wrap; background:#f8f8f8; padding:12px; margin-top:8px; border-radius:4px;">{results['review'].split('\\n')[-1].strip()}</pre>
</details>

<h3>Timing Breakdown</h3>

<table style="border-collapse:collapse; width:100%;">
  <thead>
    <tr style="border-bottom:2px solid #333;">
      <th style="text-align:left; padding:8px;">Stage</th>
      <th style="text-align:center; padding:8px;">Priority</th>
      <th style="text-align:right; padding:8px;">Time</th>
    </tr>
  </thead>
  <tbody>
    <tr style="background-color:#e8f5e9;">
      <td style="padding:8px;">classify_query</td>
      <td style="text-align:center; padding:8px;"><strong>HIGH</strong></td>
      <td style="text-align:right; padding:8px;">{t['classify']:.2f}s</td>
    </tr>
    <tr>
      <td style="padding:8px;">parallel_analysis (4 branches)</td>
      <td style="text-align:center; padding:8px;">LOW</td>
      <td style="text-align:right; padding:8px;">{t['parallel_analysis']:.2f}s</td>
    </tr>
    <tr>
      <td style="padding:8px;">draft_response</td>
      <td style="text-align:center; padding:8px;">MED</td>
      <td style="text-align:right; padding:8px;">{t['draft']:.2f}s</td>
    </tr>
    <tr style="background-color:#e8f5e9;">
      <td style="padding:8px;">review_response</td>
      <td style="text-align:center; padding:8px;"><strong>HIGH</strong></td>
      <td style="text-align:right; padding:8px;">{t['review']:.2f}s</td>
    </tr>
    <tr style="border-top:2px solid #333; background-color:#e8f5e9;">
      <td style="padding:8px;" colspan="2"><strong>Critical path</strong> (classify + review)</td>
      <td style="text-align:right; padding:8px;"><strong>{t['critical_path']:.2f}s</strong></td>
    </tr>
    <tr style="border-top:1px solid #ccc;">
      <td style="padding:8px;" colspan="2"><strong>Total wall clock</strong></td>
      <td style="text-align:right; padding:8px;"><strong>{t['total']:.2f}s</strong></td>
    </tr>
  </tbody>
</table>
"""

display(HTML(html))

## How It Works Under the Hood

When `ChatNVIDIADynamo` sends a request, it injects an `nvext.agent_hints` section into the request payload. Here is what the payloads look like for our high-priority and low-priority calls:

**High-priority request** (classification / review — critical path):
```json
{
  "model": "meta/llama-3.1-8b-instruct",
  "messages": [{"role": "user", "content": "..."}],
  "nvext": {
    "agent_hints": {
      "prefix_id": "langchain-dynamo-a1b2c3d4e5f6",
      "osl": 10,
      "iat": 250,
      "latency_sensitivity": 1.0,
      "priority": 10
    }
  }
}
```

**Low-priority request** (parallel background analysis):
```json
{
  "model": "meta/llama-3.1-8b-instruct",
  "messages": [{"role": "user", "content": "..."}],
  "nvext": {
    "agent_hints": {
      "prefix_id": "langchain-dynamo-f6e5d4c3b2a1",
      "osl": 500,
      "iat": 250,
      "latency_sensitivity": 0.1,
      "priority": 1
    }
  }
}
```

The Smart Router uses these hints to make scheduling decisions:

- **`prefix_id`** — auto-generated per request (`langchain-dynamo-<uuid>`), enabling the router to track KV cache entries
- **`osl`** — pre-allocate the right amount of GPU memory for output tokens
- **`iat`** — predict incoming load for capacity planning
- **`latency_sensitivity`** — decide whether to queue or fast-track the request
- **`priority`** — determine scheduling order when requests compete for GPU resources

When the Dynamo worker is started with `--enable-priority-scheduling`, requests with higher priority values are scheduled ahead of lower-priority ones, even if the lower-priority requests arrived first. This means the 4 parallel background branches (~500 tokens each) yield GPU decode time to the short classification and review calls (~10-20 tokens), reducing perceived latency.

## Inspecting the Payload

For debugging, you can inspect the exact payload that will be sent to the NIM endpoint using the internal `_get_payload` method.

In [None]:
import json

payload = llm_critical._get_payload(
    inputs=[{"role": "user", "content": "Hello!"}],
    stop=None,
)

# Show the nvext section with agent_hints
print(json.dumps(payload["nvext"], indent=2))

## Summary

<table>
  <thead>
    <tr>
      <th style="text-align:left">Feature</th>
      <th style="text-align:center"><code>ChatNVIDIA</code></th>
      <th style="text-align:center"><code>ChatNVIDIADynamo</code></th>
    </tr>
  </thead>
  <tbody>
    <tr><td>API Catalog / NIM support</td><td style="text-align:center">Yes</td><td style="text-align:center">Yes</td></tr>
    <tr><td>Streaming</td><td style="text-align:center">Yes</td><td style="text-align:center">Yes</td></tr>
    <tr><td>Tool calling</td><td style="text-align:center">Yes</td><td style="text-align:center">Yes</td></tr>
    <tr><td>Structured output</td><td style="text-align:center">Yes</td><td style="text-align:center">Yes</td></tr>
    <tr><td>LangChain chains & agents</td><td style="text-align:center">Yes</td><td style="text-align:center">Yes</td></tr>
    <tr style="background-color:#f0f7f0"><td><strong>KV cache routing hints</strong> (<code>nvext.agent_hints</code>)</td><td style="text-align:center">&mdash;</td><td style="text-align:center"><strong>Yes</strong></td></tr>
    <tr style="background-color:#f0f7f0"><td><strong>Per-request <code>osl</code> / <code>iat</code></strong></td><td style="text-align:center">&mdash;</td><td style="text-align:center"><strong>Yes</strong></td></tr>
    <tr style="background-color:#f0f7f0"><td><strong>Priority-based scheduling</strong></td><td style="text-align:center">&mdash;</td><td style="text-align:center"><strong>Yes</strong></td></tr>
    <tr style="background-color:#f0f7f0"><td><strong>Latency sensitivity hints</strong></td><td style="text-align:center">&mdash;</td><td style="text-align:center"><strong>Yes</strong></td></tr>
    <tr style="background-color:#f0f7f0"><td><strong>Auto-generated <code>prefix_id</code></strong></td><td style="text-align:center">&mdash;</td><td style="text-align:center"><strong>Yes</strong></td></tr>
  </tbody>
</table>

## Related Topics

- [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) — open-source inference framework
- [Dynamo Quickstart Guide](https://docs.nvidia.com/dynamo/latest/getting-started/quickstart) — get a local deployment running
- [KV Cache-Aware Routing](https://docs.nvidia.com/dynamo/latest/user-guides/kv-cache-aware-routing) — how the Smart Router works
- [AIQ Toolkit Latency Sensitivity Demo](https://github.com/NVIDIA/AIQToolkit/tree/main/examples/dynamo_integration/latency_sensitivity_demo) — extended example with profiling
- [ChatNVIDIA Documentation](nvidia_ai_endpoints.ipynb) — standard ChatNVIDIA usage
- [langchain-nvidia-ai-endpoints README](https://github.com/langchain-ai/langchain-nvidia/blob/main/libs/ai-endpoints/README.md)
- [NVIDIA NIM Microservices](https://www.nvidia.com/en-us/ai-data-science/products/nim-microservices/)