<a href="https://www.nvidia.com/dli"> <img src="./images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>


# 1.0 NVIDIA Dynamo: Lab Infrastructure, Dynamo and Monitoring Setup

In this comprehensive notebook, you'll first gain a deep understanding of NVIDIA Dynamo's architecture, core concepts, and the innovations that make it a powerful framework for high-performance inference deployment. Then, you'll explore the production-grade lab infrastructure where you'll deploy and experiment with Dynamo in real-world scenarios.


# Table of Contents

## 1.1 Lab Infrastructure
- [Cluster Overview](#21-cluster-overview)
- [GPU Resource Management](#22-gpu-resource-management)
- [Monitoring Setup](#23-monitoring-setup)
  - [DCGM Exporter](#231-dcgm-exporter)
  - [Prometheus & Grafana](#232-prometheus--grafana)
  - [Dashboard Configuration](#233-dashboard-configuration)


---

The lab environment is designed to mirror real-world AI production setups, providing hands-on experience with the same technologies and patterns used in enterprise deployments.


## 1.1 Lab Infrastructure

NVIDIA Dynamo is designed around several key architectural principles that address the unique challenges of serving large language models (LLMs) and generative AI workloads at scale.

### Why This Infrastructure Matters for Dynamo

The architectural principles you learned about Dynamo - disaggregated serving, intelligent scheduling, and multi-tier memory management - all require a sophisticated orchestration platform. Kubernetes provides:

- **Dynamic Resource Allocation**: Essential for Dynamo's adaptive planning
- **Multi-Node Coordination**: Required for disaggregated prefill/decode operations
- **GPU Resource Management**: Critical for Dynamo's performance optimizations
- **Service Discovery**: Needed for LLM-aware request routing

### Infrastructure Overview

The cluster is built on Azure Kubernetes Service (AKS) and features a multi-node architecture optimized for AI/ML workloads. It includes:

<img src="images/dynamo//aks.png" style="float: right; width: 500px;">

1. **Default Node Pool (CPU Node)**:
   - Handles general-purpose and lightweight CPU-based tasks
   - Manages cluster operations and non-GPU workloads
   - Always active as the baseline compute resource
   - Perfect for Dynamo's control plane components

2. **H100 Node Pool**:
   - Configured with [`Standard_NC40ads_H100_v5`](https://learn.microsoft.com/en-us/azure/virtual-machines/ncads-h100-v5) instances
   - Powered by NVIDIA H100 NVL GPUs
   - Ideal for large-scale AI training and high-performance inference
   - Features 94GB of GPU memory per device
   - Where Dynamo's prefill and decode workloads will run

In [None]:
# Check available nodes in the cluster
!kubectl get nodes

## 1.2 Dynamo Deployment Overview

The Dynamo Cloud Platform deployment consists of two main phases:
1. **CRD Installation**: Deploy the core Dynamo CRDs
2. **Platform Installation**: Deploy the core Dynamo infrastructure ( Operator, ETCD, NATS)

### Deployment Method

Use pre-built Helm charts and container images from NGC (recommended for production)
```
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-{RELEASE_VERSION}.tgz
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-{RELEASE_VERSION}.tgz
```

In this lab, we have already deployed the Dynamo version `0.5.0` (https://github.com/ai-dynamo/dynamo/releases/tag/v0.5.0)


In [None]:
# Namespace where Dynamo Cloud Platform will be deployed
NAMESPACE = "dynamo-cloud"

The Dynamo Platform Helm chart deploys several critical components:

1. **Dynamo Operator**: The core controller that manages Dynamo resources
2. **ETCD**: Distributed key-value store for configuration and state management
3. **NATS**: High-performance messaging system for inter-component communication
4. **RBAC Resources**: Service accounts, roles, and bindings for security


#### 1.2.1 Dynamo Pod Status Verification
Expected pods:
```
NAME                                                              READY   STATUS      RESTARTS   AGE
dynamo-platform-dynamo-operator-controller-manager-dc759d4xsh9t   2/2     Running     0          35m
dynamo-platform-etcd-0                                            1/1     Running     0          27m
dynamo-platform-etcd-pre-upgrade-n5xr4                            0/1     Completed   0          27m
dynamo-platform-nats-0                                            2/2     Running     0          27m
dynamo-platform-nats-box-768ddb656d-xqjr5                         1/1     Running     0          35m

```
   Status should show 'Running' for all pods

In [None]:
!kubectl get pods -n $NAMESPACE

#### 1.2.2 Dynamo Service Status Verification

```
NAME                            TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)             AGE
dynamo-platform-etcd            ClusterIP   192.168.100.197   <none>        2379/TCP,2380/TCP   36m
dynamo-platform-etcd-headless   ClusterIP   None              <none>        2379/TCP,2380/TCP   36m
dynamo-platform-nats            ClusterIP   192.168.100.116   <none>        4222/TCP            36m
dynamo-platform-nats-headless   ClusterIP   None              <none>        4222/TCP,8222/TCP   36m

```

In [None]:
!kubectl get services -n $NAMESPACE

## 1.3 Dynamo and Kubernetes Cluster Monitoring
We have already installed monitoring as part of this Lab. 


### 1.3.1 DCGM 
A comprehensive suite providing active GPU health monitoring, diagnostics, system alerts, and governance policies for GPU clusters

The **DCGM Exporter** is a critical component for GPU observability in Kubernetes environments. It serves as the bridge between NVIDIA's Data Center GPU Manager (DCGM) and your monitoring stack.

##### Key Capabilities:
- **GPU Telemetry Collection**: Gathers comprehensive GPU metrics from DCGM
- **Prometheus Integration**: Exposes metrics in Prometheus-compatible format
- **Kubernetes Native**: Leverages KubeletPodResources API for optimal integration
- **Real-time Monitoring**: Provides continuous GPU performance and health data
- 
##### Architecture Overview:

<center><img  src="images/gpu-telemetry.png"></center>


In [None]:
# Check DCGM deployed as part of gpu-operator
!kubectl get pods -n gpu-operator

### 1.3.2 Prometheus and Grafana

**Prometheus** is an industry-standard open-source monitoring system designed for reliability and scalability. 
In our GPU monitoring architecture, Prometheus performs several key functions. It scrapes metrics from instrumented applications and exporters It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. Prometheus targets represent how Prometheus extracts metrics from a different resources. In many cases, the metrics are exposed by the services themselves, such as Kubernetes. In this case, Prometheus collects metrics directly. But in some instances, like in unexposed services, Prometheus has to use exporters. Exporters are some programs that extract data from a service and then convert them into Prometheus formats.

<center><img src="images/k8s/prometheus-architecture.png" width="700"></center>


**[Grafana](https://grafana.com/)**: Advanced visualization platform for creating rich, interactive dashboards and alerts



Now let's check the status of Prometheus and Grafana. Initially, you may see pods in `ContainerCreating` status, which will transition to `Running` as they initialize.

**Expected Components:**
- **Grafana**: Web-based visualization platform
- **Prometheus Operator**: Manages Prometheus instances
- **Kube State Metrics**: Kubernetes metrics collector
- **Node Exporter**: Host-level metrics exporter

Sample Output 
```
NAME                                                        READY   STATUS    RESTARTS   AGE
alertmanager-kube-prometheus-stack-alertmanager-0           2/2     Running   0          41m
kube-prometheus-stack-grafana-dc596f6f7-5b4kg               3/3     Running   0          41m
kube-prometheus-stack-kube-state-metrics-8655798c4c-zxc7t   1/1     Running   0          41m
kube-prometheus-stack-operator-6dc4bf67dd-rjn8j             1/1     Running   0          41m
kube-prometheus-stack-prometheus-node-exporter-2m9gr        1/1     Running   0          41m
prometheus-kube-prometheus-stack-prometheus-0               2/2     Running   0          41m
```

In [None]:
!kubectl get pods  -n prometheus

#### 1.3.2.1 Expose Prometheus Service

##### Prometheus Service Access

To interact with the Prometheus metrics database and query interface, we need to expose the Prometheus service for external access. We'll use port forwarding to make the Prometheus UI available on your lab instance.

**Port Forwarding Setup:**
- **Local Port**: 30090 (accessible from your browser)
- **Target Port**: 9090 (Prometheus server default)
- **Access Method**: Direct connection to Prometheus service

This will enable you to query GPU metrics, explore DCGM data, and validate the monitoring pipeline.

In [None]:
# Get the Prometheus service name
PROM_NAME = !kubectl get svc --namespace prometheus -l app=kube-prometheus-stack-prometheus -o custom-columns=NAME:.metadata.name --no-headers
PROM_NAME = PROM_NAME[0]

In [None]:
# Port forward
import subprocess

subprocess.Popen(
    [
        "kubectl",
        "-n",
        "prometheus",
        "port-forward",
        "--address",
        "0.0.0.0",
        f"service/{PROM_NAME}",
        "30090:9090",
    ],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
    close_fds=True,
)

##### Prometheus UI Access

The Prometheus interface is now accessible through port forwarding. You can use the Prometheus Query UI to explore GPU metrics, validate data collection, and understand your system's performance characteristics.

[Open Prometheus!](/prom/graph)

#### 1.3.2.2 Configure Grafana Access

##### Grafana Dashboard Access

Now we'll configure access to the Grafana visualization platform. Similar to Prometheus, we'll use port forwarding to make Grafana available through your lab instance's web interface.

**Access Configuration:**
- **Local Port**: 31091 (matches our NodePort configuration)
- **Target Port**: 80 (Grafana web interface)
- **Authentication**: Username `admin`, Password `prom-operator`

In [None]:
# Get the Grafana service name
GRAFANA_NAME = !kubectl get svc --namespace prometheus -l app.kubernetes.io/name=grafana -o custom-columns=NAME:.metadata.name --no-headers
GRAFANA_NAME = GRAFANA_NAME[0]
# Port forward
import subprocess

subprocess.Popen(
    [
        "kubectl",
        "-n",
        "prometheus",
        "port-forward",
        "--address",
        "0.0.0.0",
        f"service/{GRAFANA_NAME}",
        "31091:80",
    ],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
    close_fds=True,
)

The Grafana interface is now exposed. Let's open the Grafana Dashboard.

To login, use: 
- Username: `admin` 
- Password: `prom-operator` 

The password was originally set in the `kube-prometheus-stack.values` file. If successful, your page should look similar to this:

<center><img src="images/k8s/grafana_page1.png" style="width: 800px"></center>


[Open Grafana!](/grafana/)

#### GPU Telemetry Dashboard Setup

Now let's create a comprehensive GPU monitoring dashboard using a pre-built template optimized for NVIDIA GPUs and DCGM metrics.

**Dashboard Import Process:**

1. **Navigate to Import**: In Grafana, select the "+" icon → "Import" from the left sidebar
2. **Load Dashboard Template**: Enter the dashboard ID: `https://grafana.com/grafana/dashboards/12239`
3. **Configure Data Source**: Select "Prometheus" as the data source in the dropdown
4. **Customize Settings**: Adjust dashboard name and folder location if desired
5. **Import Dashboard**: Click "Import" to create your GPU monitoring dashboard

<center><img src="images/grafana.png" style="width: 800px"></center>

### 1.3.3 Configure Dynamo Monitoring

#### Dynamo Component Monitoring

 The Dynamo platform exposes metrics that provide insights into:

- **Inference Performance**: Request latency, throughput, and queue depths
- **Resource Utilization**: CPU, memory, and GPU usage across components  
- **System Health**: Component availability, error rates, and resource limits

#### PodMonitor Configuration

Dynamo components expose Prometheus metrics through standard `/metrics` endpoints. We have configured PodMonitor resources to automatically discover and scrape these metrics.

The PodMonitor resources target specific Dynamo components:
- **Frontend**: Request routing and load balancing metrics
- **Prefill Workers**: Input processing performance metrics
- **Decode Workers**: Token generation and KV cache metrics


sample Output
```
NAMESPACE    NAME                            AGE
prometheus   dynamo-decode-worker-metrics    46m
prometheus   dynamo-frontend-metrics         46m
prometheus   dynamo-prefill-worker-metrics   46m
```

In [None]:
!kubectl get podmonitors -A

#### View the Dashboard
- Navigate to **Dashboards → Dynamo Dashboard**.
- Open your newly imported dashboard.
- You should now see real-time metrics from your cluster.

[Open Grafana!](/grafana/) -> Dynamo Dashboard

<center><img src="images/dynamo-grafana-dashboard.png" width="700px"></center>

---

## Summary

In this comprehensive notebook, you've learned about both NVIDIA Dynamo's innovative architecture and the production-grade infrastructure where you'll deploy and experiment with these concepts.

### Key Takeaways from Dynamo Architecture:

1. **Disaggregated Serving**: Separating prefill and decode operations for optimal resource utilization
2. **KV Cache Management**: Multi-tier cache management with 40% TTFT improvements
3. **LLM-Aware Routing**: Intelligent request routing that understands LLM workload characteristics
4. **Dynamic Planning**: Both load-based and SLA-based planning for different use cases
5. **NIXL**: Accelerated data transfer across memory hierarchies
6. **Performance Benefits**: Significant improvements in latency, throughput, and resource efficiency

### Next Steps:

Now that you understand both Dynamo's architecture and the infrastructure capabilities, and the Dynamo Cloud Platform is already running - next step is deploy Dynamo Infernece Graphs.

**Continue to**: [vLLM Agg Deployment](Dynamo_02_vLLM_Agg_Deployment.ipynb)


<a href="https://www.nvidia.com/dli"> <img src="./images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
