Skip to content

NO-ISSUE: docs(website): add metrics monitoring#66

Merged
hhk7734 merged 1 commit intomainfrom
metrics
Feb 23, 2026
Merged

NO-ISSUE: docs(website): add metrics monitoring#66
hhk7734 merged 1 commit intomainfrom
metrics

Conversation

@hhk7734
Copy link
Copy Markdown
Member

@hhk7734 hhk7734 commented Feb 23, 2026

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the website operations documentation by adding a new Grafana-based metrics monitoring guide (with supporting dashboard screenshots) and organizing the Monitoring section in the sidebar. It also introduces a new “Resource Allocation” operations guide and updates a prerequisite link in the log-collection doc.

Changes:

  • Add Metrics Monitoring documentation under Operations → Monitoring, including Grafana access steps and metric/dashboard explanations.
  • Add new Resource Allocation guide covering GPU/RDMA requests/limits, node selection, and taints/tolerations.
  • Add Monitoring category metadata and fix the log-collection prerequisites link.

Reviewed changes

Copilot reviewed 4 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
website/docs/operations/resource-allocation.mdx New guide for resource requests/limits, node selection, and tolerations.
website/docs/operations/monitoring/metrics/index.mdx New metrics monitoring guide explaining Grafana access and dashboard metrics.
website/docs/operations/monitoring/metrics/heimdall.png Screenshot asset referenced by metrics guide.
website/docs/operations/monitoring/metrics/gpu-total.png Screenshot asset referenced by metrics guide.
website/docs/operations/monitoring/metrics/filters.png Screenshot asset referenced by metrics guide.
website/docs/operations/monitoring/log-collection.mdx Fix prerequisites link formatting/target.
website/docs/operations/monitoring/category.yaml Adds Monitoring section category configuration for the sidebar.
Comments suppressed due to low confidence (1)

website/docs/operations/resource-allocation.mdx:211

  • Same as above: the +++ AMD GPUs / +++ NVIDIA GPUs / trailing +++ markers are not standard MDX/Docusaurus formatting and will likely render as stray text. Please convert these to headings or a supported grouping component (Tabs/admonitions/details) for readability.
+++ AMD GPUs

**Node taint:**

```yaml
taints:
  - key: 'amd.com/gpu'
    effect: 'NoSchedule'

Pod toleration:

tolerations:
  - key: 'amd.com/gpu'
    operator: 'Exists'
    effect: 'NoSchedule'

+++ NVIDIA GPUs

Node taint:

taints:
  - key: 'nvidia.com/gpu'
    effect: 'NoSchedule'

Pod toleration:

tolerations:
  - key: 'nvidia.com/gpu'
    operator: 'Exists'
    effect: 'NoSchedule'

+++

</details>

Comment thread website/docs/operations/resource-allocation.mdx Outdated
Comment thread website/docs/operations/resource-allocation.mdx Outdated
Comment thread website/docs/operations/resource-allocation.mdx Outdated
@hhk7734 hhk7734 merged commit 8d9b97e into main Feb 23, 2026
3 checks passed
@hhk7734 hhk7734 deleted the metrics branch February 23, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants