Ruigao/update for 1.6#177
Merged
Merged
Conversation
added 22 commits
March 31, 2026 04:18
…gle.golang.org/grpc v1.74.2 -> v1.79.3) Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
When a user without permission opens another user's job page, fetchJobInfo returns 403 but the error was silently ignored, causing the page to show "Loading..." forever with a vague empty alert. Now fetchJobInfo checks HTTP status, shows a clear permission error, and skips subsequent requests. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…naged nodes On VMSS nodes where NetworkManager manages IB interfaces, ifconfig sets the IP with noprefixroute flag, preventing automatic subnet route creation. This causes IPoIB TCP (rsync/bcast) to fail between nodes while RDMA works. Add explicit route check and creation after ifconfig to ensure connectivity. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…FR pipeline from stalling Nodes with empty NodeId would transition to triaged_hardware but OFR cannot create IcM tickets without a valid NodeId, causing the pipeline to stall. Now these nodes stay in cordoned status so the classifier retries on the next cycle. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
….x to 8.0.5 The resolutions field pinned nodemailer to ^7.0.11 which overrode the dependencies entry of ^8.0.5, causing yarn to install 7.0.13 in the image. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
npm upgrades: - alert-handler: axios 1.13.5->1.15.2, follow-redirects 1.15.11->1.16.0 - database-controller: lodash 4.17.23->4.18.1 (added yarn resolution) - rest-server, job-status-change-notification, webportal: follow-redirects 1.15.11->1.16.0 Dockerfile updates (add tdnf update for Azure Linux openssl 3.3.5-4->3.3.5-5): - alert-parser, node-recycler, node-issue-classifier, job-data-recorder Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Hardware issues like FrontendNetworkIssue and DiskError have no matching Azure OFR fault code. Submitting OFR for these results in unresolvable tickets and, combined with the lack of dedup in node-recycler, causes repeated OFR submissions (as seen with openpai-00000s). By downgrading to triaged_unknown the node stays visible for manual investigation while avoiding the broken OFR pipeline. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…ame node Check the latest action before creating a new IcM OFR ticket — if triaged_hardware-ua already exists, skip ticket creation and reuse the existing ticket ID for polling. This fixes the bug where every pipeline loop could spawn a new OFR request for the same node because get_latest_action_by_state (endswith query) never matches the triaged_hardware-ua action. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Hardcoded 8TB retention caused disk full on the we cluster (16T disk). Now each service can override retention_size in services-configuration.yaml. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
When a validating/available_nodata node has zero alert records in Kusto (e.g. due to Prometheus data gap), find_node_alerts returns an empty DataFrame without columns. Accessing period_alerts['alertname'] then raises KeyError, causing the node to be stuck in validating indefinitely. Add an empty check before accessing DataFrame columns. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
zhogu
approved these changes
Apr 29, 2026
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix Security warning and update web portal to node.js 24