Skip to content

Ruigao/update for 1.6#177

Merged
hippogr merged 48 commits into
devfrom
ruigao/update-for-1.6
Apr 30, 2026
Merged

Ruigao/update for 1.6#177
hippogr merged 48 commits into
devfrom
ruigao/update-for-1.6

Conversation

@hippogr
Copy link
Copy Markdown
Contributor

@hippogr hippogr commented Apr 8, 2026

Fix Security warning and update web portal to node.js 24

Copilot AI review requested due to automatic review settings April 8, 2026 08:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…gle.golang.org/grpc v1.74.2 -> v1.79.3)

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
When a user without permission opens another user's job page, fetchJobInfo
returns 403 but the error was silently ignored, causing the page to show
"Loading..." forever with a vague empty alert. Now fetchJobInfo checks HTTP
status, shows a clear permission error, and skips subsequent requests.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…naged nodes

On VMSS nodes where NetworkManager manages IB interfaces, ifconfig sets
the IP with noprefixroute flag, preventing automatic subnet route creation.
This causes IPoIB TCP (rsync/bcast) to fail between nodes while RDMA works.
Add explicit route check and creation after ifconfig to ensure connectivity.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Rui Gao and others added 3 commits April 22, 2026 03:59
…FR pipeline from stalling

Nodes with empty NodeId would transition to triaged_hardware but OFR cannot
create IcM tickets without a valid NodeId, causing the pipeline to stall.
Now these nodes stay in cordoned status so the classifier retries on the next cycle.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
….x to 8.0.5

The resolutions field pinned nodemailer to ^7.0.11 which overrode the
dependencies entry of ^8.0.5, causing yarn to install 7.0.13 in the image.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
npm upgrades:
- alert-handler: axios 1.13.5->1.15.2, follow-redirects 1.15.11->1.16.0
- database-controller: lodash 4.17.23->4.18.1 (added yarn resolution)
- rest-server, job-status-change-notification, webportal: follow-redirects 1.15.11->1.16.0

Dockerfile updates (add tdnf update for Azure Linux openssl 3.3.5-4->3.3.5-5):
- alert-parser, node-recycler, node-issue-classifier, job-data-recorder

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Rui Gao and others added 2 commits April 23, 2026 00:54
Hardware issues like FrontendNetworkIssue and DiskError have no matching
Azure OFR fault code. Submitting OFR for these results in unresolvable
tickets and, combined with the lack of dedup in node-recycler, causes
repeated OFR submissions (as seen with openpai-00000s). By downgrading
to triaged_unknown the node stays visible for manual investigation
while avoiding the broken OFR pipeline.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…ame node

Check the latest action before creating a new IcM OFR ticket — if
triaged_hardware-ua already exists, skip ticket creation and reuse the
existing ticket ID for polling.  This fixes the bug where every pipeline
loop could spawn a new OFR request for the same node because
get_latest_action_by_state (endswith query) never matches the
triaged_hardware-ua action.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Hardcoded 8TB retention caused disk full on the we cluster (16T disk).
Now each service can override retention_size in services-configuration.yaml.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
When a validating/available_nodata node has zero alert records in Kusto
(e.g. due to Prometheus data gap), find_node_alerts returns an empty
DataFrame without columns. Accessing period_alerts['alertname'] then
raises KeyError, causing the node to be stuck in validating indefinitely.

Add an empty check before accessing DataFrame columns.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@hippogr hippogr requested a review from a team as a code owner April 30, 2026 00:51
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@hippogr hippogr merged commit c25887f into dev Apr 30, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants