Skip to content

zedkube/edgeview: fix stale vmiVNC.run blocking new VNC sessions#5875

Merged
milan-zededa merged 1 commit intolf-edge:masterfrom
naiming-zededa:naiming-vnc-cleanup
May 7, 2026
Merged

zedkube/edgeview: fix stale vmiVNC.run blocking new VNC sessions#5875
milan-zededa merged 1 commit intolf-edge:masterfrom
naiming-zededa:naiming-vnc-cleanup

Conversation

@naiming-zededa
Copy link
Copy Markdown
Contributor

Description

Problem: vmiVNC.run was not being cleaned up when an edgeview instance crashed or was killed.

Fixes:

  • clustertypes.go: Add AppUUID field to VmiVNCConfig to identify file ownership per-app. Add OwnerAlive() which reads /proc//comm and checks the name is edge-view, guarding against PID reuse by unrelated processes.
  • zedkube/vnc.go: Replace the blind os.Stat block with canClaimVNCFile(). It evicts stale files (dead edgeview PID, reused PID, or idle virtctl port) and blocks only on a genuinely live session — a running edgeview process or a virtctl proxy still listening on the port.
  • edgeview/network.go: Replace the blind os.Stat block with removeStaleVNCFile(), the edgeview-side equivalent. Also call it at edgeview startup (runOnServer) with evictIdle=false to clear files left from a previous crash without disturbing a live remote-console session. Rewrite isPortListening to read /proc/net/tcp directly instead of spawning ss (avoids disrupting the virtctl WebSocket connection).
  • vnc-proxy.sh: Move monitor_caller_pid start to the top of handle_vnc before the virtctl retry loop

PR dependencies

How to test and validate this PR

run the edgeview VNC in multiple times, and verify they are not blocked by the stale config file on the device.

Changelog notes

zedkube/edgeview: fix stale vmiVNC.run blocking new VNC sessions

PR Backports

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

For backport PRs (remove it if it's not a backport):

  • I've added a reference link to the original PR
  • PR's title follows the template

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

@naiming-zededa naiming-zededa force-pushed the naiming-vnc-cleanup branch 2 times, most recently from b49eec0 to b32e0c1 Compare April 28, 2026 04:13
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 31.81818% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 22.05%. Comparing base (3c61838) to head (0750c39).

Files with missing lines Patch % Lines
...f-edge/eve/pkg/pillar/utils/netutils/portlisten.go 0.00% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5875      +/-   ##
==========================================
+ Coverage   21.62%   22.05%   +0.43%     
==========================================
  Files         464      475      +11     
  Lines       83994    85714    +1720     
==========================================
+ Hits        18165    18907     +742     
- Misses      64303    65099     +796     
- Partials     1526     1708     +182     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/edgeview/src/network.go
Comment thread pkg/kube/vnc-proxy.sh Outdated
logmsg "monitor_caller_pid: Caller PID $caller_pid is gone (crashed?), cleaning up"

# Re-check CallerPID in the file before acting, in case a new edgeview
# process with a recycled PID already wrote a new session file.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really for "recycled PID" if we are checking for changed CallerPID?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated comments.

Comment thread pkg/pillar/docs/vnc-workflows.md Outdated
│ ┌─────────┐ AppInstanceConfig ┌──────────┐ │
│ │zedagent │─────────────────────►│ zedkube │ │
│ └─────────┘ (RemoteConsole) └────┬─────┘ │
│ │ runAppVNC() │
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add space to fix the arrow

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

Comment thread pkg/pillar/docs/vnc-workflows.md Outdated
│ │edgeview │◄────────────────────│ edgeview │ │
│ │ client │ │ (server) │ │
│ └──────────┘ └────┬─────┘ │
│ │ setupEveKVNC() │
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is edgeview used also for Controller-triggered VNC access? Based on this diagram it seems that zedkube talks to edgeview to setup VNC.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redo the drawing to reflect the schemes properly.

Comment thread pkg/pillar/docs/vnc-workflows.md Outdated

Rules:

- `CallerPID > 0` → edgeview owns this session; liveness via `/proc/<pid>/comm`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that in the edgeview-case, the liveness is both PID-check and port-listen check (?)
But the port-listen check is not mentioned for CallerPID > 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we actually missing the code in checking the vnc port. updated the code and the doc.

Comment thread pkg/pillar/docs/vnc-workflows.md
Comment thread pkg/pillar/docs/vnc-workflows.md Outdated

Both `removeStaleVNCFile` (edgeview) and `canClaimVNCFile` (zedkube) probe
whether a virtctl proxy is actually running before declaring a remote-console
file stale. They do this differently:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they really do it differently? They use the same helper function after all

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Comment thread pkg/pillar/docs/vnc-workflows.md
@github-actions github-actions Bot requested a review from milan-zededa April 30, 2026 20:08
@naiming-zededa naiming-zededa force-pushed the naiming-vnc-cleanup branch 2 times, most recently from 79515e0 to 6f0c5c2 Compare May 1, 2026 03:58
Problem: vmiVNC.run was not being cleaned up when an edgeview instance
crashed or was killed.

  Fixes:
  - clustertypes.go: Add AppUUID field to VmiVNCConfig to identify file
    ownership per-app. Add OwnerAlive() which reads /proc/<pid>/comm and
    checks the name is edge-view, guarding against PID reuse by
    unrelated processes.
  - zedkube/vnc.go: Replace the blind os.Stat block with
    canClaimVNCFile(). It evicts stale files (dead edgeview PID, reused
    PID, or idle virtctl port) and blocks only on a genuinely live
    session — a running edgeview process or a virtctl proxy still
    listening on the port.
  - edgeview/network.go: Replace the blind os.Stat block with
    removeStaleVNCFile(), the edgeview-side equivalent. Also call it at
    edgeview startup (runOnServer) with evictIdle=false to clear files
    left from a previous crash without disturbing a live remote-console
    session. Rewrite isPortListening to read /proc/net/tcp directly
    instead of spawning ss (avoids disrupting the virtctl WebSocket
    connection).
  - vnc-proxy.sh: Move monitor_caller_pid start to the top of handle_vnc
    before the virtctl retry loop

Signed-off-by: naiming-zededa <naiming@zededa.com>
@naiming-zededa naiming-zededa force-pushed the naiming-vnc-cleanup branch from 6f0c5c2 to 0750c39 Compare May 6, 2026 22:44
@github-actions github-actions Bot requested a review from milan-zededa May 6, 2026 22:44
@milan-zededa milan-zededa merged commit 90a79d2 into lf-edge:master May 7, 2026
41 of 44 checks passed
@naiming-zededa naiming-zededa deleted the naiming-vnc-cleanup branch May 9, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants