pillar: nireconciler: fix "already connected" error on app re-activation by rucoder · Pull Request #5620 · lf-edge/eve

rucoder · 2026-02-18T22:41:20Z

Description

AddAppConn() rejects an app whose UUID is already present in r.apps, but it does not check whether that entry is marked deleted (i.e. pending async teardown from a prior DelAppConn()). This causes app re-activation to fail during purge when the async cleanup has not yet completed.

The call sequence that triggers the bug:

sequenceDiagram
    participant ZM as zedmanager
    participant ZR as zedrouter
    participant NR as NI Reconciler

    Note over ZM,NR: Purge BringDown
    ZM->>ZR: AppNetworkConfig(Activate=false)
    ZR->>NR: DelAppConn(appID)
    NR->>NR: r.apps[appID].deleted = true
    NR->>NR: reconcile() — async ops pending
    NR-->>ZR: return (entry stays in map)

    Note over ZM,NR: Purge BringUp
    ZM->>ZR: AppNetworkConfig(Activate=true)
    ZR->>NR: AddAppConn(appID)
    NR->>NR: r.apps[appID] exists!
    alt Before fix
        NR-->>ZR: ERROR "already connected"
    else After fix
        NR->>NR: existing.deleted? → delete stale entry
        NR->>NR: proceed with new connection
        NR-->>ZR: OK
    end

The entry lingers because updateAppConnStatus() only calls delete(r.apps) once all async operations (VIF teardown, namespace cleanup) have finished. When the BringDown and BringUp pubsub events are processed back-to-back, the async window is too short for completion.

Fix AddAppConn() to check the deleted flag on existing entries. When an entry is pending deletion, remove it immediately and proceed with the new connection. The subsequent reconcile() call performs a full current-vs-intended state reconciliation, so any incomplete teardown from the old connection is carried forward and completed alongside the new setup — no resources are leaked.

Also skip deleted entries in the DisplayName uniqueness check to avoid false conflicts with outgoing apps.

Additionally, fix doInactivateAppNetwork() to always set status.Activated = false, even when DelAppConn() returns an error. Previously it returned early, leaving status.Activated = true while the reconciler considered the app removed. This mismatch caused the next handleAppNetworkModify to take the update-activated-app path instead of the activate path, compounding the failure.

PR dependencies

none

Changelog notes

None

Test verification instructions

Steps to reproduce

Preconditions

A running EVE node with an application deployed and its network activated
The application's container creation must have failed (e.g. cgroup mount error, image pull failure, OCI runtime error) while the network stayed activated — zedmanager intentionally keeps the network ready for retry on domain errors

Trigger

Push an app config update (or purge) from the controller
zedmanager initiates purge: sets PurgeInprogress = BringDown, publishes AppNetworkConfig(Activate=false)
zedmanager transitions to PurgeInprogress = BringUp, publishes AppNetworkConfig(Activate=true)

Expected result

Network deactivates, then re-activates successfully. App proceeds with the new deployment.

Actual result

doActivateAppNetwork fails with:

doActivateAppNetwork: failed to activate application network: NI Reconciler: App <uuid> is already connected

The app is stuck in error state. Subsequent retries hit the same error until the node is rebooted (which clears the reconciler's in-memory state).

Why it's not always reproducible

The bug requires at least one VIF teardown operation to be async during DelAppConn. Whether that happens depends on the network configuration (bridge type, namespace state, number of VIFs). When all teardown is synchronous, the entry is removed from the map immediately and the next AddAppConn succeeds.

The key thing is being honest that it's not a simple "click here, see bug" scenario — it requires a prior failure plus a specific internal timing condition. Calling that out explicitly saves reviewers from trying to reproduce it naively and concluding the fix is unnecessary.

PR Backports

- 16.0-stable: To be backported.
- 14.5-stable: To be backported.
- 13.4-stable: To be backported.

Also, to the PRs that should be backported into any stable branch, please
add a label stable.

Checklist

I've provided a proper description
I've added the proper documentation
I've tested my PR on amd64 device
I've tested my PR on arm64 device
I've written the test verification instructions
I've set the proper labels to this PR

And the last but not least:

I've checked the boxes above, or I've provided a good reason why I didn't
check them.

Please, check the boxes above after submitting the PR in interactive mode.

AddAppConn() rejects an app whose UUID is already present in r.apps, but it does not check whether that entry is marked deleted (i.e. pending async teardown from a prior DelAppConn()). This causes app re-activation to fail during purge when the async cleanup has not yet completed. The call sequence that triggers the bug: zedmanager publishes AppNetworkConfig(Activate=false) -> handleAppNetworkModify -> doInactivateAppNetwork -> DelAppConn -> r.apps[appID].deleted = true -> reconcile() -> updateAppConnStatus: async ops still in progress, entry stays in map zedmanager publishes AppNetworkConfig(Activate=true) -> handleAppNetworkModify -> doActivateAppNetwork -> AddAppConn -> r.apps[appID] exists => "already connected" <-- BUG The entry lingers because updateAppConnStatus() only calls delete(r.apps) once all async operations (VIF teardown, namespace cleanup) have finished. When the BringDown and BringUp pubsub events are processed back-to-back, the async window is too short for completion. Fix AddAppConn() to check the deleted flag on existing entries. When an entry is pending deletion, remove it immediately and proceed with the new connection. The subsequent reconcile() call performs a full current-vs-intended state reconciliation, so any incomplete teardown from the old connection is carried forward and completed alongside the new setup — no resources are leaked. Also skip deleted entries in the DisplayName uniqueness check to avoid false conflicts with outgoing apps. Additionally, fix doInactivateAppNetwork() to always set status.Activated = false, even when DelAppConn() returns an error. Previously it returned early, leaving status.Activated = true while the reconciler considered the app removed. This mismatch caused the next handleAppNetworkModify to take the update-activated-app path instead of the activate path, compounding the failure. Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>

codecov · 2026-02-18T23:45:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (2281599) to head (47750ec).
⚠️ Report is 298 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5620      +/-   ##
==========================================
+ Coverage   19.52%   29.49%   +9.96%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      713     +123     
+ Misses       2310     1552     -758     
- Partials      121      152      +31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rucoder · 2026-02-19T09:46:26Z

@rene could you approve to run tests?

rene

Kicking off tests, but I want @milan-zededa to give the final approval. In any case, LGTM...

rucoder · 2026-02-19T09:57:40Z

Kicking off tests, but I want @milan-zededa to give the final approval. In any case, LGTM...

sure! reconciler is complex, I might overlooked something. Unfortunately I couldn't add tests because the infrastructure assumes these async ops are always successful. so @milan-zededa we are waiting for your verdict :)

eriknordmark · 2026-02-19T21:16:43Z

@milan-zededa can you take a look?

milan-zededa

Thanks for fixing this!

rucoder requested review from milan-zededa and rouming as code owners February 18, 2026 22:41

rucoder added bug Something isn't working stable Should be backported to stable release(s) labels Feb 18, 2026

github-actions Bot requested review from christoph-zededa and eriknordmark February 18, 2026 22:41

rucoder requested a review from rene February 18, 2026 22:42

rene approved these changes Feb 19, 2026

View reviewed changes

milan-zededa approved these changes Feb 21, 2026

View reviewed changes

milan-zededa merged commit dc36bd4 into lf-edge:master Feb 21, 2026
51 of 52 checks passed

rucoder mentioned this pull request Mar 24, 2026

Backport PRs to stable branches (rucoder) #5114

Open

34 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pillar: nireconciler: fix "already connected" error on app re-activation#5620

pillar: nireconciler: fix "already connected" error on app re-activation#5620
milan-zededa merged 1 commit intolf-edge:masterfrom
rucoder:rucoder/nireconciler-already-activated-bug

rucoder commented Feb 18, 2026

Uh oh!

codecov Bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

rucoder commented Feb 19, 2026

Uh oh!

rene left a comment

Uh oh!

rucoder commented Feb 19, 2026

Uh oh!

eriknordmark commented Feb 19, 2026

Uh oh!

milan-zededa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rucoder commented Feb 18, 2026

Description

PR dependencies

Changelog notes

Test verification instructions

Preconditions

Trigger

Expected result

Actual result

Why it's not always reproducible

PR Backports

Checklist

Uh oh!

codecov Bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rucoder commented Feb 19, 2026

Uh oh!

rene left a comment

Choose a reason for hiding this comment

Uh oh!

rucoder commented Feb 19, 2026

Uh oh!

eriknordmark commented Feb 19, 2026

Uh oh!

milan-zededa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Feb 18, 2026 •

edited

Loading