Skip to content

pillar: nireconciler: fix "already connected" error on app re-activation#5620

Merged
milan-zededa merged 1 commit intolf-edge:masterfrom
rucoder:rucoder/nireconciler-already-activated-bug
Feb 21, 2026
Merged

pillar: nireconciler: fix "already connected" error on app re-activation#5620
milan-zededa merged 1 commit intolf-edge:masterfrom
rucoder:rucoder/nireconciler-already-activated-bug

Conversation

@rucoder
Copy link
Copy Markdown
Contributor

@rucoder rucoder commented Feb 18, 2026

Description

AddAppConn() rejects an app whose UUID is already present in r.apps, but it does not check whether that entry is marked deleted (i.e. pending async teardown from a prior DelAppConn()). This causes app re-activation to fail during purge when the async cleanup has not yet completed.

The call sequence that triggers the bug:

sequenceDiagram
    participant ZM as zedmanager
    participant ZR as zedrouter
    participant NR as NI Reconciler

    Note over ZM,NR: Purge BringDown
    ZM->>ZR: AppNetworkConfig(Activate=false)
    ZR->>NR: DelAppConn(appID)
    NR->>NR: r.apps[appID].deleted = true
    NR->>NR: reconcile() — async ops pending
    NR-->>ZR: return (entry stays in map)

    Note over ZM,NR: Purge BringUp
    ZM->>ZR: AppNetworkConfig(Activate=true)
    ZR->>NR: AddAppConn(appID)
    NR->>NR: r.apps[appID] exists!
    alt Before fix
        NR-->>ZR: ERROR "already connected"
    else After fix
        NR->>NR: existing.deleted? → delete stale entry
        NR->>NR: proceed with new connection
        NR-->>ZR: OK
    end
Loading

The entry lingers because updateAppConnStatus() only calls delete(r.apps) once all async operations (VIF teardown, namespace cleanup) have finished. When the BringDown and BringUp pubsub events are processed back-to-back, the async window is too short for completion.

Fix AddAppConn() to check the deleted flag on existing entries. When an entry is pending deletion, remove it immediately and proceed with the new connection. The subsequent reconcile() call performs a full current-vs-intended state reconciliation, so any incomplete teardown from the old connection is carried forward and completed alongside the new setup — no resources are leaked.

Also skip deleted entries in the DisplayName uniqueness check to avoid false conflicts with outgoing apps.

Additionally, fix doInactivateAppNetwork() to always set status.Activated = false, even when DelAppConn() returns an error. Previously it returned early, leaving status.Activated = true while the reconciler considered the app removed. This mismatch caused the next handleAppNetworkModify to take the update-activated-app path instead of the activate path, compounding the failure.

PR dependencies

none

Changelog notes

None

Test verification instructions

Steps to reproduce

Preconditions

  • A running EVE node with an application deployed and its network activated
  • The application's container creation must have failed (e.g. cgroup mount error, image pull failure, OCI runtime error) while the network stayed activated — zedmanager intentionally keeps the network ready for retry on domain errors

Trigger

  1. Push an app config update (or purge) from the controller
  2. zedmanager initiates purge: sets PurgeInprogress = BringDown, publishes AppNetworkConfig(Activate=false)
  3. zedmanager transitions to PurgeInprogress = BringUp, publishes AppNetworkConfig(Activate=true)

Expected result

Network deactivates, then re-activates successfully. App proceeds with the new deployment.

Actual result

doActivateAppNetwork fails with:

doActivateAppNetwork: failed to activate application network: NI Reconciler: App <uuid> is already connected

The app is stuck in error state. Subsequent retries hit the same error until the node is rebooted (which clears the reconciler's in-memory state).

Why it's not always reproducible

The bug requires at least one VIF teardown operation to be async during DelAppConn. Whether that happens depends on the network configuration (bridge type, namespace state, number of VIFs). When all teardown is synchronous, the entry is removed from the map immediately and the next AddAppConn succeeds.


The key thing is being honest that it's not a simple "click here, see bug" scenario — it requires a prior failure plus a specific internal timing condition. Calling that out explicitly saves reviewers from trying to reproduce it naively and concluding the fix is unnecessary.

PR Backports

- 16.0-stable: To be backported.
- 14.5-stable: To be backported.
- 13.4-stable: To be backported.

Also, to the PRs that should be backported into any stable branch, please
add a label stable.

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

AddAppConn() rejects an app whose UUID is already present in r.apps,
but it does not check whether that entry is marked deleted (i.e.
pending async teardown from a prior DelAppConn()). This causes
app re-activation to fail during purge when the async cleanup has
not yet completed.

The call sequence that triggers the bug:

  zedmanager publishes AppNetworkConfig(Activate=false)
    -> handleAppNetworkModify
      -> doInactivateAppNetwork
        -> DelAppConn
          -> r.apps[appID].deleted = true
          -> reconcile()
            -> updateAppConnStatus: async ops still in progress,
               entry stays in map

  zedmanager publishes AppNetworkConfig(Activate=true)
    -> handleAppNetworkModify
      -> doActivateAppNetwork
        -> AddAppConn
          -> r.apps[appID] exists => "already connected"  <-- BUG

The entry lingers because updateAppConnStatus() only calls
delete(r.apps) once all async operations (VIF teardown, namespace
cleanup) have finished. When the BringDown and BringUp pubsub events
are processed back-to-back, the async window is too short for
completion.

Fix AddAppConn() to check the deleted flag on existing entries.
When an entry is pending deletion, remove it immediately and proceed
with the new connection. The subsequent reconcile() call performs a
full current-vs-intended state reconciliation, so any incomplete
teardown from the old connection is carried forward and completed
alongside the new setup — no resources are leaked.

Also skip deleted entries in the DisplayName uniqueness check to
avoid false conflicts with outgoing apps.

Additionally, fix doInactivateAppNetwork() to always set
status.Activated = false, even when DelAppConn() returns an error.
Previously it returned early, leaving status.Activated = true while
the reconciler considered the app removed. This mismatch caused the
next handleAppNetworkModify to take the update-activated-app path
instead of the activate path, compounding the failure.

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
@rucoder rucoder added bug Something isn't working stable Should be backported to stable release(s) labels Feb 18, 2026
@rucoder rucoder requested a review from rene February 18, 2026 22:42
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (2281599) to head (47750ec).
⚠️ Report is 298 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5620      +/-   ##
==========================================
+ Coverage   19.52%   29.49%   +9.96%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      713     +123     
+ Misses       2310     1552     -758     
- Partials      121      152      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rucoder
Copy link
Copy Markdown
Contributor Author

rucoder commented Feb 19, 2026

@rene could you approve to run tests?

Copy link
Copy Markdown
Contributor

@rene rene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kicking off tests, but I want @milan-zededa to give the final approval. In any case, LGTM...

@rucoder
Copy link
Copy Markdown
Contributor Author

rucoder commented Feb 19, 2026

Kicking off tests, but I want @milan-zededa to give the final approval. In any case, LGTM...

sure! reconciler is complex, I might overlooked something. Unfortunately I couldn't add tests because the infrastructure assumes these async ops are always successful. so @milan-zededa we are waiting for your verdict :)

@eriknordmark
Copy link
Copy Markdown
Contributor

@milan-zededa can you take a look?

Copy link
Copy Markdown
Contributor

@milan-zededa milan-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@milan-zededa milan-zededa merged commit dc36bd4 into lf-edge:master Feb 21, 2026
51 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working stable Should be backported to stable release(s)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants