pillar: nireconciler: fix "already connected" error on app re-activation#5620
Conversation
AddAppConn() rejects an app whose UUID is already present in r.apps,
but it does not check whether that entry is marked deleted (i.e.
pending async teardown from a prior DelAppConn()). This causes
app re-activation to fail during purge when the async cleanup has
not yet completed.
The call sequence that triggers the bug:
zedmanager publishes AppNetworkConfig(Activate=false)
-> handleAppNetworkModify
-> doInactivateAppNetwork
-> DelAppConn
-> r.apps[appID].deleted = true
-> reconcile()
-> updateAppConnStatus: async ops still in progress,
entry stays in map
zedmanager publishes AppNetworkConfig(Activate=true)
-> handleAppNetworkModify
-> doActivateAppNetwork
-> AddAppConn
-> r.apps[appID] exists => "already connected" <-- BUG
The entry lingers because updateAppConnStatus() only calls
delete(r.apps) once all async operations (VIF teardown, namespace
cleanup) have finished. When the BringDown and BringUp pubsub events
are processed back-to-back, the async window is too short for
completion.
Fix AddAppConn() to check the deleted flag on existing entries.
When an entry is pending deletion, remove it immediately and proceed
with the new connection. The subsequent reconcile() call performs a
full current-vs-intended state reconciliation, so any incomplete
teardown from the old connection is carried forward and completed
alongside the new setup — no resources are leaked.
Also skip deleted entries in the DisplayName uniqueness check to
avoid false conflicts with outgoing apps.
Additionally, fix doInactivateAppNetwork() to always set
status.Activated = false, even when DelAppConn() returns an error.
Previously it returned early, leaving status.Activated = true while
the reconciler considered the app removed. This mismatch caused the
next handleAppNetworkModify to take the update-activated-app path
instead of the activate path, compounding the failure.
Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5620 +/- ##
==========================================
+ Coverage 19.52% 29.49% +9.96%
==========================================
Files 19 18 -1
Lines 3021 2417 -604
==========================================
+ Hits 590 713 +123
+ Misses 2310 1552 -758
- Partials 121 152 +31 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@rene could you approve to run tests? |
rene
left a comment
There was a problem hiding this comment.
Kicking off tests, but I want @milan-zededa to give the final approval. In any case, LGTM...
sure! reconciler is complex, I might overlooked something. Unfortunately I couldn't add tests because the infrastructure assumes these async ops are always successful. so @milan-zededa we are waiting for your verdict :) |
|
@milan-zededa can you take a look? |
milan-zededa
left a comment
There was a problem hiding this comment.
Thanks for fixing this!
Description
AddAppConn() rejects an app whose UUID is already present in r.apps, but it does not check whether that entry is marked deleted (i.e. pending async teardown from a prior DelAppConn()). This causes app re-activation to fail during purge when the async cleanup has not yet completed.
The call sequence that triggers the bug:
sequenceDiagram participant ZM as zedmanager participant ZR as zedrouter participant NR as NI Reconciler Note over ZM,NR: Purge BringDown ZM->>ZR: AppNetworkConfig(Activate=false) ZR->>NR: DelAppConn(appID) NR->>NR: r.apps[appID].deleted = true NR->>NR: reconcile() — async ops pending NR-->>ZR: return (entry stays in map) Note over ZM,NR: Purge BringUp ZM->>ZR: AppNetworkConfig(Activate=true) ZR->>NR: AddAppConn(appID) NR->>NR: r.apps[appID] exists! alt Before fix NR-->>ZR: ERROR "already connected" else After fix NR->>NR: existing.deleted? → delete stale entry NR->>NR: proceed with new connection NR-->>ZR: OK endThe entry lingers because updateAppConnStatus() only calls delete(r.apps) once all async operations (VIF teardown, namespace cleanup) have finished. When the BringDown and BringUp pubsub events are processed back-to-back, the async window is too short for completion.
Fix AddAppConn() to check the deleted flag on existing entries. When an entry is pending deletion, remove it immediately and proceed with the new connection. The subsequent reconcile() call performs a full current-vs-intended state reconciliation, so any incomplete teardown from the old connection is carried forward and completed alongside the new setup — no resources are leaked.
Also skip deleted entries in the DisplayName uniqueness check to avoid false conflicts with outgoing apps.
Additionally, fix doInactivateAppNetwork() to always set status.Activated = false, even when DelAppConn() returns an error. Previously it returned early, leaving status.Activated = true while the reconciler considered the app removed. This mismatch caused the next handleAppNetworkModify to take the update-activated-app path instead of the activate path, compounding the failure.
PR dependencies
none
Changelog notes
None
Test verification instructions
Steps to reproduce
Preconditions
Trigger
PurgeInprogress = BringDown, publishesAppNetworkConfig(Activate=false)PurgeInprogress = BringUp, publishesAppNetworkConfig(Activate=true)Expected result
Network deactivates, then re-activates successfully. App proceeds with the new deployment.
Actual result
doActivateAppNetworkfails with:The app is stuck in error state. Subsequent retries hit the same error until the node is rebooted (which clears the reconciler's in-memory state).
Why it's not always reproducible
The bug requires at least one VIF teardown operation to be async during
DelAppConn. Whether that happens depends on the network configuration (bridge type, namespace state, number of VIFs). When all teardown is synchronous, the entry is removed from the map immediately and the nextAddAppConnsucceeds.The key thing is being honest that it's not a simple "click here, see bug" scenario — it requires a prior failure plus a specific internal timing condition. Calling that out explicitly saves reviewers from trying to reproduce it naively and concluding the fix is unnecessary.
PR Backports
Also, to the PRs that should be backported into any stable branch, please
add a label
stable.Checklist
And the last but not least:
check them.
Please, check the boxes above after submitting the PR in interactive mode.