Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EIP: Fix CPIP updates on cloud environments #3942

Merged
merged 2 commits into from Oct 5, 2023

Conversation

tssurya
Copy link
Member

@tssurya tssurya commented Sep 29, 2023

- What this PR does and why is it needed
The EIP controller being event driven means we don't
have control over the order of events that we get.

There is a bug around the ordering of event processing currently:

  • When a node goes offline,
  • cluster-manager calls deleteEgressNode based on healthchecks and then patches the EIP to reflect that on the status by deleting the node immediately from the status
  • this causes zone-controller to take action and do tear down of OVN setup for this EIP in the respective zone
  • the CPIP also reacts and starts deleting the CPIP for this EIP (On clouds like azure this is very slow)
  • node now comes back online - no control on the amount of time its down
  • cluster-manager is super fast in IC as its asynchronous and tries to re-add this egressNode
  • CM calls addEgressNode -> reconcileEIP -> executeCloudPrivateIPConfigOps -> and then we return with CloudPrivateIPConfig: %s already assigned to node: %s", cloudPrivateIPConfigName, cloudPrivateIPConfig.Spec.Node because the deletion for CPIP takes time and its still in CloudResponsePending Deleting IP addres phase.
W0925 18:55:30.984193       1 egressip_healthcheck.go:145] Health checking using insecure connection
I0925 18:55:30.987283       1 egressip_healthcheck.go:168] Connected to huirwang-0925c-8tdq2-worker-westus-85ljx (10.131.0.2:9107)
I0925 18:55:30.987309       1 egressip_controller.go:631] Node: huirwang-0925c-8tdq2-worker-westus-85ljx is detected as reachable and ready again, adding it to egress assignment
I0925 18:55:30.989470       1 egressip_controller.go:1320] Successful assignment of egress IP: 10.0.128.100 on node: &{egressIPConfig:0xc000cee000 mgmtIPs:[[10 131 0 2]] allocations:map[] healthClient:0xc000c3b7d0 isReady:true isReachable:true isEgressAssignable:true name:huirwang-0925c-8tdq2-worker-westus-85ljx}
I0925 18:55:30.989510       1 egressip_controller.go:228] CloudPrivateIPConfig: 10.0.128.100 already assigned to node: huirwang-0925c-8tdq2-worker-westus-85ljx
  • Now we get the request for CPIP delete completed and the object goes away but the add is also considered successful as per above set of events which is bad.
  • This is detrimental since setup won't be done in zone-controller without getting that add patch again.

This PR fixes this by firstly fixing the CPIP config add logic to explicitly fail to add something if the object is currently being deleted. This is the same thing we do for updates. The next thing this PR does is to fix the retries around node's becoming online and offline. If the node add/delete fails, we do not retry, I think we should.

- Special notes for reviewers
Currently I added retry only for when node comes back online, if we want we can also add retries for when node goes offline?
Long term solution: Become level driven

- How to verify it
Tested on cloud env, we are blind here u/s. Not easy to reproduce the ordering of events via a test.

- Description for the changelog

Copy the same logic we use for update

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side effect of this is that:

I0929 11:13:37.416488       1 obj_retry.go:370] Retry add failed for *factory.egressNode ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc, will try again later: synthetic update for EgressIP: egressip-3 failed, err: cloud update request failed, CloudPrivateIPConfig: 10.0.128.100 is being deleted
I0929 11:14:07.412775       1 obj_retry.go:296] Retry object setup: *factory.egressNode ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc
I0929 11:14:07.412828       1 obj_retry.go:358] Adding new object: *factory.egressNode ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc
I0929 11:14:07.415488       1 egressip_controller.go:227] SURYA ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc//<nil>
I0929 11:14:07.415551       1 obj_retry.go:370] Retry add failed for *factory.egressNode ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc, will try again later: synthetic update for EgressIP: egressip-3 failed, err: cloud update request failed, CloudPrivateIPConfig: 10.0.128.100 is being deleted
I0929 11:14:37.412824       1 obj_retry.go:296] Retry object setup: *factory.egressNode ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc
I0929 11:14:37.412878       1 obj_retry.go:358] Adding new object: *factory.egressNode ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc
I0929 11:14:37.418902       1 egressip_controller.go:227] SURYA ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc//<nil>
I0929 11:14:37.418941       1 obj_retry.go:370] Retry add failed for *factory.egressNode ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc, will try again later: synthetic update for EgressIP: egressip-3 failed, err: cloud update request failed, CloudPrivateIPConfig: 10.0.128.100 is being deleted

will happen till CPIP get's recreated and then we are happy:

I0929 11:15:13.854051       1 egressip_controller.go:133] Patching status on EgressIP egressip-3: [{ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc 10.0.128.100}]
I0929 11:15:13.865805       1 egressip_controller.go:1621] SURYA 10.0.128.100/[]/{ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc 10.0.128.100}/egressip-3/[{ci-ln-tqh690b-1d09d-qbgbr-worker-westus-92czc 10.0.128.100}]

This commit adds the retry framework for when
a node comes back online, we should retry
the nodeAdd till it succeeds without giving
up on the first attempt.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
@coveralls
Copy link

Coverage Status

coverage: 49.841% (-0.02%) from 49.861% when pulling daafa09 on tssurya:eip-cpip-fix-azure into ac329f8 on ovn-org:master.

@tssurya tssurya changed the title Fix EIP CPIP updates on cloud environments EIP: Fix CPIP updates on cloud environments Sep 29, 2023
@trozet trozet merged commit 56a9e30 into ovn-org:master Oct 5, 2023
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants