ovn: fix reserve joinSwitch LRP IPs #2331

Reamer · 2021-07-12T12:05:42Z

- What this PR does and why is it needed
This PR fixes the reservation of joinSwitch IPs and is needed to make egressIPs work even if the active ovn changes due to a node failure or ovn update.

RedHat-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1973215

- Special notes for reviewers
namespace.go: When using the HostNetworkNamespace feature, the synchronisation
code for namespaces triggers the ensureJoinLRPIPs method, which returns
a valid IP from the join subnet without considering a possible active
IP address. The end result is that the gwLRPIP is changed every time ovn is
restarted and this breaks things like egressIPs.

gateway: During startup, getJoinLRPAddresses validates the
active joinLRPAddress against the node's subnet, but because of
the early state, the node's subnets are empty, instead we should
validate against the join switch's subnets that are already initialised.

- How to verify it
I have tested it manually in my environment.
I know that these changes should be tested with a unit test or e2e test. Unfortunately, my Go expertise is low and I don't know how to test such a complex behaviour as restarting the application. I hope an experienced maintainer can add tests.

- Description for the changelog
Fix the reservation of JoinLRPAddresses

Reamer · 2021-07-13T12:28:43Z

Added test correction.

Reamer · 2021-07-14T05:23:20Z

Can someone re-trigger the CI jobs?

Reamer · 2021-07-15T05:02:30Z

/retest

Reamer · 2021-07-15T05:17:59Z

/retest

Reamer · 2021-07-15T05:37:18Z

Does anyone have any idea why my retests are being cancelled?

Reamer · 2021-07-15T05:37:24Z

/retest

Reamer · 2021-07-15T13:18:09Z

/retest

dcbw · 2021-07-15T13:35:55Z

No clue; I restarted them.

Reamer · 2021-07-16T06:02:53Z

/retest

coveralls · 2021-07-16T06:14:51Z

Coverage decreased (-0.07%) to 51.744% when pulling b0cd10e on Reamer:fix_reserve into 2249afa on ovn-org:master.

Reamer · 2021-07-16T11:17:49Z

I ran the test in my fork and they were not cancelled
Have a look at: https://github.com/Reamer/ovn-kubernetes-origin/actions/runs/1026505700

Reamer · 2021-07-19T07:37:17Z

/retest

Reamer · 2021-07-19T08:00:44Z

A review would be very nice.

alexanderConstantinescu

In general, I think this looks good. We don't currently have any e2e tests which performs disruptive tests as you mention (delete ovnkube-master while tests are running), so I think this is fine.

@trozet : could you please have a look at this. I had noticed the same issues locally on my computer, and I suspect this might have upgrade impacts.

go-controller/pkg/ovn/namespace.go

Reamer · 2021-08-16T13:33:57Z

I rebased my changes to the current master.

girishmg · 2021-08-16T23:14:59Z

@Reamer with this change we now do:

				// Because createNamespaceAddrSetAllPods is called before syncNode, we need to
				// reserve its joinSwitch LRP IPs if they already exist.
				gwLRPIPs := oc.getJoinLRPAddresses(node.Name)
				_ = oc.joinSwIPManager.reserveJoinLRPIPs(node.Name, gwLRPIPs)

later in syncNodes() we again do the same things for existing nodes..

		// For each existing node, reserve its joinSwitch LRP IPs if they already exist.
		gwLRPIPs := oc.getJoinLRPAddresses(node.Name)
		_ = oc.joinSwIPManager.reserveJoinLRPIPs(node.Name, gwLRPIPs)

since we have already reserved the IP you will error out 2nd time here

			if cidr.Contains(ipnet.IP) {
				if _, ok = allocated[idx]; ok {
					err = fmt.Errorf("Error: attempt to reserve multiple IPs in the same IPAM instance")
					return err
				}

Reamer · 2021-08-17T07:26:38Z

Hi @girishmg,
thanks for your review.
I doesn't think that this error, which is thrown the second time, causes problems. Errors from reserveJoinLRPIPs are currently ignored.
I understand that it is not nice to throw an error here.
I see the following possible solutions:

Ignore the Error as it currently is.
Compare the IPs in the cache with the ones to be reserved and skip the call in syncNode if necessary.
Check on the HostNetworkNamespace feature and skip reading the currently running OpenvSwitch application and the reservation.

What do you think?

Reamer · 2021-08-19T09:50:43Z

@girishmg
With the help of @alexanderConstantinescu a second reservation will not be made. I would be grateful if you could also look through the changes again.

alexanderConstantinescu

/lgtm

Reamer · 2021-08-19T12:10:08Z

/retest

alexanderConstantinescu · 2021-08-19T12:58:45Z

/retest

Reamer · 2021-08-19T13:30:57Z

/retest

Reamer · 2021-08-19T13:50:07Z

/retest

Reamer · 2021-08-19T14:28:53Z

@alexanderConstantinescu
I see the following test error, but I don't think it is related to this PR.
https://pastebin.com/geZXxd4B

Do you see why the tests fail?

alexanderConstantinescu · 2021-08-19T14:33:44Z

@alexanderConstantinescu
I see the following test error, but I don't think it is related to this PR.
https://pastebin.com/geZXxd4B

Do you see why the tests fail?

I suspect this means you need to git fetch origin && git rebase -i origin/master...I suspect github is not very explicit about this since you didn't have any merge conflicts. I might be wrong though so could you try that?

Reamer · 2021-08-19T14:53:36Z

/retest

alexanderConstantinescu · 2021-08-19T15:26:07Z

I managed to work out what the issue is and filed: #2428

Sorry for that!

alexanderConstantinescu · 2021-08-20T06:11:28Z

/retest

alexanderConstantinescu · 2021-08-20T07:12:30Z

/retest

master.go: Use of ensureJoinLRPIPs, which also checks the running DB. logical_switch_manager: ensureJoinLRPIPs now also looks into the running DB and fills the cache on a hit. Move getJoinLRPAddresses from gateway to logical_switch_manager Signed-off-by: Philipp Dallig <philipp.dallig@gmail.com>

logical_switch_manager: During startup, getJoinLRPAddresses validates the active joinLRPAddress against the node's subnet, but because of the early state, the node's subnets are empty, instead we should validate against the join switch's subnets that are already initialised. Signed-off-by: Philipp Dallig <philipp.dallig@gmail.com>

Reamer · 2021-08-20T09:52:44Z

I have done a git rebase.
#2428 is now included.

alexanderConstantinescu · 2021-08-23T14:12:46Z

@Reamer : I am going to merge #2434 if that is fine for you? I am not sure in which ways github is broken, but re-running it incessantly like this is not helping.

Reamer · 2021-08-23T14:27:52Z

@Reamer : I am going to merge #2434 if that is fine for you? I am not sure in which ways github is broken, but re-running it incessantly like this is not helping.

Yes, I have no problem with that. It's about the code change. Who is named as the author or who opened the PR is not important to me.

alexanderConstantinescu · 2021-08-23T14:30:26Z

@Reamer : I am going to merge #2434 if that is fine for you? I am not sure in which ways github is broken, but re-running it incessantly like this is not helping.

Yes, I have no problem with that. It's about the code change. Who is named as the author or who opened the PR is not important to me.

Great! I merged #2434 , so I am closing this. Thanks resolving this!

Reamer · 2021-08-23T14:30:51Z

I would like to see this fixed in the openshift 4.7 and 4.8 branch as soon as possible. The fixed bug is massively hindering me in setting up my productive environment.

alexanderConstantinescu · 2021-08-23T14:39:35Z

I would like to see this fixed in the openshift 4.7 and 4.8 branch as soon as possible. The fixed bug is massively hindering me in setting up my productive environment.

It's coming. I am opening up a PR as we write.

Reamer force-pushed the fix_reserve branch from 87a859d to f518b86 Compare July 13, 2021 12:27

alexanderConstantinescu reviewed Jul 29, 2021

View reviewed changes

go-controller/pkg/ovn/namespace.go Outdated Show resolved Hide resolved

Reamer force-pushed the fix_reserve branch from f518b86 to b6e002b Compare August 16, 2021 13:32

Reamer force-pushed the fix_reserve branch from b6e002b to 0e57647 Compare August 19, 2021 09:45

alexanderConstantinescu approved these changes Aug 19, 2021

View reviewed changes

Reamer force-pushed the fix_reserve branch from 0e57647 to cdf2795 Compare August 19, 2021 14:45

Reamer added 2 commits August 20, 2021 11:50

Reamer force-pushed the fix_reserve branch from cdf2795 to b0cd10e Compare August 20, 2021 09:51

alexanderConstantinescu mentioned this pull request Aug 20, 2021

Succesor for PR 2331 [fix reserve joinSwitch LRP IPs] #2434

Merged

alexanderConstantinescu closed this Aug 23, 2021

Reamer deleted the fix_reserve branch August 23, 2021 14:30

ovn: fix reserve joinSwitch LRP IPs #2331

ovn: fix reserve joinSwitch LRP IPs #2331

Conversation

Reamer commented Jul 12, 2021

Reamer commented Jul 13, 2021

Reamer commented Jul 14, 2021

Reamer commented Jul 15, 2021

Reamer commented Jul 15, 2021

Reamer commented Jul 15, 2021

Reamer commented Jul 15, 2021

Reamer commented Jul 15, 2021

dcbw commented Jul 15, 2021

Reamer commented Jul 16, 2021

coveralls commented Jul 16, 2021 • edited

Reamer commented Jul 16, 2021

Reamer commented Jul 19, 2021

Reamer commented Jul 19, 2021

alexanderConstantinescu left a comment

Choose a reason for hiding this comment

Reamer commented Aug 16, 2021

girishmg commented Aug 16, 2021

Reamer commented Aug 17, 2021 • edited

Reamer commented Aug 19, 2021

alexanderConstantinescu left a comment

Choose a reason for hiding this comment

Reamer commented Aug 19, 2021

alexanderConstantinescu commented Aug 19, 2021

Reamer commented Aug 19, 2021

Reamer commented Aug 19, 2021

Reamer commented Aug 19, 2021

alexanderConstantinescu commented Aug 19, 2021

Reamer commented Aug 19, 2021

alexanderConstantinescu commented Aug 19, 2021

alexanderConstantinescu commented Aug 20, 2021

alexanderConstantinescu commented Aug 20, 2021

Reamer commented Aug 20, 2021

alexanderConstantinescu commented Aug 23, 2021

Reamer commented Aug 23, 2021

alexanderConstantinescu commented Aug 23, 2021

Reamer commented Aug 23, 2021

alexanderConstantinescu commented Aug 23, 2021

coveralls commented Jul 16, 2021 •

edited

Reamer commented Aug 17, 2021 •

edited