Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ovn: fix reserve joinSwitch LRP IPs #2331

Closed
wants to merge 2 commits into from
Closed

Conversation

Reamer
Copy link
Contributor

@Reamer Reamer commented Jul 12, 2021

- What this PR does and why is it needed
This PR fixes the reservation of joinSwitch IPs and is needed to make egressIPs work even if the active ovn changes due to a node failure or ovn update.

RedHat-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1973215

- Special notes for reviewers
namespace.go: When using the HostNetworkNamespace feature, the synchronisation
code for namespaces triggers the ensureJoinLRPIPs method, which returns
a valid IP from the join subnet without considering a possible active
IP address. The end result is that the gwLRPIP is changed every time ovn is
restarted and this breaks things like egressIPs.

gateway: During startup, getJoinLRPAddresses validates the
active joinLRPAddress against the node's subnet, but because of
the early state, the node's subnets are empty, instead we should
validate against the join switch's subnets that are already initialised.

- How to verify it
I have tested it manually in my environment.
I know that these changes should be tested with a unit test or e2e test. Unfortunately, my Go expertise is low and I don't know how to test such a complex behaviour as restarting the application. I hope an experienced maintainer can add tests.

- Description for the changelog
Fix the reservation of JoinLRPAddresses

@Reamer
Copy link
Contributor Author

Reamer commented Jul 13, 2021

Added test correction.

@Reamer
Copy link
Contributor Author

Reamer commented Jul 14, 2021

Can someone re-trigger the CI jobs?

@Reamer
Copy link
Contributor Author

Reamer commented Jul 15, 2021

/retest

1 similar comment
@Reamer
Copy link
Contributor Author

Reamer commented Jul 15, 2021

/retest

@Reamer
Copy link
Contributor Author

Reamer commented Jul 15, 2021

Does anyone have any idea why my retests are being cancelled?

@Reamer
Copy link
Contributor Author

Reamer commented Jul 15, 2021

/retest

1 similar comment
@Reamer
Copy link
Contributor Author

Reamer commented Jul 15, 2021

/retest

@dcbw
Copy link
Contributor

dcbw commented Jul 15, 2021

No clue; I restarted them.

@Reamer
Copy link
Contributor Author

Reamer commented Jul 16, 2021

/retest

@coveralls
Copy link

coveralls commented Jul 16, 2021

Coverage Status

Coverage decreased (-0.07%) to 51.744% when pulling b0cd10e on Reamer:fix_reserve into 2249afa on ovn-org:master.

@Reamer
Copy link
Contributor Author

Reamer commented Jul 16, 2021

I ran the test in my fork and they were not cancelled
Have a look at: https://github.com/Reamer/ovn-kubernetes-origin/actions/runs/1026505700

@Reamer
Copy link
Contributor Author

Reamer commented Jul 19, 2021

/retest

@Reamer
Copy link
Contributor Author

Reamer commented Jul 19, 2021

A review would be very nice.

Copy link
Collaborator

@alexanderConstantinescu alexanderConstantinescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think this looks good. We don't currently have any e2e tests which performs disruptive tests as you mention (delete ovnkube-master while tests are running), so I think this is fine.

@trozet : could you please have a look at this. I had noticed the same issues locally on my computer, and I suspect this might have upgrade impacts.

go-controller/pkg/ovn/namespace.go Outdated Show resolved Hide resolved
@Reamer
Copy link
Contributor Author

Reamer commented Aug 16, 2021

I rebased my changes to the current master.

@girishmg
Copy link
Member

@Reamer with this change we now do:

				// Because createNamespaceAddrSetAllPods is called before syncNode, we need to
				// reserve its joinSwitch LRP IPs if they already exist.
				gwLRPIPs := oc.getJoinLRPAddresses(node.Name)
				_ = oc.joinSwIPManager.reserveJoinLRPIPs(node.Name, gwLRPIPs)

later in syncNodes() we again do the same things for existing nodes..

		// For each existing node, reserve its joinSwitch LRP IPs if they already exist.
		gwLRPIPs := oc.getJoinLRPAddresses(node.Name)
		_ = oc.joinSwIPManager.reserveJoinLRPIPs(node.Name, gwLRPIPs)

since we have already reserved the IP you will error out 2nd time here

			if cidr.Contains(ipnet.IP) {
				if _, ok = allocated[idx]; ok {
					err = fmt.Errorf("Error: attempt to reserve multiple IPs in the same IPAM instance")
					return err
				}

@Reamer
Copy link
Contributor Author

Reamer commented Aug 17, 2021

Hi @girishmg,
thanks for your review.
I doesn't think that this error, which is thrown the second time, causes problems. Errors from reserveJoinLRPIPs are currently ignored.
I understand that it is not nice to throw an error here.
I see the following possible solutions:

  1. Ignore the Error as it currently is.
  2. Compare the IPs in the cache with the ones to be reserved and skip the call in syncNode if necessary.
  3. Check on the HostNetworkNamespace feature and skip reading the currently running OpenvSwitch application and the reservation.

What do you think?

@Reamer
Copy link
Contributor Author

Reamer commented Aug 19, 2021

@girishmg
With the help of @alexanderConstantinescu a second reservation will not be made. I would be grateful if you could also look through the changes again.

Copy link
Collaborator

@alexanderConstantinescu alexanderConstantinescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@Reamer
Copy link
Contributor Author

Reamer commented Aug 19, 2021

/retest

3 similar comments
@alexanderConstantinescu
Copy link
Collaborator

/retest

@Reamer
Copy link
Contributor Author

Reamer commented Aug 19, 2021

/retest

@Reamer
Copy link
Contributor Author

Reamer commented Aug 19, 2021

/retest

@Reamer
Copy link
Contributor Author

Reamer commented Aug 19, 2021

@alexanderConstantinescu
I see the following test error, but I don't think it is related to this PR.
https://pastebin.com/geZXxd4B

Do you see why the tests fail?

@alexanderConstantinescu
Copy link
Collaborator

@alexanderConstantinescu
I see the following test error, but I don't think it is related to this PR.
https://pastebin.com/geZXxd4B

Do you see why the tests fail?

I suspect this means you need to git fetch origin && git rebase -i origin/master...I suspect github is not very explicit about this since you didn't have any merge conflicts. I might be wrong though so could you try that?

@Reamer
Copy link
Contributor Author

Reamer commented Aug 19, 2021

/retest

@alexanderConstantinescu
Copy link
Collaborator

I managed to work out what the issue is and filed: #2428

Sorry for that!

@alexanderConstantinescu
Copy link
Collaborator

/retest

1 similar comment
@alexanderConstantinescu
Copy link
Collaborator

/retest

master.go: Use of ensureJoinLRPIPs, which also checks the running DB.

logical_switch_manager: ensureJoinLRPIPs now also looks into the running DB and fills the cache on a hit.

Move getJoinLRPAddresses from gateway to logical_switch_manager

Signed-off-by: Philipp Dallig <philipp.dallig@gmail.com>
logical_switch_manager: During startup, getJoinLRPAddresses validates the
active joinLRPAddress against the node's subnet, but because of
the early state, the node's subnets are empty, instead we should
validate against the join switch's subnets that are already initialised.

Signed-off-by: Philipp Dallig <philipp.dallig@gmail.com>
@Reamer
Copy link
Contributor Author

Reamer commented Aug 20, 2021

I have done a git rebase.
#2428 is now included.

@alexanderConstantinescu
Copy link
Collaborator

@Reamer : I am going to merge #2434 if that is fine for you? I am not sure in which ways github is broken, but re-running it incessantly like this is not helping.

@Reamer
Copy link
Contributor Author

Reamer commented Aug 23, 2021

@Reamer : I am going to merge #2434 if that is fine for you? I am not sure in which ways github is broken, but re-running it incessantly like this is not helping.

Yes, I have no problem with that. It's about the code change. Who is named as the author or who opened the PR is not important to me.

@alexanderConstantinescu
Copy link
Collaborator

@Reamer : I am going to merge #2434 if that is fine for you? I am not sure in which ways github is broken, but re-running it incessantly like this is not helping.

Yes, I have no problem with that. It's about the code change. Who is named as the author or who opened the PR is not important to me.

Great! I merged #2434 , so I am closing this. Thanks resolving this!

@Reamer
Copy link
Contributor Author

Reamer commented Aug 23, 2021

I would like to see this fixed in the openshift 4.7 and 4.8 branch as soon as possible. The fixed bug is massively hindering me in setting up my productive environment.

@Reamer Reamer deleted the fix_reserve branch August 23, 2021 14:30
@alexanderConstantinescu
Copy link
Collaborator

I would like to see this fixed in the openshift 4.7 and 4.8 branch as soon as possible. The fixed bug is massively hindering me in setting up my productive environment.

It's coming. I am opening up a PR as we write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants