Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change the name of host-addresses annotation to host-cidrs #3915

Merged
merged 1 commit into from Sep 26, 2023

Conversation

JacobTanenbaum
Copy link
Contributor

commit: 8847618bbf2feeb42c30375a6de6f91afe011d91 changed the host-addresses annotation to include a subnet mask. This causes updates to experiance a network disruption as masters and nodes expect different data to be in the host-addresses annotation.

Changing the name of the host-addresses annotation to host-cidrs will eliminate the network disruption because nodes updating will not remove the host-addresses annotation and once upgraded masters will do the right thing with the host-cidrs annotation

- What this PR does and why is it needed

- Special notes for reviewers

- How to verify it

- Description for the changelog

Copy link
Member

@martinkennelly martinkennelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some extra changes needed:

go-controller/pkg/node/node_ip_handler_linux.go:184:
go-controller/pkg/ovn/controller/services/services_controller.go:473
go-controller/pkg/ovn/controller/services/services_controller.go:486
go-controller/pkg/ovn/controller/services/node_tracker.go:254
go-controller/pkg/ovn/egressservices_test.go:1465
go-controller/pkg/ovn/ovn.go:359

Do you think we need to rename hostAddress below (go-controller/pkg/ovn/controller/services/services_controller.go) ? I am on the fench - up to you:

type nodeInfo struct {
	// the node's Name
	name string
	// The list of physical IPs reported by the gatewayconf annotation
	l3gatewayAddresses []net.IP
	// The list of physical IPs and subnet masks the node has, as reported by the host-cidrs annotation
	hostAddresses []net.IP

go-controller/pkg/ovn/controller/services/services_controller.go:473
go-controller/pkg/ovn/controller/services/services_controller.go:486
go-controller/pkg/ovn/controller/services/node_tracker.go:254
go-controller/pkg/ovn/egressservices_test.go:1465
go-controller/pkg/ovn/ovn.go:359
go-controller/pkg/util/node_annotations.go:719
go-controller/pkg/util/node_annotations.go:726
go-controller/pkg/util/node_annotations.go:735
go-controller/pkg/util/node_annotations.go:742
go-controller/pkg/util/node_annotations.go:775
go-controller/pkg/clustermanager/egressip_controller.go:1395

// does not contain EgressIPs that are assigned to interfaces.
for _, node := range nodes {
nodeHostAddressesSet, err := util.ParseNodeHostAddressesDropNetMask(node)
nodeHostAddressesSet, err := util.ParseNodeHostCIDRSDropNetMask(node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but shouldnt we have the s in CIDRS lowercase here and the rest of the vars? WDYT?

nodeHostAddressesSet, err := util.ParseNodeHostCIDRsDropNetMask(node)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that I did it

test/e2e/service.go Show resolved Hide resolved
@@ -307,7 +307,7 @@ func (c *addressManager) doesNodeHostAddressesMatch() bool {
}
// check to see if ips on the node differ from what we stored
// in host-address annotation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to host-cidrs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to doNodeHostCIDRsMatch

@JacobTanenbaum
Copy link
Contributor Author

/retest

@ovn-robot
Copy link
Collaborator

Oops, something went wrong:

Must have admin rights to Repository.

@coveralls
Copy link

coveralls commented Sep 21, 2023

Coverage Status

coverage: 50.167% (-0.04%) from 50.203% when pulling 8decf7c on JacobTanenbaum:disruption into c463cea on ovn-org:master.

@martinkennelly
Copy link
Member

Thanks Jacob - some extra changes:

pkg/ovn/controller/services/node_tracker.go:134
pkg/clustermanager/egressip_controller.go +687

@martinkennelly
Copy link
Member

martinkennelly commented Sep 22, 2023

pkg/node/controllers/egressip/egressip_test.go:900
pkg/node/node_ip_handler_linux.go:243
pkg/node/node_ip_handler_linux.go:288
pkg/node/node_ip_handler_linux.go:292
pkg/node/node_ip_handler_linux.go:339
pkg/ovn/controller/services/node_tracker.go:62
pkg/ovn/controller/services/node_tracker.go:252
pkg/ovn/controller/services/node_tracker.go:261
pkg/ovn/gateway_init.go:563
pkg/ovn/master.go:303
pkg/ovn/default_network_controller.go:892
pkg/ovn/egressip_test.go:158
pkg/ovn/egressip_test.go:190
pkg/ovn/hybrid_test.go:336
pkg/ovn/hybrid_test.go:627
pkg/ovn/hybrid_test.go:842
pkg/ovn/hybrid_test.go:1125
pkg/ovn/master_test.go:941
pkg/ovn/ovn.go:355
pkg/ovn/ovn.go:376
pkg/ovn/ovn.go:377
pkg/clustermanager/egressip_controller.go:1394
pkg/clustermanager/egressip_event_handler.go:110
pkg/clustermanager/egressip_event_handler.go:125

@martinkennelly
Copy link
Member

CP failures unrelated to this PR and looks like upstream issue - retesting:

Error: 275.9 internal/k8s/k8s.go:45:2: k8s.io/kubernetes@v1.26.0: read "https:/proxy.golang.org/@v/v1.26.0.zip": read tcp 172.17.0.3:39578->142.250.115.207:443: read: connection reset by peer

Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check places you changed hostAddrs to hostCIDRs. Not all of them should be named CIDRs as some are truly just IP addresses without a mask.

go-controller/pkg/clustermanager/egressip_controller.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/controller/services/node_tracker.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/ovn.go Outdated Show resolved Hide resolved
@trozet
Copy link
Contributor

trozet commented Sep 25, 2023

I took the liberty of addressing the comments and pushing a new revision since @JacobTanenbaum is on PTO and this is urgent

@trozet
Copy link
Contributor

trozet commented Sep 25, 2023

@qinqon fyi this flake keeps happening on several PRs:

[Fail] Kubevirt Virtual Machines when live migrated [It] with pre-copy should keep connectivity
/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:619

[Fail] Kubevirt Virtual Machines when live migrated [It] with pre-copy should keep connectivity
/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:619

https://github.com/ovn-org/ovn-kubernetes/actions/runs/6304894472/job/17118087308?pr=3915

commit: 8847618 changed the
host-addresses annotation to include a subnet mask. This causes updates
to experiance a network disruption as masters and nodes expect different
data to be in the host-addresses annotation.

Changing the name of the host-addresses annotation to host-cidrs will
eliminate the network disruption because nodes updating will not remove
the host-addresses annotation and once upgraded masters will do the
right thing with the host-cidrs annotation

Co-authored-by: Tim Rozet <trozet@redhat.com>
Signed-off-by: Jacob Tanenbaum <jtanenba@redhat.com>
@qinqon
Copy link
Contributor

qinqon commented Sep 26, 2023

@qinqon fyi this flake keeps happening on several PRs:

[Fail] Kubevirt Virtual Machines when live migrated [It] with pre-copy should keep connectivity /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:619

[Fail] Kubevirt Virtual Machines when live migrated [It] with pre-copy should keep connectivity /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:619

https://github.com/ovn-org/ovn-kubernetes/actions/runs/6304894472/job/17118087308?pr=3915

We are tracking this here, #3902, I am trying to reproduce it, but more or less I think is the kubevirt pod that handles webhook is at bad state.

Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm let's bring this downstream and ensure 12mins becomes 0mins for stable upgrades :D
some questions inline just to clarify some parts.
thank you @JacobTanenbaum @trozet

// does not contain EgressIPs that are assigned to interfaces.
for _, node := range nodes {
nodeHostAddressesSet, err := util.ParseNodeHostAddressesDropNetMask(node)
nodeHostAddrsSet, err := util.ParseNodeHostCIDRsDropNetMask(node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so when we upgrade from older version, since the node rolls out first, this new annotation will already be set by the time master's roll out and thus there wont be any lag/delay in processing ovnManagedEIPs existing in the cluster right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node IP handler will update/sync this annot under 2 conditions:

  • it get trigger either by an addr on the node being create/altered - we sub to events
  • 30s time resync period

so the max delay can be ~30s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does do a sync in-line when generating a new addr manager so we are good i think:

func newAddressManagerInternal(nodeName string, k kube.Interface, config *managementPortConfig, watchFactory factory.NodeWatchFactory, useNetlink bool) *addressManager {
	mgr := &addressManager{
		nodeName:       nodeName,
		watchFactory:   watchFactory,
		addresses:      sets.NewString(),
		mgmtPortConfig: config,
		OnChanged:      func() {},
		useNetlink:     useNetlink,
	}
	mgr.nodeAnnotator = kube.NewNodeAnnotator(k, nodeName)
	mgr.sync()func newAddressManagerInternal(nodeName string, k kube.Interface, config *managementPortConfig, watchFactory factory.NodeWatchFactory, useNetlink bool) *addressManager {
	mgr := &addressManager{
		nodeName:       nodeName,
		watchFactory:   watchFactory,
		addresses:      sets.NewString(),
		mgmtPortConfig: config,
		OnChanged:      func() {},
		useNetlink:     useNetlink,
	}
	mgr.nodeAnnotator = kube.NewNodeAnnotator(k, nodeName)
	mgr.sync()   <--- HERE
	

@@ -220,7 +220,7 @@ func (c *addressManager) handleNodePrimaryAddrChange() {
}

// updateNodeAddressAnnotations updates all relevant annotations for the node including
// k8s.ovn.org/host-addresses, k8s.ovn.org/node-primary-ifaddr, k8s.ovn.org/l3-gateway-config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the host-addresses annotation will live on forever? or during upgrades it will naturally vanish?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should live on forever until we clean it in the future when theres no longer an upgrade path from before IC.

@@ -560,16 +560,16 @@ func (oc *DefaultNetworkController) addExternalSwitch(prefix, interfaceID, nodeN
return nil
}

func (oc *DefaultNetworkController) addPolicyBasedRoutes(nodeName, mgmtPortIP string, hostIfAddr *net.IPNet, otherHostAddrs []string) error {
func (oc *DefaultNetworkController) addPolicyBasedRoutes(nodeName, mgmtPortIP string, hostIfCIDR *net.IPNet, otherHostAddrs []string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This path of code is not tested at all which is why this bug was not caught when the match of the policy changed from hostIP to hostIP/16. No code should be able to modify core routing policies on ovn_cluster_router and get away with it. Tests should complain.

matchstr2 := fmt.Sprintf(`inport == "rtos-%s" && ip4.dst == nodePhysicalIP /* %s */`, nodeName, nodeName)
matchstr3 := fmt.Sprintf("ip4.src == source && ip4.dst == nodePhysicalIP")
matchstr6 := fmt.Sprintf("ip4.src == NO DELETE && ip4.dst == nodePhysicalIP /* %s-no */", nodeName)

All current unit tests are playing with dummy match expresssions trying to leverage the comment on the actual policy:

We need unit test coverage in OVN DB libovsdb test framework in gateway_test.go and gateway_init.go for the match to prevent this from happening in the future. I will not block this PR for that but I will open an issue for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm not true, we do have some tests here: addNodeLogicalFlows however for nonIC setups these should have clearly failed - we got lucky with upgrades for ic only because db's are wiped, so nonIC->nonIC should have been stuck forever but it didn't unsure why.

@@ -25,7 +25,7 @@ type checkNodeAnnot func(v annotationChange, nodeName string) error

// commonNodeAnnotationChecks holds annotations allowed for ovnkube-node:<nodeName> users in non-IC and IC environments
var commonNodeAnnotationChecks = map[string]checkNodeAnnot{
util.OvnNodeHostAddresses: nil,
util.OVNNodeHostCIDRs: nil,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the order of upgrades for the new node-identity component. My question is: Is there a chance that this new component shows up alongside 4.13 ovnkube-node's? Or will the new component show up only after all ovnkube-node's are in 4.14. Basically do we need to handle the presence of an old ovnkube-node that could potentially update the old annotation and fail due to webhooks not knowing about that annotation?

@trozet trozet merged commit 6487d84 into ovn-org:master Sep 26, 2023
28 checks passed
@openshift-merge-robot
Copy link

Fix included in accepted release 4.15.0-0.nightly-2023-09-27-073353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants