Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Egress IP health monitoring over GRPC #3100

Merged
merged 5 commits into from Sep 2, 2022

Conversation

flavio-fernandes
Copy link
Contributor

@flavio-fernandes flavio-fernandes commented Jul 28, 2022

This PR includes a new functionality to egress ip, so it can
use gRPC in order to determine the health of its egress nodes.

Info on the protobuf definition used for the probing is here:
https://github.com/grpc/grpc/blob/master/doc/health-checking.md

The requirement in order to use this method is that both
ovnkube-master and ovnkube-node pods are started with a
new flag: --egressip-node-healthcheck-port <TCP_PORT>.

The original behavior remains as a viable method, should
ovnkube-master pod get started without the new flag.
For sake of completeness, the original method is to
have the ovnkube-master dial to discard (TCP port 9) and
declare node up upon receiving a TCP reject answer.

Signed-off-by: Flavio Fernandes flaviof@redhat.com
Reported-at: https://issues.redhat.com/browse/SDN-3156

@bpickard22
Copy link
Contributor

bpickard22 commented Jul 28, 2022

/assign @bpickard22

@flavio-fernandes
Copy link
Contributor Author

/retest

@andreaskaris andreaskaris self-assigned this Jul 29, 2022
@andreaskaris
Copy link
Collaborator

/retest-failed

}
}

func checkEgressNodesReachabilityIterate(oc *Controller) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: best to view these changes with ignoring white-spacing option

@flavio-fernandes
Copy link
Contributor Author

/retest-failed

@flavio-fernandes flavio-fernandes changed the title WIP: Egress IP health monitoring over GRPC [WIP] Egress IP health monitoring over GRPC Jul 29, 2022
@andreaskaris
Copy link
Collaborator

/retest-failed

@coveralls
Copy link

coveralls commented Aug 1, 2022

Coverage Status

Coverage increased (+0.1%) to 52.504% when pulling 77ceb00 on flavio-fernandes:eip_health into 9906372 on ovn-org:master.

Copy link
Collaborator

@andreaskaris andreaskaris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run this against a linter and address all issues:

[root@ovnkubernetes healthcheck]# golint
egressip_healthcheck.go:15:2: const ServiceEgressIpNode should be ServiceEgressIPNode
egressip_healthcheck.go:15:2: exported const ServiceEgressIpNode should have comment (or a comment on this block) or be unexported
egressip_healthcheck.go:36:2: don't use underscores in Go names; struct field node_mgmt_ip should be nodeMgmtIP
egressip_healthcheck.go:39:2: don't use underscores in Go names; struct field health_check_port should be healthCheckPort
egressip_healthcheck.go:42:1: exported function NewEgressIPHealthServer should have comment or be unexported
egressip_healthcheck.go:42:30: don't use underscores in Go names; func parameter node_mgmt_ip should be nodeMgmtIP
egressip_healthcheck.go:42:51: don't use underscores in Go names; func parameter health_check_port should be healthCheckPort
egressip_healthcheck.go:42:75: exported func NewEgressIPHealthServer returns unexported type *healthcheck.egressIPHealthServer, which can be annoying to use
egressip_healthcheck.go:81:6: exported type EgressIPHealthClient should have comment or be unexported
egressip_healthcheck.go:83:10: don't use underscores in Go names; interface method parameter dial_ctx should be dialCtx
egressip_healthcheck.go:83:54: don't use underscores in Go names; interface method parameter health_check_port should be healthCheckPort
egressip_healthcheck.go:85:8: don't use underscores in Go names; interface method parameter dial_ctx should be dialCtx
egressip_healthcheck.go:95:2: don't use underscores in Go names; struct field probe_failed should be probeFailed
egressip_healthcheck.go:98:1: exported function NewEgressIPHealthClient should have comment or be unexported
egressip_healthcheck.go:106:42: don't use underscores in Go names; method parameter dial_ctx should be dialCtx
egressip_healthcheck.go:106:86: don't use underscores in Go names; method parameter health_check_port should be healthCheckPort
egressip_healthcheck.go:109:6: don't use underscores in Go names; var node_addr should be nodeAddr
egressip_healthcheck.go:112:9: don't use underscores in Go names; range var node_mgmt_ip should be nodeMgmtIP
egressip_healthcheck.go:143:40: don't use underscores in Go names; method parameter dial_ctx should be dialCtx
egressip_healthcheck.go:158:3: don't use underscores in Go names; var prev_probe_failed should be prevProbeFailed

contrib/kind.sh Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/config/config.go Show resolved Hide resolved
go-controller/pkg/node/node.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/healthcheck/egressip_healthcheck.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/healthcheck/egressip_healthcheck.go Outdated Show resolved Hide resolved
@flavio-fernandes
Copy link
Contributor Author

Run this against a linter and address all issues:

[root@ovnkubernetes healthcheck]# golint
egressip_healthcheck.go:15:2: const ServiceEgressIpNode should be ServiceEgressIPNode
egressip_healthcheck.go:15:2: exported const ServiceEgressIpNode should have comment (or a comment on this block) or be unexported
egressip_healthcheck.go:36:2: don't use underscores in Go names; struct field node_mgmt_ip should be nodeMgmtIP
egressip_healthcheck.go:39:2: don't use underscores in Go names; struct field health_check_port should be healthCheckPort
egressip_healthcheck.go:42:1: exported function NewEgressIPHealthServer should have comment or be unexported
egressip_healthcheck.go:42:30: don't use underscores in Go names; func parameter node_mgmt_ip should be nodeMgmtIP
egressip_healthcheck.go:42:51: don't use underscores in Go names; func parameter health_check_port should be healthCheckPort
egressip_healthcheck.go:42:75: exported func NewEgressIPHealthServer returns unexported type *healthcheck.egressIPHealthServer, which can be annoying to use
egressip_healthcheck.go:81:6: exported type EgressIPHealthClient should have comment or be unexported
egressip_healthcheck.go:83:10: don't use underscores in Go names; interface method parameter dial_ctx should be dialCtx
egressip_healthcheck.go:83:54: don't use underscores in Go names; interface method parameter health_check_port should be healthCheckPort
egressip_healthcheck.go:85:8: don't use underscores in Go names; interface method parameter dial_ctx should be dialCtx
egressip_healthcheck.go:95:2: don't use underscores in Go names; struct field probe_failed should be probeFailed
egressip_healthcheck.go:98:1: exported function NewEgressIPHealthClient should have comment or be unexported
egressip_healthcheck.go:106:42: don't use underscores in Go names; method parameter dial_ctx should be dialCtx
egressip_healthcheck.go:106:86: don't use underscores in Go names; method parameter health_check_port should be healthCheckPort
egressip_healthcheck.go:109:6: don't use underscores in Go names; var node_addr should be nodeAddr
egressip_healthcheck.go:112:9: don't use underscores in Go names; range var node_mgmt_ip should be nodeMgmtIP
egressip_healthcheck.go:143:40: don't use underscores in Go names; method parameter dial_ctx should be dialCtx
egressip_healthcheck.go:158:3: don't use underscores in Go names; var prev_probe_failed should be prevProbeFailed
❯ make lint 2>&1 | tee ~/x
level=info msg="[config_reader] Config search paths: [./ /app / /root]"
level=info msg="[config_reader] Used config file .golangci.yml"
level=info msg="[lintersdb] Active 13 linters: [deadcode errcheck exportloopref gofmt gosimple govet ineffassign nosprintfhostport staticcheck structcheck typecheck unused varcheck]"
level=info msg="[loader] Go packages loading at mode 575 (types_sizes|deps|exports_file|files|imports|name|compiled_files) took 1m31.31143454s"
level=info msg="[runner/filename_unadjuster] Pre-built 0 adjustments in 237.327084ms"
level=info msg="[linters context/goanalysis] analyzers took 5m53.600139829s with top 10 stages: buildir: 3m53.207123522s, gofmt: 10.513114941s, nilness: 9.511269896s, S1038: 5.990555809s, inspect: 4.666516648s, printf: 3.177458915s, fact_deprecated: 3.01233256s, ctrlflow: 2.702656853s, SA4030: 2.596352305s, directives: 2.588695656s"
level=info msg="[runner] Issues before processing: 725, after processing: 0"
level=info msg="[runner] Processors filtering stat (out/in): filename_unadjuster: 725/725, identifier_marker: 21/21, exclude-rules: 3/21, skip_files: 21/725, exclude: 21/21, path_prettifier: 725/725, autogenerated_exclude: 21/21, nolint: 0/3, cgo: 725/725, skip_dirs: 21/21"
level=info msg="[runner] processing took 44.703253ms with stages: autogenerated_exclude: 17.36304ms, nolint: 12.297915ms, path_prettifier: 12.184903ms, skip_files: 1.160597ms, identifier_marker: 836.52µs, exclude-rules: 614.246µs, skip_dirs: 128.188µs, cgo: 66.865µs, filename_unadjuster: 45.66µs, max_same_issues: 1.736µs, uniq_by_line: 737ns, source_code: 574ns, exclude: 425ns, max_from_linter: 378ns, diff: 374ns, severity-rules: 235ns, sort_results: 233ns, max_per_file_from_linter: 224ns, path_shortener: 221ns, path_prefixer: 182ns"
level=info msg="[runner] linters took 1m4.886636201s with stages: goanalysis_metalinter: 1m4.841802035s"
level=info msg="File cache stats: 0 entries of total size 0B"
level=info msg="Memory: 1445 samples, avg is 391.2MB, max is 1731.6MB"
level=info msg="Execution took 2m36.447970116s"
lint OK!

Lint is playing tricks on me! :P

@flavio-fernandes
Copy link
Contributor Author

@andreaskaris I made the changes you pointed out, including lint.
I may need you help in learning how to lint outside makefile, tho.
PTAL. Thanks!!

@flavio-fernandes flavio-fernandes force-pushed the eip_health branch 2 times, most recently from 24fedd7 to afbe099 Compare August 1, 2022 19:59
@flavio-fernandes
Copy link
Contributor Author

/retest-failed

1 similar comment
@andreaskaris
Copy link
Collaborator

/retest-failed

@andreaskaris
Copy link
Collaborator

@andreaskaris I made the changes you pointed out, including lint. I may need you help in learning how to lint outside makefile, tho. PTAL. Thanks!!

I have this here in my bashrc:

lint ()
{ 
    golint -min_confidence 0.8
}

@flavio-fernandes
Copy link
Contributor Author

I have this here in my bashrc:

TY. I'm addressing the last issues I found soon

❯ ~/go/bin/golint -min_confidence 0.8
egressip_healthcheck.go:34:1: comment on exported type EgressIPHealthServer should be of the form "EgressIPHealthServer ..." (with optional leading article)
egressip_healthcheck.go:46:1: comment on exported function NewEgressIPHealthServer should be of the form "NewEgressIPHealthServer ..."
egressip_healthcheck.go:86:1: comment on exported type EgressIPHealthClient should be of the form "EgressIPHealthClient ..." (with optional leading article)
egressip_healthcheck.go:104:1: comment on exported function NewEgressIPHealthClient should be of the form "NewEgressIPHealthClient ..."

Copy link
Collaborator

@andreaskaris andreaskaris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, some more linter stuff .. feel free to correct what you see fit but at least fix the python underscore syntax to golang camelcase

[akaris@linux ovn ((afbe09999...))]$ cd healthcheck/
[akaris@linux healthcheck ((afbe09999...))]$ lint
egressip_healthcheck.go:37:1: comment on exported type EgressIPHealthServer should be of the form "EgressIPHealthServer ..." (with optional leading article)
egressip_healthcheck.go:49:1: comment on exported function NewEgressIPHealthServer should be of the form "NewEgressIPHealthServer ..."
egressip_healthcheck.go:93:1: comment on exported type EgressIPHealthClient should be of the form "EgressIPHealthClient ..." (with optional leading article)
egressip_healthcheck.go:111:1: comment on exported function NewEgressIPHealthClient should be of the form "NewEgressIPHealthClient ..."
[akaris@linux ovn ((afbe09999...))]$ lint | grep egressip.go
egressip.go:1393:6: exported type EgressIPPatchStatus should have comment or be unexported
egressip.go:1982:9: if block ends with a return statement, so drop this else and outdent its block (move short variable declaration to its own line if necessary)
egressip.go:2241:60: don't use underscores in Go names; method parameter health_check_port should be healthCheckPort
egressip.go:2252:2: don't use underscores in Go names; var dial_ctx should be dialCtx
egressip.go:2252:12: don't use underscores in Go names; var dial_cancel should be dialCancel
[akaris@linux node ((afbe09999...))]$ lint | grep node.go
node.go:586:2: don't use underscores in Go names; var health_check_port should be healthCheckPort
node.go:589:7: don't use underscores in Go names; var node_mgmt_ip should be nodeMgmtIP
node.go:602:3: don't use underscores in Go names; var health_server should be healthServer
node.go:619:1: exported method OvnNode.WatchEndpoints should have comment or be unexported
node.go:742:1: exported method OvnNode.WatchNamespaces should have comment or be unexported

go-controller/pkg/node/node.go Outdated Show resolved Hide resolved
go-controller/pkg/node/node.go Outdated Show resolved Hide resolved
go-controller/pkg/node/node.go Outdated Show resolved Hide resolved
go-controller/pkg/config/config_test.go Outdated Show resolved Hide resolved
go-controller/pkg/config/config_test.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/healthcheck/egressip_healthcheck.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/healthcheck/egressip_healthcheck.go Outdated Show resolved Hide resolved
@flavio-fernandes flavio-fernandes force-pushed the eip_health branch 4 times, most recently from a9f8ac0 to 48a4bff Compare August 4, 2022 20:02
Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good except for some test cases and swapping the default config behavior

go-controller/pkg/node/node.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
@flavio-fernandes
Copy link
Contributor Author

@flavio-fernandes Do you have any good docs to introduce me to how you created the auto generated grpc code?

@martinkennelly below are the places I used to come up to speed on using gRPC. Please take a look and let me know if that is not good for you:

contrib/kind.sh Outdated Show resolved Hide resolved
Copied proto definition from https://github.com/grpc/grpc/blob/master/doc/health-checking.md
and generated the associated code.

Steps taken:

   $ export PATH="$PATH:$(go env GOPATH)/bin"
   $ go version
   go version go1.18.3 linux/amd64
   $ protoc --version
   libprotoc 3.14.0
   $ cd go-controller/pkg/ovn/healthcheck
   $ protoc --go_out=. --go_opt=paths=source_relative \
       --go-grpc_out=. --go-grpc_opt=paths=source_relative health.proto

Signed-off-by: Flavio Fernandes <flaviof@redhat.com>
Break checkEgressNodesReachability into checkEgressNodesReachabilityIterate
calls so implementation can honor oc.stopChan events and know when egressip
should stop probing its egress nodes.

Signed-off-by: Flavio Fernandes <flaviof@redhat.com>
@flavio-fernandes
Copy link
Contributor Author

/retest

@oribon oribon mentioned this pull request Sep 1, 2022
@trozet
Copy link
Contributor

trozet commented Sep 1, 2022

@flavio-fernandes one of the jobs failed with ovnkbue node crashing 9x:

2022-08-31T22:41:24.067948114Z stderr F F0831 22:41:24.067877   12131 egressip_healthcheck.go:63] Health checking listen failed: listen tcp [fd00:10:244:1::2]:9107: bind: cannot assign requested address

https://github.com/ovn-org/ovn-kubernetes/runs/8123914416?check_suite_focus=true

@flavio-fernandes
Copy link
Contributor Author

@flavio-fernandes one of the jobs failed with ovnkbue node crashing 9x:

2022-08-31T22:41:24.067948114Z stderr F F0831 22:41:24.067877   12131 egressip_healthcheck.go:63] Health checking listen failed: listen tcp [fd00:10:244:1::2]:9107: bind: cannot assign requested address

https://github.com/ovn-org/ovn-kubernetes/runs/8123914416?check_suite_focus=true

my changes broke ipv6. Fixing it now

When ovnkube container, in both master and node pods, is started with
the newly introduced flag 'egressip-node-healthcheck-port', egressip
implementation will now use gRPC with that parameter.

Signed-off-by: Flavio Fernandes <flaviof@redhat.com>
Extend kind and dist to support a new flag added to ovn-kube:

    egress ip health check

This flag is used by ovnkube-master and ovnkube-node templates,
so eip probe uses gRPC to that port instead of dialing discard port (9).

NOTE: With this change, kind will use port 9107 by default.

Signed-off-by: Flavio Fernandes <flaviof@redhat.com>
@flavio-fernandes
Copy link
Contributor Author

@flavio-fernandes one of the jobs failed with ovnkbue node crashing 9x:

2022-08-31T22:41:24.067948114Z stderr F F0831 22:41:24.067877   12131 egressip_healthcheck.go:63] Health checking listen failed: listen tcp [fd00:10:244:1::2]:9107: bind: cannot assign requested address

https://github.com/ovn-org/ovn-kubernetes/runs/8123914416?check_suite_focus=true

Added settling change to ensure it is safe to use ipv6. TY @jcaamano and @trozet !!!

diff --git a/go-controller/pkg/node/node.go b/go-controller/pkg/node/node.go
index 5fdd03eb2..fd5905123 100644
--- a/go-controller/pkg/node/node.go
+++ b/go-controller/pkg/node/node.go
@@ -20,6 +20,7 @@ import (
        "k8s.io/klog/v2"
        utilnet "k8s.io/utils/net"

+       "github.com/containernetworking/plugins/pkg/ip"
        honode "github.com/ovn-org/ovn-kubernetes/go-controller/hybrid-overlay/pkg/controller"
        "github.com/ovn-org/ovn-kubernetes/go-controller/pkg/cni"
        "github.com/ovn-org/ovn-kubernetes/go-controller/pkg/config"
@@ -593,6 +594,11 @@ func (n *OvnNode) startEgressIPHealthCheckingServer(wg *sync.WaitGroup, mgmtPort
                nodeMgmtIP = mgmtPortConfig.ipv4.ifAddr.IP
        } else if mgmtPortConfig.ipv6 != nil {
                nodeMgmtIP = mgmtPortConfig.ipv6.ifAddr.IP
+               // Wait for address to become usable.
+               if err := ip.SettleAddresses(mgmtPortConfig.ifName, 10); err != nil {
+                       return fmt.Errorf("failed start health checking server due to unsettled IPv6: %w", err)
+               }
        } else {
                return fmt.Errorf("unable to start health checking server: no mgmt ip")
        }

@flavio-fernandes flavio-fernandes requested review from trozet, kyrtapz and martinkennelly and removed request for kyrtapz September 1, 2022 21:04
Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@kyrtapz
Copy link
Contributor

kyrtapz commented Sep 2, 2022

/lgtm

@martinkennelly
Copy link
Member

/lgtm

Thanks for the info flavio.

@trozet trozet merged commit 5c4213d into ovn-org:master Sep 2, 2022
@flavio-fernandes flavio-fernandes deleted the eip_health branch September 2, 2022 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants