[NVSHAS-9189] Scan will stuck in scheduling after controller is shutdown and restarted #155

pohanhuangtw · 2024-07-12T02:28:45Z

Summary

Add HealthCheck grpc call to check controller health periodically, if the controller is down, we restart the re-register loop
user is able to customize the period and retry max

Test Performed

old scanner > old controller, Expected cause the scanner is unable to register the controller again, since no logic for scanner to re-register
old scanner > new controller, Expected cause the scanner is unable to register the controller again, since no logic for scanner to re-register
new scanner > old controller, Expected cause the scanner is unable to register the controller again
new scanner > new controller, resolve the problem

williamlin-suse · 2024-07-17T17:43:06Z

Shouldn't the modified shared files(.proto & .pg.go files in neuvector/neuvector#1461) under neuvector/scanner/vendor/github.com/neuvector/neuvector/share be included in this PR?

holyspectral · 2024-07-18T00:49:59Z

monitor/monitor.c

+            if ((period = getenv(ENV_HEALTH_CHECK_PERIOD)) != NULL) {
+                args[a ++] = "--period";
+                args[a ++] = period;
+            } 
+            if ((retry_max = getenv(ENV_RETRY)) != NULL) {
+                args[a ++] = "--retry_max";
+                args[a ++] = retry_max;
+            }            


My understanding is that environment variable will be passed to scanner, so you don't have to parse these settings in monitor. Is this not the case?

HEALTH_CHECK_PERIOD/MAX_RETRY can be configured in scanner yaml
monitor as the entry process in scanner container, it parses the env vars & translate them to --period/--retry_max arguments when creating scanner process

@pohanhuangtw

Hi @holyspectral,
The scanner.go is triggered by monitor.c. To configure the HEALTH_CHECK_PERIOD and MAX_RETRY correctly, we need to add these values to the environment variables. This way, the scanner can read them correctly.

Hi @williamlin-suse,

After a quick discussion with @holyspectral, we realized that monitor.c may be shared by many components, and we're not sure if making changes might break some behavior. Would it be okay to use os.GetEnv() to read the environment variables instead of passing them through monitor.c?

To add, I think monitor's design purpose is to monitor and restart its child process, so it's not a perfect place to write component-specific logic. Technically we can still put some logic into monitor if we really want to, but maybe we should only keep generic ones there.

scanner repo has its own monitor code.
what do you mean "monitor.c may be shared by many components" ?
~~I think another PR is needed for neuvector-helm repo (github.com/neuvector/neuvector-helm/charts/core/templates/scanner-deployment.yaml) regarding the 2 new env variables added in this PR.~~

I think env vars can be provided to scanner pod via the existing cve.scanner.env already, but we probably should document these settings somewhere for sure. Do you have any places in mind?

We can submit a jira case to Nuno for doc update

I do not think we should export those variables unless they are useful for customers to escalate their situation. @pohanhuangtw please use a default pair of the reasonable value instead.

holyspectral · 2024-07-18T01:00:56Z

server.go

+
+// To ensure the controller's availability, periodCheckHealth use HealthCheck to periodically check if the controller is alive.
+// Additionally, if the controller is deleted or not responsive, the scanner will re-register.
+func periodCheckHealth(joinIP string, joinPort uint16, data *share.ScannerRegisterData, cb *clientCallback, healthCheckCh chan struct{}, done chan bool, period, retryMax int) {


Optional: You can use context, which is the standard way to control go routine, to replace healthCheckCh. You can also assign timeout to it.

I think I will remain the same, since we may trigger the re-register when we have the idle state.
Thus I think we should have a way to close the channel manually.

server.go

holyspectral · 2024-07-18T01:08:06Z

scanner.go

@@ -98,12 +98,14 @@ func dbRead(path string, maxRetry int, output string) map[string]*share.ScanVuln
 	}
 }

-func connectController(path, advIP, joinIP, selfID string, advPort uint32, joinPort uint16) {
+func connectController(path, advIP, joinIP, selfID string, advPort uint32, joinPort uint16, period, retryMax int, doneCh chan bool) {


optional: I know we have many joinIP everywhere, but it can actually contains service name of controller. Maybe consider renaming it to joinHost?

It is not a Host target. The joinServiceAddr or joinServiceIP is better. It is okay to keep joinIP, too.

I like joinServiceAddr. Considering scenarios like docker, how about joinAddr?

jayhuang-suse

Please separate this PR in two parts sequentially:
(1) Merge the neuvector/share from its main branch.
(2) The code changes at scanner side.

Most likely, it is an internal improvement. We do not need to address this timeout and retry in the documentation. The default values should be enough.

jayhuang-suse · 2024-07-31T09:16:44Z

server.go

Please revert the original variable names. Thanks.

…riodically.

…ically.

…once, shorten the check period.

…the scanner if the scanner is not in the controller list.

…o avoid old controller to keep register.

jayhuang-suse

Good

jayhuang-suse · 2024-08-01T22:37:23Z

@pohanhuangtw Please update the average time to trigger the recovery procedure in the JIRA case.

holyspectral reviewed Jul 18, 2024

View reviewed changes

pohanhuangtw requested review from williamlin-suse and holyspectral July 18, 2024 07:33

jayhuang-suse reviewed Jul 24, 2024

View reviewed changes

jayhuang-suse reviewed Jul 31, 2024

View reviewed changes

server.go

Copy link

Collaborator

jayhuang-suse Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the original variable names. Thanks.

pohanhuangtw added 6 commits July 31, 2024 09:35

[NVSHAS-9189] Add HealthCheck grpc call to check controller health pe…

fa6f87c

…riodically.

[NVSHAS-9189] Use GetCaps grpc call to check controller health period…

6292e7b

…ically.

[NVSHAS-9189] Add comment mesage to the periodCheckHealth

30e056e

[NVSHAS-9189] Improve the periodCheckHealth make sure it only create …

eb676ac

…once, shorten the check period.

[NVSHAS-9189] Improve the periodCheckHealth make sure it can restart …

b8547b7

…the scanner if the scanner is not in the controller list.

[NVSHAS-9189] Add modified shared files.

5a87be6

pohanhuangtw force-pushed the NVSHAS-9189 branch 2 times, most recently from a4667a5 to 3395b8a Compare July 31, 2024 10:28

pohanhuangtw requested a review from jayhuang-suse July 31, 2024 10:28

[NVSHAS-9189] Remove setting from monitor and add isGetCapsActivate t…

bf28afe

…o avoid old controller to keep register.

pohanhuangtw force-pushed the NVSHAS-9189 branch from 3395b8a to bf28afe Compare July 31, 2024 10:41

jayhuang-suse approved these changes Aug 1, 2024

View reviewed changes

williamlin-suse approved these changes Aug 1, 2024

View reviewed changes

jayhuang-suse merged commit bfc458d into neuvector:main Aug 1, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVSHAS-9189] Scan will stuck in scheduling after controller is shutdown and restarted #155

[NVSHAS-9189] Scan will stuck in scheduling after controller is shutdown and restarted #155

pohanhuangtw commented Jul 12, 2024 •

edited

Loading

williamlin-suse commented Jul 17, 2024

holyspectral Jul 18, 2024

williamlin-suse Jul 18, 2024

holyspectral Jul 23, 2024

pohanhuangtw Jul 23, 2024

pohanhuangtw Jul 23, 2024

holyspectral Jul 23, 2024

williamlin-suse Jul 23, 2024 •

edited

Loading

holyspectral Jul 23, 2024

williamlin-suse Jul 23, 2024

jayhuang-suse Jul 31, 2024

holyspectral Jul 18, 2024

pohanhuangtw Jul 18, 2024

holyspectral Jul 18, 2024

pohanhuangtw Jul 18, 2024

jayhuang-suse Jul 24, 2024

holyspectral Jul 24, 2024

jayhuang-suse left a comment

jayhuang-suse Jul 31, 2024

jayhuang-suse left a comment

jayhuang-suse commented Aug 1, 2024

[NVSHAS-9189] Scan will stuck in scheduling after controller is shutdown and restarted #155

[NVSHAS-9189] Scan will stuck in scheduling after controller is shutdown and restarted #155

Conversation

pohanhuangtw commented Jul 12, 2024 • edited Loading

Summary

Test Performed

williamlin-suse commented Jul 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williamlin-suse Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayhuang-suse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayhuang-suse left a comment

Choose a reason for hiding this comment

jayhuang-suse commented Aug 1, 2024

pohanhuangtw commented Jul 12, 2024 •

edited

Loading

williamlin-suse Jul 23, 2024 •

edited

Loading