Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVSHAS-9189] Scan will stuck in scheduling after controller is shutdown and restarted #155

Merged
merged 7 commits into from
Aug 1, 2024

Conversation

pohanhuangtw
Copy link
Contributor

@pohanhuangtw pohanhuangtw commented Jul 12, 2024

Summary

  1. Add HealthCheck grpc call to check controller health periodically, if the controller is down, we restart the re-register loop
  2. user is able to customize the period and retry max

Test Performed

  1. old scanner > old controller, Expected cause the scanner is unable to register the controller again, since no logic for scanner to re-register
  2. old scanner > new controller, Expected cause the scanner is unable to register the controller again, since no logic for scanner to re-register
  3. new scanner > old controller, Expected cause the scanner is unable to register the controller again
  4. new scanner > new controller, resolve the problem

@williamlin-suse
Copy link
Contributor

Shouldn't the modified shared files(.proto & .pg.go files in neuvector/neuvector#1461) under neuvector/scanner/vendor/github.com/neuvector/neuvector/share be included in this PR?

Comment on lines 210 to 217
if ((period = getenv(ENV_HEALTH_CHECK_PERIOD)) != NULL) {
args[a ++] = "--period";
args[a ++] = period;
}
if ((retry_max = getenv(ENV_RETRY)) != NULL) {
args[a ++] = "--retry_max";
args[a ++] = retry_max;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that environment variable will be passed to scanner, so you don't have to parse these settings in monitor. Is this not the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HEALTH_CHECK_PERIOD/MAX_RETRY can be configured in scanner yaml
monitor as the entry process in scanner container, it parses the env vars & translate them to --period/--retry_max arguments when creating scanner process

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @holyspectral,
The scanner.go is triggered by monitor.c. To configure the HEALTH_CHECK_PERIOD and MAX_RETRY correctly, we need to add these values to the environment variables. This way, the scanner can read them correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @williamlin-suse,

After a quick discussion with @holyspectral, we realized that monitor.c may be shared by many components, and we're not sure if making changes might break some behavior. Would it be okay to use os.GetEnv() to read the environment variables instead of passing them through monitor.c?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add, I think monitor's design purpose is to monitor and restart its child process, so it's not a perfect place to write component-specific logic. Technically we can still put some logic into monitor if we really want to, but maybe we should only keep generic ones there.

Copy link
Contributor

@williamlin-suse williamlin-suse Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanner repo has its own monitor code.
what do you mean "monitor.c may be shared by many components" ?
I think another PR is needed for neuvector-helm repo (github.com/neuvector/neuvector-helm/charts/core/templates/scanner-deployment.yaml) regarding the 2 new env variables added in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think env vars can be provided to scanner pod via the existing cve.scanner.env already, but we probably should document these settings somewhere for sure. Do you have any places in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can submit a jira case to Nuno for doc update

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should export those variables unless they are useful for customers to escalate their situation. @pohanhuangtw please use a default pair of the reasonable value instead.

server.go Outdated

// To ensure the controller's availability, periodCheckHealth use HealthCheck to periodically check if the controller is alive.
// Additionally, if the controller is deleted or not responsive, the scanner will re-register.
func periodCheckHealth(joinIP string, joinPort uint16, data *share.ScannerRegisterData, cb *clientCallback, healthCheckCh chan struct{}, done chan bool, period, retryMax int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: You can use context, which is the standard way to control go routine, to replace healthCheckCh. You can also assign timeout to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will remain the same, since we may trigger the re-register when we have the idle state.
Thus I think we should have a way to close the channel manually.

server.go Outdated Show resolved Hide resolved
scanner.go Outdated
@@ -98,12 +98,14 @@ func dbRead(path string, maxRetry int, output string) map[string]*share.ScanVuln
}
}

func connectController(path, advIP, joinIP, selfID string, advPort uint32, joinPort uint16) {
func connectController(path, advIP, joinIP, selfID string, advPort uint32, joinPort uint16, period, retryMax int, doneCh chan bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: I know we have many joinIP everywhere, but it can actually contains service name of controller. Maybe consider renaming it to joinHost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a Host target. The joinServiceAddr or joinServiceIP is better. It is okay to keep joinIP, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like joinServiceAddr. Considering scenarios like docker, how about joinAddr?

Copy link
Collaborator

@jayhuang-suse jayhuang-suse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please separate this PR in two parts sequentially:
(1) Merge the neuvector/share from its main branch.
(2) The code changes at scanner side.

Most likely, it is an internal improvement. We do not need to address this timeout and retry in the documentation. The default values should be enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the original variable names. Thanks.

Copy link
Collaborator

@jayhuang-suse jayhuang-suse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good

@jayhuang-suse
Copy link
Collaborator

@pohanhuangtw Please update the average time to trigger the recovery procedure in the JIRA case.

@jayhuang-suse jayhuang-suse merged commit bfc458d into neuvector:main Aug 1, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants