New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(test::npd): provide NPD with proper kubeconfig #96262
Conversation
/priority important-soon |
/test pull-kubernetes-bazel-test Thanks for this fix. Were you able to verify that it actually fixes the test using a local/remote run? |
/assign @wangzhen127 |
Unfortunately I was unable to verify due to lack of testing environment.. |
@karan Are you able to try this PR with the NPD e2e tests? I don't have a setup myself. Thought you may have it already. Do you mind helping verify? |
Sure - let me see if I can get it to work:
|
Doesn't seem to work:
|
NPD container is crash looping (no logs or anything to indicate why in NPD itself):
|
So far I've been unable to figure out why NPD is unable to start up during the test:
There are no container or pod logs on the host. The various configs expected exist in the right place. I've ruled out resource constraints (using a n1-standard-2 machine instead of a 1 core machine). I've also ruled out permissions issues by making sure that the Finally, I tried running NPD manually:
This fails because of a missing config. Weird because it should be using
Regardless at least there's some logging happening. We don't have anything when in the e2e test. So I'm out of ideas for now. |
@karan Thanks for your effort, much appreicated! I think we are very close. As for the Could you try modifying the entrypoint and run the command again to see what is going on? |
It works when manually run.
Note that I changed the e2e to write So it does work manually, but not when run as a Pod from the e2e test. |
Okay, I'm a bit closer to figuring out what's up. If I change the PodSpec to this:
Then at least the container starts (it crashes because So it seems like the issue is with the use of |
@knight42 #96381 fixes the Pod spec. Once that's submitted, you can rebase your PR and the test will be green:
|
/triage accepted |
"-c", | ||
// `ServiceAccount` admission controller is disabled in node e2e tests, so we could not use | ||
// inClusterConfig here. | ||
fmt.Sprintf("touch %s && /node-problem-detector --logtostderr --system-log-monitors=%s --apiserver-override=%s?inClusterConfig=false&auth=%s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me some time, but in the end, the reason why it failed with no error and logs...
When you execute the command with the sh -c
you should be aware of special characters, in your case you have --apiserver-override=%s?inClusterConfig=false&auth=%s"
that includes &
that interpreted as a special character, so the final command has view
touch %s && \
/node-problem-detector --logtostderr --system-log-monitors=%s --apiserver-override=%s?inClusterConfig=false & \
auth=%s
when &
bitwise and, auth=some_variable
always have zero output, so the final value will be 0(at least I think so, did not check it).
To make it work you just need to escape &
-> \&
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a possibility. I tested it (\&
is not a valid escape sequence):
$ git diff
diff --git a/test/e2e_node/node_problem_detector_linux.go b/test/e2e_node/node_problem_detector_linux.go
index 6750a2ceee5..e91046d9434 100644
--- a/test/e2e_node/node_problem_detector_linux.go
+++ b/test/e2e_node/node_problem_detector_linux.go
@@ -240,7 +240,7 @@ current-context: local-context
"-c",
// `ServiceAccount` admission controller is disabled in node e2e tests, so we could not use
// inClusterConfig here.
- fmt.Sprintf("touch %s && /node-problem-detector --logtostderr --system-log-monitors=%s --apiserver-override=%s?inClusterConfig=false&auth=%s",
+ fmt.Sprintf(`touch %s && /node-problem-detector --logtostderr --system-log-monitors=%s --apiserver-override=%s?inClusterConfig=false&auth=%s`,
logFile,
configFile,
framework.TestContext.Host,
The test still fails - same symptoms - even though the docker entrypoint looks correct:
"Entrypoint": [
"sh",
"-c",
"touch /log/test.log && /node-problem-detector --logtostderr --system-log-monitors=/config/testconfig.json --apiserver-override=https://127.0.0.1:6443?inClusterConfig=false&auth=/config/kubeconfig"
],
Then I tried again with:
$ git diff
diff --git a/test/e2e_node/node_problem_detector_linux.go b/test/e2e_node/node_problem_detector_linux.go
index 6750a2ceee5..3eca64593e9 100644
--- a/test/e2e_node/node_problem_detector_linux.go
+++ b/test/e2e_node/node_problem_detector_linux.go
@@ -231,20 +231,36 @@ current-context: local-context
},
},
},
- Containers: []v1.Container{
+ InitContainers: []v1.Container{
{
- Name: name,
- Image: image,
- Command: []string{
- "sh",
+ Name: "init-log-file",
+ Image: "debian",
+ Command: []string{"/bin/sh"},
+ Args: []string{
"-c",
- // `ServiceAccount` admission controller is disabled in node e2e tests, so we could not use
- // inClusterConfig here.
- fmt.Sprintf("touch %s && /node-problem-detector --logtostderr --system-log-monitors=%s --apiserver-override=%s?inClusterConfig=false&auth=%s",
- logFile,
- configFile,
- framework.TestContext.Host,
- kubeConfigFile),
+ fmt.Sprintf("touch %s", logFile),
+ },
+ VolumeMounts: []v1.VolumeMount{
+ {
+ Name: logVolume,
+ MountPath: path.Dir(logFile),
+ },
+ {
+ Name: localtimeVolume,
+ MountPath: etcLocaltime,
+ },
+ },
+ },
+ },
+ Containers: []v1.Container{
+ {
+ Name: name,
+ Image: image,
+ Command: []string{"/node-problem-detector"},
+ Args: []string{
+ "--logtostderr",
+ fmt.Sprintf("--system-log-monitors=%s", configFile),
+ fmt.Sprintf("--apiserver-override=%s?inClusterConfig=false&auth=%s", framework.TestContext.Host, kubeConfigFile),
},
Env: []v1.EnvVar{
{
This works fine:
Ran 1 of 330 Specs in 112.085 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 329 Skipped
The docker command/args look like this:
"Path": "/node-problem-detector",
"Args": [
"--logtostderr",
"--system-log-monitors=/config/testconfig.json",
"--apiserver-override=https://127.0.0.1:6443?inClusterConfig=false&auth=/config/kubeconfig"
],
So I don't think the escape is the issue (or at least the way I'm escaping).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you execute the command with the
sh -c
you should be aware of special characters
@cynepco3hahue Ah I think this make sense! Because the node-problem-detector
was put in background, so the container exit immediately and we got no error and logs. Thanks for enlightening us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karan I think the point is sh
would treat &
in a special way, perhaps you could try:
`touch %s && /node-problem-detector --logtostderr --system-log-monitors=%s --apiserver-override='%s?inClusterConfig=false&auth=%s'`
note the single quote around the value of --apiserver-override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep exactly, I verified your PR with
fmt.Sprintf("touch %s && /node-problem-detector --logtostderr --system-log-monitors=%s --apiserver-override=%s?inClusterConfig=false\\&auth=%s",
logFile,
configFile,
framework.TestContext.Host,
kubeConfigFile),
and it passed.
Take into account that you need to double slash it \\&
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha! @knight42 feel free to undo my changes in your PR in lieu of the escaping here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karan I think your change is beneficial which makes the command easier to read, I'd like to keep it.
Signed-off-by: knight42 <anonymousknight96@gmail.com>
56f90be
to
1a9600d
Compare
/assign @derekwaynecarr |
/lgtm /approved |
Thanks for looking into this! |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, knight42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: knight42 anonymousknight96@gmail.com
What type of PR is this?
/kind failing-test
What this PR does / why we need it:
The
ServiceAccount
admission controller is disabled in node e2e tests:kubernetes/test/e2e_node/services/apiserver.go
Line 64 in 4364ef6
So we have to create kubeconfig for NPD in test.
Which issue(s) this PR fixes:
Fixes #95955
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: