Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sig-windows-gce test jobs are failing consistently for a long time #124047

Open
AnishShah opened this issue Mar 25, 2024 · 7 comments
Open

sig-windows-gce test jobs are failing consistently for a long time #124047

AnishShah opened this issue Mar 25, 2024 · 7 comments
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@AnishShah
Copy link
Contributor

AnishShah commented Mar 25, 2024

Which jobs are failing?

gce-windows-2019-containerd-master and gce-windows-2022-containerd-master are failing

Which tests are failing?

The Windows nodes are failing to come up because the startup scripts are failing.

Since when has it been failing?

It has been failing for a long time. It is red on the whole testgrid.

Testgrid link

https://testgrid.k8s.io/sig-windows-gce

Reason for failure (if possible)

The Windows startup scripts are failing to create NPD kubeconfig. I found this in the Windows Node serial log from a test run failure.

2024/03/25 15:15:33 GCEMetadataScripts: windows-startup-script-ps1: Exception caught in script:
2024/03/25 15:15:33 GCEMetadataScripts: windows-startup-script-ps1: At C:\k8s-node-setup.psm1:1549 char:14
2024/03/25 15:15:33 GCEMetadataScripts: windows-startup-script-ps1: +       -Token ${kube_env}['NODE_PROBLEM_DETECTOR_TOKEN']
2024/03/25 15:15:33 GCEMetadataScripts: windows-startup-script-ps1: +              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2024/03/25 15:15:33 GCEMetadataScripts: windows-startup-script-ps1: Kubernetes Windows node setup failed: Cannot bind argument to parameter 'Token' because it is an empty string.
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1: Cleaning up, Unregistering WorkerServices...
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1: [SC] OpenService FAILED 1060:
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1:
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1: The specified service does not exist as an installed service.
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1:
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1: [SC] OpenService FAILED 1060:
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1:
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1: The specified service does not exist as an installed service.

Anything else we need to know?

On the Linux node, we run NPD in standalone mode and use the kubeconfig generated from the token specified in NODE_PROBLEM_DETECTOR_TOKEN or use the kubelet kubeconfig. ref.

But on Windows, we do not run NPD because ENABLE_NODE_PROBLEM_DETECTOR is set to none based on the serial logs.

2024/03/25 15:12:46 GCEMetadataScripts: windows-startup-script-ps1: Logging kube-env key-value pairs except CERT and KEY values
....
....
2024/03/25 15:12:46 GCEMetadataScripts: windows-startup-script-ps1: ENABLE_NODE_PROBLEM_DETECTOR: none
....
2024/03/25 15:12:46 GCEMetadataScripts: windows-startup-script-ps1: NODE_PROBLEM_DETECTOR_VERSION:
2024/03/25 15:12:46 GCEMetadataScripts: windows-startup-script-ps1: NODE_PROBLEM_DETECTOR_TOKEN:

Based on the Linux node behavior, We should maybe set ENABLE_NODE_PROBLEM_DETECTOR to standalone on Windows and use kubelet kubeconfig if NODE_PROBLEM_DETECTOR_TOKEN is missing?

Relevant SIG(s)

/sig windows
/good-first-issue

@AnishShah AnishShah added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Mar 25, 2024
@k8s-ci-robot
Copy link
Contributor

@AnishShah:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

Which jobs are failing?

gce-windows-2019-containerd-master and gce-windows-2022-containerd-master are failing

Which tests are failing?

The Windows nodes are failing to come up because the startup scripts are failing.

Since when has it been failing?

It has been failing for a long time. It is red on the whole testgrid.

Testgrid link

https://testgrid.k8s.io/sig-windows-gce

Reason for failure (if possible)

The Windows startup scripts are failing to create NPD kubeconfig. I found this in the Windows Node serial log from a test run failure.

2024/03/25 15:15:33 GCEMetadataScripts: windows-startup-script-ps1: Exception caught in script:
2024/03/25 15:15:33 GCEMetadataScripts: windows-startup-script-ps1: At C:\k8s-node-setup.psm1:1549 char:14
2024/03/25 15:15:33 GCEMetadataScripts: windows-startup-script-ps1: +       -Token ${kube_env}['NODE_PROBLEM_DETECTOR_TOKEN']
2024/03/25 15:15:33 GCEMetadataScripts: windows-startup-script-ps1: +              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2024/03/25 15:15:33 GCEMetadataScripts: windows-startup-script-ps1: Kubernetes Windows node setup failed: Cannot bind argument to parameter 'Token' because it is an empty string.
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1: Cleaning up, Unregistering WorkerServices...
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1: [SC] OpenService FAILED 1060:
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1:
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1: The specified service does not exist as an installed service.
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1:
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1: [SC] OpenService FAILED 1060:
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1:
2024/03/25 15:15:34 GCEMetadataScripts: windows-startup-script-ps1: The specified service does not exist as an installed service.

Anything else we need to know?

On the Linux node, we run NPD in standalone mode and use the kubeconfig generated from the token specified in NODE_PROBLEM_DETECTOR_TOKEN or use the kubelet kubeconfig. ref.

But on Windows, we do not run NPD because ENABLE_NODE_PROBLEM_DETECTOR is set to none based on the serial logs.

2024/03/25 15:12:46 GCEMetadataScripts: windows-startup-script-ps1: Logging kube-env key-value pairs except CERT and KEY values
....
....
2024/03/25 15:12:46 GCEMetadataScripts: windows-startup-script-ps1: ENABLE_NODE_PROBLEM_DETECTOR: none
....
2024/03/25 15:12:46 GCEMetadataScripts: windows-startup-script-ps1: NODE_PROBLEM_DETECTOR_VERSION:
2024/03/25 15:12:46 GCEMetadataScripts: windows-startup-script-ps1: NODE_PROBLEM_DETECTOR_TOKEN:

Based on the Linux node behavior, We should maybe set ENABLE_NODE_PROBLEM_DETECTOR to standalone on Windows and use kubelet kubeconfig if NODE_PROBLEM_DETECTOR_TOKEN?

Relevant SIG(s)

/sig windows
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 25, 2024
@AnishShah
Copy link
Contributor Author

/triage accepted

@k8s-ci-robot
Copy link
Contributor

@AnishShah: The label triage/accepted cannot be applied. Only GitHub organization members can add the label.

In response to this:

/triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AryaMoghaddam
Copy link

I can give this a shot

@AnishShah
Copy link
Contributor Author

/triage accepted
/assign @AryaMoghaddam

Thanks! Reach out on slack if you need help

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 27, 2024
@kushalShukla-web
Copy link
Contributor

i think i can work on this @AnishShah

@lavishpal
Copy link

@AryaMoghaddam are you still working on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: No status
Development

No branches or pull requests

5 participants