-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NPD jobs are failing: "failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector [...] 403 Forbidden" #119211
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I have 2 potential solutions to this issue, asked for a bit more context on Slack sig-testing:
@rjsadow @Vyom-Yadav @pacoxu do you have any context on this? |
I think rolling to GCP is a fine choices. Though I would be interested in understanding more about why it's not working. Looking at This would make sense to me and explain why it's not working in EKS since there's no previous gcloud authentication I could find. @xmudrii @BenTheElder, Do we have an existing mechanism that allow eks clients can use a specific service account or maybe reference a secret that contain the gcloud environment credentials? Please also note I'm making quite a few assumptions through this and don't have a lot of experience with GCP or NPD so if I'm totally off base please let me know. 😃 |
@rjsadow We don't want to migrate jobs depending on any GCP resource. It doesn't make much sense to, let's say, push an image from EKS to GKE. Running it on EKS would mean higher bandwidth/traffic charges because of transferring data (in this case images) to GKE, and that's usually much more expensive than just running the job on GCP and uploading from there. That said, I believe those jobs should be reverted back to the GKE cluster. |
Of course! That makes a lot of sense. Thank you as always 😄 |
Job |
Checked the logs, the error is still the same. I dug up a bit more on the test script. The test script uses GCP projects for two things:
The current setup is:
I'm unaware what's the use case of the current images hosted in @Random-Liu is there a reason why we have two different buckets? I noticed that prod binaries are released in |
I think we should put all images in k8s-staging-npd because repo is allocated in https://github.com/kubernetes/k8s.io/blob/71e636f5a165d2d29d86331a41709346b20f10eb/infra/aws/terraform/registry.k8s.io/README.md?plain=1#L4 |
Can we switch back to gce? |
We're using a new GCP project. To unblock, I'll revert to the original project and ensure everything else is ready before using the new GCP project. |
I reverted to use default cluster on some jobs in kubernetes/test-infra#30262. Meanwhile, I'll keep investigating to see which permissions are missing and how can we grant them. |
To the EKS cluster that is, however it may be reasonable to go ahead and migrate to the community GKE cluster and we eventually intend all jobs to migrate out of the legacy cluster(s). |
We've been prioritizing jobs that are cloud independent to identify the scope of what remains and ramp up usage of the AWS resources, but it's totally expected that not all jobs can run on AWS and the other k8s infra cluster should be fine. |
/triage important-longterm |
@mmiranda96: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@mmiranda96 what is left in this bug? Jobs seems to be passing now |
@mmiranda96 wanted to know the current status of this issue. Can we now move jobs to the |
I will update the config after kubernetes/node-problem-detector#882 is merged. |
Looks like we still have the |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
This is fixed. |
@wangzhen127: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I don't think this was really fixed though, because these jobs are back in https://groups.google.com/a/kubernetes.io/g/dev/c/p6PAML90ZOU |
You are correct. I was confused. |
@wangzhen127: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Which jobs are failing?
ci-npd-build
pull-npd-e2e-test
pull-npd-e2e-node
Which tests are failing?
Jobs fail to start. The container image is built but it fails on push with the following error:
Since when has it been failing?
2023-07-04 ~12:40 PDT (CI job first was in 2023-07-01 ~06:30 PDT)
Testgrid link
No response
Reason for failure (if possible)
Jobs were migrated to EKS in kubernetes/test-infra#29751, it seems that this is the culprit.
Anything else we need to know?
No response
Relevant SIG(s)
/sig node
The text was updated successfully, but these errors were encountered: