NPD jobs are failing: "failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector [...] 403 Forbidden" #119211

mmiranda96 · 2023-07-10T17:19:52Z

Which jobs are failing?

Job name	Config source	Testgrid (or job history) link
`ci-npd-build`	Source	https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-build
`pull-npd-e2e-test`	Source	https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-npd-e2e-test
`pull-npd-e2e-node`	Source	https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-npd-e2e-node

Which tests are failing?

Jobs fail to start. The container image is built but it fails on push with the following error:

#33 pushing layers
#33 ...
#34 [auth] node-problem-detector-staging/ci/node-problem-detector:pull,push token for gcr.io
#34 DONE 0.0s
#33 exporting to image
#33 pushing layers 1.4s done
#33 ERROR: failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector:v0.8.13-44-g5558643-20230710.1614: failed to authorize: failed to fetch oauth token: unexpected status: 403 Forbidden
------
 > exporting to image:
------
ERROR: failed to solve: failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector:v0.8.13-44-g5558643-20230710.1614: failed to authorize: failed to fetch oauth token: unexpected status: 403 Forbidden
make: *** [Makefile:270: push-container] Error 1

Since when has it been failing?

2023-07-04 ~12:40 PDT (CI job first was in 2023-07-01 ~06:30 PDT)

Testgrid link

No response

Reason for failure (if possible)

Jobs were migrated to EKS in kubernetes/test-infra#29751, it seems that this is the culprit.

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-07-10T17:20:00Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mmiranda96 · 2023-07-10T17:32:24Z

I have 2 potential solutions to this issue, asked for a bit more context on Slack sig-testing:

Revert to using the previous GCP cluster.
Fix the permission issue in EKS.

@rjsadow @Vyom-Yadav @pacoxu do you have any context on this?

rjsadow · 2023-07-10T18:16:28Z

I think rolling to GCP is a fine choices. Though I would be interested in understanding more about why it's not working. Looking at k/npd, I'm guessing it's this line that's causing the issue. My understanding is that gcloud auth configure-docker relies on an existing gcloud login. I'm guessing (thought not 100% sure) that GKE nodes are already auto authenticated to their environments through the service accounts.

This would make sense to me and explain why it's not working in EKS since there's no previous gcloud authentication I could find. @xmudrii @BenTheElder, Do we have an existing mechanism that allow eks clients can use a specific service account or maybe reference a secret that contain the gcloud environment credentials?

Please also note I'm making quite a few assumptions through this and don't have a lot of experience with GCP or NPD so if I'm totally off base please let me know. 😃

xmudrii · 2023-07-10T18:39:59Z

@rjsadow We don't want to migrate jobs depending on any GCP resource. It doesn't make much sense to, let's say, push an image from EKS to GKE. Running it on EKS would mean higher bandwidth/traffic charges because of transferring data (in this case images) to GKE, and that's usually much more expensive than just running the job on GCP and uploading from there. That said, I believe those jobs should be reverted back to the GKE cluster.

rjsadow · 2023-07-10T19:06:24Z

Of course! That makes a lot of sense. Thank you as always 😄

mmiranda96 · 2023-07-10T22:41:25Z

Job pull-npd-e2e-node is still failing. It seems that the service account in cluster k8s-infra-prow-build doesn't have permission to push to bucket gs://node-problem-detector-staging. We probably want to grant it permissions, since other jobs depend on it (like ci-npd-build).

mmiranda96 · 2023-07-13T00:26:09Z

Checked the logs, the error is still the same. I dug up a bit more on the test script.

The test script uses GCP projects for two things:

Upload env configuration and tar files to GCS bucket, with a TTL of 7 days.
Upload container images to GCR registry (under subdirectories pr or ci, depending on the job)

The current setup is:

Project node-problem-detector-staging (currently used by PR/CI jobs):
- GCR registry (gcr.io/node-problem-detector-staging) has the two directories (screenshot 1)
- GCS bucket (gs://node-problem-detector-staging) exists and should hold tar/env files (none exist due to the 7-day TTL)
Project k8s-staging-npd (unsure what it's used for):
- GCR registry (gcr.io/k8s-npd-staging/node-problem-detector) has no nested directories, it only contains images (screenshot 2)
- GCS bucket (gs://k8s-staging-npd) is the same as the previous: exists and should hold tar/env files (none exist due to the 7-day TTL)

I'm unaware what's the use case of the current images hosted in gcr.io/k8s-staging-npd. But if possible, I would argue that we can use that project for the original use case of node-problem-detector-staging: hosting PR and CI images. Similarly, we can use the project bucket to store tar/env files.

@Random-Liu is there a reason why we have two different buckets? I noticed that prod binaries are released in gcr.io/kubernetes-release/node-problem-detector.

Screenshot 1

Screenshot 2

MartinForReal · 2023-07-14T05:43:08Z

I think we should put all images in k8s-staging-npd because repo is allocated in https://github.com/kubernetes/k8s.io/blob/71e636f5a165d2d29d86331a41709346b20f10eb/infra/aws/terraform/registry.k8s.io/README.md?plain=1#L4

MartinForReal · 2023-07-31T13:45:38Z

Can we switch back to gce?

mmiranda96 · 2023-08-01T17:36:47Z

We're using a new GCP project. To unblock, I'll revert to the original project and ensure everything else is ready before using the new GCP project.

mmiranda96 · 2023-08-01T18:16:08Z

I reverted to use default cluster on some jobs in kubernetes/test-infra#30262. Meanwhile, I'll keep investigating to see which permissions are missing and how can we grant them.

BenTheElder · 2023-08-02T18:24:31Z

[...] We don't want to migrate jobs depending on any GCP resource. [...]

To the EKS cluster that is, however it may be reasonable to go ahead and migrate to the community GKE cluster and we eventually intend all jobs to migrate out of the legacy cluster(s).

BenTheElder · 2023-08-02T18:25:29Z

We've been prioritizing jobs that are cloud independent to identify the scope of what remains and ramp up usage of the AWS resources, but it's totally expected that not all jobs can run on AWS and the other k8s infra cluster should be fine.

mmiranda96 · 2023-08-23T17:40:18Z

/triage important-longterm

k8s-ci-robot · 2023-08-23T17:40:20Z

@mmiranda96: The label(s) triage/important-longterm cannot be applied, because the repository doesn't have them.

In response to this:

/triage important-longterm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SergeyKanzhelev · 2024-01-03T18:12:00Z

@mmiranda96 what is left in this bug? Jobs seems to be passing now

SD-13 · 2024-02-09T19:40:41Z

@mmiranda96 wanted to know the current status of this issue. Can we now move jobs to the k8s-infra-prow-build cluster? If not, then what is remaining to setup the GKE cluster?

wangzhen127 · 2024-03-10T21:56:51Z

I will update the config after kubernetes/node-problem-detector#882 is merged.

wangzhen127 · 2024-03-11T16:00:43Z

Looks like we still have the 403 Forbidden issue. I wonder why ci-npd-e2e-kubernetes-gce-gci can use k8s-infra-prow-build?

k8s-triage-robot · 2024-06-09T16:21:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wangzhen127 · 2024-06-10T20:30:35Z

This is fixed.
/close

k8s-ci-robot · 2024-06-10T20:30:39Z

@wangzhen127: Closing this issue.

In response to this:

This is fixed.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

BenTheElder · 2024-06-10T22:44:12Z

I don't think this was really fixed though, because these jobs are back in cluster: default which is subject to removal in August.

https://groups.google.com/a/kubernetes.io/g/dev/c/p6PAML90ZOU

BenTheElder · 2024-06-10T22:45:26Z

To push staging builds:
https://github.com/kubernetes/k8s.io/tree/main/registry.k8s.io#managing-kubernetes-container-registries

wangzhen127 · 2024-06-11T00:49:55Z

You are correct. I was confused.
/reopen

k8s-ci-robot · 2024-06-11T00:50:00Z

@wangzhen127: Reopened this issue.

In response to this:

You are correct. I was confused.
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mmiranda96 added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jul 10, 2023

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 10, 2023

mmiranda96 mentioned this issue Jul 10, 2023

Migrate NPD jobs to cluster k8s-infra-prow-build kubernetes/test-infra#30045

Merged

mmiranda96 mentioned this issue Jul 13, 2023

chores:Update stage gcr repo in test script kubernetes/node-problem-detector#784

Closed

SergeyKanzhelev added this to Triage in SIG Node CI/Test Board Jul 17, 2023

mmiranda96 mentioned this issue Jul 18, 2023

Fix missing build arg in build-in-docker make target kubernetes/node-problem-detector#780

Merged

SergeyKanzhelev moved this from Triage to Issues - In progress in SIG Node CI/Test Board Jul 19, 2023

SergeyKanzhelev mentioned this issue Jul 19, 2023

NPD jobs are failing: "No URLs matched: gs://node-problem-detector-staging/ci/ci.env" #119210

Closed

SergeyKanzhelev moved this from Issues - In progress to Issues - To do in SIG Node CI/Test Board Mar 4, 2024

This was referenced Mar 11, 2024

Move NPD CI job to k8s-infra-prow-build kubernetes/test-infra#32222

Merged

Move NPD presubmit jobs to k8s-infra-prow-build kubernetes/test-infra#32228

Merged

SD-13 mentioned this issue Mar 11, 2024

Migrate remaining sig-node jobs to community clusters kubernetes/test-infra#31794

Open

6 tasks

This was referenced Mar 11, 2024

Revert "Move NPD presubmit jobs to k8s-infra-prow-build" kubernetes/test-infra#32231

Merged

Revert "Move NPD CI job to k8s-infra-prow-build" kubernetes/test-infra#32230

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2024

k8s-ci-robot closed this as completed Jun 10, 2024

SIG Node CI/Test Board automation moved this from Issues - To do to Done Jun 10, 2024

k8s-ci-robot reopened this Jun 11, 2024

SIG Node CI/Test Board automation moved this from Done to Issues - In progress Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPD jobs are failing: "failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector [...] 403 Forbidden" #119211

NPD jobs are failing: "failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector [...] 403 Forbidden" #119211

mmiranda96 commented Jul 10, 2023

k8s-ci-robot commented Jul 10, 2023

mmiranda96 commented Jul 10, 2023

rjsadow commented Jul 10, 2023 •

edited

xmudrii commented Jul 10, 2023

rjsadow commented Jul 10, 2023

mmiranda96 commented Jul 10, 2023

mmiranda96 commented Jul 13, 2023 •

edited

MartinForReal commented Jul 14, 2023

MartinForReal commented Jul 31, 2023

mmiranda96 commented Aug 1, 2023

mmiranda96 commented Aug 1, 2023

BenTheElder commented Aug 2, 2023

BenTheElder commented Aug 2, 2023

mmiranda96 commented Aug 23, 2023

k8s-ci-robot commented Aug 23, 2023

SergeyKanzhelev commented Jan 3, 2024

SD-13 commented Feb 9, 2024 •

edited

wangzhen127 commented Mar 10, 2024

wangzhen127 commented Mar 11, 2024

k8s-triage-robot commented Jun 9, 2024

wangzhen127 commented Jun 10, 2024

k8s-ci-robot commented Jun 10, 2024

BenTheElder commented Jun 10, 2024

BenTheElder commented Jun 10, 2024

wangzhen127 commented Jun 11, 2024

k8s-ci-robot commented Jun 11, 2024

NPD jobs are failing: "failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector [...] 403 Forbidden" #119211

NPD jobs are failing: "failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector [...] 403 Forbidden" #119211

Comments

mmiranda96 commented Jul 10, 2023

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

k8s-ci-robot commented Jul 10, 2023

mmiranda96 commented Jul 10, 2023

rjsadow commented Jul 10, 2023 • edited

xmudrii commented Jul 10, 2023

rjsadow commented Jul 10, 2023

mmiranda96 commented Jul 10, 2023

mmiranda96 commented Jul 13, 2023 • edited

MartinForReal commented Jul 14, 2023

MartinForReal commented Jul 31, 2023

mmiranda96 commented Aug 1, 2023

mmiranda96 commented Aug 1, 2023

BenTheElder commented Aug 2, 2023

BenTheElder commented Aug 2, 2023

mmiranda96 commented Aug 23, 2023

k8s-ci-robot commented Aug 23, 2023

SergeyKanzhelev commented Jan 3, 2024

SD-13 commented Feb 9, 2024 • edited

wangzhen127 commented Mar 10, 2024

wangzhen127 commented Mar 11, 2024

k8s-triage-robot commented Jun 9, 2024

wangzhen127 commented Jun 10, 2024

k8s-ci-robot commented Jun 10, 2024

BenTheElder commented Jun 10, 2024

BenTheElder commented Jun 10, 2024

wangzhen127 commented Jun 11, 2024

k8s-ci-robot commented Jun 11, 2024

rjsadow commented Jul 10, 2023 •

edited

mmiranda96 commented Jul 13, 2023 •

edited

SD-13 commented Feb 9, 2024 •

edited