Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPD jobs are failing: "failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector [...] 403 Forbidden" #119211

Open
mmiranda96 opened this issue Jul 10, 2023 · 26 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@mmiranda96
Copy link
Contributor

Which jobs are failing?

Job name Config source Testgrid (or job history) link
ci-npd-build Source https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-build
pull-npd-e2e-test Source https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-npd-e2e-test
pull-npd-e2e-node Source https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-npd-e2e-node

Which tests are failing?

Jobs fail to start. The container image is built but it fails on push with the following error:

#33 pushing layers
#33 ...
#34 [auth] node-problem-detector-staging/ci/node-problem-detector:pull,push token for gcr.io
#34 DONE 0.0s
#33 exporting to image
#33 pushing layers 1.4s done
#33 ERROR: failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector:v0.8.13-44-g5558643-20230710.1614: failed to authorize: failed to fetch oauth token: unexpected status: 403 Forbidden
------
 > exporting to image:
------
ERROR: failed to solve: failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector:v0.8.13-44-g5558643-20230710.1614: failed to authorize: failed to fetch oauth token: unexpected status: 403 Forbidden
make: *** [Makefile:270: push-container] Error 1

Since when has it been failing?

2023-07-04 ~12:40 PDT (CI job first was in 2023-07-01 ~06:30 PDT)

Testgrid link

No response

Reason for failure (if possible)

Jobs were migrated to EKS in kubernetes/test-infra#29751, it seems that this is the culprit.

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

@mmiranda96 mmiranda96 added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jul 10, 2023
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 10, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mmiranda96
Copy link
Contributor Author

I have 2 potential solutions to this issue, asked for a bit more context on Slack sig-testing:

  1. Revert to using the previous GCP cluster.
  2. Fix the permission issue in EKS.

@rjsadow @Vyom-Yadav @pacoxu do you have any context on this?

@rjsadow
Copy link
Contributor

rjsadow commented Jul 10, 2023

I think rolling to GCP is a fine choices. Though I would be interested in understanding more about why it's not working. Looking at k/npd, I'm guessing it's this line that's causing the issue. My understanding is that gcloud auth configure-docker relies on an existing gcloud login. I'm guessing (thought not 100% sure) that GKE nodes are already auto authenticated to their environments through the service accounts.

This would make sense to me and explain why it's not working in EKS since there's no previous gcloud authentication I could find. @xmudrii @BenTheElder, Do we have an existing mechanism that allow eks clients can use a specific service account or maybe reference a secret that contain the gcloud environment credentials?

Please also note I'm making quite a few assumptions through this and don't have a lot of experience with GCP or NPD so if I'm totally off base please let me know. 😃

@xmudrii
Copy link
Member

xmudrii commented Jul 10, 2023

@rjsadow We don't want to migrate jobs depending on any GCP resource. It doesn't make much sense to, let's say, push an image from EKS to GKE. Running it on EKS would mean higher bandwidth/traffic charges because of transferring data (in this case images) to GKE, and that's usually much more expensive than just running the job on GCP and uploading from there. That said, I believe those jobs should be reverted back to the GKE cluster.

@rjsadow
Copy link
Contributor

rjsadow commented Jul 10, 2023

Of course! That makes a lot of sense. Thank you as always 😄

@mmiranda96
Copy link
Contributor Author

Job pull-npd-e2e-node is still failing. It seems that the service account in cluster k8s-infra-prow-build doesn't have permission to push to bucket gs://node-problem-detector-staging. We probably want to grant it permissions, since other jobs depend on it (like ci-npd-build).

@mmiranda96
Copy link
Contributor Author

mmiranda96 commented Jul 13, 2023

Checked the logs, the error is still the same. I dug up a bit more on the test script.

The test script uses GCP projects for two things:

  • Upload env configuration and tar files to GCS bucket, with a TTL of 7 days.
  • Upload container images to GCR registry (under subdirectories pr or ci, depending on the job)

The current setup is:

  • Project node-problem-detector-staging (currently used by PR/CI jobs):
    • GCR registry (gcr.io/node-problem-detector-staging) has the two directories (screenshot 1)
    • GCS bucket (gs://node-problem-detector-staging) exists and should hold tar/env files (none exist due to the 7-day TTL)
  • Project k8s-staging-npd (unsure what it's used for):
    • GCR registry (gcr.io/k8s-npd-staging/node-problem-detector) has no nested directories, it only contains images (screenshot 2)
    • GCS bucket (gs://k8s-staging-npd) is the same as the previous: exists and should hold tar/env files (none exist due to the 7-day TTL)

I'm unaware what's the use case of the current images hosted in gcr.io/k8s-staging-npd. But if possible, I would argue that we can use that project for the original use case of node-problem-detector-staging: hosting PR and CI images. Similarly, we can use the project bucket to store tar/env files.

@Random-Liu is there a reason why we have two different buckets? I noticed that prod binaries are released in gcr.io/kubernetes-release/node-problem-detector.


Screenshot 1
screenshot_1

Screenshot 2
screenshot_2

@MartinForReal
Copy link
Contributor

I think we should put all images in k8s-staging-npd because repo is allocated in https://github.com/kubernetes/k8s.io/blob/71e636f5a165d2d29d86331a41709346b20f10eb/infra/aws/terraform/registry.k8s.io/README.md?plain=1#L4

@MartinForReal
Copy link
Contributor

Can we switch back to gce?

@mmiranda96
Copy link
Contributor Author

We're using a new GCP project. To unblock, I'll revert to the original project and ensure everything else is ready before using the new GCP project.

@mmiranda96
Copy link
Contributor Author

I reverted to use default cluster on some jobs in kubernetes/test-infra#30262. Meanwhile, I'll keep investigating to see which permissions are missing and how can we grant them.

@BenTheElder
Copy link
Member

[...] We don't want to migrate jobs depending on any GCP resource. [...]

To the EKS cluster that is, however it may be reasonable to go ahead and migrate to the community GKE cluster and we eventually intend all jobs to migrate out of the legacy cluster(s).

@BenTheElder
Copy link
Member

We've been prioritizing jobs that are cloud independent to identify the scope of what remains and ramp up usage of the AWS resources, but it's totally expected that not all jobs can run on AWS and the other k8s infra cluster should be fine.

@mmiranda96
Copy link
Contributor Author

/triage important-longterm

@k8s-ci-robot
Copy link
Contributor

@mmiranda96: The label(s) triage/important-longterm cannot be applied, because the repository doesn't have them.

In response to this:

/triage important-longterm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SergeyKanzhelev
Copy link
Member

@mmiranda96 what is left in this bug? Jobs seems to be passing now

@SD-13
Copy link

SD-13 commented Feb 9, 2024

@mmiranda96 wanted to know the current status of this issue. Can we now move jobs to the k8s-infra-prow-build cluster? If not, then what is remaining to setup the GKE cluster?

@SergeyKanzhelev SergeyKanzhelev moved this from Issues - In progress to Issues - To do in SIG Node CI/Test Board Mar 4, 2024
@wangzhen127
Copy link
Member

I will update the config after kubernetes/node-problem-detector#882 is merged.

@wangzhen127
Copy link
Member

Looks like we still have the 403 Forbidden issue. I wonder why ci-npd-e2e-kubernetes-gce-gci can use k8s-infra-prow-build?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2024
@wangzhen127
Copy link
Member

This is fixed.
/close

@k8s-ci-robot
Copy link
Contributor

@wangzhen127: Closing this issue.

In response to this:

This is fixed.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

SIG Node CI/Test Board automation moved this from Issues - To do to Done Jun 10, 2024
@BenTheElder
Copy link
Member

I don't think this was really fixed though, because these jobs are back in cluster: default which is subject to removal in August.

https://groups.google.com/a/kubernetes.io/g/dev/c/p6PAML90ZOU

@BenTheElder
Copy link
Member

@wangzhen127
Copy link
Member

You are correct. I was confused.
/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jun 11, 2024
@k8s-ci-robot
Copy link
Contributor

@wangzhen127: Reopened this issue.

In response to this:

You are correct. I was confused.
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

SIG Node CI/Test Board automation moved this from Done to Issues - In progress Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
SIG Node CI/Test Board
Issues - In progress
Status: Issues - In progress
Development

No branches or pull requests

10 participants