Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Backup / restore with AWS IAM Role fails with NoCredentialProviders #2718

Closed
Wykiki opened this issue Jun 21, 2021 · 6 comments
Closed
Assignees
Labels
backport/1.1.2 Require to backport to 1.1.2 release branch kind/bug require/doc Require updating the longhorn.io documentation
Milestone

Comments

@Wykiki
Copy link

Wykiki commented Jun 21, 2021

Describe the bug

After a Longhorn upgrade from 1.1.0 to 1.1.1, I needed to used the Backup / Restore feature with an IAM Role instead of IAM User.

Once I updated the appropriate secrets and set the appropriate annotation to allow the longhorn-manager pods to get the appropriate temporary credentials (via kiam), I tried to trigger a backup, but I received an error message :

To ensure that the longhorn-manager pod was able to reach S3, I installed AWS CLI inside longhorn-manager pod, and I was able to read / write onto the S3 bucket.

  • How can I find what is used by longhorn-manager ?
  • Is longhorn-manager the only pod concerned by the Backup / Restore feature ?

To Reproduce

Steps to reproduce the behavior:

  1. Have a Backup / Restore with IAM User working on 1.1.0
  2. Migrate to 1.1.1
  3. Configure the Backup / Restore to use a IAM Role
  4. See error

Expected behavior

Backup / Restore features working without error.

Log

time="2021-06-21T09:50:01Z" level=error msg="Error in request: fail to backup snapshot: failed to execute: /var/lib/longhorn/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.1.1/longhorn [--url 172.16.141.232:10000 backup create --dest s3://qualif-0.longhorn.backup-kq64d@eu-west-3/ --label KubernetesStatus={\"pvName\":\"pvc-a66d8f8c-4ad0-44b8-ae10-cfadccb055a7\",\"pvStatus\":\"Bound\",\"namespace\":\"default\",\"pvcName\":\"test-volume-backup\",\"lastPVCRefAt\":\"\",\"workloadsStatus\":[{\"podName\":\"test-volume-d7fb8fccf-jc8c8\",\"podStatus\":\"Running\",\"workloadName\":\"test-volume-d7fb8fccf\",\"workloadType\":\"ReplicaSet\"}],\"lastPodRefAt\":\"\"} f1e0d469-e33c-4b77-9b73-0b89550d9cf2], output , stderr, time=\"2021-06-21T09:50:00Z\" level=info msg=\"Backing up f1e0d469-e33c-4b77-9b73-0b89550d9cf2 on tcp://172.16.135.74:10000, to s3://xxx@yyy/\"\ntime=\"2021-06-21T09:50:01Z\" level=fatal msg=\"Error running create backup command: failed to create backup to s3://qualif-0.longhorn.backup-kq64d@eu-west-3/ for volume pvc-a66d8f8c-4ad0-44b8-ae10-cfadccb055a7: rpc error: code = Unknown desc = NoCredentialProviders: no valid providers in chain. Deprecated.\\n\\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors\"\n, error exit status 1"

You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment:

  • Longhorn version: 1.1.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App via Cluster Explorer
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.19.7 (Rancher provisionning from existing node)
    • Number of management node in the cluster: 5
    • Number of worker node in the cluster: 10
  • Node config
    • OS type and version: Ubuntu 20.04
    • CPU per node: 4
    • Memory per node: 16
    • Disk type(e.g. SSD/NVMe): SS (AWS gp3 standard settings)
    • Network bandwidth between the nodes: 10 GBps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster: 10

Additional context
Add any other context about the problem here.

@jenting jenting added this to New in Community Issue Review via automation Jun 21, 2021
@jenting
Copy link
Contributor

jenting commented Jun 21, 2021

Once I updated the appropriate secrets and set the appropriate annotation to allow the longhorn-manager pods to get the appropriate temporary credentials (via kiam).

Have you followed the doc to configure the AWS_IAM_ROLE_ARN in the secret? After that, the Longhorn adds the annotation to longhorn-manager Pods and instance manager replica Pods. Could you please check all the longhorn-manager Pods and the instance manager replica Pods that have annotated the iam.amazonaws.com/role?

Also, could you please check your AWS assume role configured correctly?

@jenting jenting self-assigned this Jun 21, 2021
@jenting jenting moved this from New to Pending user response in Community Issue Review Jun 21, 2021
@Wykiki
Copy link
Author

Wykiki commented Jun 21, 2021

  • AWS_IAM_ROLE_ARN configured in the secret ✅
  • instance-manager-replica have the annotation iam.amazonaws.com/role
  • longhorn-manager pods have the annotation iam.amazonaws.com/role
  • Assume role works as expected ✅

The hint about what was needing the Role was precious (longhorn-manager AND instance-manager-replica), because our instance-manager-replicas are running on specific nodes that didn't got Kiam agent on them.

Running the Kiam agent on those nodes just fixed the problem.

Does the documentation explains that longhorn-manager and instance-manager-replica needs S3 access ? I think that I could have resolved my issue by myself if this information existed / if I found it.

Thanks for your really fast reply, closing it now !

@Wykiki Wykiki closed this as completed Jun 21, 2021
@jenting
Copy link
Contributor

jenting commented Jun 21, 2021

Does the documentation explains that longhorn-manager and instance-manager-replica needs S3 access ? I think that I could have resolved my issue by myself if this information existed / if I found it.

Sorry, no. But we could enhance our document on this one. Thank you.

@jenting jenting added the require/doc Require updating the longhorn.io documentation label Jun 21, 2021
@jenting
Copy link
Contributor

jenting commented Jun 21, 2021

Let me reopen it, I'll close it after we enhance our documentation.

@jenting jenting reopened this Jun 21, 2021
Community Issue Review automation moved this from Pending user response to New Jun 21, 2021
@jenting jenting moved this from New to Resolved/Scheduled in Community Issue Review Jun 21, 2021
@jenting jenting added this to the Planning milestone Jun 22, 2021
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jun 22, 2021

Pre Ready-For-Testing Checklist

  • Is the reproduce steps/test steps documented?

  • Is there a workaround for the issue? If so, is it documented?

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, have both YAML file and Chart been updated in the PR?

  • Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed)?

  • Which areas/issues this PR might have potential impacts on?
    Area Website documentation

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed)?

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed)?
    The Doc issue/PR is at Add a note for AWS IAM Role ARN website#329, [cherry-pick-v1.1.2] Add a note for AWS IAM Role ARN website#330, [cherry-pick-v1.2.0] Add a note for AWS IAM Role ARN website#331

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? (including backport-needed)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed)?

  • If labeled: require/manual-test-plan Has the manual test plan been documented?

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?

@innobead
Copy link
Member

Closing, this is about the documentation enhancement, so nothing to verify because it works as expected.

cc @longhorn/qa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.1.2 Require to backport to 1.1.2 release branch kind/bug require/doc Require updating the longhorn.io documentation
Projects
Archived in project
Community Issue Review
Resolved/Scheduled
Development

No branches or pull requests

4 participants