Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins Startup fails on AWS ECS due to secrets-manager-credentials-provider-plugin #117

Closed
thedevopsguyblog opened this issue Jul 7, 2021 · 20 comments
Labels
bug Something isn't working

Comments

@thedevopsguyblog
Copy link

thedevopsguyblog commented Jul 7, 2021

Version report

https://stackoverflow.com/questions/68287374/jenkins-startup-fails-on-aws-ecs-due-to-secrets-manager-credentials-provider-plu <- also posted this.

Jenkins and plugins versions report:

Jenkins 2.289.2 (jenkins/jenkins:lts-jdk11)
I don't specify a version number during the install so plugins.sh always installs the lates version.
  • What Operating System are you using (both controller, and any agents involved in the problem)?
amazonLinux2

Reproduction steps

  1. Create an AWS ECS Cluster Running amazonLinux2
  2. Create a secret in AWS Secrets Manager
  3. Using JCasC specify a secret to pull from AWS
  4. Build a docker image (FROM jenkins/jenkins:lts-jdk11) and run '/usr/local/bin/install-plugins.sh' to install the secrets plugin, also copy the JcasC yaml file to the image.
  5. Build and Upload the docker image to AWS ECR
  6. Create a task that uses the built docker image
  7. Deploy the task the cluster created in step 1

Results

Expected result:

Jenkins starts up, and can access the secrets.

Actual result:

java.lang.NullPointerException
	at io.jenkins.plugins.credentials.secretsmanager.AwsSecretSource.reveal(AwsSecretSource.java:35)
	at io.jenkins.plugins.casc.SecretSourceResolver$ConfigurationContextStringLookup.lambda$lookup$ad236547$1(SecretSourceResolver.java:141)
	at io.vavr.CheckedFunction0.lambda$unchecked$52349c75$1(CheckedFunction0.java:247)
	at io.jenkins.plugins.casc.SecretSourceResolver$ConfigurationContextStringLookup.lambda$lookup$0(SecretSourceResolver.java:141)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1632)
	at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:127)
	at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:502)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:488)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:150)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:543)
	at io.jenkins.plugins.casc.SecretSourceResolver$ConfigurationContextStringLookup.lookup(SecretSourceResolver.java:143)
	at org.apache.commons.text.lookup.InterpolatorStringLookup.lookup(InterpolatorStringLookup.java:144)
	at org.apache.commons.text.StringSubstitutor.resolveVariable(StringSubstitutor.java:1067)
	at org.apache.commons.text.StringSubstitutor.substitute(StringSubstitutor.java:1433)
	at org.apache.commons.text.StringSubstitutor.substitute(StringSubstitutor.java:1308)
	at org.apache.commons.text.StringSubstitutor.replaceIn(StringSubstitutor.java:1019)
	at io.jenkins.plugins.casc.SecretSourceResolver.resolve(SecretSourceResolver.java:109)
	at io.jenkins.plugins.casc.impl.configurators.PrimitiveConfigurator.configure(PrimitiveConfigurator.java:44)
	at io.jenkins.plugins.casc.impl.configurators.DataBoundConfigurator.tryConstructor(DataBoundConfigurator.java:159)
	at io.jenkins.plugins.casc.impl.configurators.DataBoundConfigurator.instance(DataBoundConfigurator.java:76)
	at io.jenkins.plugins.casc.BaseConfigurator.configure(BaseConfigurator.java:267)
	at io.jenkins.plugins.casc.impl.configurators.DataBoundConfigurator.configure(DataBoundConfigurator.java:82)
	at io.jenkins.plugins.casc.impl.configurators.HeteroDescribableConfigurator.lambda$doConfigure$16668e2$1(HeteroDescribableConfigurator.java:277)
	at io.vavr.CheckedFunction0.lambda$unchecked$52349c75$1(CheckedFunction0.java:247)
	at io.jenkins.plugins.casc.impl.configurators.HeteroDescribableConfigurator.doConfigure(HeteroDescribableConfigurator.java:277)
	at io.jenkins.plugins.casc.impl.configurators.HeteroDescribableConfigurator.lambda$configure$2(HeteroDescribableConfigurator.java:86)
	at io.vavr.control.Option.map(Option.java:392)
	at io.jenkins.plugins.casc.impl.configurators.HeteroDescribableConfigurator.lambda$configure$3(HeteroDescribableConfigurator.java:86)
	at io.vavr.Tuple2.apply(Tuple2.java:238)
	at io.jenkins.plugins.casc.impl.configurators.HeteroDescribableConfigurator.configure(HeteroDescribableConfigurator.java:83)
	at io.jenkins.plugins.casc.impl.configurators.HeteroDescribableConfigurator.configure(HeteroDescribableConfigurator.java:55)
	at io.jenkins.plugins.casc.impl.configurators.DataBoundConfigurator.tryConstructor(DataBoundConfigurator.java:151)
	at io.jenkins.plugins.casc.impl.configurators.DataBoundConfigurator.instance(DataBoundConfigurator.java:76)
	at io.jenkins.plugins.casc.BaseConfigurator.configure(BaseConfigurator.java:267)
	at io.jenkins.plugins.casc.impl.configurators.DataBoundConfigurator.check(DataBoundConfigurator.java:100)
	at io.jenkins.plugins.casc.BaseConfigurator.configure(BaseConfigurator.java:344)
	at io.jenkins.plugins.casc.BaseConfigurator.check(BaseConfigurator.java:287)
	at io.jenkins.plugins.casc.BaseConfigurator.configure(BaseConfigurator.java:351)
	at io.jenkins.plugins.casc.BaseConfigurator.check(BaseConfigurator.java:287)
	at io.jenkins.plugins.casc.ConfigurationAsCode.lambda$checkWith$8(ConfigurationAsCode.java:777)
	at io.jenkins.plugins.casc.ConfigurationAsCode.invokeWith(ConfigurationAsCode.java:713)
	at io.jenkins.plugins.casc.ConfigurationAsCode.checkWith(ConfigurationAsCode.java:777)
	at io.jenkins.plugins.casc.ConfigurationAsCode.configureWith(ConfigurationAsCode.java:762)
	at io.jenkins.plugins.casc.ConfigurationAsCode.configureWith(ConfigurationAsCode.java:638)
	at io.jenkins.plugins.casc.ConfigurationAsCode.configure(ConfigurationAsCode.java:307)
	at io.jenkins.plugins.casc.ConfigurationAsCode.init(ConfigurationAsCode.java:299)
Caused: java.lang.reflect.InvocationTargetException
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at hudson.init.TaskMethodFinder.invoke(TaskMethodFinder.java:104)
Caused: java.lang.Error
	at hudson.init.TaskMethodFinder.invoke(TaskMethodFinder.java:110)
	at hudson.init.TaskMethodFinder$TaskImpl.run(TaskMethodFinder.java:175)
	at org.jvnet.hudson.reactor.Reactor.runTask(Reactor.java:296)
	at jenkins.model.Jenkins$5.runTask(Jenkins.java:1129)
	at org.jvnet.hudson.reactor.Reactor$2.run(Reactor.java:214)
	at org.jvnet.hudson.reactor.Reactor$Node.run(Reactor.java:117)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused: org.jvnet.hudson.reactor.ReactorException
	at org.jvnet.hudson.reactor.Reactor.execute(Reactor.java:282)
	at jenkins.InitReactorRunner.run(InitReactorRunner.java:49)
	at jenkins.model.Jenkins.executeReactor(Jenkins.java:1162)
	at jenkins.model.Jenkins.<init>(Jenkins.java:960)
	at hudson.model.Hudson.<init>(Hudson.java:86)
	at hudson.model.Hudson.<init>(Hudson.java:82)
	at hudson.WebAppMain$3.run(WebAppMain.java:295)
Caused: hudson.util.HudsonFailedToLoad
	at hudson.WebAppMain$3.run(WebAppMain.java:312)
@thedevopsguyblog thedevopsguyblog added the bug Something isn't working label Jul 7, 2021
@chriskilding
Copy link
Contributor

First thing to check, what IAM permissions does the Jenkins principal (or if on a cluster, the nearest relevant principal) have? If Jenkins doesn't have the right permissions it will not be able to access secrets.

@thedevopsguyblog
Copy link
Author

I SSH'd onto my ECS server, git cloned my Dockerfile, built and ran the image and was able to replicate the error.
In fact, I got a more specific error message from Jenkins during startup, this is was very helpful.

2021-07-08 12:10:56.861+0000 [id=28]    WARNING c.a.util.EC2MetadataUtils#getItems: Unable to retrieve the requested metadata (/latest/dynamic/instance-identity/document). Failed to connect to service endpoint: 
java.net.SocketTimeoutException: connect timed out

@chriskilding I also suspected IAM perms were the root cause, but at the time was unsure.

You may or may not know this, but where do I attach the IAM Policy?

  • The Ec2 autoscale group?
  • Or the ECS Task?

Any contribution is greatly appreciated.

@chriskilding
Copy link
Contributor

I've not used Jenkins on ECS myself, but I imagine you could start by putting it on the ECS task definition. Either directly, or you might need to make a Jenkins IAM role with the policy, and set the role ARN on the task definition.

@thedevopsguyblog
Copy link
Author

Upon further investigation, the secrets-manager plugin want's access to instance-identity-documents This metadata is only available on EC2 instances.

So if someone tries to use JCasC + ECS + secrets-manager-credentials-provider-plugin they are going to run into this issue, cause the container application can't natively access the hosts metadata file.

I'll see if I can get Jenkins on ECS to access the hosts metadata file.

This will also be an issue on EKS.

@thedevopsguyblog
Copy link
Author

I can confirm that IAM is not the root cause, If the plugin fails to pull secrets due to insufficent permissions in the IAM policy, the error message will look something like this...

2021-07-09 06:23:55.581+0000 [id=29]    SEVERE  jenkins.InitReactorRunner$1#onTaskFailed: Failed ConfigurationAsCode.init
com.amazonaws.services.secretsmanager.model.AWSSecretsManagerException: User: arn:aws:sts::xxxx:assumed-role/SERVERNAME-InstanceRoleID-number is not authorized to perform: secretsmanager:GetSecretValue on resource: sampleSecretVaule (Service: AWSSecretsManager; Status Code: 400; Error Code: AccessDeniedException; 

Instead the Error is this...

2021-07-08 12:10:56.861+0000 [id=28]    WARNING c.a.util.EC2MetadataUtils#getItems: Unable to retrieve the requested metadata (/latest/dynamic/instance-identity/document). Failed to connect to service endpoint: 
java.net.SocketTimeoutException: connect timed out

In an attempt to mimic AWS ECS I did the following...

  1. Created a new server using the latest amazonLinux2 AMI
  2. Installed awscli
  3. Authenticated with AWS ECR (aws ecr get-login-password) and pulled down my custom Jenkins image
  4. Jenkins is fully up and running

The container started cause it could access, 169.254.169.254/latest/dynamic/instance-identity/document, this returns a json object containing the region and availabilityZone, I'm assuming the plugin needs this to figure out where the secrets are.

When starting Jenkins in AWS ECS, that endpoint is not available.

Not really sure how to procede now.

@chriskilding
Copy link
Contributor

I believe we have teams in our workplace that do use the plugin on ECS or EKS, so I'll ask around and see what they suggest

@chriskilding
Copy link
Contributor

As a short term suggestion, you could try setting the AWS_DEFAULT_REGION on the Jenkins container in ECS and see if that helps?

@chriskilding
Copy link
Contributor

It's also possible that there is simply a bug to do with this in the AWS Java SDK, that may since have been fixed. Could you post which version of the Jenkins AWS Java SDK plugin you have?

@thedevopsguyblog
Copy link
Author

thedevopsguyblog commented Jul 13, 2021

It's also possible that there is simply a bug to do with this in the AWS Java SDK, that may since have been fixed. Could you post which version of the Jenkins AWS Java SDK plugin you have?

I'm running Jenkins in docker using the jenkins/jenkins:lts image.

jenkins@12bf88526df6:/$ java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)

AWS SDK Plugin

Amazon Web Services SDK - 1.11.995

@chriskilding
Copy link
Contributor

I'm looking again at the original NullPointerException stack trace you posted. Most likely the client in AwsSecretSource (that's the AWS SDK object) is null when the GetSecretValue happens.

The client is created in an init() method when Jenkins starts up. If this fails the client won't get made and an exception is logged.

Could you scan back in your logs and find a message around the time of that stack trace that starts with:

"Could not set up AWS Secrets Manager client. Reason:"

And post it?

@chriskilding
Copy link
Contributor

(That message should be logged at WARNING level btw)

@thedevopsguyblog
Copy link
Author

Hey @chriskilding , i'll have a look at this soon.
Just in the middle of migrating Jenkins to ECS and we've switched to using Parameter Store for now.

Also Parameter Store is free but Scrts Mgr is .40c a request/per month.
Financially, P.Store seems cheaper.
Why might one use Scrts Mgr over P.Store (apart from automated rotation)?

@chriskilding
Copy link
Contributor

It is a good question, AWS themselves seem to have created 2 services that sort-of overlap but not quite. I've never seen an explanation for why they did this.

I compared them a while back in #72. From that table, you'd use Secrets Manager if you need:

  • Binary secrets
  • A larger max secret size

Beyond that, we'd have to know about the design intent of each service.

@thedevopsguyblog
Copy link
Author

I was also running into this issue with other plugins that rely on the AWS SDK.
In my Dockerfile I added this line, the creds file contains region=ap-southeast-2

COPY --chown=jenkins:jenkins credentials /var/jenkins_home/.aws/credentials

It's annoying cause now I'm locked into 1 region, I could see how this might cause issues with larger organisations.
Seems like AWS plugins aren't looking at ECS Metadata.

I wonder how other people are solving this problem? @chriskilding

@chriskilding
Copy link
Contributor

To authenticate with AWS, the plugin is merely creating a standard version of the Secrets Manager client, which uses the DefaultAWSCredentialsProviderChain under the hood.

This is what any other user of the AWS Java SDK (V1) would do - it's not specific to Jenkins plugins.

Per AWS docs the chain looks for credentials in this order:

  • Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
  • Java System Properties - aws.accessKeyId and aws.secretKey
  • Web Identity Token credentials from the environment or container
  • Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
  • Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI" environment variable is set and security manager has permission to access the variable,
  • Instance profile credentials delivered through the Amazon EC2 metadata service

The last 2 approaches on the list (any form of Amazon container lookup, be that EC2 or ECS) are handled in EC2ContainerCredentialsProviderWrapper. The docs for that say:

AWSCredentialsProvider that loads credentials from an Amazon Container (e.g. EC2) Credentials are solved in the following order:
If environment variable "AWS_CONTAINER_CREDENTIALS_RELATIVE_URI" is set (typically on EC2) it is used to hit the metadata service at the following endpoint: http://169.254.170.2
If environment variable "AWS_CONTAINER_CREDENTIALS_FULL_URI" is set it is used to hit a metadata service at that URI. Optionally an authorization token can be included in the "Authorization" header of the request by setting the "AWS_CONTAINER_AUTHORIZATION_TOKEN" environment variable.
If neither of the above environment variables are specified credentials are attempted to be loaded from Amazon EC2 Instance Metadata Service using the InstanceProfileCredentialsProvider.

The fact that in your case it's falling through to the EC2 metadata service suggests the ECS code branches didn't supply a credential. Could you check if one of those AWS_CONTAINER environment variables are set (or if not set it anyway for Jenkins as an override) and see if that changes things?

@kbratanis
Copy link

Not sure if this was mentioned: Several online guides for Jenkins + ECS include a guideline to block access from the container to the instance metadata. Here are the instructions to do so https://aws.amazon.com/premiumsupport/knowledge-center/ecs-container-ec2-metadata/ .

I would recommend that you check your ECS stack and if there are similar instructions in the USER_DATA of the Launch Configuration or Launch Template used for launching ECS Container Instances.

@jairov4
Copy link

jairov4 commented Oct 6, 2021

Hi I had the same problem @thedevopsguyblog note because that my exact stack: casc + ECS + secrets-manager-credentials-provider-plugin.
I have two experiences to share:

  • It was not happening to me on Fargate, only when I switched to EC2 cluster
  • Whereas @chriskilding solution adding AWS_DEFAULT_REGION to the task definition didnt work. It works with AWS_REGION! 💯

@chriskilding
Copy link
Contributor

Great! If setting AWS_REGION is a solution that consistently works for ECS, we can add that to the README.

(It's a little strange that a common AWS_ environment variable wouldn't be set by default in an AWS environment though.)

@thedevopsguyblog
Copy link
Author

Hi @chriskilding, I will close this issue.

As you mentioned it's not the plugin thats at fault but its the AWS SDK behaving weirdly. It also seems that you will only run into this issue if your are using an EC2 cluster with ECS.

@jairov4 has confirmed the workaround.

@chriskilding
Copy link
Contributor

Great, I've also added the workaround (the note about setting AWS_REGION manually) to the authentication guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants