Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containers are spawned but appear offline and build doesn't start. #740

Closed
phreakadelle opened this issue Jul 5, 2019 · 24 comments
Closed

Comments

@phreakadelle
Copy link

phreakadelle commented Jul 5, 2019

Since a few days i am facing an issue with my Jenkins. We have used the DockerCloud plugin with the same containers for a long time but suddenly it stopped working. I am unable to find out the reason why it stopped so i am gently asking for support.

Whenever i trigger the build a new container is started. I can see the Docker container is launched on the DockerHost and i can see the Node pops up in the Jenkins master. But the node is shown as offline.

I am able to do a manual ssh login from the jenkins master into the container on the DockerHost with the private key.

The DockerCloud plugins launches as many containers until the maximum of 5 is reached.
Logfile.txt
image

image

As said, it worked for a long time with the same containers. The container images have not been updated.

I am using Jenkins 2.176.1 and Docker Plugin 1.1.6
image

@pjdarton
Copy link
Member

pjdarton commented Jul 5, 2019

The docker plugin hasn't been updated "for a long time" either; I don't think you can lay any blame on that changing 😁
If you're using SSH to connect then that's probably where you should be looking for the cause - SSH is designed to be secure, and "secure" generally means "doesn't give helpful error messages when things aren't quite correct", so it's very easy to overlook problems.
I'd suggest you also check your cloud configuration is the way you expect - where I work, we've had issues where people using a mouse's scroll-wheel to scroll up & down the page find that the browser "helpfully" transferred focus to a drop-down-list-box as it scrolled past and applied scroll up/down event to that, changing its selection without any user knowledge. We've also had "fun" with some browsers "helpfully" wiping and/or auto-filling fields on forms (even ones not on display) without user knowledge too (which, when it's the LDAP section of the security page, ends up locking everyone out of Jenkins and requiring manual config hackery to restore access - fun times).

It's possible that your networking/security has changed - could you be using selinux or similar where Jenkins isn't permitted to SSH to the containers?

You should also check the slave log on the "offline" slaves to see what Jenkins is reporting. That can be highly instructive.
If you're quick, you could SSH (or docker-exec) into one of the slaves (before Jenkins kills it) and see what's being shown at that end of the conversation too.

If you have shell access to you Jenkins host, take a look at the last-modified date of the config.xml file in the Jenkins home (and if you don't, use the groovy console to do the same) - if that file changed when it stopped working (or afterwards) then that's a clue.

TL;DR: The docker-plugin itself is only a part of this; 95% of what's required for things to work is outside its control, so you have to look all over the place to find the cause of this sort of problem.

...and if all else fails, use the "attach" method - that only requires the ability to issue a "docker run" command to the docker host (as all subsequent comms is over that channel with the docker host) and is therefore much simpler (i.e. less vulnerable to "other things going wrong"). Or use JNLP (IME it's less of a problem than SSH, although #739 can be an inconvenience for some).

@phreakadelle
Copy link
Author

Here you can see the container output. They are terminated after 30min

image

Thanks for linking the issue. Seems to be quiet similar to what is described in #739

@alonbl
Copy link
Contributor

alonbl commented Jul 6, 2019

Not sure where issues should be reported, I reported this here[1], this is incompatibility with >=ssh-slaves-plugin-1.30.0

[1] https://issues.jenkins-ci.org/browse/JENKINS-58340

@asadpiz
Copy link

asadpiz commented Jul 8, 2019

Hey I just encountered a similar issue and the problem seems to be the "Inject SSH Method". If you use "User Generated SSH Credentials" option instead, it should work as expected :)

Issue 738

@pjdarton
Copy link
Member

pjdarton commented Jul 8, 2019

It would be really useful if someone could post a stacktrace of an exception. Maybe the exception isn't being logged directly from the docker code, but somewhere there's sure to be an exception being thrown.
That exception's full stacktrace will indicate the true cause of the problem and will be the key to fixing any of this.

TL;DR: If you're reading this, check your Jenkins logs (not just the docker plugin logging) for exceptions mentioning SSH or docker and post them here.

@phreakadelle
Copy link
Author

phreakadelle commented Jul 8, 2019

I've upgrade the plugin again. The Exception i can see in jenkins log is:

WARNING: Connection #1 failed
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at java.io.DataInputStream.readFully(DataInputStream.java:169)
	at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:254)

Jul 08, 2019 4:40:35 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
INFO: Accepted JNLP4-connect connection #2 from /172.18.24.184:48068
Jul 08, 2019 4:40:37 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
WARNING: Connection #3 failed
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at java.io.DataInputStream.readFully(DataInputStream.java:169)
	at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:254)

Jul 08, 2019 4:40:37 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
INFO: Accepted JNLP4-connect connection #4 from /172.18.24.195:44220
Jul 08, 2019 4:40:38 PM org.jenkinsci.remoting.util.AnonymousClassWarnings warn
WARNING: Attempt to (de-)serialize anonymous class org.jenkinsci.plugins.envinject.EnvInjectComputerListener$2 in file:/var/lib/jenkins/plugins/envinject/WEB-INF/lib/envinject.jar; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
Jul 08, 2019 4:40:38 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
WARNING: Connection #5 failed
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at java.io.DataInputStream.readFully(DataInputStream.java:169)
	at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:254)

Jul 08, 2019 4:40:38 PM hudson.TcpSlaveAgentListener$ConnectionHandler run``` 

@pjdarton
Copy link
Member

pjdarton commented Jul 8, 2019

Hmm, unfortunately these are consequences of a slave disconnecting rather than the cause of a slave failing to connect - they are not the key to enlightenment here :-/
Also, these look like errors from JNLP (or direct-attach?) connections, whereas the original issue mentioned SSH - can you check your config to make sure?

FYI while much of the docker-cloud-plugin process is the same for JNLP, SSH and Attach connection methods, the symptoms of how things look, especially when things don't work, varies a lot between them; it's helpful to specify what kind of connection is/isn't working when picking through logs.

PS. Single back-quotes only work for text that doesn't split over multiple lines. When posting a multi-line log, use triple back-quotes before and after, e.g.
```
log
log...
```
and it'll look like this:

log
log...

@phreakadelle
Copy link
Author

Should i enable any other logs? Just let me know...

I am doing the SSH Connect. So my docker-in-docker-images is based on
FROM jenkins/ssh-slave image.

@Pietro-G
Copy link

Pietro-G commented Jul 8, 2019

Hello
I've been working with this plug in and experienced a very similar issue with the SSH method. I was wondering how the Inject SSH was affected by this piece of warning that is listed in the base image registry:

To use this image with Docker Plugin, you need to pass the public SSH key using environment variable JENKINS_SLAVE_SSH_PUBKEY and not as a startup argument.

In Environment field of the Docker Template (advanced section), just add:

JENKINS_SLAVE_SSH_PUBKEY=
Don't put quotes around the public key. You should be all set.

source: https://hub.docker.com/r/jenkins/ssh-slave/

As of the time of writing, my problems were fixed by simply changing to the "Attach container mode" which I was doubtful of using because it's listed as experimental, in the official wiki documentation.

@phreakadelle
Copy link
Author

Thanks Pietro. I've tried that as a "-D" parameter and in the "Environment" Section. Without success. Maybe i did it wrong because i am not sure about the "Environments" in the "Advanced Section". Are you able to post a screenshot?

@Pietro-G
Copy link

Pietro-G commented Jul 8, 2019

@phreakadelle Sadly no, as I just switched to to the 'Attach Container' mode and discarded that previous configuration.

Sorry about that

@asadpiz
Copy link

asadpiz commented Jul 9, 2019

@phreakadelle @Pietro-G

Hey, as I mentioned before, the inject SSH method doesn't work. For SSH to work you need User Configured SSH credentials:

  1. Generate a keypair via openssh.
  2. Add the private key to Jenkins credentials (username with key). For ssh-slave container the username is "jenkins".
  3. In Docker Agent Template->Environment: Add the Public Key e.g., JENKINS_SLAVE_SSH_PUBKEY=ssh-rsa....
  4. In Docker Agent Template->Connect Method: Connect with SSH
  5. SSH Key: User configured SSH credentials
  6. SSH Credentials: Choose the Jenkins credentials you created in step 2.
  7. Host key Verification Strategy: Non verifying Verification Strategy.

This should work.

@alonbl
Copy link
Contributor

alonbl commented Jul 9, 2019 via email

@pjdarton
Copy link
Member

pjdarton commented Jul 9, 2019

@alonbl While I agree that the issue "should be resolved properly", I don't see how one method is deemed insecure where the other is acceptable.
There's a shared secret that's pushed from Jenkins into the container when the container is started; in the case of the Inject SSH key, it's pushed in via one route, in the case of the method described above, it's pushed in via another. In both cases, the shared secret is sent across the docker API connection (if that's insecure then you have much bigger problems than this one) and then Jenkins connects via SSH.
i.e. the two methods are, from a security point of view, pretty much the same thing - one's easier from a user point of view, but they're both doing very similar things "under the hood".
Can you clarify why one is insecure and the other isn't?

@phreakadelle Re: Should i enable any other logs?
I'm not 100% sure. Somewhere there will be an exception or error. Maybe it's in the Jenkins stderr log, maybe it's in https:///log/all, or maybe it's in https:///computer//log, but somewhere there will be an exception or error indicating what's failing to connect and why.

Given that there's a fair amount of interest from multiple people here, can I ask you each to confirm...

  1. Version of all docker-related plugins
  2. Version of all SSH-related plugins.
  3. Version of Jenkins.
  4. For each configuration you're trying.
    a. What docker image you're using
    b. What connection method you're using
    c. What non-default parameters (jenkins username, filesystem root, other SSH-related parameters etc) you're using in your template.
  5. What, if any, exceptions you've got in any Jenkins logs (Jenkins "all", Jenkins stderr, slave-node's logs) that mention connections, docker or SSH

My guess is that the recent "breaking-changes" to the ssh-slaves plugin in 1.30 have caused this, as that's the only area I know of where "brave" changes have happened recently, but without a nice stacktrace pointing the finger-of-blame, it's difficult to be sure.

@phreakadelle
Copy link
Author

I've enabled logging on the com.* package and this is the result

@pjdarton
Copy link
Member

pjdarton commented Jul 9, 2019

Awesome; thanks for those logs. OK, so what we're seeing here...
Jul 09, 2019 1:57:25 PM FINEST com.nirima.jenkins.plugins.docker.utils.PortUtils$ConnectionCheck execute Testing connectivity to 172.18.24.195 port 35194
That's where the docker plugin is waiting for the SSH port of the container to open up. It repeatedly tries to get a SSH connection until it succeeds, at which point it lets the Jenkins core take over.

com.trilead.ssh2.transport... is the result of that SSH connection attempt; it looks like it all succeeded, and then the docker plugin then closed the SSH connection (as expected - the docker plugin is only interested in the success/fail state, not in using it).

Jul 09, 2019 1:57:17 PM INFO com.nirima.jenkins.plugins.docker.utils.PortUtils$ConnectionCheckSSH execute SSH port is open on 172.18.24.195:35193
That's where the docker plugin has confirmed that it's possible to SSH to the container. It then closes the SSH connection and return true to DockerComputerSSHConnector.createLauncher which then returns sshKeyStrategy.getSSHLauncher(address, this)

When we're using the InjectSSHKey method this is a call to DockerComputerSSHConnector.InjectSSHKey.getSSHLauncher which contains code:

final InstanceIdentity id = InstanceIdentity.get();
final String pem = PEMEncodable.create(id.getPrivate()).encode();
return new DockerSSHLauncher(address.getHostString(), address.getPort(), user, pem,
        connector.jvmOptions, connector.javaPath, connector.prefixStartSlaveCmd, connector.suffixStartSlaveCmd,
        connector.launchTimeoutSeconds, connector.maxNumRetries, connector.retryWaitTime,
        new NonVerifyingKeyVerificationStrategy()
);

i.e. This gets an ID based on the Jenkins server itself, turns that into a private key in PEM form, and passes it to DockerSSHLauncher which is just like a normal SSHLauncher but where credentialsId == "InstanceIdentity" and getCredentials() returns the private key we just generated. So far, so good...

However... I've just looked at the SSHLauncher code, SSHLauncher.java, and it looks like, around 3 months ago (the github "blame" button reveals all!), an extra line was added and, looking at these changes, I'd guess that it'll reject the credentials as invalid because doCheckCredentialsId(...) won't find any credentials called "InstanceIdentity" (as it's looking in the drop-down list that would be displayed to users if they were manually configuring a static slave, but we're not and so that's not relevant).
That'll (now) result in a InterruptedException being thrown during the Jenkins-core's attempts to connect (i.e. not inside the docker plugin's world-view anymore) and that might not be logged anywhere (which would, arguably, be Bad behavior from the ssh-slaves plugin, as suppressing error messages is a Bad Thing and leads to all sorts of confusion ... like this issue).
i.e. there's over-zealous configuration-WebUI data-checking being applied to data that didn't come from the configuration-WebUI and it's failing (silently) somewhere deep in Jenkins-core.

So my guess is that y'all have a new(ish) version of ssh-slaves 1.30.0 (github says 1.30.0 was released on June 9th 2019) or later and that's what broke it.

If that's the case then one (temporary) workaround would be to downgrade to the previous version of the ssh-slaves plugin.
A proper fix would either mean the ssh-slaves plugin being changed so it does not reject the docker-plugin's configuration or someone adding a Descriptor to DockerSSHLauncher that extends SSHLauncher.DescriptorImpl but overrides the doCheckCredentialsId method with a { return FormValidation.ok(); } stub (at least, I'm hoping that latter method would work - it looks like it's looking up the descriptor dynamically, so our overridden code ought to be the one that's called.
Hmm... if fact, if that "override the descriptor" method works then, ideally, we should also replace the doCheck methods for host and port with a "warn if it isn't empty" validator to warn users that nothing they type in to the host or port will have any effect here...

Can everyone confirm that:

  1. It's the InjectSSHKey method that's broken for you, and
  2. You're using ssh-slaves plugin version 1.30.0 or later.
  3. Reverting to an earlier version of the ssh-slaves plugin or switching to a manually configured SSH key (as described above, or switching to JNLP or Attach instead of SSH) is a workaround until someone codes up a decent fix.

@phreakadelle
Copy link
Author

So i am using 1.30.0 and it breaks. With 1.29.4 it works. The workaround with manual providing a key does not work for me. For the moment i revert the plugin to 1.29.4.

Thanks for the brilliant support. Who is going to fix this now? Jenkins Core Team? Is there a ticket?

@pjdarton
Copy link
Member

pjdarton commented Jul 9, 2019

Who is going to fix this now?

Well, it's not going to be me any time soon ... my "day job" isn't this; I'm only answering questions because (a) I use this plugin too, (b) I seem to be the "last man standing" with commit-rights to this repo and (c) this provides a welcome break from (and is more interesting than) the work that I should be doing instead ;-) ... but I've got other pressures on my time right now such that I can't go diving into this code at present; I'm happy to "advise", but not to "do".

  • If you have commercial support from CloudBees then I'm sure they'd be able to fix it, especially given the solution hints above.
  • If not then you just have to remind yourself that Jenkins is Free (as in beer) software, and you get what you paid for ;-)

If you (or anyone else reading this!) can code up a pull-request that fixes this then that'll be widely welcomed ... and, because PRs are auto-built by the jenkinsci CI system, any PRs will also result in a downloadable .hpi/.jpi plugin file that folks can try out even before it gets merged into the master code and released.

As for "a ticket", I suggest that you check to see if there's one on the Jenkins JIRA site and, if not, create one and link back here. This plugin uses github issues to drive code changes, but it helps to also have JIRA issues cross-linking back here as lots of folks log bugs on, and search for bugs on, purely on JIRA.
... and if you do create/update a JIRA ticket, please go looking for duplicates and cross-link them so that everyone finds this github issue; even if you don't feel up to coding changes, linking duplicate/related issues is a general thing that anyone can do to be helpful ;-)

TL;DR: We need a volunteer.

@phreakadelle
Copy link
Author

Right. Yeah. So maybe someone else can fix that. I've just linked our discussion in the official JIRA.

I've my own plugins to care about and if i find some spare time, i will investigate this issue. For the moment i can live with the 1.29.4 version.

@pjdarton
Copy link
Member

pjdarton commented Jul 9, 2019

@phreakadelle FYI there's no point pinging Nick - he stopped his involvement in this plugin some time ago, just as I'm trying to do. This plugin is "up for adoption" - I'll help mentor anyone wanting to get involved, but (like Nick!) I have other commitments.

@kuisathaverat
Copy link
Contributor

The issue happens on https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L865 when the ssh-slaves-plugin as for the descriptor of the DockerComputerSSHConnector, it launches a runtime exception. I'll make a patch with some kind of hack to allow the docker-plugin to bypass the checkconfig.

Method threw 'java.lang.AssertionError' exception.
class io.jenkins.docker.connector.DockerComputerSSHConnector$DockerSSHLauncher is missing its descriptor

@pjdarton
Copy link
Member

@kuisathaverat If you can submit a PR that adds a Descriptor to DockerSSHLauncher (and which extends the SSHLauncher's descriptor but stubs out the validation code being called) then please do so - I'm sure that everyone who's involved in this thread will be very grateful.
If you do submit a PR, GitHub should notify the jenkinsci build servers and cause it to be built & tested, and that should then result in a build artifact containing the plugin file. Anyone can then download that and upload it to their local Jenkins servers (via manage plugins -> advanced) and try it out.

Note: I'm out of the office for the next 3 weeks so I'll take a look when I get back.

@kuisathaverat
Copy link
Contributor

kuisathaverat commented Jul 17, 2019

finally, I resolved the issue in the ssh-slaves-plugin, it is better, jenkinsci/ssh-agents-plugin#136, I'll release a new version this weekend.

@pjdarton
Copy link
Member

pjdarton commented Aug 6, 2019

Closing issue as it's fixed in ssh-slaves plugin version 1.30.1

@pjdarton pjdarton closed this as completed Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants