Virtual machines connected to Jenkins via Compute Engine plugin are terminated periodically within an hour #61

lukasova · 2019-04-05T09:49:51Z

I use Compute Engine Plugin (v. 3.0.0) for connecting GCE instances to Jenkins CI (v. 2.159). Jenkins automatically creates the instances (e.g. CentOS 6,7, Debian 9 - I tried official images that provides Google Cloud Engine) when some job is stared, but in specific time in every hour (e.g. every XX:57, yesterday it was every XX:53) all these machines are terminated no matter how long does they run. In logs of machines there are just information about the shutdown, anything special:

...
08:46:33 jenkins-gce-cent-7-cv5jlc systemd: Startup finished in 1min 30.753s.
08:47:54 jenkins-gce-cent-7-cv5jlc systemd-logind: Power key pressed.
08:47:54 jenkins-gce-cent-7-cv5jlc systemd-logind: Powering Off...
...

I have no timeout or preemptibility set on the machines.
When I run the same GCE instance manually in Google Cloud console and I connect it to Jenkins via IP address (i use internal IP address and VPN), the problem does not appear.
I tried to change many parameters of Google Compute Engine plugin in Jenkins (connection timeout, One Shot option, Node retention time, etc.) but nothing helped.
In Operations log in Google Compute Engine I can see that the one who initiated the Delete operation was Jenkins account.

Steps to reproduce:
Prepare some template in GCE, use it in Jenkins with Google Compute Engine plugin, start some job and during an hour the machines will be terminated.

I attach log from Jenkins about connected machine and log from /var/log/messages from the virtual machine

messages-20190405.txt
jenkins_slave_log.txt

rachely3n · 2019-04-10T00:34:28Z

I'll have to take a closer look, but at first glance it might have to do with this feature:
#17

lukasova · 2019-04-16T11:24:56Z

Any update here please?

rachely3n · 2019-04-16T17:40:01Z

Hi @lukasova , what I'm saying with #17, it might be normal that your instances are getting cleaned up after an hour of inactivity. This is intended to save you money in the case you just have instances running that you're not using.

lukasova · 2019-04-16T19:15:59Z

no, instances are not inactive. Some job is running on them and they are terminated. That's the problem.

rachely3n · 2019-04-16T20:40:47Z

O, that is very interesting. Can you show me your instance configuration and more logs if possible.

rachely3n · 2019-04-16T23:53:25Z

Did you see the following in your logs at all:

hudson.model.AsyncPeriodicWork$1 run
INFO: Started Fingerprint cleanup

lukasova · 2019-04-24T06:49:41Z

No, i did not see this INFO.

I have attached some screenshots of used template on Google Cloud Engine and the Jenkins configuration. On the screenshot you can see that we use our own company image for CentOS 7, but the same problem appears on other systems (CentOS 6, Debian 9) and also when I try official CentOS 7 image provided by GCE.

What other logs would you like to see?

lukasova · 2019-04-24T07:13:47Z

today's logs (jenkins slave log, jenkins full log and jenkins job failure)

jenkins_system_log.txt
jenkins_slave_log_cent7.txt
jenkins_job_failure.txt

I also archived the instance disk, so if you want some logs from the instance, let me know which ones.

lukasova · 2019-04-24T07:36:31Z

I have created an archive with /var/log/ directory of crashed CentOS 7 instance:

var_log.zip

rachely3n · 2019-04-25T00:36:23Z

Ok, so you seem to be SSH'ing in just fine because of the connect fresh as root INFO log.
However, I see there is a relative remote path of /home/jenkins/./.jenkins-slave,
but it seems agent.jar was copied to /tmp/. It's really interesting that's the path we get.

Can you SSH into your instance manually and see if this is a valid path? I think it might not be, and we'll have to look further into that. I feel like I've seen this issue before and it has to do with faulty directories. Just not quite sure how these incorrect paths get generated.

rachely3n · 2019-04-25T00:50:40Z

Ok, so I just tried out with my remote and with ./ as my remote location i get ./home/jenkins

<===[JENKINS REMOTING CAPACITY]===>Remoting version: 3.17
This is a Unix agent
NOTE: Relative remote path resolved to: /home/jenkins/.
Evacuated stdout
Agent successfully connected and online

lukasova · 2019-04-25T08:50:19Z

yes, i forgot to mention it. I've also tried to search why the slave was copied into /tmp directory but I didn't find anything about it. I also tried to find some event which could delete this agent in /tmp but no cron or something like that was started, the agent.jar is still present in /tmp directory.

It is wierd that the instance is always terminated at the same time during an hour. Yesterday it was every XX:47.

And also one question - why the agent name is agent.jar, but when I connect the instance to Jenkins manualy, the jar is called /home/jenkins/remoting.jar ? Is it OK?

Attaching the agent.jar. You may want to check it.
agent.jar.zip

lukasova · 2019-04-25T09:03:27Z

When i try to change directory to /home/jenkins/./.jenkins-slave with command 'cd /home/jenkins/./.jenkins-slave' it is valid command. The directory is present (it is ~/.jenkins-slave). So it does not seem to be a problem.

lukasova · 2019-04-25T09:33:10Z

Attaching also the System Information about the new CentOS-7 instance I've created few minutes ago.
jenkins_slave_system_information.zip

lukasova · 2019-04-25T10:52:51Z

#69 is this issue a similar problem?

But I have Java 8 installed:
[jenkins@jenkins-gce-cent-7-notimer-jb5y6o home]$ java -version
openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-b09)
OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode)

Sometimes the slave is connected and working almost one hour and then suddenly terminated.

rachely3n · 2019-04-25T17:01:00Z

#69 is because java 8 was not installed. You don't seem to be having that issue since your logs print a bunch of Java errors.

rachely3n · 2019-04-25T17:14:38Z

I doubt this is the issue, but worth trying, can you try using the same image as me and see what happens (Debian cloud)

lukasova · 2019-04-26T06:36:58Z

yes i can try Cloud Debian. How do you connect these machines to jenkins? Did you generate some ssh keys? Did you create some jenkins account? What other special changes did you make on this machine?

lukasova · 2019-04-26T12:55:20Z

ok, so i tried to run Debian official image and it is the same situation. Attaching jenkins logs.
debian_gcloud_official.zip

rachely3n · 2019-04-26T15:30:44Z

For Linux images, we generate the SSH keys for you. And like I said before, you seem to have no issue SSH'ing. For some reason your agent has trouble running the job.

We're going to put out a new release today and I wonder if that will resolve your issues... I'm not able to reproduce this error and it's not clear at all from the stack trace why this is happening. I will work on this extensively the coming week since I will be on bug duty and can dedicate more bandwidth to issues.

rachely3n · 2019-04-26T15:34:35Z

Prepare some template in GCE, use it in Jenkins with Google Compute Engine plugin, start some job and during an hour the machines will be terminated.

When you say start some job and during an hour, is the job still running when the instance is terminated or did the job complete and you just kept the instance there and it was deleted?

lukasova · 2019-04-26T16:23:17Z

i have already generated some ssh key,it is ok.

lukasova · 2019-04-26T16:35:23Z

I start the job, istance is created and the job is running, then at specific time (today every xx:47) the instance is terminated (on google could operations page I can see that the request to terminate comes from jenkins account - stop and delete the instance). Then the machine is not available on Cloud or Jenkins. I can set an option to not delete the disk when the instance is terminated. As I already do (so when I need some logs fromfrom deleted machine, I create the new one manually in google cloud and use the deleted instance's disk and connect to it via ssh).

As you can see in logs I have already attached here, the running job is not completed and it is then terminated, because the agent was deleted.

lukasova · 2019-04-26T16:38:28Z

it is wierd that the termination happens whole day at specific minute of an hour. It does not matter if the instance (job) runs 10minutes or 50minutes. If i run the job at 18:40, instance is terminated at 18:47. The same happens when i run the job at 17:50 it also crashes at 18:47.

lukasova · 2019-05-07T12:31:43Z

any update here please?

rachely3n · 2019-05-07T16:27:14Z

Sorry about the delay.
I was looking at the systems log. Is it possible to get logs before the following line executes?
Apr 26, 2019 1:47:54 PM INFO hudson.remoting.SynchronousCommandTransport$ReaderThread run

The reason I ask is because I want to see if some other plugin or retention strategy is interfering with the agents and terminating them improperly since I notice the following plugin might have something to do with what's happening:

Apr 26, 2019 1:49:50 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog$Statistics writeStatisticsToLog

Watchdog Statistics: Number of overall executions: 7204, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms

lukasova · 2019-05-09T08:11:42Z

Hello, ok, i am attaching

full system Jenkins log,
jenkins log from loggers com.nirima.jenkins.plugins.docker.DockerContainerWatchdog plugin + com.google,
jenkins log from debian official instance 1,
jenkins log from debian official instance 2
jenkins_gcloud_log.zip

thanks

lukasova · 2019-05-13T14:07:44Z

was it helpful?

lukasova · 2019-05-21T12:10:24Z

no update here please?

rachely3n · 2019-05-21T21:08:21Z

I can't seem to open any of these files, did you just save everything from the website? There are lots of web-related files.

I had wanted to see the logs since I am guessing there could be other plugins interfering with the instances.
Can you isolate the logs?

rachely3n · 2019-05-21T21:34:43Z

Ok, I've managed to open them.
At 9:53:27 that is where you start getting 404 not found for instances.

At 9:52:57 i see the following:

May 09, 2019 9:49:50 AM INFO hudson.model.AsyncPeriodicWork$1 run
Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
May 09, 2019 9:52:57 AM FINEST com.google.jenkins.plugins.computeengine.CleanLostNodesWork
Starting clean lost nodes worker
May 09, 2019 9:52:57 AM FINEST com.google.jenkins.plugins.computeengine.CleanLostNodesWork
Cleaning cloud Codasip-cloud

However, I'm not seeing any log statements that would indicate we found any instances to terminate. This is possible if no remote instances were found.
However, I'm looking at the method findRemoteInstances (

google-compute-engine-plugin/src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java

Line 97 in ce5761e

private List<Instance> findRemoteInstances(ComputeEngineCloud cloud) {

), and we should be finding remote instances.

This may be overkill, but I wonder if you could run Jenkins with your own local build of the plugin and insert more log statements...

rachely3n · 2019-05-21T21:42:06Z

Alright, at 9:52:58 am, which is not too long after 9:52:57 where we saw Cleaning cloud Codasip-Cloud:

  "zones/europe-west3-c": {
   "warning": {
    "code": "NO_RESULTS_ON_PAGE",
    "message": "There are no results for scope 'zones/europe-west3-c' on this page.",
    "data": [
     {
      "key": "scope",
      "value": "zones/europe-west3-c"
     }
    ]
   }
  },
  "zones/europe-west3-a": {
   "warning": {
    "code": "NO_RESULTS_ON_PAGE",
    "message": "There are no results for scope 'zones/europe-west3-a' on this page.",
    "data": [
     {
      "key": "scope",
      "value": "zones/europe-west3-a"
     }
    ]
   }
  },
  "zones/europe-west3-b": {
   "warning": {
    "code": "NO_RESULTS_ON_PAGE",
    "message": "There are no results for scope 'zones/europe-west3-b' on this page.",
    "data": [
     {
      "key": "scope",
      "value": "zones/europe-west3-b"
     }
    ]
   }
  },

The timing of this statement makes me suspect it is because of CleanLostNodesWorker.
However, there should be log statements when instances are terminated because of CleanLostNodesWorker...

@ingwarsw Care to contribute any input?

ingwarsw · 2019-05-21T22:05:17Z

@lukasova Are you using latest version of plugin?
There was recently fix for cleaning not own instances..

ingwarsw · 2019-05-22T22:16:43Z

@lukasova You dont have maybe few jenkins configured with same cloud?

rachely3n · 2019-05-22T23:30:04Z

Logs seem to show only 1 cloud?

ingwarsw · 2019-05-23T07:50:41Z

Not many clouds on one jenkins..
But at least 2 jenkinses with same cloud.. (maybe some test instance)

ingwarsw · 2019-05-23T07:52:04Z

@lukasova Check version at least 3.1.1

lukasova · 2019-05-23T07:54:29Z

that's true. We have 2 Jenkinses configured with the same cloud. I never realized it could be related. I will update both plugins to version 3.2.0 and if it does not help I will disable testing version of Jenkins and we'll see. Thank you

lukasova · 2019-05-28T13:59:47Z

problem seems to be fixed after updating plugin to version 3.2.0. Hope it won't appear again :) thank you

rachely3n · 2019-05-28T17:38:28Z

@lukasova thank you for being patient with us! Glad it worked out.

Mukhtarali212 · 2020-04-17T10:24:37Z

@lukasova thank you for being patient with us! Glad it worked out.

Hi rachely3n,

i'm trying to use Google Compute Engine Plugin but getting an error "Could not list in region in project " please look into them. i didn't find out where i am going to wrong.

rachely3n · 2020-04-20T01:09:51Z

@Mukhtarali212 Usually that issue has to do with your service account credentials. Make sure the credentials you created has the proper permissions.

For reference: https://cloud.google.com/solutions/using-jenkins-for-distributed-builds-on-compute-engine#configure_cloud_identity_and_access_management

Mukhtarali212 · 2020-05-08T09:10:03Z

Hi rachely3n ,

Thanks for the reference to resolve that issue, i have one more new issue please see that , there is a error for cloning the git repository in jenkins server when provisioned a new instance from gce plugin VM is launch and job will trigger but getting the error.
![Screenshot from 2020-05-05 15-46-39](https://user-

sindhu-chilukuri · 2023-12-04T07:09:47Z

I have upgraded jenkins to 2.426.1 and I am facing similar issue @Mukhtarali212 can you suggest what can be checked here

lukasova changed the title ~~Virtual machines connected to Jenkins automatically are terminated every hour~~ Virtual machines connected to Jenkins via Compute Engine plugin are terminated periodically within an hour Apr 8, 2019

rachely3n self-assigned this Apr 10, 2019

rachely3n closed this as completed May 28, 2019

Virtual machines connected to Jenkins via Compute Engine plugin are terminated periodically within an hour #61

Virtual machines connected to Jenkins via Compute Engine plugin are terminated periodically within an hour #61

Comments

lukasova commented Apr 5, 2019 • edited Loading

rachely3n commented Apr 10, 2019

lukasova commented Apr 16, 2019

rachely3n commented Apr 16, 2019

lukasova commented Apr 16, 2019

rachely3n commented Apr 16, 2019

rachely3n commented Apr 16, 2019

lukasova commented Apr 24, 2019

lukasova commented Apr 24, 2019 • edited Loading

lukasova commented Apr 24, 2019

rachely3n commented Apr 25, 2019

rachely3n commented Apr 25, 2019

lukasova commented Apr 25, 2019

lukasova commented Apr 25, 2019

lukasova commented Apr 25, 2019

lukasova commented Apr 25, 2019

rachely3n commented Apr 25, 2019

rachely3n commented Apr 25, 2019

lukasova commented Apr 26, 2019

lukasova commented Apr 26, 2019

rachely3n commented Apr 26, 2019

rachely3n commented Apr 26, 2019

lukasova commented Apr 26, 2019

lukasova commented Apr 26, 2019

lukasova commented Apr 26, 2019

lukasova commented May 7, 2019

rachely3n commented May 7, 2019

lukasova commented May 9, 2019

lukasova commented May 13, 2019 • edited Loading

lukasova commented May 21, 2019

rachely3n commented May 21, 2019

rachely3n commented May 21, 2019

rachely3n commented May 21, 2019

ingwarsw commented May 21, 2019

ingwarsw commented May 22, 2019

rachely3n commented May 22, 2019

ingwarsw commented May 23, 2019

ingwarsw commented May 23, 2019

lukasova commented May 23, 2019

lukasova commented May 28, 2019

rachely3n commented May 28, 2019

Mukhtarali212 commented Apr 17, 2020

rachely3n commented Apr 20, 2020

Mukhtarali212 commented May 8, 2020

sindhu-chilukuri commented Dec 4, 2023 • edited Loading

lukasova commented Apr 5, 2019 •

edited

Loading

lukasova commented Apr 24, 2019 •

edited

Loading

lukasova commented May 13, 2019 •

edited

Loading

sindhu-chilukuri commented Dec 4, 2023 •

edited

Loading