Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual machines connected to Jenkins via Compute Engine plugin are terminated periodically within an hour #61

Closed
lukasova opened this issue Apr 5, 2019 · 44 comments
Assignees

Comments

@lukasova
Copy link

lukasova commented Apr 5, 2019

I use Compute Engine Plugin (v. 3.0.0) for connecting GCE instances to Jenkins CI (v. 2.159). Jenkins automatically creates the instances (e.g. CentOS 6,7, Debian 9 - I tried official images that provides Google Cloud Engine) when some job is stared, but in specific time in every hour (e.g. every XX:57, yesterday it was every XX:53) all these machines are terminated no matter how long does they run. In logs of machines there are just information about the shutdown, anything special:

...
08:46:33 jenkins-gce-cent-7-cv5jlc systemd: Startup finished in 1min 30.753s.
08:47:54 jenkins-gce-cent-7-cv5jlc systemd-logind: Power key pressed.
08:47:54 jenkins-gce-cent-7-cv5jlc systemd-logind: Powering Off...
...

  • I have no timeout or preemptibility set on the machines.
  • When I run the same GCE instance manually in Google Cloud console and I connect it to Jenkins via IP address (i use internal IP address and VPN), the problem does not appear.
  • I tried to change many parameters of Google Compute Engine plugin in Jenkins (connection timeout, One Shot option, Node retention time, etc.) but nothing helped.
  • In Operations log in Google Compute Engine I can see that the one who initiated the Delete operation was Jenkins account.

Steps to reproduce:
Prepare some template in GCE, use it in Jenkins with Google Compute Engine plugin, start some job and during an hour the machines will be terminated.

I attach log from Jenkins about connected machine and log from /var/log/messages from the virtual machine

messages-20190405.txt
jenkins_slave_log.txt

@lukasova lukasova changed the title Virtual machines connected to Jenkins automatically are terminated every hour Virtual machines connected to Jenkins via Compute Engine plugin are terminated periodically within an hour Apr 8, 2019
@rachely3n rachely3n self-assigned this Apr 10, 2019
@rachely3n
Copy link
Contributor

I'll have to take a closer look, but at first glance it might have to do with this feature:
#17

@lukasova
Copy link
Author

Any update here please?

@rachely3n
Copy link
Contributor

Hi @lukasova , what I'm saying with #17, it might be normal that your instances are getting cleaned up after an hour of inactivity. This is intended to save you money in the case you just have instances running that you're not using.

@lukasova
Copy link
Author

no, instances are not inactive. Some job is running on them and they are terminated. That's the problem.

@rachely3n
Copy link
Contributor

O, that is very interesting. Can you show me your instance configuration and more logs if possible.

@rachely3n
Copy link
Contributor

Did you see the following in your logs at all:

hudson.model.AsyncPeriodicWork$1 run
INFO: Started Fingerprint cleanup

@lukasova
Copy link
Author

No, i did not see this INFO.

I have attached some screenshots of used template on Google Cloud Engine and the Jenkins configuration. On the screenshot you can see that we use our own company image for CentOS 7, but the same problem appears on other systems (CentOS 6, Debian 9) and also when I try official CentOS 7 image provided by GCE.

What other logs would you like to see?

cloud_template1
cloud_template2
gce_plugin_jenkins1
gce_plugin_jenkins2

@lukasova
Copy link
Author

lukasova commented Apr 24, 2019

today's logs (jenkins slave log, jenkins full log and jenkins job failure)

jenkins_system_log.txt
jenkins_slave_log_cent7.txt
jenkins_job_failure.txt

I also archived the instance disk, so if you want some logs from the instance, let me know which ones.

@lukasova
Copy link
Author

I have created an archive with /var/log/ directory of crashed CentOS 7 instance:

var_log.zip

@rachely3n
Copy link
Contributor

Ok, so you seem to be SSH'ing in just fine because of the connect fresh as root INFO log.
However, I see there is a relative remote path of /home/jenkins/./.jenkins-slave,
but it seems agent.jar was copied to /tmp/. It's really interesting that's the path we get.

Can you SSH into your instance manually and see if this is a valid path? I think it might not be, and we'll have to look further into that. I feel like I've seen this issue before and it has to do with faulty directories. Just not quite sure how these incorrect paths get generated.

@rachely3n
Copy link
Contributor

Ok, so I just tried out with my remote and with ./ as my remote location i get ./home/jenkins

<===[JENKINS REMOTING CAPACITY]===>Remoting version: 3.17
This is a Unix agent
NOTE: Relative remote path resolved to: /home/jenkins/.
Evacuated stdout
Agent successfully connected and online

@lukasova
Copy link
Author

yes, i forgot to mention it. I've also tried to search why the slave was copied into /tmp directory but I didn't find anything about it. I also tried to find some event which could delete this agent in /tmp but no cron or something like that was started, the agent.jar is still present in /tmp directory.

It is wierd that the instance is always terminated at the same time during an hour. Yesterday it was every XX:47.

And also one question - why the agent name is agent.jar, but when I connect the instance to Jenkins manualy, the jar is called /home/jenkins/remoting.jar ? Is it OK?

Attaching the agent.jar. You may want to check it.
agent.jar.zip

@lukasova
Copy link
Author

When i try to change directory to /home/jenkins/./.jenkins-slave with command 'cd /home/jenkins/./.jenkins-slave' it is valid command. The directory is present (it is ~/.jenkins-slave). So it does not seem to be a problem.

@lukasova
Copy link
Author

Attaching also the System Information about the new CentOS-7 instance I've created few minutes ago.
jenkins_slave_system_information.zip

@lukasova
Copy link
Author

#69 is this issue a similar problem?

But I have Java 8 installed:
[jenkins@jenkins-gce-cent-7-notimer-jb5y6o home]$ java -version
openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-b09)
OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode)

Sometimes the slave is connected and working almost one hour and then suddenly terminated.

@rachely3n
Copy link
Contributor

#69 is because java 8 was not installed. You don't seem to be having that issue since your logs print a bunch of Java errors.

@rachely3n
Copy link
Contributor

I doubt this is the issue, but worth trying, can you try using the same image as me and see what happens (Debian cloud)

@lukasova
Copy link
Author

yes i can try Cloud Debian. How do you connect these machines to jenkins? Did you generate some ssh keys? Did you create some jenkins account? What other special changes did you make on this machine?

@lukasova
Copy link
Author

ok, so i tried to run Debian official image and it is the same situation. Attaching jenkins logs.
debian_gcloud_official.zip

@rachely3n
Copy link
Contributor

For Linux images, we generate the SSH keys for you. And like I said before, you seem to have no issue SSH'ing. For some reason your agent has trouble running the job.

We're going to put out a new release today and I wonder if that will resolve your issues... I'm not able to reproduce this error and it's not clear at all from the stack trace why this is happening. I will work on this extensively the coming week since I will be on bug duty and can dedicate more bandwidth to issues.

@rachely3n
Copy link
Contributor

Prepare some template in GCE, use it in Jenkins with Google Compute Engine plugin, start some job and during an hour the machines will be terminated.

When you say start some job and during an hour, is the job still running when the instance is terminated or did the job complete and you just kept the instance there and it was deleted?

@lukasova
Copy link
Author

i have already generated some ssh key,it is ok.

@lukasova
Copy link
Author

I start the job, istance is created and the job is running, then at specific time (today every xx:47) the instance is terminated (on google could operations page I can see that the request to terminate comes from jenkins account - stop and delete the instance). Then the machine is not available on Cloud or Jenkins. I can set an option to not delete the disk when the instance is terminated. As I already do (so when I need some logs fromfrom deleted machine, I create the new one manually in google cloud and use the deleted instance's disk and connect to it via ssh).

As you can see in logs I have already attached here, the running job is not completed and it is then terminated, because the agent was deleted.

@lukasova
Copy link
Author

it is wierd that the termination happens whole day at specific minute of an hour. It does not matter if the instance (job) runs 10minutes or 50minutes. If i run the job at 18:40, instance is terminated at 18:47. The same happens when i run the job at 17:50 it also crashes at 18:47.

@lukasova
Copy link
Author

lukasova commented May 7, 2019

any update here please?

@rachely3n
Copy link
Contributor

Sorry about the delay.
I was looking at the systems log. Is it possible to get logs before the following line executes?
Apr 26, 2019 1:47:54 PM INFO hudson.remoting.SynchronousCommandTransport$ReaderThread run

The reason I ask is because I want to see if some other plugin or retention strategy is interfering with the agents and terminating them improperly since I notice the following plugin might have something to do with what's happening:

Apr 26, 2019 1:49:50 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog$Statistics writeStatisticsToLog

Watchdog Statistics: Number of overall executions: 7204, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms

@lukasova
Copy link
Author

lukasova commented May 9, 2019

Hello, ok, i am attaching

  • full system Jenkins log,
  • jenkins log from loggers com.nirima.jenkins.plugins.docker.DockerContainerWatchdog plugin + com.google,
  • jenkins log from debian official instance 1,
  • jenkins log from debian official instance 2
    jenkins_gcloud_log.zip

thanks

@lukasova
Copy link
Author

lukasova commented May 13, 2019

was it helpful?

@lukasova
Copy link
Author

no update here please?

@rachely3n
Copy link
Contributor

I can't seem to open any of these files, did you just save everything from the website? There are lots of web-related files.

I had wanted to see the logs since I am guessing there could be other plugins interfering with the instances.
Can you isolate the logs?

@rachely3n
Copy link
Contributor

Ok, I've managed to open them.
At 9:53:27 that is where you start getting 404 not found for instances.

At 9:52:57 i see the following:

May 09, 2019 9:49:50 AM INFO hudson.model.AsyncPeriodicWork$1 run
Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
May 09, 2019 9:52:57 AM FINEST com.google.jenkins.plugins.computeengine.CleanLostNodesWork
Starting clean lost nodes worker
May 09, 2019 9:52:57 AM FINEST com.google.jenkins.plugins.computeengine.CleanLostNodesWork
Cleaning cloud Codasip-cloud

However, I'm not seeing any log statements that would indicate we found any instances to terminate. This is possible if no remote instances were found.
However, I'm looking at the method findRemoteInstances (

private List<Instance> findRemoteInstances(ComputeEngineCloud cloud) {
), and we should be finding remote instances.

This may be overkill, but I wonder if you could run Jenkins with your own local build of the plugin and insert more log statements...

@rachely3n
Copy link
Contributor

Alright, at 9:52:58 am, which is not too long after 9:52:57 where we saw Cleaning cloud Codasip-Cloud:

  "zones/europe-west3-c": {
   "warning": {
    "code": "NO_RESULTS_ON_PAGE",
    "message": "There are no results for scope 'zones/europe-west3-c' on this page.",
    "data": [
     {
      "key": "scope",
      "value": "zones/europe-west3-c"
     }
    ]
   }
  },
  "zones/europe-west3-a": {
   "warning": {
    "code": "NO_RESULTS_ON_PAGE",
    "message": "There are no results for scope 'zones/europe-west3-a' on this page.",
    "data": [
     {
      "key": "scope",
      "value": "zones/europe-west3-a"
     }
    ]
   }
  },
  "zones/europe-west3-b": {
   "warning": {
    "code": "NO_RESULTS_ON_PAGE",
    "message": "There are no results for scope 'zones/europe-west3-b' on this page.",
    "data": [
     {
      "key": "scope",
      "value": "zones/europe-west3-b"
     }
    ]
   }
  },

The timing of this statement makes me suspect it is because of CleanLostNodesWorker.
However, there should be log statements when instances are terminated because of CleanLostNodesWorker...

@ingwarsw Care to contribute any input?

@ingwarsw
Copy link
Contributor

@lukasova Are you using latest version of plugin?
There was recently fix for cleaning not own instances..

@ingwarsw
Copy link
Contributor

@lukasova You dont have maybe few jenkins configured with same cloud?

@rachely3n
Copy link
Contributor

Logs seem to show only 1 cloud?

@ingwarsw
Copy link
Contributor

Not many clouds on one jenkins..
But at least 2 jenkinses with same cloud.. (maybe some test instance)

@ingwarsw
Copy link
Contributor

@lukasova Check version at least 3.1.1

@lukasova
Copy link
Author

that's true. We have 2 Jenkinses configured with the same cloud. I never realized it could be related. I will update both plugins to version 3.2.0 and if it does not help I will disable testing version of Jenkins and we'll see. Thank you

@lukasova
Copy link
Author

problem seems to be fixed after updating plugin to version 3.2.0. Hope it won't appear again :) thank you

@rachely3n
Copy link
Contributor

@lukasova thank you for being patient with us! Glad it worked out.

@Mukhtarali212
Copy link

@lukasova thank you for being patient with us! Glad it worked out.

Hi rachely3n,

i'm trying to use Google Compute Engine Plugin but getting an error "Could not list in region in project " please look into them. i didn't find out where i am going to wrong.
Screenshot from 2020-04-01 11-38-55

@rachely3n
Copy link
Contributor

@Mukhtarali212 Usually that issue has to do with your service account credentials. Make sure the credentials you created has the proper permissions.

For reference: https://cloud.google.com/solutions/using-jenkins-for-distributed-builds-on-compute-engine#configure_cloud_identity_and_access_management

@Mukhtarali212
Copy link

Hi rachely3n ,

Thanks for the reference to resolve that issue, i have one more new issue please see that , there is a error for cloning the git repository in jenkins server when provisioned a new instance from gce plugin VM is launch and job will trigger but getting the error.
![Screenshot from 2020-05-05 15-46-39](https://user-
Screenshot from 2020-05-04 12-25-02

@sindhu-chilukuri
Copy link

sindhu-chilukuri commented Dec 4, 2023

I have upgraded jenkins to 2.426.1 and I am facing similar issue @Mukhtarali212 can you suggest what can be checked here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants