[JENKINS-34150] Test and solution for issue #21

mkaring · 2016-04-17T20:03:23Z

A test and the solution for the following issue:
https://issues.jenkins-ci.org/browse/JENKINS-34150

The issue itself seems to happen in case the check of the file presence happens at the same moment as the file being created. This causes the batch file to not create the file in question. The solution to the issue is to track the alive state of the process that is launched to execute the script. Only if that script is terminated the file is checked.

jenkinsadmin · 2016-04-18T00:02:08Z

Thank you for this pull request! Please check this document for how the Jenkins project handles pull requests.

amuniz · 2016-04-18T08:30:16Z

Did you test this patch across Jenkins restarts? I mean, make the batch script to take longer and restart Jenkins while it is being executed.

mkaring · 2016-04-18T08:51:29Z

I have not. How ever this patch shouldn't alter the behaviour of the plugin during a restart of Jenkins since it is not introducing anything that causes the execution to block.

In case this is a concern it may be a good idea to add a general test that verifies to behaviour of the plugin during a restart cycle. I am not even sure what I would have to check. If the restart cycle hangs or if the scripts are properly cleaned?

amuniz · 2016-04-18T09:46:45Z

this patch shouldn't alter the behaviour of the plugin during a restart of Jenkins since it is not introducing anything that causes the execution to block.

I suspect runningProcess will be always null after a restart.

In case this is a concern it may be a good idea to add a general test that verifies to behaviour of the plugin during a restart cycle

I meant a manual test. An automated restart test would be intrinsically unreliable - as you don't know how long the batch process should wait until Jenkins is restarted.

mkaring · 2016-04-18T09:59:43Z

I am not sure what would happen to the classes during a restart. Does the plugin currently pick a running batch execution up again after a restart?

amuniz · 2016-04-18T10:20:31Z

Does the plugin currently pick a running batch execution up again after a restart?

Yes. That's the reason this plugin exists.

mkaring · 2016-04-18T10:48:07Z

I understand. There may be some issues with that when using my fix. Even when falling back to the classic check it does not work, because it may trigger the hanging again.

The only proper idea I got now to get around this issue is to alter the created batch files and have them create create a file that marks the process as running that is checked before looking for the exit status file. This could act as fallback method in case the reference to the running process is lost.

I will check if that works later today. Thank you for pointing this issue out.

A fix for the issue was implemented using the start command. This has the downside that the spawned processes create a CMD window for the user that is used to execute the script. Also a issue was fixed that caused a error while parsing the error code in case the error file was written just in the same moment as the error file was checked. In this cases the check did read a empty string from the file and issued a error. The check for the hanging issue was rewritten to cause the issue more reliable by running multiple scripts at the same time.

mkaring · 2016-04-19T21:19:05Z

I updated this pull request with a actually working version of the script that utilizes the START command to execute the script. This way there is no more hanging because what ever is killing the parent process can't terminate the wrapper and the actual script any more.

I did some rewriting on the generation and the structure of the wrapper script. That was mostly done so the code is easier readable and any issues that may be caused due to stray variables or due to strange encoding problems may be avoided.

I also updated the test case to check for the issue more aggressive. It now runs multiple instances of the script in order to try to get them to lock up. The test works very reliable now. In case the problem is present, the test will fail.

All in all this is a working solution, but I am not totally happy with it because it still creates those command line windows. But I have no found any way to avoid this aside from using the scripting host, powershell or any third party tools. And all of those may be blocked in a server environment. After all the window spawning is not that big of a problem. I think Jenkins is mostly running on it's own user anyway, so it wouldn't be that much of a problem, because only that user sees those command line windows appearing.

One note that may be relevant: Those windows that now appear. If you close those by hand it cause the same hanging issue as we had before.

jglick · 2016-04-20T20:37:22Z

src/main/java/org/jenkinsci/plugins/durabletask/WindowsBatchScript.java

-        ps.readStdout().readStderr(); // TODO see BourneShellScript
-        ps.start();
+        ps.writeStdin().readStdout().readStderr(); // TODO see BourneShellScript
+        Proc process = ps.start();


That is a left over. Sorry I forgot to remove it. I'll do so in case you accept the general approach of the fix. It has no impact on the behaviour of the plugin.

amuniz · 2016-04-21T11:38:58Z

I think this is looking good 👍

Fixed a missing escaping for the wrapper script name when calling it with the START command. Did some renaming of the functions to provide clearer readable source code.

mkaring · 2016-04-21T22:20:29Z

I did some final clean up. In case there are no objections this could be merged and deployed. I don't see a better solution right now.

mkaring · 2016-04-25T07:40:27Z

Is there any issue left that blocks this from being merged?

sonneveld · 2016-04-26T05:34:57Z

I've noticed issues when terminating a batch script with multiple commands. If aborted during a command, it might run multiple commands after it before actually terminating (if at all, I still get hanging sometimes if aborting while running a bat file)

Example script and logs:
JENKINS-34150.zip
First successfully terminated. the second stopped the first ping then started the second before terminating/hanging.

It might be worth nothing that the command windows are closed after I attempt to terminate the job.

mkaring · 2016-04-26T08:21:52Z

The problem is that the aborting will not work properly, because the process is now running fully detached. If you manage to terminate the window that runs the wrapper script, it will still hang. But Jenkins shouldn't be able to do so, because there is no reference to that process kept around.

But yes... the batch will ignore any abort command.

sonneveld · 2016-04-26T13:55:58Z

I'm only talking about the case where I stop the jenkins job through the web interface.. I am leaving the command window alone while the job runs. I referred to the windows closing because Jenkins managed to close while terminating the job it but still hung. I think whatever fix is implemented should still allow jobs to be safely stopped through the web interface.

jglick · 2016-04-26T15:05:03Z

whatever fix is implemented should still allow jobs to be safely stopped through the web interface

Absolutely. You should assume that the node is a JNLP agent running on a Windows server in a dark closet somewhere, or EC2, etc. The usual means of stopping a runaway process is through the web UI, or a timeout in a Pipeline script, etc. etc. Interacting with the Windows desktop should be considered a corner case.

See BourneShellScriptTest.stop; there is not yet an equivalent test case in WindowsBatchScriptTest.

jtnord · 2016-04-26T15:05:09Z

src/main/java/org/jenkinsci/plugins/durabletask/WindowsBatchScript.java

+         */
+        final String nl = "\r\n";
+        c.getWrapperBatchFile(ws).write(
+                "CHCP 65001 > NUL" + nl +


I have a feeling that this will break some setups where the batch commands are written in the platform native default code page, or call tools that use the codepage to parse files in the workspace.

The batch files written in the pipelines are all written as UTF-8. So I added this to avoid that the scripts created by this plugin are read with the wrong encoding.

mkaring · 2016-04-26T15:14:35Z

@jglick You have a point, how ever this is not part of this fix, because stopping the batch doesn't work currently no matter if this fix is applied or not. This pull request solely addresses the the hanging issue.

jglick · 2016-04-26T15:25:02Z

stopping the batch doesn't work currently

No? Did last time I tested it. Perhaps this regressed.

mkaring · 2016-04-26T16:22:22Z

No? Did last time I tested it. Perhaps this regressed.

We wouldn't have this discussion of the termination of the batch scripts would work properly.

amuniz · 2016-04-26T16:40:57Z

We wouldn't have this discussion of the termination of the batch scripts would work properly.

Well, no. This PR is about builds not terminating by themselves (not about forcibly terminating a build).

Actually, as noted in a JIRA comment, everything works when using a real agent (not master), even surviving a Jenkins restart or build killing. All this PR is about running a bat step in master (which is not recommended, although usable in demo/test cases) and that step hangs forever because jenkins-wrapper.bat is not fully executed so jenkins-result.txt is not created thus the step does not finish.

mkaring · 2016-04-26T16:58:44Z

which is not recommended

Why? I've got a buildserver with Jenkins that doesn't use any slaves at all. Everything happens on the master. And considering the responses on the issues related to this I am not alone.

Well, no. This PR is about builds not terminating by themselves (not about forcibly terminating a build)

Once the hanging happens, the termination doesn't work any more too. You can't apply the timeout and you can't cancel it unless you kill the build the hard way.

Back to topic. I have a idea to get the tracking of the process working in a similar way like it works for the shell script.

The batch process is not tracked using the PID if the wrapper command line. This is far more reliable in case the process dies along the way.

mkaring · 2016-05-06T21:25:03Z

@Aeon512 Jenkins only deletes those files once the execution has finished. So the exit code has to be known. At this point is sounds like it works as expected. The only thing strange is the command line window you describe to be still open.

mkaring · 2016-05-07T10:41:57Z

@Aeon512 Could you give a minimal example configuration of the scenario you are trying to execute?

Aeon512 · 2016-05-07T10:46:49Z

@mkaring Will try to get a small example working. But might take until Monday though. Will follow up asap.

Update 2016-05-08
@mkaring: I'm unable to reproduce the issue (besides when having job names with spaces - see above). Might have been a timing issue within my batch files or simply the wrong plugin version. Sorry for that.

When compiling your pull-request I still get three test failures on Win7 Pro English (spaces and @ in names)

mkaring · 2016-05-11T17:46:03Z

@Aeon512 No amount of quotes was able to get rid of the problem. It seems there is a problem with the START command in case there is a space in the path. I was able to work around this is a fairly easy modification of the command line.

I tried it with all kinds of nasty characters in the paths and the tests don't fail. I will do some more extensive tests tomorrow. Would be nice if you run the new version in your environment.

Saucistophe · 2016-05-11T20:48:47Z

@mkaring Do you refer to this particular bug (For completeness sake) ?http://stackoverflow.com/questions/17674255/why-does-windows-start-command-not-work-with-spaces-in-arguments-and-path

mkaring · 2016-05-12T12:50:29Z

@Saucistophe No idea if it is a bug or just a limitation of the command. But it sounds like just the same problem.

Aeon512 · 2016-05-14T17:55:38Z

When trying to build on Win7 I still end up with three failed tests

Failed tests:
  WindowsBatchScriptTest.percentInPath:65->testWithPath:78 expected:<0> but was:<-2>
  WindowsBatchScriptTest.spaceInPath2:60->testWithPath:78 expected:<0> but was:<1>
  WindowsBatchScriptTest.spaceInPath:55->testWithPath:78 expected:<0> but was:<1>

mkaring · 2016-05-15T00:11:01Z

I have no clue how I missed that. The problem was the CMD call inside the wrapper script. Two quotation marks are required there. I fixed that.

Aeon512 · 2016-05-15T18:39:33Z

@mkaring
Thanks for the update. Now all test pass and I don't have any problems anymore regarding project names with spaces.

However, when using your version of the plugin for long running batch jobs (e.g. compilation for a large C++ project which takes around 1 hour - with nearly 100% cpu and memory load all the time) the batch jobs get aborted after around 10 minutes (between 6 and 17 minutes in my testing) and simply return that the batch file finished with errorlevel -1 (in some rare cases also -2).

Looking at the console output, the batch file simply terminated in the middle of the C++ compilation job (using msbuild from Visual Studio 2012 on the console).

Compared with durable-tasks 1.7 (before this issue was introduced) the same job works fine without any problems for several days of continuously building the whole project over and over again.

Hence, something is going still wrong - but honestly i don't have any idea what this could be.

mkaring · 2016-05-15T19:34:33Z

Okay I have no idea why the batch would be terminated under these conditions. Especially after such a long time. The error level -2 points to the batch file not returning the process id within the expected time frame. I added this as a safeguard to ensure that it does not lock up in case something goes wrong with the PID detection. Maybe this timeout is set too short in case the system is stressed by other processes at the same time.

Aeon512 · 2016-05-16T07:13:35Z

Well, it's not only error level -2 but also error level -1. Example output

[build_debug_x64]          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\CL.exe /c .....
[build_debug_x64]          file1.cpp
[build_debug_x64]          file2.cpp
[Pipeline] }
[Pipeline] // wrap <-- for highlighting the output
[Pipeline] }
[Pipeline] // withEnv <-- for specifying additional environment variables
[Pipeline] echo
Error:: hudson.AbortException: script returned exit code -1

The corresponding pipeline script is wrapped in a try catch block where any error is printed out in the catch block.

As you can see, the batch script simply stops executing the middle of the compilation process.

mkaring · 2016-05-16T11:29:39Z

@Aeon512 Could you try to have your compiler write a log file separately? Maybe this gives a indication why the process is terminated.

jtnord · 2016-05-19T11:57:45Z

which is not recommended

Why? I've got a buildserver with Jenkins that doesn't use any slaves at all. Everything happens on the master. And considering the responses on the issues related to this I am not alone.

You have virtually no security if you do that all users should be treated as if they where administrators and nothing on your jenkins master should be considered secret from them. Any user that can create a job is an admin - they can access/modify jenkins configuration files on disk. Any user that can checkin code that gets built is an admin - they can access/modify jenkins configuraton files on disk.
an errant build script/job can take out our entire master (an errant rm will wipe$JENKINS_HOME) the master
Builds will fight for resources on the master which can cause a slow UI.

jtnord · 2016-05-19T12:05:14Z

src/test/java/org/jenkinsci/plugins/durabletask/WindowsBatchScriptTest.java

+        FilePath ws = j.jenkins.getRootPath().child("ws");
+        Launcher launcher = j.jenkins.createLauncher(listener);
+        WindowsBatchScript script = new WindowsBatchScript("TIMEOUT /T 10");


TIMEOUT /T 10 doesn't work correctly when you have no console to read input from and will die

"ERROR: Input redirection is not supported, exiting the process immediately."

Right. This was not a issue, because the scenario this test was working in spawned a new window that worked as input context. The command can easily be changed to a ping or something like this to work around that. I just used the timeout to get a defined runtime.

jtnord · 2016-05-19T12:43:29Z

I believe I have a 100% reproducible testcase for this issue

It seems that in the case that the parent process has been killed (e.g. the slave dies) then all though the script terminates successfully and the wrapper terminates successfully (checked with process monitor) there is no attempt to create the result file.
where the parent process has not been killed I never see this issue.

All I needed to do to fix the issue I was observing is add @echo off as the first line of the wrapper script. Basically I beleive it is trying to echo the commands to be run before running the commands and as there is no longer anything consuming the wrappers input/output when echoing the command it is doomed to fail (but strangly not with an exit code that implies something died!??!

sonneveld · 2016-05-19T13:50:47Z

@jtnord I think that's one way of reproducing it but I've seen it without having killed the slave. There might be some merit to your @echo off suggestion though.

mkaring · 2016-05-19T17:35:09Z

This problem only ever happened in case the scripts were running on the master. So if you terminate any slave or not shouldn't effect this issue at all.

jtnord · 2016-05-19T18:57:17Z

This problem only ever happened in case the scripts were running on the master.

Well I (and a few others) observed it when run on a slave :-)

There may be multiple distinct issues all with the same symptoms. (The first unit test committed here was not fixed when I ran it on top of #24)

mkaring · 2016-05-19T21:20:11Z

There may be multiple distinct issues all with the same symptoms. (The first unit test committed here was not fixed when I ran it on top of #24)

I noticed this problem too, how ever the issue was not that the test ran into a timeout, indicating that the hanging problem happened, but rather this error reading from the result file.

This problem is one I fixed in my changes without mentioning. The change for this problem to happen becomes significant in case the return code file is queried very often. It happens in case the check for the presence for the file happens just in the moment the batch file tries to write it, leading to the file remaining empty. I guess the batch first touches the file handle, opens it for write and then writes to it. This process I think is interrupted by the checks after touching the file. So the file is there, but empty and this causes the exception to be thrown.

In the changes I did I worked around this by writing the return code to a temporary file and renaming the file (move) after the result is there. This way the filled right away when java checks for it.

jtnord · 2016-05-20T10:31:45Z

There may be multiple distinct issues all with the same symptoms. (The first unit test committed here was not fixed when I ran it on top of #24)

I noticed this problem too, how ever the issue was not that the test ran into a timeout, indicating that the hanging problem happened, but rather this error reading from the result file.

But actually by removing the timeout the test is fixed :-)
When I removed the 5 second timeout I see the unit test runs and the "getting the exit part" takes just shy of 10 seconds. Given this is a tight loop I'm not so surprised (even though I am on a 4 core machine with HT and an SSD :-) )

Aeon512 · 2016-05-21T07:52:51Z

My issues got resolved after applying #24. So far, no problems occurred.

mkaring · 2016-05-21T08:04:28Z

@jtnord Both tests seem to work with your change. On my system it finishes before the timeout. The only "problem" that happens some times is that the exitCode function throws a exception because it reads a empty file. I guess this issue is depends highly on how fast the test system is.

But your change really seems to sort it out.

I will fix the tests and send them in as another pull request. Just to have some tests to this issue and to have something come out of this entire discussion. ;-) I think this pull request can be closed since #24 fixes the issue.

jtnord · 2016-05-25T14:44:16Z

@mkaring

The only "problem" that happens some times is that the exitCode function throws a exception because it reads a empty file. I guess this issue is depends highly on how fast the test system is.

I had a reasonable spec machine (i7-4712HQ with SSD) so it's pretty fast. indeed there could be a narrow race condition between the file being created and the data flushed to it, so there probably needs to be another JIRA issue and a retry added to the exit code reader.

I think this pull request can be closed

Closing...

philippe-granet · 2016-09-21T00:59:33Z

Why closing this PR?
According the documentation https://jenkins.io/doc/pipeline/
Accordingly, pipeline functionality is:
Durable: Pipelines can survive both planned and unplanned restarts of your Jenkins master.

It doesn't actually work on Windows if your pipeline run on master nodes and you kill Jenkins. When you restart Jenkins, jobs hang.
This PR seems to fix this, no?

jglick · 2017-01-04T17:22:33Z

It doesn't actually work on Windows if your pipeline run on master nodes and you kill Jenkins.

The preferred fix would be as outlined in JENKINS-27617.

mkaring added 2 commits April 17, 2016 21:32

[JENKINS-34150] Created test case for issue.

5bfbf1c

jglick reviewed Apr 20, 2016
View reviewed changes

[JENKINS-34150] Minor fixes and renames for better readability

20e7fa2

Fixed a missing escaping for the wrapper script name when calling it with the START command. Did some renaming of the functions to provide clearer readable source code.

jtnord reviewed Apr 26, 2016
View reviewed changes

[JENKINS-34150] Improved tracking for running batch process

cd0b28f

The batch process is not tracked using the PID if the wrapper command line. This is far more reliable in case the process dies along the way.

[JENKINS-34150] Fixed problem with spaces in path to scripts

fc76de9

[JENKINS-34150] Fixed problem with spaces in path to scripts

dd97644

jtnord reviewed May 19, 2016
View reviewed changes

jtnord mentioned this pull request May 19, 2016

[JENKINS-34150] Fixes my observed issue. #24

Merged

jtnord closed this May 25, 2016

mkaring mentioned this pull request May 25, 2016

[JENKINS-34150] Test Cases for the issue #25

Closed

[JENKINS-34150] Test and solution for issue #21

[JENKINS-34150] Test and solution for issue #21

Conversation

mkaring commented Apr 17, 2016

jenkinsadmin commented Apr 18, 2016

amuniz commented Apr 18, 2016

mkaring commented Apr 18, 2016

amuniz commented Apr 18, 2016

mkaring commented Apr 18, 2016

amuniz commented Apr 18, 2016

mkaring commented Apr 18, 2016

mkaring commented Apr 19, 2016

jglick Apr 20, 2016

Choose a reason for hiding this comment

mkaring Apr 20, 2016

Choose a reason for hiding this comment

amuniz commented Apr 21, 2016

mkaring commented Apr 21, 2016

mkaring commented Apr 25, 2016

sonneveld commented Apr 26, 2016 • edited Loading

mkaring commented Apr 26, 2016

sonneveld commented Apr 26, 2016

jglick commented Apr 26, 2016

jtnord Apr 26, 2016 • edited Loading

Choose a reason for hiding this comment

mkaring Apr 26, 2016

Choose a reason for hiding this comment

mkaring commented Apr 26, 2016

jglick commented Apr 26, 2016

mkaring commented Apr 26, 2016 • edited Loading

amuniz commented Apr 26, 2016

mkaring commented Apr 26, 2016

mkaring commented May 6, 2016

mkaring commented May 7, 2016

Aeon512 commented May 7, 2016 • edited Loading

mkaring commented May 11, 2016

Saucistophe commented May 11, 2016

mkaring commented May 12, 2016

Aeon512 commented May 14, 2016 • edited Loading

mkaring commented May 15, 2016

Aeon512 commented May 15, 2016

mkaring commented May 15, 2016

Aeon512 commented May 16, 2016

mkaring commented May 16, 2016

jtnord commented May 19, 2016

jtnord May 19, 2016

Choose a reason for hiding this comment

mkaring May 19, 2016

Choose a reason for hiding this comment

jtnord commented May 19, 2016

sonneveld commented May 19, 2016

mkaring commented May 19, 2016

jtnord commented May 19, 2016

mkaring commented May 19, 2016

jtnord commented May 20, 2016 • edited Loading

Aeon512 commented May 21, 2016

mkaring commented May 21, 2016

jtnord commented May 25, 2016

philippe-granet commented Sep 21, 2016 • edited Loading

jglick commented Jan 4, 2017

sonneveld commented Apr 26, 2016 •

edited

Loading

jtnord Apr 26, 2016 •

edited

Loading

mkaring commented Apr 26, 2016 •

edited

Loading

Aeon512 commented May 7, 2016 •

edited

Loading

Aeon512 commented May 14, 2016 •

edited

Loading

jtnord commented May 20, 2016 •

edited

Loading

philippe-granet commented Sep 21, 2016 •

edited

Loading