Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-34150] Test and solution for issue #21

Closed
wants to merge 9 commits into from

Conversation

mkaring
Copy link

@mkaring mkaring commented Apr 17, 2016

A test and the solution for the following issue:
https://issues.jenkins-ci.org/browse/JENKINS-34150

The issue itself seems to happen in case the check of the file presence happens
at the same moment as the file being created. This causes the batch file to not
create the file in question.

The solution to the issue is to track the alive state of the process that is
launched to execute the script. Only if that script is terminated the file is
checked.
@jenkinsadmin
Copy link
Member

Thank you for this pull request! Please check this document for how the Jenkins project handles pull requests.

@amuniz
Copy link
Member

amuniz commented Apr 18, 2016

Did you test this patch across Jenkins restarts? I mean, make the batch script to take longer and restart Jenkins while it is being executed.

@mkaring
Copy link
Author

mkaring commented Apr 18, 2016

I have not. How ever this patch shouldn't alter the behaviour of the plugin during a restart of Jenkins since it is not introducing anything that causes the execution to block.

In case this is a concern it may be a good idea to add a general test that verifies to behaviour of the plugin during a restart cycle. I am not even sure what I would have to check. If the restart cycle hangs or if the scripts are properly cleaned?

@amuniz
Copy link
Member

amuniz commented Apr 18, 2016

this patch shouldn't alter the behaviour of the plugin during a restart of Jenkins since it is not introducing anything that causes the execution to block.

I suspect runningProcess will be always null after a restart.

In case this is a concern it may be a good idea to add a general test that verifies to behaviour of the plugin during a restart cycle

I meant a manual test. An automated restart test would be intrinsically unreliable - as you don't know how long the batch process should wait until Jenkins is restarted.

@mkaring
Copy link
Author

mkaring commented Apr 18, 2016

I am not sure what would happen to the classes during a restart. Does the plugin currently pick a running batch execution up again after a restart?

@amuniz
Copy link
Member

amuniz commented Apr 18, 2016

Does the plugin currently pick a running batch execution up again after a restart?

Yes. That's the reason this plugin exists.

@mkaring
Copy link
Author

mkaring commented Apr 18, 2016

I understand. There may be some issues with that when using my fix. Even when falling back to the classic check it does not work, because it may trigger the hanging again.

The only proper idea I got now to get around this issue is to alter the created batch files and have them create create a file that marks the process as running that is checked before looking for the exit status file. This could act as fallback method in case the reference to the running process is lost.

I will check if that works later today. Thank you for pointing this issue out.

A fix for the issue was implemented using the start command. This has
the downside that the spawned processes create a CMD window for the
user that is used to execute the script.

Also a issue was fixed that caused a error while parsing the error code
in case the error file was written just in the same moment as the error
file was checked. In this cases the check did read a empty string from
the file and issued a error.

The check for the hanging issue was rewritten to cause the issue more
reliable by running multiple scripts at the same time.
@mkaring
Copy link
Author

mkaring commented Apr 19, 2016

I updated this pull request with a actually working version of the script that utilizes the START command to execute the script. This way there is no more hanging because what ever is killing the parent process can't terminate the wrapper and the actual script any more.

I did some rewriting on the generation and the structure of the wrapper script. That was mostly done so the code is easier readable and any issues that may be caused due to stray variables or due to strange encoding problems may be avoided.

I also updated the test case to check for the issue more aggressive. It now runs multiple instances of the script in order to try to get them to lock up. The test works very reliable now. In case the problem is present, the test will fail.

All in all this is a working solution, but I am not totally happy with it because it still creates those command line windows. But I have no found any way to avoid this aside from using the scripting host, powershell or any third party tools. And all of those may be blocked in a server environment. After all the window spawning is not that big of a problem. I think Jenkins is mostly running on it's own user anyway, so it wouldn't be that much of a problem, because only that user sees those command line windows appearing.

One note that may be relevant: Those windows that now appear. If you close those by hand it cause the same hanging issue as we had before.

ps.readStdout().readStderr(); // TODO see BourneShellScript
ps.start();
ps.writeStdin().readStdout().readStderr(); // TODO see BourneShellScript
Proc process = ps.start();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a left over. Sorry I forgot to remove it. I'll do so in case you accept the general approach of the fix. It has no impact on the behaviour of the plugin.

@amuniz
Copy link
Member

amuniz commented Apr 21, 2016

I think this is looking good 👍

Fixed a missing escaping for the wrapper script name when calling it
with the START command.

Did some renaming of the functions to provide clearer readable source
code.
@mkaring
Copy link
Author

mkaring commented Apr 21, 2016

I did some final clean up. In case there are no objections this could be merged and deployed. I don't see a better solution right now.

@mkaring
Copy link
Author

mkaring commented Apr 25, 2016

Is there any issue left that blocks this from being merged?

@sonneveld
Copy link

sonneveld commented Apr 26, 2016

I've noticed issues when terminating a batch script with multiple commands. If aborted during a command, it might run multiple commands after it before actually terminating (if at all, I still get hanging sometimes if aborting while running a bat file)

Example script and logs:
JENKINS-34150.zip
First successfully terminated. the second stopped the first ping then started the second before terminating/hanging.

It might be worth nothing that the command windows are closed after I attempt to terminate the job.

@mkaring
Copy link
Author

mkaring commented Apr 26, 2016

The problem is that the aborting will not work properly, because the process is now running fully detached. If you manage to terminate the window that runs the wrapper script, it will still hang. But Jenkins shouldn't be able to do so, because there is no reference to that process kept around.

But yes... the batch will ignore any abort command.

@sonneveld
Copy link

I'm only talking about the case where I stop the jenkins job through the web interface.. I am leaving the command window alone while the job runs. I referred to the windows closing because Jenkins managed to close while terminating the job it but still hung. I think whatever fix is implemented should still allow jobs to be safely stopped through the web interface.

@jglick
Copy link
Member

jglick commented Apr 26, 2016

whatever fix is implemented should still allow jobs to be safely stopped through the web interface

Absolutely. You should assume that the node is a JNLP agent running on a Windows server in a dark closet somewhere, or EC2, etc. The usual means of stopping a runaway process is through the web UI, or a timeout in a Pipeline script, etc. etc. Interacting with the Windows desktop should be considered a corner case.

See BourneShellScriptTest.stop; there is not yet an equivalent test case in WindowsBatchScriptTest.

*/
final String nl = "\r\n";
c.getWrapperBatchFile(ws).write(
"CHCP 65001 > NUL" + nl +
Copy link
Member

@jtnord jtnord Apr 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that this will break some setups where the batch commands are written in the platform native default code page, or call tools that use the codepage to parse files in the workspace.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch files written in the pipelines are all written as UTF-8. So I added this to avoid that the scripts created by this plugin are read with the wrong encoding.

@mkaring
Copy link
Author

mkaring commented Apr 26, 2016

@jglick You have a point, how ever this is not part of this fix, because stopping the batch doesn't work currently no matter if this fix is applied or not. This pull request solely addresses the the hanging issue.

@jglick
Copy link
Member

jglick commented Apr 26, 2016

stopping the batch doesn't work currently

No? Did last time I tested it. Perhaps this regressed.

@mkaring
Copy link
Author

mkaring commented Apr 26, 2016

No? Did last time I tested it. Perhaps this regressed.

We wouldn't have this discussion of the termination of the batch scripts would work properly.

@amuniz
Copy link
Member

amuniz commented Apr 26, 2016

We wouldn't have this discussion of the termination of the batch scripts would work properly.

Well, no. This PR is about builds not terminating by themselves (not about forcibly terminating a build).

Actually, as noted in a JIRA comment, everything works when using a real agent (not master), even surviving a Jenkins restart or build killing. All this PR is about running a bat step in master (which is not recommended, although usable in demo/test cases) and that step hangs forever because jenkins-wrapper.bat is not fully executed so jenkins-result.txt is not created thus the step does not finish.

@mkaring
Copy link
Author

mkaring commented Apr 26, 2016

which is not recommended

Why? I've got a buildserver with Jenkins that doesn't use any slaves at all. Everything happens on the master. And considering the responses on the issues related to this I am not alone.

Well, no. This PR is about builds not terminating by themselves (not about forcibly terminating a build)

Once the hanging happens, the termination doesn't work any more too. You can't apply the timeout and you can't cancel it unless you kill the build the hard way.

Back to topic. I have a idea to get the tracking of the process working in a similar way like it works for the shell script.

The batch process is not tracked using the PID if the wrapper command
line. This is far more reliable in case the process dies along the way.
@mkaring
Copy link
Author

mkaring commented May 6, 2016

@Aeon512 Jenkins only deletes those files once the execution has finished. So the exit code has to be known. At this point is sounds like it works as expected. The only thing strange is the command line window you describe to be still open.

@mkaring
Copy link
Author

mkaring commented May 7, 2016

@Aeon512 Could you give a minimal example configuration of the scenario you are trying to execute?

@Aeon512
Copy link

Aeon512 commented May 7, 2016

@mkaring Will try to get a small example working. But might take until Monday though. Will follow up asap.

Update 2016-05-08
@mkaring: I'm unable to reproduce the issue (besides when having job names with spaces - see above). Might have been a timing issue within my batch files or simply the wrong plugin version. Sorry for that.

When compiling your pull-request I still get three test failures on Win7 Pro English (spaces and @ in names)

@mkaring
Copy link
Author

mkaring commented May 11, 2016

@Aeon512 No amount of quotes was able to get rid of the problem. It seems there is a problem with the START command in case there is a space in the path. I was able to work around this is a fairly easy modification of the command line.

I tried it with all kinds of nasty characters in the paths and the tests don't fail. I will do some more extensive tests tomorrow. Would be nice if you run the new version in your environment.

@Saucistophe
Copy link

@mkaring
Copy link
Author

mkaring commented May 12, 2016

@Saucistophe No idea if it is a bug or just a limitation of the command. But it sounds like just the same problem.

@Aeon512
Copy link

Aeon512 commented May 14, 2016

When trying to build on Win7 I still end up with three failed tests

Failed tests:
  WindowsBatchScriptTest.percentInPath:65->testWithPath:78 expected:<0> but was:<-2>
  WindowsBatchScriptTest.spaceInPath2:60->testWithPath:78 expected:<0> but was:<1>
  WindowsBatchScriptTest.spaceInPath:55->testWithPath:78 expected:<0> but was:<1>

@mkaring
Copy link
Author

mkaring commented May 15, 2016

I have no clue how I missed that. The problem was the CMD call inside the wrapper script. Two quotation marks are required there. I fixed that.

@Aeon512
Copy link

Aeon512 commented May 15, 2016

@mkaring
Thanks for the update. Now all test pass and I don't have any problems anymore regarding project names with spaces.

However, when using your version of the plugin for long running batch jobs (e.g. compilation for a large C++ project which takes around 1 hour - with nearly 100% cpu and memory load all the time) the batch jobs get aborted after around 10 minutes (between 6 and 17 minutes in my testing) and simply return that the batch file finished with errorlevel -1 (in some rare cases also -2).

Looking at the console output, the batch file simply terminated in the middle of the C++ compilation job (using msbuild from Visual Studio 2012 on the console).

Compared with durable-tasks 1.7 (before this issue was introduced) the same job works fine without any problems for several days of continuously building the whole project over and over again.

Hence, something is going still wrong - but honestly i don't have any idea what this could be.

@mkaring
Copy link
Author

mkaring commented May 15, 2016

Okay I have no idea why the batch would be terminated under these conditions. Especially after such a long time. The error level -2 points to the batch file not returning the process id within the expected time frame. I added this as a safeguard to ensure that it does not lock up in case something goes wrong with the PID detection. Maybe this timeout is set too short in case the system is stressed by other processes at the same time.

@Aeon512
Copy link

Aeon512 commented May 16, 2016

Well, it's not only error level -2 but also error level -1. Example output

[build_debug_x64]          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\CL.exe /c .....
[build_debug_x64]          file1.cpp
[build_debug_x64]          file2.cpp
[Pipeline] }
[Pipeline] // wrap <-- for highlighting the output
[Pipeline] }
[Pipeline] // withEnv <-- for specifying additional environment variables
[Pipeline] echo
Error:: hudson.AbortException: script returned exit code -1

The corresponding pipeline script is wrapped in a try catch block where any error is printed out in the catch block.

As you can see, the batch script simply stops executing the middle of the compilation process.

@mkaring
Copy link
Author

mkaring commented May 16, 2016

@Aeon512 Could you try to have your compiler write a log file separately? Maybe this gives a indication why the process is terminated.

@jtnord
Copy link
Member

jtnord commented May 19, 2016

which is not recommended

Why? I've got a buildserver with Jenkins that doesn't use any slaves at all. Everything happens on the master. And considering the responses on the issues related to this I am not alone.

  • You have virtually no security if you do that all users should be treated as if they where administrators and nothing on your jenkins master should be considered secret from them. Any user that can create a job is an admin - they can access/modify jenkins configuration files on disk. Any user that can checkin code that gets built is an admin - they can access/modify jenkins configuraton files on disk.
  • an errant build script/job can take out our entire master (an errant rm will wipe$JENKINS_HOME) the master
  • Builds will fight for resources on the master which can cause a slow UI.

FilePath ws = j.jenkins.getRootPath().child("ws");
Launcher launcher = j.jenkins.createLauncher(listener);
WindowsBatchScript script = new WindowsBatchScript("TIMEOUT /T 10");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIMEOUT /T 10 doesn't work correctly when you have no console to read input from and will die

"ERROR: Input redirection is not supported, exiting the process immediately."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This was not a issue, because the scenario this test was working in spawned a new window that worked as input context. The command can easily be changed to a ping or something like this to work around that. I just used the timeout to get a defined runtime.

@jtnord
Copy link
Member

jtnord commented May 19, 2016

I believe I have a 100% reproducible testcase for this issue

It seems that in the case that the parent process has been killed (e.g. the slave dies) then all though the script terminates successfully and the wrapper terminates successfully (checked with process monitor) there is no attempt to create the result file.
where the parent process has not been killed I never see this issue.

All I needed to do to fix the issue I was observing is add @echo off as the first line of the wrapper script. Basically I beleive it is trying to echo the commands to be run before running the commands and as there is no longer anything consuming the wrappers input/output when echoing the command it is doomed to fail (but strangly not with an exit code that implies something died!??!

@sonneveld
Copy link

@jtnord I think that's one way of reproducing it but I've seen it without having killed the slave. There might be some merit to your @echo off suggestion though.

@mkaring
Copy link
Author

mkaring commented May 19, 2016

This problem only ever happened in case the scripts were running on the master. So if you terminate any slave or not shouldn't effect this issue at all.

@jtnord
Copy link
Member

jtnord commented May 19, 2016

This problem only ever happened in case the scripts were running on the master.

Well I (and a few others) observed it when run on a slave :-)

There may be multiple distinct issues all with the same symptoms. (The first unit test committed here was not fixed when I ran it on top of #24)

@mkaring
Copy link
Author

mkaring commented May 19, 2016

There may be multiple distinct issues all with the same symptoms. (The first unit test committed here was not fixed when I ran it on top of #24)

I noticed this problem too, how ever the issue was not that the test ran into a timeout, indicating that the hanging problem happened, but rather this error reading from the result file.

This problem is one I fixed in my changes without mentioning. The change for this problem to happen becomes significant in case the return code file is queried very often. It happens in case the check for the presence for the file happens just in the moment the batch file tries to write it, leading to the file remaining empty. I guess the batch first touches the file handle, opens it for write and then writes to it. This process I think is interrupted by the checks after touching the file. So the file is there, but empty and this causes the exception to be thrown.

In the changes I did I worked around this by writing the return code to a temporary file and renaming the file (move) after the result is there. This way the filled right away when java checks for it.

@jtnord
Copy link
Member

jtnord commented May 20, 2016

There may be multiple distinct issues all with the same symptoms. (The first unit test committed here was not fixed when I ran it on top of #24)

I noticed this problem too, how ever the issue was not that the test ran into a timeout, indicating that the hanging problem happened, but rather this error reading from the result file.

But actually by removing the timeout the test is fixed :-)
When I removed the 5 second timeout I see the unit test runs and the "getting the exit part" takes just shy of 10 seconds. Given this is a tight loop I'm not so surprised (even though I am on a 4 core machine with HT and an SSD :-) )

@Aeon512
Copy link

Aeon512 commented May 21, 2016

My issues got resolved after applying #24. So far, no problems occurred.

@mkaring
Copy link
Author

mkaring commented May 21, 2016

@jtnord Both tests seem to work with your change. On my system it finishes before the timeout. The only "problem" that happens some times is that the exitCode function throws a exception because it reads a empty file. I guess this issue is depends highly on how fast the test system is.

But your change really seems to sort it out.

I will fix the tests and send them in as another pull request. Just to have some tests to this issue and to have something come out of this entire discussion. ;-) I think this pull request can be closed since #24 fixes the issue.

@jtnord
Copy link
Member

jtnord commented May 25, 2016

@mkaring

The only "problem" that happens some times is that the exitCode function throws a exception because it reads a empty file. I guess this issue is depends highly on how fast the test system is.

I had a reasonable spec machine (i7-4712HQ with SSD) so it's pretty fast. indeed there could be a narrow race condition between the file being created and the data flushed to it, so there probably needs to be another JIRA issue and a retry added to the exit code reader.

I think this pull request can be closed

Closing...

@philippe-granet
Copy link

philippe-granet commented Sep 21, 2016

Why closing this PR?
According the documentation https://jenkins.io/doc/pipeline/
Accordingly, pipeline functionality is:
Durable: Pipelines can survive both planned and unplanned restarts of your Jenkins master.

It doesn't actually work on Windows if your pipeline run on master nodes and you kill Jenkins. When you restart Jenkins, jobs hang.
This PR seems to fix this, no?

@jglick
Copy link
Member

jglick commented Jan 4, 2017

It doesn't actually work on Windows if your pipeline run on master nodes and you kill Jenkins.

The preferred fix would be as outlined in JENKINS-27617.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants