-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-34150] Test and solution for issue #21
Conversation
The issue itself seems to happen in case the check of the file presence happens at the same moment as the file being created. This causes the batch file to not create the file in question. The solution to the issue is to track the alive state of the process that is launched to execute the script. Only if that script is terminated the file is checked.
Thank you for this pull request! Please check this document for how the Jenkins project handles pull requests. |
Did you test this patch across Jenkins restarts? I mean, make the batch script to take longer and restart Jenkins while it is being executed. |
I have not. How ever this patch shouldn't alter the behaviour of the plugin during a restart of Jenkins since it is not introducing anything that causes the execution to block. In case this is a concern it may be a good idea to add a general test that verifies to behaviour of the plugin during a restart cycle. I am not even sure what I would have to check. If the restart cycle hangs or if the scripts are properly cleaned? |
I suspect
I meant a manual test. An automated restart test would be intrinsically unreliable - as you don't know how long the batch process should wait until Jenkins is restarted. |
I am not sure what would happen to the classes during a restart. Does the plugin currently pick a running batch execution up again after a restart? |
Yes. That's the reason this plugin exists. |
I understand. There may be some issues with that when using my fix. Even when falling back to the classic check it does not work, because it may trigger the hanging again. The only proper idea I got now to get around this issue is to alter the created batch files and have them create create a file that marks the process as running that is checked before looking for the exit status file. This could act as fallback method in case the reference to the running process is lost. I will check if that works later today. Thank you for pointing this issue out. |
A fix for the issue was implemented using the start command. This has the downside that the spawned processes create a CMD window for the user that is used to execute the script. Also a issue was fixed that caused a error while parsing the error code in case the error file was written just in the same moment as the error file was checked. In this cases the check did read a empty string from the file and issued a error. The check for the hanging issue was rewritten to cause the issue more reliable by running multiple scripts at the same time.
I updated this pull request with a actually working version of the script that utilizes the I did some rewriting on the generation and the structure of the wrapper script. That was mostly done so the code is easier readable and any issues that may be caused due to stray variables or due to strange encoding problems may be avoided. I also updated the test case to check for the issue more aggressive. It now runs multiple instances of the script in order to try to get them to lock up. The test works very reliable now. In case the problem is present, the test will fail. All in all this is a working solution, but I am not totally happy with it because it still creates those command line windows. But I have no found any way to avoid this aside from using the scripting host, powershell or any third party tools. And all of those may be blocked in a server environment. After all the window spawning is not that big of a problem. I think Jenkins is mostly running on it's own user anyway, so it wouldn't be that much of a problem, because only that user sees those command line windows appearing. One note that may be relevant: Those windows that now appear. If you close those by hand it cause the same hanging issue as we had before. |
ps.readStdout().readStderr(); // TODO see BourneShellScript | ||
ps.start(); | ||
ps.writeStdin().readStdout().readStderr(); // TODO see BourneShellScript | ||
Proc process = ps.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a left over. Sorry I forgot to remove it. I'll do so in case you accept the general approach of the fix. It has no impact on the behaviour of the plugin.
I think this is looking good 👍 |
Fixed a missing escaping for the wrapper script name when calling it with the START command. Did some renaming of the functions to provide clearer readable source code.
I did some final clean up. In case there are no objections this could be merged and deployed. I don't see a better solution right now. |
Is there any issue left that blocks this from being merged? |
I've noticed issues when terminating a batch script with multiple commands. If aborted during a command, it might run multiple commands after it before actually terminating (if at all, I still get hanging sometimes if aborting while running a bat file) Example script and logs: It might be worth nothing that the command windows are closed after I attempt to terminate the job. |
The problem is that the aborting will not work properly, because the process is now running fully detached. If you manage to terminate the window that runs the wrapper script, it will still hang. But Jenkins shouldn't be able to do so, because there is no reference to that process kept around. But yes... the batch will ignore any abort command. |
I'm only talking about the case where I stop the jenkins job through the web interface.. I am leaving the command window alone while the job runs. I referred to the windows closing because Jenkins managed to close while terminating the job it but still hung. I think whatever fix is implemented should still allow jobs to be safely stopped through the web interface. |
Absolutely. You should assume that the node is a JNLP agent running on a Windows server in a dark closet somewhere, or EC2, etc. The usual means of stopping a runaway process is through the web UI, or a See |
*/ | ||
final String nl = "\r\n"; | ||
c.getWrapperBatchFile(ws).write( | ||
"CHCP 65001 > NUL" + nl + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling that this will break some setups where the batch commands are written in the platform native default code page, or call tools that use the codepage to parse files in the workspace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batch files written in the pipelines are all written as UTF-8. So I added this to avoid that the scripts created by this plugin are read with the wrong encoding.
@jglick You have a point, how ever this is not part of this fix, because stopping the batch doesn't work currently no matter if this fix is applied or not. This pull request solely addresses the the hanging issue. |
No? Did last time I tested it. Perhaps this regressed. |
We wouldn't have this discussion of the termination of the batch scripts would work properly. |
Well, no. This PR is about builds not terminating by themselves (not about forcibly terminating a build). Actually, as noted in a JIRA comment, everything works when using a real agent (not master), even surviving a Jenkins restart or build killing. All this PR is about running a |
Why? I've got a buildserver with Jenkins that doesn't use any slaves at all. Everything happens on the master. And considering the responses on the issues related to this I am not alone.
Once the hanging happens, the termination doesn't work any more too. You can't apply the timeout and you can't cancel it unless you kill the build the hard way. Back to topic. I have a idea to get the tracking of the process working in a similar way like it works for the shell script. |
The batch process is not tracked using the PID if the wrapper command line. This is far more reliable in case the process dies along the way.
@Aeon512 Jenkins only deletes those files once the execution has finished. So the exit code has to be known. At this point is sounds like it works as expected. The only thing strange is the command line window you describe to be still open. |
@Aeon512 Could you give a minimal example configuration of the scenario you are trying to execute? |
@mkaring Will try to get a small example working. But might take until Monday though. Will follow up asap. Update 2016-05-08 When compiling your pull-request I still get three test failures on Win7 Pro English (spaces and @ in names) |
@Aeon512 No amount of quotes was able to get rid of the problem. It seems there is a problem with the I tried it with all kinds of nasty characters in the paths and the tests don't fail. I will do some more extensive tests tomorrow. Would be nice if you run the new version in your environment. |
@mkaring Do you refer to this particular bug (For completeness sake) ?http://stackoverflow.com/questions/17674255/why-does-windows-start-command-not-work-with-spaces-in-arguments-and-path |
@Saucistophe No idea if it is a bug or just a limitation of the command. But it sounds like just the same problem. |
When trying to build on Win7 I still end up with three failed tests
|
I have no clue how I missed that. The problem was the |
@mkaring However, when using your version of the plugin for long running batch jobs (e.g. compilation for a large C++ project which takes around 1 hour - with nearly 100% cpu and memory load all the time) the batch jobs get aborted after around 10 minutes (between 6 and 17 minutes in my testing) and simply return that the batch file finished with errorlevel -1 (in some rare cases also -2). Looking at the console output, the batch file simply terminated in the middle of the C++ compilation job (using msbuild from Visual Studio 2012 on the console). Compared with durable-tasks 1.7 (before this issue was introduced) the same job works fine without any problems for several days of continuously building the whole project over and over again. Hence, something is going still wrong - but honestly i don't have any idea what this could be. |
Okay I have no idea why the batch would be terminated under these conditions. Especially after such a long time. The error level |
Well, it's not only error level
The corresponding pipeline script is wrapped in a As you can see, the batch script simply stops executing the middle of the compilation process. |
@Aeon512 Could you try to have your compiler write a log file separately? Maybe this gives a indication why the process is terminated. |
|
FilePath ws = j.jenkins.getRootPath().child("ws"); | ||
Launcher launcher = j.jenkins.createLauncher(listener); | ||
WindowsBatchScript script = new WindowsBatchScript("TIMEOUT /T 10"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIMEOUT /T 10
doesn't work correctly when you have no console to read input from and will die
"ERROR: Input redirection is not supported, exiting the process immediately."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. This was not a issue, because the scenario this test was working in spawned a new window that worked as input context. The command can easily be changed to a ping
or something like this to work around that. I just used the timeout to get a defined runtime.
I believe I have a 100% reproducible testcase for this issue It seems that in the case that the parent process has been killed (e.g. the slave dies) then all though the script terminates successfully and the wrapper terminates successfully (checked with process monitor) there is no attempt to create the result file. All I needed to do to fix the issue I was observing is add |
@jtnord I think that's one way of reproducing it but I've seen it without having killed the slave. There might be some merit to your |
This problem only ever happened in case the scripts were running on the master. So if you terminate any slave or not shouldn't effect this issue at all. |
Well I (and a few others) observed it when run on a slave :-) There may be multiple distinct issues all with the same symptoms. (The first unit test committed here was not fixed when I ran it on top of #24) |
I noticed this problem too, how ever the issue was not that the test ran into a timeout, indicating that the hanging problem happened, but rather this error reading from the result file. This problem is one I fixed in my changes without mentioning. The change for this problem to happen becomes significant in case the return code file is queried very often. It happens in case the check for the presence for the file happens just in the moment the batch file tries to write it, leading to the file remaining empty. I guess the batch first touches the file handle, opens it for write and then writes to it. This process I think is interrupted by the checks after touching the file. So the file is there, but empty and this causes the exception to be thrown. In the changes I did I worked around this by writing the return code to a temporary file and renaming the file ( |
But actually by removing the timeout the test is fixed :-) |
My issues got resolved after applying #24. So far, no problems occurred. |
@jtnord Both tests seem to work with your change. On my system it finishes before the timeout. The only "problem" that happens some times is that the But your change really seems to sort it out. I will fix the tests and send them in as another pull request. Just to have some tests to this issue and to have something come out of this entire discussion. ;-) I think this pull request can be closed since #24 fixes the issue. |
I had a reasonable spec machine (i7-4712HQ with SSD) so it's pretty fast. indeed there could be a narrow race condition between the file being created and the data flushed to it, so there probably needs to be another JIRA issue and a retry added to the exit code reader.
Closing... |
Why closing this PR? It doesn't actually work on Windows if your pipeline run on master nodes and you kill Jenkins. When you restart Jenkins, jobs hang. |
The preferred fix would be as outlined in JENKINS-27617. |
A test and the solution for the following issue:
https://issues.jenkins-ci.org/browse/JENKINS-34150